AIs Discovering Vulnerabilities - Schneier on Security

AIs Discovering Vulnerabilities

I’ve been writing about the possibility of AIs automatically discovering code vulnerabilities since at least 2018. This is an ongoing area of research: AIs doing source code scanning, AIs finding zero-days in the wild, and everything in between. The AIs aren’t very good at it yet, but they’re getting better.

Here’s some anecdotal data from this summer:

Since July 2024, ZeroPath is taking a novel approach combining deep program analysis with adversarial AI agents for validation. Our methodology has uncovered numerous critical vulnerabilities in production systems, including several that traditional Static Application Security Testing (SAST) tools were ill-equipped to find. This post provides a technical deep-dive into our research methodology and a living summary of the bugs found in popular open-source tools.

Expect lots of developments in this area over the next few years.

This is what I said in a recent interview:

Let’s stick with software. Imagine that we have an AI that finds software vulnerabilities. Yes, the attackers can use those AIs to break into systems. But the defenders can use the same AIs to find software vulnerabilities and then patch them. This capability, once it exists, will probably be built into the standard suite of software development tools. We can imagine a future where all the easily findable vulnerabilities (not all the vulnerabilities; there are lots of theoretical results about that) are removed in software before shipping.

When that day comes, all legacy code would be vulnerable. But all new code would be secure. And, eventually, those software vulnerabilities will be a thing of the past. In my head, some future programmer shakes their head and says, “Remember the early decades of this century when software was full of vulnerabilities? That’s before the AIs found them all. Wow, that was a crazy time.” We’re not there yet. We’re not even remotely there yet. But it’s a reasonable extrapolation.

EDITED TO ADD: And Google’s LLM just discovered an exploitable zero-day.

Posted on November 5, 2024 at 7:08 AM15 Comments

Comments

Clive Robinson November 5, 2024 9:20 AM

With regards,

“AIs doing source code scanning, AIs finding zero-days in the wild, and everything in between. The AIs aren’t very good at it yet, but they’re getting better.”

An obvious observation applies,

“The blind leading the blind.”

Lets be honest humans are not very good at it either… So the usual way of learning to do it is by,

“Fumbling around in the dark till you develop a hinky/spidey sense.”

Eventually you get a feel for “patterns” and these warn you of problem areas and where to look more closely.

Thus it’s learning not just to hear the signal in the noise but what is just noise and what are the various types signal within it.

We’ve done this before back at the end of the last century with “extracting secrets” from “Power Signitures” on smart cards and what are now seen as obsolete 8bit microcontrollers and similar.

For some reason once discovered few people went further with it, even though much earlier back from the mid 1960’s one software diagnostic tool was an AM radio tuned to a harmonic of the computer clock frequency. Where the “envelope modulation” produced a sound that was indicative of what the code was doing.

In the “academic community” this kind of got relegated to student projects written in assembler to play “tunes” and similar.

Some year ago when talking about “Castles -v- Prisons” I went into quite some detail about doing this by having a hypervisor looking for aberrant signals or changes in expected signals from executing programs in small single task computing units I called Prisons.

Such changes in signal signatures indicated that the code functioning of a tasklet in a prison was not as expected, thus something was wrong such as error, or on otherwise debugged code untested/allowed input or malware thus an exception needed to be raised and investigated.

In theory this is very easy to do you just simplify the code into individual tasks with clear function thus a clear “signal against time” could be established and then develop an appropriate monitoring mask.

Thus we get back to the idea of “Digital Signal Processing”(DSP) and “Adaptive filtering” around a base mask.

As we actually know how to do this, I would expect us to be able to get an LLM system to do it. Provided we keep the tasks sufficiently defined and simplified.

This has been one of the promises of very small RISC cores in superscalar arrangements using highly parallel processing. Only the way we’ve gone done the superscalar path on the hardware side of things has made it a very rough road.

This is because humans are very bad at doing parallel programming and appear stuck in an ever increasingly awkward form of sequential programming.

Thus I predict that what will probably need to happen is that some form of yet to be AI will need to take human written “high level code” that will be in effect “sequential” and produce via the compiler optimised parallel code that will give clear signatures.

Yes I can see the arguments, but remember we have been walking down a similar path with RISC designs that in the 1990’s led to Intel’s idea of “EPIC” on the Itanium processors that turned into a bit of a disaster. Especially with AMD coming up with a clean way thus their x64 system that Intel ended up having to licence. Some say because of Microsoft digging their heals in and refusing to go down into Intel’s way.

Thus a second warning is that for AI to do this in a cost efficient way, it needs the “software tool chain” developers to be very much on board with it…

And that will probably be the major stumbling block with patents used to put in road blocks to protect what are monopoly positions.

Alex M November 5, 2024 9:30 AM

It’s a strange paradox how LLM’s both enhance and compromise security. Because it’s trained on code with a lot of security debt.

But positively enough, maybe once inserted in a development lifecycle we might see LLM’s produce flawed code that once caught in the testing phase could be fixed by even the same LLM

Jelo 117 November 5, 2024 10:55 AM

AI seem seems forever to be spoken of in a way that lends it capabilities of the same nature as those of humans, whereas it is just a tool devised by humans for certain human ends. The tool gets better by humans getting better at toolmaking. People that work in this area are all too familiar with the fact that these AI tools are only algorithms and hardware, not different in essential nature from any calculator.

Daniel Popescu November 5, 2024 2:11 PM

Fascinating post Bruce.Somewhat off-topic: a time will come(not soon I hope :)) when you’ll need to creat a digital twin fir yourself.

Person November 5, 2024 2:24 PM

AI seems like a decent way to target software fuzzing to reduce its computation time and increase its hit percent.

Clive Robinson November 5, 2024 7:21 PM

@ Person,

With regards,

“AI seems like a decent way to target software fuzzing to reduce its computation time and increase its hit percent.”

Hmm not sure what you actually mean… Your comment is shall we say ambiguous due to lack of detail due to brevity.

Care to amplify what you are thinking.

Long Sleep November 5, 2024 7:32 PM

Re: Google Claims World First As AI Finds 0-Day Security Vulnerability

“We found this issue before it appeared in an official release,” the Big Sleep team from Google said, “so SQLite users were not impacted.”

That may be easy to verify now, but if any parts of the LLM are trained by scraping the entire datasphere, it might pick up on some obscure clues of prior exploitation.

Clive Robinson November 6, 2024 6:41 AM

@ Jelo 117

Re : Being a toolmaker.

You note,

“The tool gets better by humans getting better at toolmaking.”

It’s not just humans that think up and make and use tools. Even “bird brains” like Corvids do it.

For a while anthropologists looked on “new toolmaking and use” as a sign of intelligence.

But is it?

Later studies showed that chimpanzees and other primates “learned their toolmaking” from humans watching them.

That is it was found repeatedly that a troop of primates observed long term by naturalists in return observed the naturalists and started to “mimic” the activities they saw.

Thus they were “learning and applying” rather than “inventing”.

Some years ago this came to a head due to the observation of the “eating of meat”. Specifically other primates in other troops. It appears that primate troops near humans that hunt and eat primates had “picked up the habit”…

This in turn reopened the “nature v nurture” argument about “innate v acquired” behaviours that through group memory learning in effect became cultural learning as it conferred an advantage.

With the advantage being effectively a gain, that is some kind of expenditure minimisation to achieve a given level of outcome. Thus less energy or less time or less movement and similar being obvious but others such as certain types of collective hunting and foraging being less obvious.

Some have pointed out that we see the same in “natural physical systems” like thermodynamics, which gives rise to all sorts of other questions.

Thus an argument could be made that what we call intelligence is based on,

1, Observe (see)
2, Experiment (try)
3, Equate (benefit)
4, Repeat.

The second and third steps allow for “random” or “Stochastic” behaviours. So you can start to see why some think as they do.

The real question though is the third step as to if it’s an “observation or reason” based system. That is do you ask a question and find an answer by reason/experiment or just observe a benefit.

Arguably current AI LLM and ML systems are capable of “measuring” benefit but not of “reasoning” what a benefit might be. What reasoning is show by LLM/ML use is “not by the tool” but by “how the tool is used” that is “from the tool user”. Who by asking a series of directed questions of the LLM to “test a hypothesis of the user” and the user “optimise or discard” by some measure “the tool user thinks fit”.

Which then brings us into the realm of “Monte Carlo Methods”[1] based on stochastic, or as some call them “scatter gun” methods. And from there we move into the likes of thermodynamics we get to see in school science –with that deed purple potassium permanganate swirl caused by heating water– where what appears individually as totally random movement shows overall to have predictable results to quite a high degree through chaotic and complex processes.

But still no “reasoning” being evident… Even though others might wish it was so.

[1] The method, is named after a famous gambling Casino by the mathematician Stanislaw Ulam, who thought the method up and used it as part of the Manhattan Project to find answers that were not amenable to traditional analysis methods. It was apparently inspired by his uncle’s gambling habits…

Marcelo Rinesi November 6, 2024 11:03 AM

There’s an economic counterargument. Oversimplifying:

1) LLMs make creating large unreviewed code artifacts for under-designed systems by poorly trained coders very much cheaper than before.

2) In a competitive, unregulated technology market with poor ex ante visibility on software quality this shifts the market equilibrium towards more code, with less review, and less investment in non-coding work and processes, through both cost and speed-to-market pressures. Historically and in the aggregate, higher-productivity coding tools have usually led to more code by less trained people, not to reinvestment of the productivity dividend into, say, testing, design, training, etc.

3) Overall, security gets better driven by the individual quality of each line of code, and worse through less investment on other things that also impact security.

In a more anecdotal formulation: (a) LLMs make it possible for very untrained and/or time-pressured teams to ship good-enough-for-demo software with good quality code and pretty much nothing else. (b) Given market dynamics, a lot of deployed software is going to look like that. (c) It’s not obvious, and I suspect it’s not true, that in the balance that’s more secure software than before.

Or, in a historical analogy: High-level languages and libraries do work as a form of code-generation LLM (just from Python instead of English), and they do create better-quality and safer low-level code (JIT assembler or whatever). But we don’t have safer a safer software ecosystem, and it’s at least arguable that the way their productivity features interact with the regulatory and market environment have made things worse, or at least have not had a positive impact.

I don’t think any of the above is a definite proof or anything like that, but it seems plausible enough to make me doubt that the outcome is clear-cut.

Person November 6, 2024 11:12 AM

@ Clive Robinson

With Regards to “Care to amplify what you are thinking.”

Software fuzzing is just another tool devs use to check their code. The concept of checking every possible (or as many as possible) variations of instructions or inputs for a given system to try and find edge cases and other vulnerabilities that are hard to figure out when looking at a complex system as a whole. The main drawback of fuzzing is the more complex the system becomes, the exponentially longer it takes to do a comprehensive fuzz.

The current solutions to this problem are algorithmic tailoring, using a subset that you hope covers what you want, or a mixture of both. The downsides are quite obvious; an algorithm might not work for the specific style of program you are testing, and subsets are always going to miss something.

If someone smarter than me were instead to use AI to tweak the algorithms used to choose which branches to test, and/or decide which subsets of branches are most likely to have vulnerabilities, one might be able to drastically reduce the time needed to run a fuzz, or simply increase the “hit” probability or coverage of a program given the same timing constraints.

Clive Robinson November 6, 2024 11:30 PM

@ Person

Thanks for the reply, if I’m reading things right at this “early hour of the morning (0345 where I am),

You are proposing a divide and conquer type attack to reduce the number of “tests” needed.

But rather than “stochastic” or random selection of tests across the whole range, which in effect gives broad probabilistic results, use the AI system to direct the selection of tests to narrow ranges that have from previous analysis a much higher probability of error.

Yes I would expect this to show significantly increased performance on “known” errors, but at a cost.

The chance of it finding a new error is going to be proportional to the multidimensional vector distance to a known error.

Whilst I would expect it to have increased performance in finding variations on “known cases” and maybe find some new “edge cases”, I would not expect it to find new “corner cases”.

Which begs the obvious question,

“Does this really matter?”

Yes we would see a reduction in errors in “known cases” in the general software population, thus hopefully reducing “low hanging fruit” significantly.

But it would not reduce the expected rate of new corner cases being found, as that would remain the domain of the “hinky thinker”.

But once the new cases becomes known cases, such an AI system would in effect reduce the time to find new matches that are close in the vector space. Which obviously has a reduced time component.

In the past I’ve noted the difference between,

1, New instant in an unknown class of attack.
2, New instant in a known class of attack.

Where the latter are expected to be far more prevalent.

I would expect the system you are proposing to in effect focus on this second more prevalent group.

Which I think many would regards as “a good thing”.

However… We also have to consider “new tool” lead time.

In manufacturing new tools are known to be expensive and this delays general take up (development cost amortization). That is those that are first users are those who benefit most by what a new tool specialises in as it has a greater impact on their effectiveness.

In the case of software vulnerabilities those who would benefit most by this tool would initially be attackers.

So we would expect there to initially be a rise on new attacks, that would taper off over time as take up of the new tool increases by those who are in effect defenders.

Clive Robinson November 9, 2024 9:40 PM

@ Bruce, ALL,

Re : Software signatures

Above I mention “Castles v Prisons” private research I did and wrote about on this blog quite a few years ago.

It came out of work I’d done on designing tiny Computers running in parallel the idea being to run simplified “tasklet” software that generated monitorable signatures. The tasklets would in effect work a bit like *nix shell command scripts to build applications, the point being the tasklets had well defined signatures that could be effectively monitored by a over-watching hypervisor. If the signature was “not right” then malware or input that was not being correctly handled was causing a situation that needed to be brought to the attention of the security system.

Well many years after I put up the “Castle v Prison” information the academic community finally started getting it’s act together…

In the meantime Machine Learning had advanced thus much more complex signatures could be instrumented and seen thus could work with high end CPU Cores (which in effect did not exist when I started doing my research).

Well… a couple of years back there was an “academic paper” that finally started looking into the security issues of “signatures”,

https://jackcook.github.io/bigger-fish/

Funnily enough it was in part about things I’d discussed here with @Nick P, @Wael, and @RobertT, and also as I had mentioned as a way that the likes of “Television Licencing” and others such as the RIAA etc could tell what Movies, Audio, and similar media streams you were watching/listening to via the power usage waveforms seen at the Smart-Meter. Such that they could be detected to see if supposed “Digital Rights” were being abused.

The two improvements the paper had were due mainly to advances in hardware, which gave

1, Use of another Web App running on another core as the instrument head.
2, The use of improved AI ML

Well today one of the authors of that academic paper wrote a blog post about it, which is a little more non hardware-guru reader friendly,

https://jackcook.com/2024/11/09/bigger-fish.html

The use of instrumentation to detect “Side Channels” and AI ML to automate the “surveillance” on users computers is something I expect to become a major use of AI in the next few years as I’ve previously mentioned.

For those that don’t implicitly get “side channels”, they are in this case due to “resource limitations”.

Put simply think about your home,

“There is only so much hot water in the storage tank.”

So as you use hot water it gets replaced with cold water that has to be heated. The length of time it takes to heat the water is proportional to the hot water taken out of the tank.

So if someone is having a shower whilst someone else is doing the dishes, both can tell not only that hot water is being used by the other, but how much and for how long.

All systems that have multiple task capability have a “side channel” issue from the “shared resource” use. Put simply it can not be avoided unless you go to some quite interesting techniques… Which are in effect “imposed inefficiency” and mostly we just don’t design PC and similar IT systems that way. We do however design “Real Time OS”(RTOS) and some high security highly segregated systems that way.

Keith Levkoff November 21, 2024 4:58 PM

The biggest problem I see in all this is that the process of finding AND FIXING vulnerabilities is very often far from symmetrical.

The company I work for manufactures consumer products that have relatively complex operating firmware. (We’re talking about home theater gear so the actual risk associated with weaknesses in our firmware are somewhat limited…)

However, even when it comes to “plain old bugs”, the process of finding a bug is often much less complicated or expensive than the process of FIXING it, testing the patch, and ensuring that nothing else was broken in the process.

As I see it we’ve always had the asymmetry that, while the vendor has to find and fix ALL of the vulnerabilities in order for their product to remain secure, the hacker only needs to find ONE unpatched vulnerability to compromise it. (It’s the old analogy that, in order to secure a building, the owner must secure ALL of the the doors and windows, whereas a burglar only needs to find ONE that wasn’t properly locked in order to get in.)

However, as AI gets better at finding vulnerabilities, that ability is going to be of FAR more benefit to the hacker than the vendor. Assuming equal capabilities on both side – so both “get the same list at the same time” – the hacker is getting a more comprehensive list of opportunities to exploit more quickly whereas the vendor is getting a longer list of problems that need to be solved more urgently. This is going to make it more difficult for the vendor to stay ahead of the hacker.

To go with the physical security analogy… imagine an AI tool that could offer a list of every unlocked door and window in a neighborhood… updated every ten seconds… to both police departments and burglars. It seems obvious to me that, at least initially, such a system would provide far more advantage to the burglars than to the police.

The situation is actually worse. With my physical security analogy you could theoretically figure out a way to notify individual homeowners when their home was unlocked… but, with software vendors, in most such situations most companies aren’t going to have the option of “not using a certain program for a few days because a new exploit for it was discovered and announced five minutes ago”.

And, at the development level, we might hope that all software would be thoroughly checked before being released, but I’m afraid that would be too much to hope for… especially if we’re talking about finding and fixing more problems I’m afraid some software might never get released if it actually had to be “totally free of critical bugs”. (And, at this rate, we might end up talking about “zero MINUTE exploits” instead of “zero DAY exploits”.)

I don’t know what the answer is but I fear that this is going to make the overall situation more problematic, and more “frenetic”, than it is already.

Leave a comment

All comments are now being held for moderation. For details, see this blog post.

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.