I used o3 to find a remote zeroday in the Linux SMB implementation

nxobject · 2025-05-24T19:43:35 1748115815

A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.

It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.

[1] https://github.com/SeanHeelan/o3_finds_cve-2025-37899

epolanski · 2025-05-24T22:41:28 1748126488

I find your take amusing considering that's literally the only part of the post he admits to just vibing it:

> In fact my entire system prompt is speculative so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering

conradev · 2025-05-25T01:07:39 1748135259

A good engineer can vibe good engineering plans!

Just like Eisenhower's famous "plans are useless, planning is indispensable" quote. The muscle you build is creating new plans, not memorizing them.

moffkalast · 2025-05-25T07:41:17 1748158877

People also underestimate how much winging it is actually the ideal approach for a natural language interface, since that's the kind of thing it was trained on anyway.

NitpickLawyer · 2025-05-25T04:35:32 1748147732

The difference between vibing and "engineering" is keeping good records, logs and prompt provenance in a methodical way? Also have a (manual) way of reviewing the results. :) (paraphrased from mythbusters)

chii · 2025-05-25T07:39:58 1748158798

as the mythbusters have famously said, the only difference between science and fucking around is writing it down.

_boffin_ · 2025-05-25T01:03:49 1748135029

One person’s Vibe is another person’s dream? In my mind, the person is able to formulate a mental model complete enough to even go after vurln, unlike me, where I wouldn’t have even considered thinking about it.

kweingar · 2025-05-24T20:19:25 1748117965

How do we benchmark these different methodologies?

It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

nindalf · 2025-05-24T20:29:38 1748118578

The author is up front about the limitations of their prompt. They say

> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

0points · 2025-05-25T12:05:27 1748174727

Author seems to downplay their own expertise and attribute it to the LLM, while at the same time admitting he's vibe prompting the LLM and dismissing wrong results while hyping the ones that happen to work out for him.

This seems more like wishful thinking and fringe stuff than CS.

pixl97 · 2025-05-25T15:12:17 1748185937

Science starts at the fringe with a "that's interesting"

The interesting thing here is the LLM can come to very complex correct answers some of the time. The problem space of understanding and finding bugs is so large that this isn't just by chance, it's not like flipping a coin.

The issue for any particular user is the amount of testing required to make this into science is really massive.

mrlongroots · 2025-05-24T21:37:34 1748122654

I think there's two aspects around LLM usage:

1. Having workflows to be able to provide meaningful context quickly. Very helpful.

2. Arbitrary incantations.

I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.

As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.

TrapLord_Rhodo · 2025-05-25T15:08:02 1748185682

> ksmbd has too much code for it all to fit in your context window in one go. Therefore you are going to audit each SMB command in turn. Commands are handled by the __process_request function from server.c, which selects a command from the conn->cmds list and calls it. We are currently auditing the smb2_sess_setup command. The code context you have been given includes all of the work setup code code up to the __process_request function, the smb2_sess_setup function and a breadth first expansion of smb2_sess_setup up to a depth of 3 function calls.

The author deserves more credit here, than just "vibing".

kristopolous · 2025-05-24T22:26:29 1748125589

I usually like fear, shame and guilt based prompting: "You are a frightened and nervous engineer that is very weary about doing incorrect things so you tread cautiously and carefully, making sure everything is coherent and justifiable. You enjoy going over your previous work and checking it repeatedly for accuracy, especially after discovering new information. You are self-effacing and responsible and feel no shame in correcting yourself. Only after you've come up with a thorough plan ... "

I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.

The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.

I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"

hollerith · 2025-05-24T22:28:27 1748125707

Should be "wary".

kristopolous · 2025-05-24T22:52:38 1748127158

oh interesting, I somehow survived 42 years and didn't know there were 2 words there. I'll check my prompts and give it a go. Thanks.

ValentineC · 2025-05-24T23:37:09 1748129829

I'd be weary of the model doing incorrect things too. Nice prompt though! I'll try it out in Roo soon.

Now I wonder how the model reasons between the two words in that black box of theirs.

kristopolous · 2025-05-25T00:38:04 1748133484

I was coding a chatting bot with an agent like everyone else at https://github.com/day50-dev/llmehelp and I called the agent "DUI" mode because it's funny.

However, as I was testing it, it would do reckless and irresponsible things. After I changed it, as far as bot communication, to "Do-Ur-Inspection" mode and it became radically better.

None of the words you give it are free from consequences. It didn't just discard the "DUI" name as a mere title and move on. Fascinating lesson.

gundmc · 2025-05-24T22:54:42 1748127282

[flagged]

kristopolous · 2025-05-24T22:56:20 1748127380

yeah I removed it. I grew up catholic, went to catholic school, was an altar boy, and spent decades in the church but people reading it don't know this.

The point is when you instruct it that it's some kind of god-like expert, this is part of the reason that it keeps doing prompt refusal by redoing mistakes despite every insistence by you to the contrary. After all what do you know, It's the expert here!

When you use this approach in cline/roo it stops going in and moving shit around when you just ask it questions

tiahura · 2025-05-24T23:45:47 1748130347

As a former alterboy from before there were altergirls can you uncensor.

kristopolous · 2025-05-25T00:18:32 1748132312

I refer to the first method as "catholic prompting" - shame, fear and guilt.

georgemcbay · 2025-05-25T03:52:25 1748145145

As someone from a traditional Boston Catholic family who graduated from Catholic grade and high school and who has since moved away from religion but still has a lot of family and friends who are Catholic, the fact that someone found the idea that Catholics are prone to shame, fear and guilt offensive almost makes me doubt they are Catholic.

I've yet to meet one Catholic IRL who wouldn't have a laugh about that, regardless of the current state of their faith.

kristopolous · 2025-05-25T05:32:52 1748151172

I think the proper thing to do is if someone is offended is "alright sure, whatever, there you go".

Others being offended isn't something you control, responding to it is

ptdnxyz · 2025-05-25T05:46:16 1748151976

Giving in to people who are truly unreasonably offended (by proxy, for social validation, and so on) rewards and incentivizes the behavior, and in fact I believe you have an ethical obligation not to allow people to do this. Promoting antisocial behavior is antisocial.

It's worth having some grace about it though.

0xDEAFBEAD · 2025-05-25T09:14:39 1748164479

To be fair, gundmc's original comment was rather prosocial as these things go:

>I find this use of "Catholic" pretty offensive and distasteful.

They didn't claim their preferences were universal. Nor did they attempt any personal attack on the person they were responding to. They simply described their emotional reaction and left it at that.

If everyone's initial default was the "gundmc approach" when they were offended, the internet would be a much nicer place :-)

So yeah, as far as I'm concerned, everyone in this comment chain is simply lovely :-)

tptacek · 2025-05-25T01:40:07 1748137207

I'm Catholic and it's fine.

naasking · 2025-05-25T09:55:40 1748166940

> Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.

kweingar · 2025-05-25T18:35:09 1748198109

I can't think of many engineering disciplines that do things this way. "This seems to work, I don't know how or why it works, I don't even know if it's possible to know how or why it works, but I will just apply this moving forward, crossing my fingers that in future situations it will work by analogy."

If the act of discovery and iterative refinement makes prompting an engineering discipline, then is raising a baby also an engineering discipline?

naasking · 2025-05-25T19:53:08 1748202788

Lots of engineering disciplines work this way. For instance, materials science is still crude, we don't have perfect theories for why some materials have the properties they do (like concrete or superconductors), we simply quantify what those properties are under a wide range of conditions and then make use of those materials under suitable conditions.

> then is raising a baby also an engineering discipline?

The key to science and engineering is repeatability. Raising a baby is an N=1 trial, no guarantees of repeatability.

limflick · 2025-05-25T22:38:52 1748212732

I think the point is that it's more about trial and error, and less about blindly winging it. When you don't know how a system seems to work, you latch on to whatever seems to initially work and proceed from there to find patterns. It's not an entire approach to engineering, just a small part of the process.

p0w3n3d · 2025-05-24T21:09:11 1748120951

Listen to a video made by Karpathy about LLM, he explains why made up html tags work. It's to help the tokenizer

dotancohen · 2025-05-25T11:15:44 1748171744

I recall this even being in the Anthropic documentation.

dotancohen · 2025-05-25T11:18:28 1748171908

Here, found it:

  > Use XML tags to structure your prompts

  > There are no canonical “best” XML tags that Claude has been trained with in particular, although we recommend that your tag names make sense with the information they surround.

https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

justsomehnguy · 2025-05-25T20:42:07 1748205727

My guess would be there is enough training materiel what a mere tagging sometging is enough to have a bigger SNR.

victor106 · 2025-05-25T04:56:08 1748148968

Could not find it. Can you please provide a link?

p0w3n3d · 2025-05-25T19:43:37 1748202217

https://youtu.be/7xTGNNLPyMI?si=eaqVjx8maPtl1STJ

He shows how the prompt is parsed etc. Very nice and eye opening. Also superstition dispelling

stingraycharles · 2025-05-25T06:27:29 1748154449

It’s not that difficult to benchmark these things, eg have an expected result and a few variants of templates.

But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.

Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.

ptdnxyz · 2025-05-25T05:40:49 1748151649

How do you benchmark different ways to interact with employees? Neural networks are somewhere between opaque and translucent to inspection, and your only interface with them is language.

Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.

kweingar · 2025-05-25T18:26:01 1748197561

I think we agree. Interacting with employees is not an engineering discipline, and neither is prompting.

I'm not objecting to the incantations or the vibes per se. I'm happy to use AI and try different methods to get the results I want. I just don't understand the claims that prompting is a type of engineering. If it were, then you would need benchmarks.

threeseed · 2025-05-24T21:47:50 1748123270

It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.

Terr_ · 2025-05-25T06:56:33 1748156193

> Those prompts should be renamed as hints. [...] its sole overarching goal: to give you an answer no matter whether it's true or not.

I like to think of them as beginnings of an arbitrary document which I hope will be autocompleted in a direction I find useful... By an algorithm with the overarching "goal" of Make Document Bigger.

baq · 2025-05-25T07:47:15 1748159235

You’re confusing engineering with maths. You engineer your prompting to maximize the chance the LLM does what you need - in your example, the true answer - to get you closer to solving your problem. It doesn’t matter what the LLM does internally as long as the problem is being solved correctly.

(As an engineer it’s part of your job to know if the problem is being solved correctly.)

ngneer · 2025-05-25T12:32:24 1748176344

Maybe very very soft "engineering". Do you have metrics on which prompt is best? What units are you measuring this in? Can you follow a repeatable process to obtain a repeatable result?

CharlesW · 2025-05-24T23:03:22 1748127802

> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

You invoke "engineering principles", but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes. Using LLMs is no different in that respect. It's not rocket science. It's manageable.

th0ma5 · 2025-05-24T23:07:14 1748128034

But the threshold between correct and incorrect inference is dependent on an intersection of the model and the document so far. That is not manageable by definition, I mean... It is a chaotic system.

bredren · 2025-05-25T03:38:26 1748144306

Is this dissimilar to what the human brain produces? Are we not producing chaos controlled by wanting to give the right answer?

th0ma5 · 2025-05-26T00:29:04 1748219344

Yes it is very dissimilar. Life isn't a sum of the discrete inputs. I mean maybe it is at times but the context is several orders of magnitude greater, the inputs several orders of magnitude input, etc but the theory that it can be quantified like this is unproven let alone a good basis for an artificial system.

skydhash · 2025-05-25T04:10:58 1748146258

> but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes

Software engineering is mostly about dealing with human limitations (both the writer of the code and its readers). SO you have principles like modularization and cohesion which is for the people working on the code, not the computer. We also have tests, which is an imperfect, but economical approach to ensure the correctness of the software. Every design decision can be justified or argued and the outcome can be predicted and weighted. You're not cajoling a model to get results. You take a decision and just do it.

limflick · 2025-05-25T22:42:36 1748212956

> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

Are you Insinuating that dealing with unstable and unpredictable systems isn't somewhere engineering principles are frequently applied to solve complex problems?

roywiggins · 2025-05-24T22:07:57 1748124477

Engineering principles are probably the best we've got when it comes to trying to work with a poorly understood system? That doesn't mean they'll work necessarily, but...

avianlyric · 2025-05-24T23:41:37 1748130097

> Engineering principles are probably the best we've got when it comes to trying to work with a poorly understood system?

At its heart that all engineering principles exist to do. Allow us to extract useful value, and hopefully predictable outcomes from systems that are either poorly understood, or too expensive to economically characterise. Engineering is more-or-less the science of “good enough”.

There’s a reason why computer science, and software engineering are two different disciplines.

skydhash · 2025-05-25T04:20:43 1748146843

From "Modern Software Engineering" by David Farley

> Software engineering is the application of an empirical, scientific approach to finding efficient, economic solutions to practical problems in software.

> The adoption of an engineering approach to software development is important for two main reasons. First, software development is always an exercise in discovery and learning, and second, if our aim is to be “efficient” and “economic,” then our ability to learn must be sustainable.

> This means that we must manage the complexity of the systems that we create in ways that maintain our ability to learn new things and adapt to them.

That is why I don't care about LLMs per se, but their usage is highly correlated to the wish of the user to not learn anything, just have some answer, even incorrect, as long as it passes the evaluation process (compilation, review, ci tests,..). If the usage is to learn, I don't have anything to say.

As for efficient and economical solutions that can be found with them,...

avianlyric · 2025-05-25T13:01:53 1748178113

I think you’re being a little over critical of LLMs. They certainly have their issues, and most assuredly people often use them inappropriately. But it rather intellectually lazy to declare that because many people use LLMs inappropriately, that means they can’t offer real value.

I’ve personally found them extremely useful to test and experiment new ideas. Having an LLM throw together a PoC which would have taken me an hour to create, in less than 5mins, is a huge time saver. Makes it possible to iterate through many more ideas and test my understanding of systems far more efficiently than doing the same by hand.

skydhash · 2025-05-25T15:49:50 1748188190

Maybe that’s alien to me because I don’t tend to build PoC, mostly using wireframes to convey ideas. Most of my coding is fully planned to get to the end. The experiment part is on a much smaller scale (module level).

avianlyric · 2025-05-25T18:00:38 1748196038

Ah my apologies. I didn’t realise you’re an individual capable of designing and building complex systems made of multiple interconnected novel modules using only wireframes, and having all that work without any prior experimentation.

For the rest of us less fortunate, LLMs can be a fantastic tool to sketch out novel modules quickly, and then test assumptions and interactions between them, before committing to a specific high level design.

skydhash · 2025-05-25T18:16:04 1748196964

> I didn’t realise you’re an individual capable of designing and building complex systems made of multiple interconnected novel modules using only wireframes, and having all that work without any prior experimentation.

Not really. It's just that there's a lot of prior works out there, so I don't need to do experimentation when someone has already done it and describe the lessons learned. Then you do requirement analysis and some designs (system, api, and ux), plus with the platform constraints, there aren't a lot of flexible points left. I'm not doing research on software engineering.

For a lot of projects, the objective is to get something working out there. Then I can focus on refining if needs be. I don't need to optimize every parameter with my own experiments.

avianlyric · 2025-05-25T22:35:32 1748212532

How do you handle work that involves building novel systems, where good prior art simply doesn’t exist?

I’m currently dealing with a project that involves developing systems where the existing prior art is either completely proprietary and inaccessible, or public, but extremely nacient and thus documented learnings are less developed than our own learnings and designs.

Many projects may have the primary objective of getting something working. But we don’t all have the luxury of being able to declare something working and walk away. I specifically have requirements around long term evolution of our project (I.e. over a 5-10 year time horizon at a minimum), plus long term operational burden and cost. While also delivering value in the short term.

LLM provide are an invaluable tool for exploring the many possible solutions to what we’re building, and helping to evaluate the longer term consequences of our design decisions, before we’ve committed significant resources to developing them completely.

Of course we could do all this without LLMs, but LLMs substantially increase the distance we can explore before timelines force us to commit.

skydhash · 2025-05-26T01:20:24 1748222424

Maybe the main problem is not solved yet, but I highly doubt that the subproblems are not. Because that would be cutting edge domain, which is very much an outlier.

avianlyric · 2025-05-26T14:32:43 1748269963

Ah so what exactly do you mean when you say

> Most of my coding is fully planned to get to the end. The experiment part is on a much smaller scale (module level).

I would seem that these statements taken together mean you don’t experiment at all?

skydhash · 2025-05-26T21:15:09 1748294109

That means that I take time to analyze the problem and come up with a convincing design (mostly research, and experience). After that I've just got a few parameters that I don't know much about. But that doesn't mean that I can't build the stuff. I just isolate them so that I can tweak them later. Why? Because they are often accidental complexities, not essential ones.

avianlyric · 2025-05-27T08:24:40 1748334280

> That means that I take time to analyze the problem and come up with a convincing design (mostly research, and experience).

Ah I think we’re finally getting somewhere. My point is that you can use LLM as part of that research process. Not just as a poor substitute for proper research, but as a tool for experimental research. It’s supplemental to the normal research process, and is certainly not a tool for creating final outputs.

Using LLMs like that can make a meaningful difference to speed and quality of the analysis and final design. And something you should consider, rather than dismissing out of hand.

jcims · 2025-05-24T22:55:45 1748127345

>people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

What's the alternative?

bigstrat2003 · 2025-05-25T01:02:28 1748134948

Not use such a poor tool.

shakna · 2025-05-25T07:08:22 1748156902

Using predictable systems.

If your C compiler invents a new function call for a non-existent function while generating code, that's usually a bug.

If an LLM does, that's... Normal. And a non-event.

pixl97 · 2025-05-25T15:18:28 1748186308

If we have to use predictable systems, how could we use humans in the first place?

jcims · 2025-05-25T14:48:10 1748184490

And?

What other engineering domain operates on a fundamentally predictable substrate? Even computer science at any appreciable scale or complexity becomes unpredictable.

shakna · 2025-05-26T05:05:39 1748235939

Every engineering domain operates within "known bounds". That makes it dependable.

An engineer doesn't just shrug and pick up slag because it contains the same materials as the original bauxite.

jcims · 2025-05-26T12:16:02 1748261762

Of course not, but how did we get to that point with materials science and chemistry?

We’re basically in the stone ages of understanding how to interact with synthetic intelligence.

shakna · 2025-05-27T06:01:06 1748325666

Through research and experimentation, yes.

But attempts to integrate little understood things in daily life gave us radium toothpaste and lead poisoning. Let's not repeat stone age mistakes. Research first, integrate later.

what-the-grump · 2025-05-25T01:07:26 1748135246

Pretending that the world is stable predictable and feeling in control while making fun of other people. Obviously.

ngneer · 2025-05-25T12:34:20 1748176460

Math and physics are pretty stable. So is computer science. Avoid voodoo.

brookst · 2025-05-25T16:51:32 1748191892

LLMs are just math.

It’s reasonable to scope one’s interest down to easily predictable, simple systems.

But most of the value in math and computer science is at the scale where there is unpredictability arising from complexity.

ngneer · 2025-05-25T18:45:49 1748198749

It's reasonable to perceive most of the value in math and computer science being "at the scale" where there is unpredictability arising from complexity, though scale may not really be the reason for the unpredictability.

But a lot of the trouble in these domains that I have observed comes from unmodeled effects, that must be modeled and reasoned about. GPZ work shows the same thing shown by the researcher here, which is that it requires a lot of tinkering and a lot of context in order to produce semi-usable results. SNR appears quite low for now. In security specifically, there is much value in sanitizing input data and ensuring correct parsing. Do you think LLMs are in a position to do so?

brookst · 2025-05-25T20:36:41 1748205401

I see LLMs as tools, so, sure I think they’re in a position to do so the same way pen testing tools or spreadsheets are.

In the hands of an expert, I believe they can help. In the hands of someone clueless, they will just confuse everyone, much like any other tool the clueless person uses.

iknowstuff · 2025-05-24T22:25:52 1748125552

Are you using 2023 LLMs? o3 and Gemini 2.5 Pro will gladly say no or declare uncertainty in my experience

Filligree · 2025-05-24T23:59:58 1748131198

90% of people only use ChatGTP, typically 4o. Of course you're right, but that's where the disconnect comes from.

stingraycharles · 2025-05-25T01:52:29 1748137949

Fun fact: if you ask an LLM about best practices and how to organize your prompts, it will hint you towards this direction.

It’s surprisingly effective to ask LLMs to help you write prompts as well, i.e. all my prompt snippets were designed with help of an LLM.

I personally keep them all in an org-mode file and copy/paste them on demand in a ChatGPT chat as I prefer more “discussion”-style interactions, but the approach is the same.

abeindoria · 2025-05-25T04:10:47 1748146247

Hah. Same. I have a step by step "reasoning" agent that asks me for confirmation after each step (understanding of problem, solutions proposed, solutions selection, and final wrap) - just so it gets red back the previous prompts and answers rather than one word salad essay.

Works incredibly well, and I created it with its own help.

rcarmo · 2025-05-25T11:38:53 1748173133

It’s all about being organized: https://taoofmac.com/space/blog/2025/05/13/2230

conception · 2025-05-25T00:40:12 1748133612

https://github.com/jezweb/roo-commander has something like 1700 prompts in it with 50+ prompts modes. And it seems to work pretty well. For me at least. It’s task/session management is really well thought out.

Enginerrrd · 2025-05-25T13:22:21 1748179341

Wrangling LLM's is remarkably like wrangling interns in my experience. Except that the LLM will surprise you by being both much smarter and much dumber.

The more you can frame the problem with your expertise, the better the results you will get.

Retr0id · 2025-05-24T18:23:24 1748111004

The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating this part will be where the real wins are, so I'll be watching this closely.

Aurornis · 2025-05-25T02:03:34 1748138614

I’ve developed a few take-home interview problems over the years that were designed to be short, easy for an experienced developer, but challenging for anyone who didn’t know the language. All were extracted from real problems we solved on the job, reduced into something minimal.

Every time a new frontier LLM is released (excluding LLMs that use input as training data) I run the interview questions through it. I’ve been surprised that my rate of working responses remains consistently around 1:10 for the first pass, and often takes upwards of 10 rounds of poking to get it to find its own mistakes.

So this level of signal to noise ratio makes sense for even more obscure topics.

Aachen · 2025-05-25T14:48:01 1748184481

> challenging for anyone who didn’t know the language.

Interviewees don't get to pick the language?

If you're hiring based on proficiency in a particular tech stack, I'm curious why. Are there that many candidates that you can be this selective? Is the language so dissimilar that the uninitiated would need a long time to get up to speed? Does the job involve working on the language itself and so a specifically deep understanding is required?

tuetuopay · 2025-05-27T02:37:19 1748313439

> Interviewees don't get to pick the language?

For leetcode interviews, sure. Other than that, at least familiarity with the language is paramount, or with the same class of language.

ponector · 2025-05-26T17:09:18 1748279358

That is the market nowadays. Employers seek not only deep knowledge in particular language, but also particular libraries. If you cannot answer interview questions about implementation of some features - you are out.

limflick · 2025-05-25T22:53:19 1748213599

Aren't most interviews like this? Most dev openings I see posted mention the specific language who's expertise they're looking for and the number of years of experience needed working with said language as well.

It can be annoying, but manageable. I've never coded in Java for example, but knowing C#, C++ and Python I imagine it wouldn't be too hard to pick up.

Aachen · 2025-05-26T22:14:14 1748297654

Huh, okay. That's not how we run interviews but I guess it's at least a thing, even if not common around here that I've seen yet (I'm not super current on interview practices though)

Regarding the job ads, yes they'd describe the ideal candidate but I haven't the experience that the perfect candidate ever actually shows up. Like you say, knowing J, T and Z, the company is confident enough that you'll be able to quickly pick up dotting the Is and crossing the 7s

ngneer · 2025-05-25T12:36:25 1748176585

I do the same, but entry level problems that require healthy analysis. New frontier LLMs do not manage to do so well at all.

ianbutler · 2025-05-24T19:07:00 1748113620

We’ve been working on a system that increases signal to noise dramatically for finding bugs, we’ve at the same time been thoroughly benchmarking the entire popular software agents space for this

We’ve found a wide range of results and we have a conference talk coming up soon where we’ll be releasing everything publicly so stay tuned for that itll be pretty illuminating on the state of the space

Edit: confusing wording

sebmellen · 2025-05-24T19:31:43 1748115103

Interesting. This is for Bismuth? I saw your pilot program link — what does that involve?

ianbutler · 2025-05-24T20:38:55 1748119135

Yup! So we have multiple businesses working with us and for pilots its deploying the tool, providing feedback (we're connected over slack with all our partners for a direct line to us), and making sure the uses fit expectations for your business and working towards long term partnership.

We have several deployments in other peoples clouds right now as well as usage of our own cloud version, so we're flexible here.

tough · 2025-05-24T18:38:36 1748111916

I was thinking about this the other day, wouldn't it be feasible to make fine-tune or something like that into every git change, mailist, etc, the linux kernel has ever hard?

Wouldn't such an LLM be the closer -synth- version of a person who has worked on a codebase for years, learnt all its quirks etc.

There's so much you can fit on a high context, some codebases are already 200k Tokens just for the code as is, so idk

sodality2 · 2025-05-24T18:46:01 1748112361

I'd be willing to bet the sum of all code submitted via patches, ideas discussed via lists, etc doesn't come close to the true amount of knowledge collected by the average kernel developer's tinkering, experimenting, etc that never leaves their computer. I also wonder if that would lead to overfitting: the same bugs being perpetuated because they were in the training data.

antirez · 2025-05-25T05:41:08 1748151668

I bet automatic this part will be simple. In general LLMs that have a given semantical ability "X" to do some task, have greater than X ability to check, among N replies about doing the same task, which reply is the best, especially if via binary tournament like RAInk did (it was posted here a few weeks ago). There is also the possibility to use agreement among different LLMs. I'm surprised Gemini 2.5 PRO was not used here, in my experience it is the most powerful LLM to do that kind of stuff.

andix · 2025-05-24T18:45:56 1748112356

1:50 is a great detection ratio for finding a needle in a haystack.

epolanski · 2025-05-24T22:37:03 1748126223

I don't think the author agrees as he points out the bugs weren't that difficult to find.

Aachen · 2025-05-25T14:54:06 1748184846

Nah. I'm not an expert code auditor myself but I've seen my colleagues do it and I've seen ChatGPT try its hand. Even when I give it a specific piece of code and probe/hint in the right direction, it produces five paragraphs of vulnerabilities, none of which are real, while overlooking the one real concern we identified

You can spend all day reading slop or you can get good at this yourself and be much more efficient at this task. Especially if you're the developer and know where to look and how things work already, catching up on security issues relevant to your situation will be much faster than looking for this needle in the haystack that is LLM output

manmal · 2025-05-24T19:33:11 1748115191

If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically. It’s just quite expensive to do all that right now.

threeseed · 2025-05-24T21:51:33 1748123493

Except that in my experience half the time it will modify the implementation in order to make the tests pass.

And it will do this no matter how many prompts you try or you forcefully you ask it.

moyix · 2025-05-24T22:16:44 1748125004

With security vulnerabilities, you don't give the agent the ability to modify the potentially vulnerable software, naturally. Instead you make them do what an attacker would have to do: come up with an input that, when sent to the unmodified program, triggers the vulnerability.

How do you know if it triggered the vulnerability? Luckily for low-level memory safety issues like the ones Sean (and o3) found we have very good oracles for detecting memory safety, like KASAN, so you can basically just let the agent throw inputs at ksmbd until you see something that looks kind of like this: https://groups.google.com/g/syzkaller/c/TzmTYZVXk_Q/m/Tzh7SN...

klabb3 · 2025-05-25T02:07:42 1748138862

> If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically.

Designing and building meaningfully testable non-trivial software is orders of magnitude more complex than writing the business logic itself. And that’s if you compare writing greenfield code from scratch. Making an old legacy code base testable in a way conducive to finding security vulns is not something you just throw together. You can be lucky with standard tooling like sanitizers and valgrind but it’s far from a panacea.

quentinp · 2025-05-24T18:54:09 1748112849

Exactly. Many AI users can’t triage effectively, as a result open source projects get a lot of spam now: https://arstechnica.com/gadgets/2025/05/open-source-project-...

nialv7 · 2025-05-25T16:59:18 1748192358

maybe we ask the AI to come up with an exploit, run it and see if it works? then you can RL on this.

iandanforth · 2025-05-24T18:47:30 1748112450

The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!

seanheelan · 2025-05-25T09:21:48 1748164908

I realised I didn't mention it in the article, so in case you're curious it cost about $116 to run the 100k token version 100 times.

egorfine · 2025-05-26T09:21:55 1748251315

So, half that for batch processing [1], which presumably would be just fine for this task?

[1] https://platform.openai.com/docs/guides/batch

wyldfire · 2025-05-25T15:57:25 1748188645

How many years/generations behind o3 are the freely available / local models?

ramy_d · 2025-05-25T13:36:31 1748180191

thank you, I was going to ask about this. It's not a crazy amount...

Aachen · 2025-05-25T14:58:57 1748185137

Do we know how that relates to actual operating cost? My understanding is that this is below cost price because we're still in the investor hype part of the cycle where they're trying to capture market share by pumping many millions into these companies and projects

Does this really reflect the resource cost of finding this vulnerability?

remram · 2025-05-26T01:33:39 1748223219

It sounds like a crazy amount to me. I can run code analyzers/sanitizers/fuzzers on every commit to my repo at virtually no cost. Would they have caught a problem like this? Maybe not, certainly not without some amount of false positives. Still this LLM approach costs many millions of times more than previous tooling, and might still have brought up nothing (we just don't read the blog posts about those attempts).

JFingleton · 2025-05-24T21:31:41 1748122301

Zero days can go for $$$, or you can go down the bug bounty route and also get $$. The cost of the LLM would be a drop in the bucket.

When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.

yencabulator · 2025-05-25T16:23:56 1748190236

Except in this case the LLM was pointed at a known-to-exist vulnerability. $116 per handler per vulnerability type, unknown how many vulnerabilities exist.

roncesvalles · 2025-05-24T19:17:28 1748114248

A lot of money is all you need~

bbarnett · 2025-05-24T20:12:10 1748117530

A lot of burned coal, is what.

The "don't blame the victim" trope is valid in many contexts. This one application might be "hackers are attacking vital infrastructure, so we need to fund vulnerabilities first". And hackers use AI now, likely hacked into and for free, to discover vulnerabilities. So we must use AI!

Therefore, the hackers are contributing to global warming. We, dear reader, are innocent.

sdoering · 2025-05-24T20:38:49 1748119129

So basically running a microwave for about 800 seconds, or a bit more than 13 minutes per model?

Oh my god - the world is gonna end. Too bad, we panicked because of exaggerated energy consumption numbers for using an LLM when doing individual work.

Yes - when a lot of people do a lot of prompting, these 0ne tenth of a second to 8 seconds of running the microwave per prompt adds up. But I strongly suggest, that we could all drop our energy consumption significantly using other means, instead of blaming the blog post's author about his energy consumption.

The "lot of burned coal" is probably not that much in this blog post's case given that 1 kWh is about 0.12 kg coal equivalent (and yes, I know that we need to burn more than that for 1kWh. Still not that much, compared to quite a few other human activities.

If you want to read up on it, James O'Donnell and Casey Crownhart try to pull together a detailed account of AI energy usage for MIT Technology Review.[1] I found that quite enlightening.

[1]: https://www.technologyreview.com/2025/05/20/1116327/ai-energ...

XorNot · 2025-05-25T00:49:18 1748134158

The better answer is just "I don't care".

Because I definitely don't care. Energy expenditure numbers are always used in isolation, lest any one have to deal with anything real about them, and always are content to ignore the abstraction which electricity is - namely, electricity is not coal. It's electricity. Unlike say, driving my petrol powered car, the power for my computers might come from solar panels, coal, nuclear power stations, geothermal power hydro...

Which is to say, if people want to worry about electricity usage: go worry about it by either building more clean energy, or campaigning to raise electricity prices.

sdoering · 2025-05-25T09:57:52 1748167072

Funny, I actually care. But I try to direct my care towards the real culprits.

So about 50% of CO2 emissions in Germany come from 20 sources. The campaigns like personal footprint (invented by BP) are there to shift the blame to consumers. Away from those with the biggest impact and the most options for action.

So yes, I f**ng don’t care if a security researcher leaves his microwave equivalent running for a few minutes. But I care, campaign in the bigger sense and also orient my own consumption wherever possible towards cleaner options.

Full well knowing that even as mostly being reasonable in my consumption, I definitely belong to those 5-10% of earth's population who drive the problem. Because more than half of the population in the so called first world live according to the Paris Climate Agreement. And it’s not the upper half of.

Balooga · 2025-05-24T20:34:09 1748118849

Between $3k and $30k to solve a single ARC-AGI problem [1]. Not sure if "100 runs" makes this comparable.

[1] https://techcrunch.com/2025/04/02/openais-o3-model-might-be-...

mcbuilder · 2025-05-25T17:35:18 1748194518

I think it gave up trying to solve Pokemon. :) Seriously, aren't these ARC-AGI problems easy for most people? They usually involve some sort of pattern recognition and visual reasoning.

umbra07 · 2025-05-24T23:58:23 1748131103

And how do you know what the purely-human-driven energy expenditure would have been?

wongarsu · 2025-05-24T22:07:40 1748124460

How much longer would OP have needed to find the same vulnerability without LLM help? Then multiply that by the energy used to produce 2000kcal/day of food as well as the electricity for running their computer.

Usually LLMs come out far ahead in those types of calculations. Compared to humans they are quite energy efficient

topaz0 · 2025-05-25T03:32:15 1748143935

Those types of calculation are extremely disingenuous.

sadeshmukh · 2025-05-25T03:56:25 1748145385

What exactly is disingenuous about it?

topaz0 · 2025-05-25T21:50:24 1748209824

It reduces the value of a human life to the incremental rate at which they produce some concrete product. It is absurd.

sadeshmukh · 2025-05-26T04:27:12 1748233632

Or, it elevates the tasks artificial intelligence produces to the actual difficulty of them - the human effort.

topaz0 · 2025-05-27T11:39:02 1748345942

You're not thinking this through. Your human life (with its associated 2000 Cal/day) does so much more than find bugs in obscure codebases. Or at least, one would hope.

xyst · 2025-05-24T23:27:45 1748129265

"100 times for each of the models" represents a significant amount of energy burned. The achievement of finding the most common vulnerability in C based codebases becomes less of an achievement. And more of a celebration of decadence and waste.

We are facing global climate change event, yet continue to burn resources for trivial shit like it’s 1950.

geraneum · 2025-05-25T10:48:33 1748170113

This has become a common recurrence recently.

Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.

In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.

Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:

https://www.nature.com/articles/s41586-023-06924-6

One small note, concluding that the LLM is “reasoning” about code just _based on this experiment_ is bit of a stretch IMHO.

antirez · 2025-05-25T12:40:08 1748176808

Either I'm very lucky or as I suspected Gemini 2.5 PRO can more easily identify the vulnerability. My success rate is so high that running the following prompt a few times is enough: https://gist.github.com/antirez/8b76cd9abf29f1902d46b2aed3cd...

firesteelrain · 2025-05-24T20:08:03 1748117283

I really hope this is legit and not what keeps happening to curl

[1] https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...

meander_water · 2025-05-25T03:05:26 1748142326

I'm not sure about the assertion that this is the first vulnerability found with an LLM. For e.g. OSS-Fuzz [0] has found a few using fuzzing, and Big Sleep using an agent approach [1].

[0] https://security.googleblog.com/2024/11/leveling-up-fuzzing-...

[1] https://googleprojectzero.blogspot.com/2024/10/from-naptime-...

seanheelan · 2025-05-25T09:07:31 1748164051

It's certainly not the first vulnerability found with an LLM =) Perhaps I should have been more clear though.

What the post says is "Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I'm aware, this is the first public discussion of a vulnerability of that nature being found by a LLM."

The point I was trying to make is that, as far as I'm aware, this is the first public documentation of an LLM figuring out that sort of bug (non-trivial amount of code, bug results from concurrent access to shared resources). To me at least, this is an interesting marker of LLM progress.

empath75 · 2025-05-24T19:21:43 1748114503

Given the value of finding zero days, pretty much every intelligence agency in the world is going to be pouring money into this if it can reliably find them with just a few hundred api calls. Especially if you can fine tune a model with lots of examples, which I don't think open ai, etc are going to do with any public api.

treebeard901 · 2025-05-24T21:56:34 1748123794

Yeah, the amount of engineering they have around controlling (censoring) the output, along with the terms of service, creates an incentive to still look for any possible bugs, but not allow it in the output.

Certainly for Govt agencies and others this will not be a factor. It is just for everyone else. This will cause people to use other models and agents without these restrictions.

It is safe to assume that a large number of vulnerabilities exist in important software all over the place. Now they can be found. This is going to set off arms race game theory applied to computer security and hacking. Probably sooner than expected...

stonepresto · 2025-05-25T10:16:47 1748168207

I know there were at least a few kernel devs who "validated" this bug, but did anyone actually build a PoC and test it? It's such a critical piece of the process yet a proof of concept is completely omitted? If you don't have a PoC, you don't know what sort of hiccups would come along the way and therefore can't determine exploitability or impact. At least the author avoided calling it an RCE without validation.

But what if there's a missing piece of the puzzle that the author and devs missed or assumed o3 covered, but in fact was out of o3's context, that would invalidate this vulnerability?

I'm not saying there is, nor am I going to take the time to do the author's work for them, rather I am saying this report is not fully validated which feels like a dangerous precedent to set with what will likely be an influential blog post in the LLM VR space moving forward.

IMO the idea of PoC || GTFO should be applied more strictly than ever before to any vulnerability report generated by a model.

The underlying perspective that o3 is much better than previous or other current models still remains, and the methodology is still interesting. I understand the desire and need to get people to focus on something by wording it a specific way, it's the clickbait problem. But dammit, do better. Build a PoC and validate your claims, don't be lazy. If you're going to write a blog post that might influence how vulnerability researchers conduct their research, you should promote validation and not theoretical assumption. The alternative is the proliferation of ignorance through false-but-seemingly-true reporting, versus deepening the community's understanding of a system through vetted and provable reports.

seanheelan · 2025-05-25T13:33:45 1748180025

Hi, author here. Yes, I built a PoC. Yes, it triggered a KASAN report/crash.

stonepresto · 2025-05-25T13:42:35 1748180555

Thank you! I'm really happy to hear you did that. But why not mention that in your blog post? I understand not wanting to include a PoC for responsible disclosure reasons, but including it would have added a lot of credibility to your work for assholes like me lol

seanheelan · 2025-05-25T13:52:03 1748181123

I honestly hadn’t anticipated someone would think I hadn’t bothered to verify the vulnerability is real ;)

Since you’re interested: the bug is real but it is, I think, hard to exploit in real world scenarios. I haven’t tried. The timing you need to achieve is quite precise and tight. There are better bugs in ksmbd from an exploitation point of view. All of that is a bit of a “luxury problem” from the PoV of assessing progress in LLM capabilities at finding vulnerabilities though. We can worry about ranking bugs based on convenience for RCE once we can reliably find them at all.

stonepresto · 2025-05-25T14:11:18 1748182278

I'm too much of a skeptic to not do so lol. Great post though overall, don't let my assholery dissuade you! I was pleasantly surprised that it was actually a researcher behind the news story and there was some real evidence / scientific procedure. I thought you had a lot of good insights into how to use LLMs in the VR space specifically, and I'm glad you did benchmarking. It's interesting to see how they're improving.

Yeah race conditions like that are always tricky to make reliable. And yeah I do realize that the purpose of the writeup was more about the efficacy of using LLMs vs the bug itself, and I did get a lot out of that part, I just hyper-focused on the bug because it's what I tend to care the most about. In the end I agree with your conclusion, I believe LLMs are going to become a key part of the VR workflow as they improve and I'm grateful for folks like yourself documenting a way forward for their integration.

Anyways, solid writeup and really appreciate the follow-up!

lyu07282 · 2025-05-25T10:50:16 1748170216

Are you saying you want PoCs that trigger a crash from the use-after-free or you would only be satisfied by full on RCE PoCs?

stonepresto · 2025-05-25T13:38:08 1748180288

PoCs should at least trigger a crash, overwrite a register, or have some other provable effect, the point being to determine:

1) If it is actually a UAF or if there is some other mechanism missing from the context that prevents UAF. 2) The category and severity of the vulnerability. Is it even a DoS, RCE, or is the only impact causing a thread to segfault?

This is all part of the standard vulnerability research process. I'm honestly surprised it got merged in without a PoC, although with high profile projects even the suggestion of a vulnerability in code that can clearly be improved will probably end up getting merged.

lyu07282 · 2025-05-25T15:26:57 1748186817

Even a rudimentary exploit can be a significant time investment, it is absolutely not common practice to develop, publish or to demand such exploits from researchers to demonstrate memory corruption vulnerabilities. Everyone thinks they are an expert in infosec its so funny.

stonepresto · 2025-05-25T21:41:56 1748209316

Well, in another subthread the author said he did in fact make a crashing PoC. I guess it depends on the customer's standards, but I would say in the vast majority of cases (especially for nuanced memory corruptions in which the ability to make something exploitable depends on your ability to demonstrate control of the heap) a crashing PoC is the bare minimum. In most VDPs, BBPs, or red team engagements you are required to provide some sort of proof to claim, otherwise you'll be laughed out of the room.

I'm curious which sector of infosec you're referring to in which vulnerability researchers are not required to provide proofs of concept? Maybe internal product VR where there is already an established trust?

simonw · 2025-05-24T21:02:10 1748120530

There's a beautiful little snippet here that perfectly captures how most of my prompt development sessions go:

> I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

logifail · 2025-05-24T18:42:18 1748112138

My understanding is that ksmbd is a kernel-space SMB server "developed as a lightweight, high-performance alternative" to the traditional (user-space) Samba server...

Q1: Who is using ksmbd in production?

Q2: Why?

donnachangstein · 2025-05-24T19:16:23 1748114183

1. People that were using the in-kernel SMB server in Solaris or Windows.

2. Samba performance sucks (by comparison) which is why people still regularly deploy Windows for file sharing in 2025.

Anybody know if this supports native Windows-style ACLs for file permissions? That is the last remaining reason to still run Solaris but I think it relies on ZFS to do so.

Samba's reliance on Unix UID/GID and the syncing as part of its security model is still stuck in the 1970s unfortunately.

The caveat is the in-kernel SMB server has been the source of at least one holy-shit-this-is-bad zero-day remote root hole in Windows (not sure about Solaris) so there are tradeoffs.

raverbashing · 2025-05-24T19:41:23 1748115683

> Samba's reliance on Unix UID/GID and the syncing as part of its security model is still stuck in the 1970s unfortunately.

Sigh. This is why we can't have nice things

Like yeah having smb in kernel is faster but honestly it's not fundamentally faster. But it seems the will to make samba better isn't there

AshamedCaptain · 2025-05-24T20:15:17 1748117717

Licensing. Samba is GPLv3, Linux is only GPLv2.

pixl97 · 2025-05-24T18:44:51 1748112291

I would assume for the reason of being lightweight and high performance?

foobar10000 · 2025-05-24T18:49:01 1748112541

Smb over 25gbit networks - user space samba is much worse there.

Henchman21 · 2025-05-24T18:53:43 1748112823

This is interesting to me! I regularly deploy 25G network connections, but I don’t think we’d run SMB over that. I am super curious the industry and use case if you’re willing to share!

hackernudes · 2025-05-24T19:05:06 1748113506

"SMB Direct" is RDMA based and ksmbd supports it. Samba does not. Disclaimer: I have not used it but was looking it up just yesterday.

Henchman21 · 2025-05-24T21:57:03 1748123823

Appreciated, thank you.

tinco · 2025-05-25T09:32:17 1748165537

I ran SMB over a 20gbit network (2x 10gbit). The use case was 3D rendering (photogrammetry specifically). There were multiple render nodes, and a central service coordinating the rendering process. The projects would be on SSD's on the central SMBD server, and after they were manually configured (using Agisoft Metashape) they'd be rendered. Projects would sometimes start as tens of gigabytes worth of photos, and the artifacts (including intermediates) would balloon into the hundreds of gigabytes, we'd have dozens of these projects per week.

I researched quite extensively prior to landing on SMB, but it really seems like there isn't a better way of doing this. The environment was mixed windows/linux, but if there was a better pure linux solution I would've pushed our office staff to switch to Ubuntu.

noname120 · 2025-05-24T20:42:58 1748119378

The same reason people use kmod-trelay instead of relayd I guess

zielmicha · 2025-05-24T14:43:38 1748097818

(To be clear, I'm not the author of the post, the title just starts with "How I")

jp0001 · 2025-05-25T10:20:44 1748168444

We followed a very similar approach at work, created a test harness and tested all the models available in AWS bedrock and the OpenAI. We created our own code challenges not available on the Internet for training with vulnerable and non-vulnerable inline snippets and more contextual multi-file bugs. We also used 100 tests per challenge - I wanted to do 1000 test per challenge but realized that these models are not even close to 2 Sigma in accuracy! Overall we found very similar results. But, we were also able to increase accuracy using additional methods - which comes as additional costs. The issue I see overall is that we found is when dealing with large codebases you'll need to put blinders on the LLMs to shorten context windows so that hallucinated results are less likely to happen. The worst thing would be to follow red herrings - perhaps in 5 years we'll have models used for more engineering specific tasks that can be rated with Six Sigma accuracy if posed with the same questions and problems sets.

bandrami · 2025-05-25T12:08:13 1748174893

The blinders give you a problem in that a lot of security issues aren't at a single point in the code but at where two remote points in the code interact.

jp0001 · 2025-05-26T14:22:04 1748269324

Correct. Dynamic runtime interactions will always be a hard problem as it’s hard to see in static code even for humans.

eqvinox · 2025-05-24T23:41:18 1748130078

Anyone else feel like this is a best case application for LLMs?

You could in theory automate the entire process, treat the LLM as a very advanced fuzzer. Run it against your target in one or more VMs. If the VM crashes or otherwise exhibits anomalous behavior, you've found something. (Most exploits like this will crash the machine initially, before you refine them.)

On one hand: great application for LLMs.

On the other hand: conversely implies that demonstrating this doesn't mean that much.

paulddraper · 2025-05-25T06:36:52 1748155012

https://security.googleblog.com/2024/11/leveling-up-fuzzing-...

eqvinox · 2025-05-25T17:04:34 1748192674

I mean, yes, they're doing it, but my question was really whether people share my belief that it's a particularly well-fitting application ;)

(Also yeah feels like the "FIRST!!1!eleven" thing metastasized from comment sections into C-level executives…)

ngneer · 2025-05-25T12:48:46 1748177326

https://news.ycombinator.com/item?id=42017771

Meh.

paulddraper · 2025-05-25T16:45:35 1748191535

That seems to be really preoccupied with who was first, without looking at the magnitude of the results, which is far from "meh."

ngneer · 2025-05-25T18:52:44 1748199164

I think it was more a PoC. I would be more impressed if it was deployed in production. "we want to reiterate that these are highly experimental results". If the dividends are massive, would they not deploy it in production and tell the world about it?

KTibow · 2025-05-24T19:40:50 1748115650

> With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log.

This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.

gizmodo59 · 2025-05-24T20:41:11 1748119271

Having tried both I’d say o3 is in a league of it’s own compared to 3.7 or even Gemini 2.5 pro. The benchmarks may show not a lot of gain but that matters a lot when the task is very complex. What’s surprising is that they announced it last November and only now it’s released a month back now? (I’m guessing lots of safety took time but no idea). Can’t wait for o4!

dieortin · 2025-05-25T10:19:58 1748168398

All your content threads from the past months consist on you saying how much better OpenAI products are than the competition, so that doesn’t inspire a ton of trust.

gizmodo59 · 2025-05-25T11:48:15 1748173695

Because in my use cases they are? Coding and math, science research are my primary use cases and codex with o3 and o3 consistently outperforms others in complex tasks for me. I can’t say a model is better just to appeal to HN. If another model is as good as o3 id use that in a second.

sothatsit · 2025-05-26T00:33:07 1748219587

I also feel similarly. o3 feels quite distinct in what it is good at compared to other models.

For example, I think 2.5 Pro and Claude 4 are probably better at programming. But, for debugging, or not-super-well-defined reasoning tasks, or even just as a better search, o3 is in a league of its own. It feels like it can do a wider breadth of tasks than other models.

iamdanieljohns · 2025-05-24T22:18:13 1748125093

Could you provide some links to relevant work/research on using a "scratchpad" that you liked?

KTibow · 2025-05-24T22:49:06 1748126946

I'm not much of an ML engineer but I can point you to the original chain of thought paper [0] and Anthropic's docs on how to enable their official thinking scratchpad [1].

[0] https://arxiv.org/pdf/2201.11903

[1] https://docs.anthropic.com/en/docs/build-with-claude/extende...

resiros · 2025-05-25T10:43:15 1748169795

I think an approach like AlphaEvolve is very likely to work well for this space.

You've got all the elements for a successful optimization algorithm: 1) A fast and good enough sampling function + 2) a fairly good energy function.

For 1) this post shows that LLMs (even unoptimized) are quite good at sampling candidate vulnerabilities in large code bases. A 1% accuracy rate isn't bad at all, and they can be made quite fast (at least very parallelizable).

For 2) theoretically you can test any exploit easily and programmatically determine if it works. The main challenge is getting the energy function to provide gradient—some signal when you're close to finding a vulnerability/exploit.

I expect we'll see such a system within the next 12 months (or maybe not, since it's the kind of system that many lettered agencies would be very interested in).

martinald · 2025-05-24T20:54:41 1748120081

I think this is the biggest alignment problem with LLMs in the short term imo. It is getting scarily good at this.

I recently found a pretty serious security vulnerability in an open source very niche server I sometimes use. This took virtually no effort using LLMs. I'm worried that there is a huge long tail of software out there which wasn't worth finding vulnerabilities in for nefarious means manually but if it was automated could lead to really serious problems.

tekacs · 2025-05-24T20:57:29 1748120249

The (obvious) flipside of this coin is that it allows us to run this adversarially against our own codebases, catching bugs that could otherwise have been found by a researcher, but that we can instead patch proactively.\

I wouldn't (personally) call it an alignment issue, as such.

Legend2440 · 2025-05-24T21:00:47 1748120447

If attackers can automatically scan code for vulnerabilities, so can defenders. You could make it part of your commit approval process or scan every build or something.

martinald · 2025-05-25T05:34:01 1748151241

A lot of this code isn't updated though. Think of how many abandoned wordpress plugins there are (for example). So the defenders could, but how do they get that code to fix it?

I agree after time you end up with a steady state but in the short medium term the attackers have a huge advantage.

roywiggins · 2025-05-25T03:11:07 1748142667

Is it an alignment problem if it's doing what was asked of it? It's "aligned" with a human's wishes.

bongodongobob · 2025-05-25T03:06:58 1748142418

It's a moot point unless attackers have better LLMs don't have access to.

dboreham · 2025-05-24T21:00:55 1748120455

I feel like our jobs are reasonably secure for a while because the LLM didn't immediately say "SMB implemented in the kernel, are you f-ing joking!?"

gerdesj · 2025-05-25T02:03:02 1748138582

I'll have to get my facts straight but I'm pretty sure that ksmbd is ... not used much (by me).

https://lwn.net/Articles/871866/ This is also nothing to do with Samba which is a well trodden path.

So why not attack a codebase that is rather more heavily used and older? Why not go for vi?

usr1106 · 2025-05-25T04:38:12 1748147892

Good link. After reading this it's not a surprise that this code has security vulnerabilities. But of course from knowing that there must be more to actually finding it, it's still a big leap.

4 years after the article, does any relevant distro have that implementation enabled?

zison · 2025-05-24T14:38:29 1748097509

Very interesting. Is the bug it found exploitable in practice? Could this have been found by syzkaller?

mdaniel · 2025-05-24T16:12:14 1748103134

I case anyone else didn't recognize that word: https://github.com/google/syzkaller

mezyt · 2025-05-24T18:54:43 1748112883

Meanwhile, as a maintainer, I've been reviewing more than a dozen false positives slop CVEs in my library and not a single one found an actual issue. This article's is probably going to make my situation worse.

SamuelAdams · 2025-05-24T19:50:58 1748116258

Maybe, but the author is an experienced vulnerability analyst. Obviously if you get a lot of people who have no experience with this you may get a lot of sloppy, false reports.

But this poster actually understands the AI output and is able to find real issues (in this case, use-after-free). From the article:

> Before I get into the technical details, the main takeaway from this post is this: with o3 LLMs have made a leap forward in their ability to reason about code, and if you work in vulnerability research you should start paying close attention. If you’re an expert-level vulnerability researcher or exploit developer the machines aren’t about to replace you. In fact, it is quite the opposite: they are now at a stage where they can make you significantly more efficient and effective.

tecleandor · 2025-05-25T09:02:02 1748163722

Not even that. The author already knew the bug was there, and fed the LLM just the files related to the bug, with the explanation on how the methods worked and where to search, and even then, only 1 out of 100 times did it find the bug.

sweetjuly · 2025-05-25T17:13:31 1748193211

There are two bugs in the article: one the author previously knew about and was trying to rediscover as an exploration as well as a second the author did not know about and stumbled into. The second bug is novel, and is what makes the blog post interesting.

baq · 2025-05-24T21:10:08 1748121008

probably not. o3 is not free to use.

whbrown · 2025-05-25T01:02:35 1748134955

Who says you need to use a top model to produce cybersecurity slop? Did this person use o3?

https://hackerone.com/reports/3125832

mehulashah · 2025-05-25T13:11:40 1748178700

The scary part of this is that the bad guys are doing the same thing. They’re looking for zero day exploits, and their ability to find them just got better. More importantly, it’s now almost automated. While the arms race will always continue, I wonder if this change of speed hurts the good guys more than the bad guys. There are many of these, and they take time to fix.

1oooqooq · 2025-05-25T12:22:39 1748175759

i can ask offline o3 about that cve and get a reply, does that mean the author used a model that knew about the vulnerability?

akomtu · 2025-05-24T19:22:09 1748114529

This made me think that the near future will be LLMs trained specifically on Linux or another large project. The source code is a small part of the dataset fed to LLMs. The more interesting is runtime data flow, similar to what we observe in a debugger. Looking at the codebase alone is like trying to understand a waterfall by looking at equations that describe the water flow.

baq · 2025-05-24T21:11:12 1748121072

it needs to be trained on on enough TLA+ traces, too.

qoez · 2025-05-25T10:37:13 1748169433

This is why AI safety is going to be impossible. This easily could have been a bad actor who would use this finding for nefarious acts. A person can just lie and there really isn't any safety finetuning that would let it separate the two intents.

theptip · 2025-05-24T23:05:55 1748127955

This is a great case study. I wonder how hard o3 would find it to build a minimal repro for these vulns? This would of course make it easier to identify true positives and discard false positives.

This is I suppose an area where the engineer can apply their expertise to build a validation rig that the LLM may be able to utilize.

jobswithgptcom · 2025-05-24T18:58:25 1748113105

Wow, interesting. I been hacking a tool called https://diffwithgpt.com with a similar angle but indexing git changelogs with qwen to have it raise risks for backward compat issues, risks including security when upgrading k8s etc.

baby · 2025-05-24T23:59:30 1748131170

I have a presentation here on doing it to target zk bugs https://youtu.be/MN2LJ5XBQS0?si=x3nX1iQy7iex0K66