Hacker News new | past | comments | ask | show | jobs | submit login

A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.

It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.

[1] https://github.com/SeanHeelan/o3_finds_cve-2025-37899






I find your take amusing considering that's literally the only part of the post he admits to just vibing it:

> In fact my entire system prompt is speculative so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering


A good engineer can vibe good engineering plans!

Just like Eisenhower's famous "plans are useless, planning is indispensable" quote. The muscle you build is creating new plans, not memorizing them.


People also underestimate how much winging it is actually the ideal approach for a natural language interface, since that's the kind of thing it was trained on anyway.

The difference between vibing and "engineering" is keeping good records, logs and prompt provenance in a methodical way? Also have a (manual) way of reviewing the results. :) (paraphrased from mythbusters)

as the mythbusters have famously said, the only difference between science and fucking around is writing it down.

One person’s Vibe is another person’s dream? In my mind, the person is able to formulate a mental model complete enough to even go after vurln, unlike me, where I wouldn’t have even considered thinking about it.

How do we benchmark these different methodologies?

It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?


The author is up front about the limitations of their prompt. They say

> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.


Author seems to downplay their own expertise and attribute it to the LLM, while at the same time admitting he's vibe prompting the LLM and dismissing wrong results while hyping the ones that happen to work out for him.

This seems more like wishful thinking and fringe stuff than CS.


Science starts at the fringe with a "that's interesting"

The interesting thing here is the LLM can come to very complex correct answers some of the time. The problem space of understanding and finding bugs is so large that this isn't just by chance, it's not like flipping a coin.

The issue for any particular user is the amount of testing required to make this into science is really massive.


I think there's two aspects around LLM usage:

1. Having workflows to be able to provide meaningful context quickly. Very helpful.

2. Arbitrary incantations.

I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.

As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.


> ksmbd has too much code for it all to fit in your context window in one go. Therefore you are going to audit each SMB command in turn. Commands are handled by the __process_request function from server.c, which selects a command from the conn->cmds list and calls it. We are currently auditing the smb2_sess_setup command. The code context you have been given includes all of the work setup code code up to the __process_request function, the smb2_sess_setup function and a breadth first expansion of smb2_sess_setup up to a depth of 3 function calls.

The author deserves more credit here, than just "vibing".


I usually like fear, shame and guilt based prompting: "You are a frightened and nervous engineer that is very weary about doing incorrect things so you tread cautiously and carefully, making sure everything is coherent and justifiable. You enjoy going over your previous work and checking it repeatedly for accuracy, especially after discovering new information. You are self-effacing and responsible and feel no shame in correcting yourself. Only after you've come up with a thorough plan ... "

I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.

The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.

I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"


Should be "wary".

oh interesting, I somehow survived 42 years and didn't know there were 2 words there. I'll check my prompts and give it a go. Thanks.

I'd be weary of the model doing incorrect things too. Nice prompt though! I'll try it out in Roo soon.

Now I wonder how the model reasons between the two words in that black box of theirs.


I was coding a chatting bot with an agent like everyone else at https://github.com/day50-dev/llmehelp and I called the agent "DUI" mode because it's funny.

However, as I was testing it, it would do reckless and irresponsible things. After I changed it, as far as bot communication, to "Do-Ur-Inspection" mode and it became radically better.

None of the words you give it are free from consequences. It didn't just discard the "DUI" name as a mere title and move on. Fascinating lesson.


[flagged]


yeah I removed it. I grew up catholic, went to catholic school, was an altar boy, and spent decades in the church but people reading it don't know this.

The point is when you instruct it that it's some kind of god-like expert, this is part of the reason that it keeps doing prompt refusal by redoing mistakes despite every insistence by you to the contrary. After all what do you know, It's the expert here!

When you use this approach in cline/roo it stops going in and moving shit around when you just ask it questions


As a former alterboy from before there were altergirls can you uncensor.

I refer to the first method as "catholic prompting" - shame, fear and guilt.

As someone from a traditional Boston Catholic family who graduated from Catholic grade and high school and who has since moved away from religion but still has a lot of family and friends who are Catholic, the fact that someone found the idea that Catholics are prone to shame, fear and guilt offensive almost makes me doubt they are Catholic.

I've yet to meet one Catholic IRL who wouldn't have a laugh about that, regardless of the current state of their faith.


I think the proper thing to do is if someone is offended is "alright sure, whatever, there you go".

Others being offended isn't something you control, responding to it is


Giving in to people who are truly unreasonably offended (by proxy, for social validation, and so on) rewards and incentivizes the behavior, and in fact I believe you have an ethical obligation not to allow people to do this. Promoting antisocial behavior is antisocial.

It's worth having some grace about it though.


To be fair, gundmc's original comment was rather prosocial as these things go:

>I find this use of "Catholic" pretty offensive and distasteful.

They didn't claim their preferences were universal. Nor did they attempt any personal attack on the person they were responding to. They simply described their emotional reaction and left it at that.

If everyone's initial default was the "gundmc approach" when they were offended, the internet would be a much nicer place :-)

So yeah, as far as I'm concerned, everyone in this comment chain is simply lovely :-)


I'm Catholic and it's fine.

> Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.


I can't think of many engineering disciplines that do things this way. "This seems to work, I don't know how or why it works, I don't even know if it's possible to know how or why it works, but I will just apply this moving forward, crossing my fingers that in future situations it will work by analogy."

If the act of discovery and iterative refinement makes prompting an engineering discipline, then is raising a baby also an engineering discipline?


Lots of engineering disciplines work this way. For instance, materials science is still crude, we don't have perfect theories for why some materials have the properties they do (like concrete or superconductors), we simply quantify what those properties are under a wide range of conditions and then make use of those materials under suitable conditions.

> then is raising a baby also an engineering discipline?

The key to science and engineering is repeatability. Raising a baby is an N=1 trial, no guarantees of repeatability.


I think the point is that it's more about trial and error, and less about blindly winging it. When you don't know how a system seems to work, you latch on to whatever seems to initially work and proceed from there to find patterns. It's not an entire approach to engineering, just a small part of the process.

Listen to a video made by Karpathy about LLM, he explains why made up html tags work. It's to help the tokenizer

I recall this even being in the Anthropic documentation.

Here, found it:

  > Use XML tags to structure your prompts

  > There are no canonical “best” XML tags that Claude has been trained with in particular, although we recommend that your tag names make sense with the information they surround.
https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

My guess would be there is enough training materiel what a mere tagging sometging is enough to have a bigger SNR.

Could not find it. Can you please provide a link?

https://youtu.be/7xTGNNLPyMI?si=eaqVjx8maPtl1STJ

He shows how the prompt is parsed etc. Very nice and eye opening. Also superstition dispelling


It’s not that difficult to benchmark these things, eg have an expected result and a few variants of templates.

But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.

Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.


How do you benchmark different ways to interact with employees? Neural networks are somewhere between opaque and translucent to inspection, and your only interface with them is language.

Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.


I think we agree. Interacting with employees is not an engineering discipline, and neither is prompting.

I'm not objecting to the incantations or the vibes per se. I'm happy to use AI and try different methods to get the results I want. I just don't understand the claims that prompting is a type of engineering. If it were, then you would need benchmarks.


It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.


> Those prompts should be renamed as hints. [...] its sole overarching goal: to give you an answer no matter whether it's true or not.

I like to think of them as beginnings of an arbitrary document which I hope will be autocompleted in a direction I find useful... By an algorithm with the overarching "goal" of Make Document Bigger.


You’re confusing engineering with maths. You engineer your prompting to maximize the chance the LLM does what you need - in your example, the true answer - to get you closer to solving your problem. It doesn’t matter what the LLM does internally as long as the problem is being solved correctly.

(As an engineer it’s part of your job to know if the problem is being solved correctly.)


Maybe very very soft "engineering". Do you have metrics on which prompt is best? What units are you measuring this in? Can you follow a repeatable process to obtain a repeatable result?

> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

You invoke "engineering principles", but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes. Using LLMs is no different in that respect. It's not rocket science. It's manageable.


But the threshold between correct and incorrect inference is dependent on an intersection of the model and the document so far. That is not manageable by definition, I mean... It is a chaotic system.

Is this dissimilar to what the human brain produces? Are we not producing chaos controlled by wanting to give the right answer?

Yes it is very dissimilar. Life isn't a sum of the discrete inputs. I mean maybe it is at times but the context is several orders of magnitude greater, the inputs several orders of magnitude input, etc but the theory that it can be quantified like this is unproven let alone a good basis for an artificial system.

> but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes

Software engineering is mostly about dealing with human limitations (both the writer of the code and its readers). SO you have principles like modularization and cohesion which is for the people working on the code, not the computer. We also have tests, which is an imperfect, but economical approach to ensure the correctness of the software. Every design decision can be justified or argued and the outcome can be predicted and weighted. You're not cajoling a model to get results. You take a decision and just do it.


Engineering principles are probably the best we've got when it comes to trying to work with a poorly understood system? That doesn't mean they'll work necessarily, but...

> Engineering principles are probably the best we've got when it comes to trying to work with a poorly understood system?

At its heart that all engineering principles exist to do. Allow us to extract useful value, and hopefully predictable outcomes from systems that are either poorly understood, or too expensive to economically characterise. Engineering is more-or-less the science of “good enough”.

There’s a reason why computer science, and software engineering are two different disciplines.


From "Modern Software Engineering" by David Farley

> Software engineering is the application of an empirical, scientific approach to finding efficient, economic solutions to practical problems in software.

> The adoption of an engineering approach to software development is important for two main reasons. First, software development is always an exercise in discovery and learning, and second, if our aim is to be “efficient” and “economic,” then our ability to learn must be sustainable.

> This means that we must manage the complexity of the systems that we create in ways that maintain our ability to learn new things and adapt to them.

That is why I don't care about LLMs per se, but their usage is highly correlated to the wish of the user to not learn anything, just have some answer, even incorrect, as long as it passes the evaluation process (compilation, review, ci tests,..). If the usage is to learn, I don't have anything to say.

As for efficient and economical solutions that can be found with them,...


I think you’re being a little over critical of LLMs. They certainly have their issues, and most assuredly people often use them inappropriately. But it rather intellectually lazy to declare that because many people use LLMs inappropriately, that means they can’t offer real value.

I’ve personally found them extremely useful to test and experiment new ideas. Having an LLM throw together a PoC which would have taken me an hour to create, in less than 5mins, is a huge time saver. Makes it possible to iterate through many more ideas and test my understanding of systems far more efficiently than doing the same by hand.


Maybe that’s alien to me because I don’t tend to build PoC, mostly using wireframes to convey ideas. Most of my coding is fully planned to get to the end. The experiment part is on a much smaller scale (module level).

Ah my apologies. I didn’t realise you’re an individual capable of designing and building complex systems made of multiple interconnected novel modules using only wireframes, and having all that work without any prior experimentation.

For the rest of us less fortunate, LLMs can be a fantastic tool to sketch out novel modules quickly, and then test assumptions and interactions between them, before committing to a specific high level design.


> I didn’t realise you’re an individual capable of designing and building complex systems made of multiple interconnected novel modules using only wireframes, and having all that work without any prior experimentation.

Not really. It's just that there's a lot of prior works out there, so I don't need to do experimentation when someone has already done it and describe the lessons learned. Then you do requirement analysis and some designs (system, api, and ux), plus with the platform constraints, there aren't a lot of flexible points left. I'm not doing research on software engineering.

For a lot of projects, the objective is to get something working out there. Then I can focus on refining if needs be. I don't need to optimize every parameter with my own experiments.


How do you handle work that involves building novel systems, where good prior art simply doesn’t exist?

I’m currently dealing with a project that involves developing systems where the existing prior art is either completely proprietary and inaccessible, or public, but extremely nacient and thus documented learnings are less developed than our own learnings and designs.

Many projects may have the primary objective of getting something working. But we don’t all have the luxury of being able to declare something working and walk away. I specifically have requirements around long term evolution of our project (I.e. over a 5-10 year time horizon at a minimum), plus long term operational burden and cost. While also delivering value in the short term.

LLM provide are an invaluable tool for exploring the many possible solutions to what we’re building, and helping to evaluate the longer term consequences of our design decisions, before we’ve committed significant resources to developing them completely.

Of course we could do all this without LLMs, but LLMs substantially increase the distance we can explore before timelines force us to commit.


Maybe the main problem is not solved yet, but I highly doubt that the subproblems are not. Because that would be cutting edge domain, which is very much an outlier.

Ah so what exactly do you mean when you say

> Most of my coding is fully planned to get to the end. The experiment part is on a much smaller scale (module level).

I would seem that these statements taken together mean you don’t experiment at all?


That means that I take time to analyze the problem and come up with a convincing design (mostly research, and experience). After that I've just got a few parameters that I don't know much about. But that doesn't mean that I can't build the stuff. I just isolate them so that I can tweak them later. Why? Because they are often accidental complexities, not essential ones.

> That means that I take time to analyze the problem and come up with a convincing design (mostly research, and experience).

Ah I think we’re finally getting somewhere. My point is that you can use LLM as part of that research process. Not just as a poor substitute for proper research, but as a tool for experimental research. It’s supplemental to the normal research process, and is certainly not a tool for creating final outputs.

Using LLMs like that can make a meaningful difference to speed and quality of the analysis and final design. And something you should consider, rather than dismissing out of hand.


> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

Are you Insinuating that dealing with unstable and unpredictable systems isn't somewhere engineering principles are frequently applied to solve complex problems?


>people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

What's the alternative?


Not use such a poor tool.

Using predictable systems.

If your C compiler invents a new function call for a non-existent function while generating code, that's usually a bug.

If an LLM does, that's... Normal. And a non-event.


If we have to use predictable systems, how could we use humans in the first place?

And?

What other engineering domain operates on a fundamentally predictable substrate? Even computer science at any appreciable scale or complexity becomes unpredictable.


Every engineering domain operates within "known bounds". That makes it dependable.

An engineer doesn't just shrug and pick up slag because it contains the same materials as the original bauxite.


Of course not, but how did we get to that point with materials science and chemistry?

We’re basically in the stone ages of understanding how to interact with synthetic intelligence.


Through research and experimentation, yes.

But attempts to integrate little understood things in daily life gave us radium toothpaste and lead poisoning. Let's not repeat stone age mistakes. Research first, integrate later.


Pretending that the world is stable predictable and feeling in control while making fun of other people. Obviously.

Math and physics are pretty stable. So is computer science. Avoid voodoo.

LLMs are just math.

It’s reasonable to scope one’s interest down to easily predictable, simple systems.

But most of the value in math and computer science is at the scale where there is unpredictability arising from complexity.


It's reasonable to perceive most of the value in math and computer science being "at the scale" where there is unpredictability arising from complexity, though scale may not really be the reason for the unpredictability.

But a lot of the trouble in these domains that I have observed comes from unmodeled effects, that must be modeled and reasoned about. GPZ work shows the same thing shown by the researcher here, which is that it requires a lot of tinkering and a lot of context in order to produce semi-usable results. SNR appears quite low for now. In security specifically, there is much value in sanitizing input data and ensuring correct parsing. Do you think LLMs are in a position to do so?


I see LLMs as tools, so, sure I think they’re in a position to do so the same way pen testing tools or spreadsheets are.

In the hands of an expert, I believe they can help. In the hands of someone clueless, they will just confuse everyone, much like any other tool the clueless person uses.


Are you using 2023 LLMs? o3 and Gemini 2.5 Pro will gladly say no or declare uncertainty in my experience

90% of people only use ChatGTP, typically 4o. Of course you're right, but that's where the disconnect comes from.

Fun fact: if you ask an LLM about best practices and how to organize your prompts, it will hint you towards this direction.

It’s surprisingly effective to ask LLMs to help you write prompts as well, i.e. all my prompt snippets were designed with help of an LLM.

I personally keep them all in an org-mode file and copy/paste them on demand in a ChatGPT chat as I prefer more “discussion”-style interactions, but the approach is the same.


Hah. Same. I have a step by step "reasoning" agent that asks me for confirmation after each step (understanding of problem, solutions proposed, solutions selection, and final wrap) - just so it gets red back the previous prompts and answers rather than one word salad essay.

Works incredibly well, and I created it with its own help.


It’s all about being organized: https://taoofmac.com/space/blog/2025/05/13/2230

https://github.com/jezweb/roo-commander has something like 1700 prompts in it with 50+ prompts modes. And it seems to work pretty well. For me at least. It’s task/session management is really well thought out.

Wrangling LLM's is remarkably like wrangling interns in my experience. Except that the LLM will surprise you by being both much smarter and much dumber.

The more you can frame the problem with your expertise, the better the results you will get.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: