I don't understand why generative AI gets a pass at constantly being wrong, but an average worker would be fired if they performed the same way. If a manager needed to constantly correct you or double check your work, you'd be out. Why are we lowering the bar for generative AI?
* Gen AI never disagrees with or objects to boss's ideas, even if they are bad or harmful to the company or others. In fact, it always praises them no matter what. Brenda, being a well-intentioned human being, might object to bad or immoral ideas to prevent harm. Since boss's ego is too fragile to accept criticism, he prefers gen AI.
* Boss is usually not qualified, willing, or free to do Brenda's job to the same quality standard as Brenda. This compels him to pay Brenda and treat her with basic decency, which is a nuisance. Gen AI does not demand fair or decent treatment and (at least for now) is cheaper than Brenda. It can work at any time and under conditions Brenda refuses to. So boss prefers gen AI.
* Brenda takes accountability for and pride in her work, making sure it is of high quality and as free of errors as she can manage. This is wasteful: boss only needs output that is good enough to make it someone else's problem, and as fast as possible. This is exactly what gen AI gives him, so boss prefers gen AI.
My kneejerk reaction is the sunk cost fallacy (AI is expensive), but I'm pretty sure it's actually because businesses have spent the last couple of decades doing absolutely everything they can to automate as many humans out of the workforce as possible.
So now you don't have to pay people to do their actual work, you assign the work to ML ("AI") and then pay the people to check what it generated. That's a very different task, menial and boring, but if it produces more value for the same amount of input money, then it's economical to do so.
And since checking the output is often a lower skilled job, you can even pay the people less, pocketing more as an owner.
You don't have a human to manage. The relationship is completely one-sided, you can query a generative AI at 3 in the morning on new years eve. This entity has no emotions to manage and no own interests.
There's cost.
There's an implicit promise of improvement over time.
There's an the domain of expertise being inhumanly wide. You can ask about cookies right now, then about XII century France, then about biochemistry.
The fact that an average worker would be fired if they perform the same way is what the human actually competes with. They have responsibility, which is not something AI can offer. If it was the case that, say, Anthropic, actually signed contracts stating that they are liable for any mistakes, then humans would be absolutely toast.
I've been trying to open my mind and "give AI a chance" lately. I spent all day yesterday struggling with Claude Code's utter incompetence. It behaves worse than any junior engineer I've ever worked with:
- It says it's done when its code does not even work, sometimes when it does not even compile.
- When asked to fix a bug, it confidently declares victory without actually having fixed the bug.
- It gets into this mode where, when it doesn't know what to do, it just tries random things over and over, each time confidently telling me "Perfect! I found the error!" and then waiting for the inevitable response from me: "No, you didn't. Revert that change".
- Only when you give it explicit, detailed commands, "modify fade_output to be -90," will it actually produce decent results, but by the time I get to that level of detail, I might as well be writing the code myself.
To top it off, unlike the junior engineer, Claude never learns from its mistakes. It makes the same ones over and over and over, even if you include "don't make XYZ mistake" in the prompt. If I were an eng manager, Claude would be on a PIP.
Recently I've used Claude Code to build a couple TUIs that I've wanted for a long time but couldn't justify the time investment to write myself.
My experience is that I think of a new feature I want, I take a minute or so to explain it to Claude, press enter, and go off and do something else. When I come back in a few minutes, the desired feature has been implemented correctly with reasonable design choices. I'm not saying this happens most of the time, I'm saying it happens every time. Claude makes mistakes but corrects them before coming to rest. (Often my taste will differ from Claude's slightly, so I'll ask for some tweaks, but that's it.)
The takeaway I'm suggesting is that not everyone has the same experience when it comes to getting useful results from Claude. Presumably it depends on what you're asking for, how you ask, the size of the codebase, how the context is structured, etc.
Its great for demos, its lousy for production code. The different cost of errors in these two use cases explains (almost) everything about the suitability of AI for various coding tasks. If you are the only one who will ever run it, its a demo. If you expect others to use it, its not.
As the name indicates, a demo is used for demonstration purposes. A personal tool is not a demo. I've seen a handful of folks assert this definition, and it seems like a very strange idea to me. But whatever.
Implicit in your claim about the cost of errors is the idea that LLMs introduce errors at a higher rate than human developers. This depends on how you're using the LLMs and on how good the developers are. But I would agree that in most cases, a human saying "this is done" carries a lot more weight than an LLM saying it.
Regardless, it is not good analysis to try to do something with an LLM, fail, and conclude that LLMs are stupid. The reality is that LLMs can be impressively and usefully effective with certain tasks in certain contexts, and they can also be very ineffective in certain contexts and are especially not great about being sure whether they've done something correctly.
> But I would agree that in most cases, a human saying "this is done" carries a lot more weight than an LLM saying it.
That's because humans have stakes. If a human tells me something is done and I later find out that it isn't, they damage their credibility with me in the future - and they know that.
> Learning to use Claude Code (and similar coding agents) effectively takes quite a lot of work.
I've tried to put in the work. I can even get it working well for a while. But then all of a sudden it is like the model suffers a massive blow to the head and can't produce anything coherent anymore. Then it is back to the drawing board, trying all over again.
It is exhausting. The promise of what it could be is really tempting fruit, but I am at the point that I can't find the value. The cost of my time to put in the work is not being multiplied in return.
> Did you have it creating and running automated tests as it worked?
Yes. I work in a professional capacity. This is a necessity regardless of who (or what) is producing the product.
> - It says it's done when its code does not even work, sometimes when it does not even compile.
> - When asked to fix a bug, it confidently declares victory without actually having fixed the bug.
You need to give it ways to validate its work. A junior dev will also give you code that doesn't compile or should have fixed a bug but doesn't if they don't actually compile the code and test that the bug is truly fixed.
Believe me, I've tried that, too. Even after giving detailed instructions on how to validate its work, it often fails to do it, or it follows those instructions and still gets it wrong.
Don't get me wrong: Claude seems to be very useful if it's on a well-trodden train track and never has to go off the tracks. But it struggles when its output is incorrect.
The worst behavior is this "try things over and over" behavior, which is also very common among junior developers and is one of the habits I try to break from real humans, too. I've gone so far as to put into the root CLAUDE.md system prompt:
--NEVER-- try fixes that you are not sure will work.
--ALWAYS-- prove that something is expected to work and is the correct fix, before implementing it, and then verify the expected output after applying the fix.
...which is a fundamental thing I'd ask of a real software engineer, too. Problem is, as an LLM, it's just spitting out probabilistic sentences: it is always 100% confident of its next few words. Which makes it a poor investigator.
It’s much cheaper than Brenda (superficially, at least). I’m not sure a worker that costs a few dollars a day would be fired, especially given the occasional brilliance they exhibit.
How much compute costs is it for the AI to do Brenda's job? Not total AI spend, but the fraction that replaced Brenda. That's why they'd fire a human but keep using the AI.
Brenda's job involves being accountable for the output. In many types of jobs, posting false numbers would render her liable for a dismissal, lawsuit, or even jail.
I'd like to see the cost of a model where the model provider (Anthropic etc) can assume that kind of financial and legal accountability.
To the extent that this output is possible only when Anthropic is not held to the same standard as Brenda, we will need to conclude that the cost savings accrue due to the reduced liability standards than on the technical capabilities of the model
It's not just compute, its also the setup costs - How much did you have to pay someone to feed the AI Brenda's decades of knowledge specific to her company and all the little special cases of how it does business.
> Because it doesn’t have to be as accurate as a human to be a helpful tool.
I disagree. If something can't be as accurate as a (good) human, then it's useless to me. I'll just ask the human instead, because I know that the human is going to be worth listening to.
That's a downright insane comparison. The whole problem with generative AI is how extremely unreliable it is. You cannot really trust it with anything because irrespective of its average performance, it has absolutely zero guarantees on its worst-case behavior.
Aviation autopilot systems are the complete opposite. They are arguably the most reliable computer-based systems ever created. While they cannot fly a plane alone, pilots can trust them blindly to do specific, known tasks consistently well in over 99.99999% of cases, and provide clear diagnostics in case they cannot.
If gen AI agents were this consistently good at anything, this discussion would not be happening.
The autopilots in aircraft have predictable behaviors based on the data and inputs available to them.
This can still be problematic! If sensors are feeding the autopilot bad data, the autopilot may do the wrong thing for a situation. Likewise, if the pilot(s) do not understand the autopilot's behaviors, they may misuse the autopilot, or take actions that interfere with the autopilot's operation.
Generative AI has unpredictable results. You cannot make confident statements like "if inputs X, Y, and Z are at these values, the system will always produce this set of outputs".
In the very short timeline of reacting to a critical mid-flight situation, confidence in the behavior of the systems is critical. A lot of plane crashes have "the pilot didn't understand what the automation was doing" as a significant contributing factor. We get enough of that from lack of training, differences between aircraft manufacturers, and plain old human fallibility. We don't need to introduce a randomized source of opportunities for the pilots to not understand what the automation is doing.
It started out as, "AI can make more errors than a human. Therefore, it is not useful to humans." Which I disagreed with.
But now it seems like the argument is, "AI is not useful to humans because its output is non-deterministic?" Is that an accurate representation of what you're saying?
My problem with generative AI is that it makes different errors than humans tend to make. And these errors can be harder to predict and detect than the kinds of errors humans tend to make, because fundamentally the error source is the non-determinism.
Remember "garbage in, garbage out"? We expect technology systems to generate expected outputs in response to inputs. With generative AI, you can get a garbage output regardless of the input quality.
Gen AI doesn't just get a pass at being wrong. It gets a pass for everything.
Look at Grok. If a human employee went around sexually harassing their CEO in public and giving themselves a Hitler nickname, they'd be fired immediately and have criminal charges. In the case of Grok, the CEO had to quit the company after being sexually harassed.
We've not lowered the bar for AI, we've removed it entirely.
It’s not even greater trust. It’s just passive trust. The thing is, Brenda is her own QA department. Every good Brenda is precisely good because she checks her own work before shipping it. AI does not do this. It doesn’t even fully understand the problem/question sometimes yet provides a smart definitive sounding answer. It’s like the doctor on The Simpson’s, if you can’t tell he’s a quack, you probably would follow his medical advice.
> Every good Brenda is precisely good because she checks her own work before shipping it. AI does not do this.
A confident statement that's trivial to disprove. I use claude code to build and deploy services on my NAS. I can ask it to spin up a new container on my subdomain and make it available internal only or also available externally. It knows it has access to my Cloudflare API key. It knows I am running rootless podman and my file storage convention. It will create the DNS records for a cloudflared tunnel or just setup DNS on my pihole for internal only resolution. It will check to make sure podman launched the container and it will then try to make an HTTP request to the site to verify that it is up. It will reach for network tools to test both the public and private interfaces. It will check the podman logs for any errors or warnings. If it detects errors, it will attempt to resolve them and is typically successful for the types of services I'm hosting.
Instructions like: "Setup Jellyfin in a container on the NAS and integrate it with the rest of the *arr stack. I'd like it to be available internally and externally on watch.<domain>.com" have worked extremely well for me. It delivers working and integrated services reliably and does check to see that what it deployed is working all without my explicit prompting.
You’ve switched contexts completely with your strawman. Meaning, you’ve pivot Brenda in finance to some technical/software engineer task. You’ve pointed the conversation specifically at a use case that AI is good at, writing code and solving those problems. The world at large is much more complex than helping you be a 10x engineer. To live up to the hype, it has to do this reliably for every vertical in a majority of situations. It’s not even close to being there.
Also, context equivalent counter examples abound. Just read HN or any tech forum and it’s takes no time to hear people talking about the hallucinations and garbage that AI sometimes generates. The whole vibe coding trend is built on “make this app” then followed by hundreds of “fix this” “fix that” prompts because it doesn’t get much right at first attempt.
You're moving goalposts. You claimed "AI" cannot verify results and that's trivially false. Claude code verifies results on a regular basis. You don't have a clue what you're talking about and are just pushing ignorant FUD.
It can't with reliability is what I'm saying. I'm not doubting you built one singular use case where it has. When I feed Copilot a PDF contract and ask it what is my monthly minimum I can charge this client and it tells me $1000, I ask it a dozen other questions and it changes it's response but never to the correct value, then when I ask it to cite where it finds that information and it points me to a paragraph that clearly says $1500 - spelled out clear as day, not entangled in a bunch of legalese or anything else - how is that reliable for a Brenda in finance? (this is a real case I tried out)
In the above scenario, if Claude accidentally wipes out your Jellyfin movies, will Claude deal with consequences (ie an unhappy family/friends) or will you?
That exemption from accountability is a massive factor that renders any comparison meaningless.
In a business scenario, a model provider that could assume financial and legal liability for mistakes (as humans need to do) would be massively more expensive.
That’s definitely the hype. But I don’t know if I agree. I’m essentially a Brenda in my corporate finance job and so far have struggled to find any useful scenarios to use AI for.
I thought once this can build me a Gantt chart because that’s an annoying task in excel. I had the data. When I asked it to help me, “I can’t do that but I can summarize your data”. Not helpful.
Any type of analysis is exactly what I don’t want to trust it with. But I could use help actually building things, which it wouldn’t do.
Also, Brenda’s are usually fast. Having them use a tool like AI that can’t be fully trusted just slows them down. So IMO, we haven’t proven the AI variable in your equation is actually a positive value.
I can't speak to finance. In programming, it can be useful but it takes some time and effort to find where it works well.
I have had no success in using it to create production code. It's just not good enough. It tends to pattern-match the problem in somewhat broad strokes and produce something that looks good but collapses if you dig into it. It might work great for CRUD apps but my work is a lot more fiddly than that.
I've had good success in using it to create one-off helper scripts to analyze data or test things. For code that doesn't have to be good and doesn't have to stand the test of time, it can do alright.
I've had great success in having it do relatively simple analysis on large amounts of code. I see a bug that involves X, and I know that it's happening in Y. There's no immediately obvious connection between X and Y. I can dig into the codebase and trace the connection. Or I can ask the machine to do it. The latter is a hundred times faster.
The key is finding things where it can produce useful results and you can verify them quickly. If it says X and Y are connected by such-and-such path and here's how that triggers the bug, I can go look at the stuff and see if that's actually true. If it is, I've saved a lot of time. If it isn't, no big loss. If I ask it to make some one-off data analysis script, I can evaluate the script and spot-check the results and have some confidence. If I ask it to modify some complicated multithreaded code, it's not likely to get it right, and the effort it takes to evaluate its output is way too much for it to be worthwhile.
I'd agree. Programming is a solid use case for AI. Programming is a part of my job, and hobby too, and that's the main place where I've seen some value with it. It still is not living up to the hype but for simple things, like building a website or helping me generate the proper SQL to get what I want - it helps and can be faster than writing by hand. It's pretty much replaced StackOverflow for helping me debug things or look up how to do something that I know is already solved somewhere and I don't want to reinvent. But, I've also seen it make a complete mess of my codebase anytime I try to build something larger. It might technically give me a working widget after some vibe coding, but I'm probably going to have to clean the whole thing up manually and refactor some of it. I'm not certain that it's more efficient than just doing it myself from the start.
Every other facet of the world that AI is trying to 'take over', is not programming. Programming is writing text, what AI is good at. It's using references to other code, which AI has been specifically trained on. Etc. It makes sense that that use case is coming along well. Everything else, not even close IMO. Unless it's similar. It's probably great at helping people draft emails and finish their homework. I don't have those pain points.
> No, no. We disavow AI because our great leaders inexplicably trust it more than Brenda.
I would add a little nuance here.
I know a lot of people who don't have technical ability either because they advanced out of hands-on or never had it because it wasn't their job/interest.
These types of people are usually the folks who set direction or govern the purse strings.
here's the thing: They are empowered by AI. they can do things themselves.
and every one of them is so happy. They are tickled pink.
No, no. We disavow AI because our great leaders inexplicably trust it more than Brenda.