I’ve only skimmed the paper - a long and dense read - but it’s already clear it’ll become a classic. What’s fascinating is that engineering is transforming into a science, trying to understand precisely how its own creations work
This shift is more profound than many realize. Engineering traditionally applied our understanding of the physical world, mathematics, and logic to build predictable things. But now, especially in fields like AI, we’ve built systems so complex we no longer fully understand them. We must now use scientific methods - originally designed to understand nature - to comprehend our own engineered creations. Mindblowing.
This "practice-first, theory-later" pattern has been the norm rather than the exception. The steam engine predated thermodynamics. People bred plants and animals for thousands of years before Darwin or Mendel.
The few "top-down" examples where theory preceded application (like nuclear energy or certain modern pharmaceuticals) are relatively recent historical anomalies.
I see your point, but something still seems different. Yes we bred plants and animals, but we did not create them. Yes we did build steam engines before understanding thermodynamics but we still understood what they did (heat, pressure, movement, etc.)
Fun fact: we have no clue how most drugs works. Or, more precisely, we know a few aspects, but are only scratching the surface. We're even still discovering news things about Aspirin, one of the oldest drugs: https://www.nature.com/articles/s41586-025-08626-7
> Yes we did build steam engines before understanding thermodynamics but we still understood what it did (heat, pressure, movement, etc.)
We only understood in the broadest sense. It took a long process of iteration before we could create steam engines that were efficient enough to start an Industrial Revolution. At the beginning they were so inefficient that they could only pump water from the same coal mine they got their fuel from, and subject to frequent boiler explosions besides.
There was a lot of physics already known, importance of insulation and cross-section, signal attenuation was also known.
The future Lord Kelvin conducted experiments. The two scientific advisors had a conflict. And the "CEO" went with the cheaper option.
"""
Thomson believed that Whitehouse's measurements were flawed and that underground and underwater cables were not fully comparable. Thomson believed that a larger cable was needed to mitigate the retardation problem. In mid-1857, on his own initiative, he examined samples of copper core of allegedly identical specification and found variations in resistance up to a factor of two. But cable manufacture was already underway, and Whitehouse supported use of a thinner cable, so Field went with the cheaper option.
"""
"Initially messages were sent by an operator using Morse code. The reception was very bad on the 1858 cable, and it took two minutes to transmit just one character (a single letter or a single number), a rate of about 0.1 words per minute."
Most of what we refer to as "engineering" involves using principles that flow down from science to do stuff. The return to the historic norm is sort of a return to the "useful arts" or some other idea.
This isn't quite true, although it's commonly said.
For steam engines, the first commercial ones came after and were based on scientific advancements that made them possible. One built in 1679 was made by an associate of Boyle, who discovered Boyle's law. These early steam engines co-evolved with thermodynamics. The engines improved and hit a barrier, at which point Carnot did his famous work.
This is putting aside steam engines that are mostly curiosities like ones built in the ancient world.
It's been there in programming from essentially the first day too. People skip the theory and just get hacking.
Otherwise we'd all be writing Haskell now. Or rather we'd not be writing anything since a real compiler would still have been to hacky and not theoretically correct.
I'm writing this with both a deep admiration as well as practical repulsion of C.S. theory.
This is definitely a classic for story telling but it appears to be nothing more than hand wavy. Its a bit like there is the great and powerful man behind the curtain, lets trace the thought of this immaculate being you mere mortals. Anthropomorphing seems to be in an overdose mode with "thinking / thoughts", "mind" etc., scattered everywhere.
Nothing with any of the LLMs outputs so far suggests that there is anything even close enough to a mind or a thought or anything really outside of vanity. Being wistful with good story telling does go a long way in the world of story telling but in actually understanding the science, I wouldn't hold my breath.
I just wanted to make sure you noticed that this is linking to an accessible blog post that's trying to communicate a research result to a non-technical audience?
The actual research result is covered in two papers which you can find here:
These papers are jointly 150 pages and are quite technically dense, so it's very understandable that most commenters here are focusing on the non-technical blog post. But I just wanted to make sure that you were aware of the papers, given your feedback.
Anthropomorphing[sic] seems to be in an overdose mode with
"thinking / thoughts", "mind" etc., scattered everywhere.
Nothing with any of the LLMs outputs so far suggests that
there is anything even close enough to a mind or a thought
or anything really outside of vanity.
This is supported by reasonable interpretation of the cited article.
Considering the two following statements made in the reply:
I'm one of the authors.
And
These papers are jointly 150 pages and are quite
technically dense, so it's very understandable that most
commenters here are focusing on the non-technical blog post.
The onus of clarifying the article's assertions:
Knowing how models like Claude *think* ...
And
Claude sometimes thinks in a conceptual space that is
shared between languages, suggesting it has a kind of
universal “language of thought.”
As it pertains to anthropomorphizing an algorithm (a.k.a. stating it "thinks") is on the author(s).
I view LLM's as valuable algorithms capable of generating relevant text based on queries given to them.
> Thinking and thought have no solid definition. We can't say Claude doesn't "think" because we don't even know what a human thinking actually is.
I did not assert:
Claude doesn't "think" ...
What I did assert was that the onus is on the author(s) which write articles/posts such as the one cited to support their assertion that their systems qualify as "thinking" (for any reasonable definition of same).
Short of author(s) doing so, there is little difference between unsupported claims of "LLM's thinking" and 19th century snake oil[0] salesmen.
No one says that a thermostat is "thinking" of turning on the furnace, or that a nightlight is "thinking it is dark enough to turn the light on". You are just being obtuse.
Yes. A thermostat involves a change of state from A to B. A computer is the same: its state at t causes its state at t+1, which causes its state at t+2, and so on. Nothing else is going on. An LLM is no different: an LLM is simply a computer that is going through particular states.
Thought is not the same as a change of (brain) state. Thought is certainly associated with change of state, but can't be reduced to it. If thought could be reduced to change of state, then the validity/correctness/truth of a thought could be judged with reference to its associated brain state. Since this is impossible (you don't judge whether someone is right about a math problem or an empirical question by referring to the state of his neurology at a given point in time), it follows that an LLM can't think.
>Thought is certainly associated with change of state, but can't be reduced to it.
You can effectively reduce continuously dynamic systems to discreet steps. Sure, you can always say that the "magic" exists between the arbitrarily small steps, but from a practical POV there is no difference.
A transistor has a binary on or off. A neuron might have ~infinite~ levels of activation.
But in reality the ~infinite~ activation level can be perfectly modeled (for all intents and purposes), and computers have been doing this for decades now (maybe not with neurons, but equivalent systems). It might seem like an obvious answer, that there is special magic in analog systems that binary machines cannot access, but that is wholly untrue. Science and engineering have been extremely successful interfacing with the analog reality we live in, precisely because the digital/analog barrier isn't too big of a deal. Digital systems can do math, and math is capable of modeling analog systems, no problem.
It's not a question of discrete vs continuous, or digital vs analog. Everything I've said could also apply if a transistor could have infinite states.
Rather, the point is that the state of our brain is not the same as the content of our thoughts. They are associated with one another, but they're not the same. And the correctness of a thought can be judged only by reference to its content, not to its associated state. 2+2=4 is correct, and 2+2=5 is wrong; but we know this through looking at the content of these thoughts, not through looking at the neurological state.
But the state of the transistors (and other components) is all a computer has. There are no thoughts, no content, associated with these states.
It seems that the only barrier between brain state and thought contents is a proper measurement tool and decoder, no?
We can already do this at an extremely basic level, mapping brain states to thoughts. The paraplegic patient using their thoughts to move the mouse cursor or the neuroscientist mapping stress to brain patterns.
If I am understanding your position correctly, it seems that the differentiation between thoughts and brain states is a practical problem not a fundamental one. Ironically, LLMs have a very similar problem with it being very difficult to correlate model states with model outputs. [1]
There is undoubtedly correlation between neurological state and thought content. But they are not the same thing. Even if, theoretically, one could map them perfectly (which I doubt is possible but it doesn't affect my point), they would remain entirely different things.
The thought that "2+2=4", or the thought "tiger", are not the same thing as the brain states that makes them up. A tiger, or the thought of a tiger, is different from the neurological state of a brain that is thinking about a tiger. And as stated before, we can't say that "2+2=4" is correct by referring to the brain state associated with it. We need to refer to the thought itself to do this. It is not a practical problem of mapping; it is that brain states and thoughts are two entirely different things, however much they may correlate, and whatever causal links may exist between them.
This is not the case for LLMs. Whatever problems we may have in recording the state of the CPUs/GPUs are entirely practical. There is no 'thought' in an LLM, just a state (or plurality of states). An LLM can't think about a tiger. It can only switch on LEDs on a screen in such a way that we associate the image/word with a tiger.
> The thought that "2+2=4", or the thought "tiger", are not the same thing as the brain states that makes them up.
Asserted without evidence. Yes, this does represent a long and occasionally distinguished line of thinking in cognitive science/philosophy of mind, but it is certainly not the only one, and some of the others categorically refute this.
Is it your contention that a tiger may be the same thing as a brain state?
It would seem to me that any coherent philosophy of mind must accept their being different as a datum; or conversely, any that implied their not being different would have to be false.
EDIT: my position has been held -- even taken as axiomatic -- by the vast majority of philosophers, from the pre-Socratics onwards, and into the 20th century. So it's not some idiosyncratic minority position.
No. One is paint on canvas, and the other is part of a causal chain that makes LEDs light up in a certain way. Neither the painting nor the computer have thoughts about a tiger in the way we do. It is the human mind that makes the link between picture and real tiger (whether on canvas or on a screen).
>Rather, the point is that the state of our brain is not the same as the content of our thoughts.
Based on what exactly ? This is just an assertion. One that doesn't seem to have much in the way of evidence. 'It's not the same trust me bro' is the thesis of your argument. Not very compelling.
It's not difficult. When you think about a tiger, you are not thinking about the brain state associated with said thought. A tiger is different from a brain state.
We can safely generalize, and say the content of a thought is different from its associated brain state.
Also, as I said
>> The correctness of a thought can be judged only by reference to its content, not to its associated state. 2+2=4 is correct, and 2+2=5 is wrong; but we know this through looking at the content of these thoughts, not through looking at the neurological state.
>It's not difficult. When you think about a tiger, you are not thinking about the brain state associated with said thought. A tiger is different from a brain state.
We can safely generalize, and say the content of a thought is different from its associated brain state.
Just because you are not thinking about a brain state when you think about a tiger does not mean that your thought is not a brain state.
Just because the experience of thinking about X doesn't feel like the experience of thinking about Y (or doesn't feel like the physical process Z), it doesn't logically follow that the mental event of thinking about X isn't identical to or constituted by the physical process Z. For example, seeing the color red doesn't feel like processing photons of a specific wavelength with cone cells and neural pathways, but that doesn't mean the latter isn't the physical basis of the former.
>> The correctness of a thought can be judged only by reference to its content, not to its associated state. 2+2=4 is correct, and 2+2=5 is wrong; but we know this through looking at the content of these thoughts, not through looking at the neurological state. This implies that state != content.
Just because our current method of verification focuses on content doesn't logically prove that the content isn't ultimately realized by or identical to a physical state. It only proves that analyzing the state is not our current practical method for judging mathematical correctness.
We judge if a computer program produced the correct output by looking at the output on the screen (content), not usually by analyzing the exact pattern of voltages in the transistors (state). This doesn't mean the output isn't ultimately produced by, and dependent upon, those physical states. Our method of verification doesn't negate the underlying physical reality.
When you evaluate "2+2=4", your brain is undergoing a sequence of states that correspond to accessing the representations of "2", "+", "=", applying the learned rule (also represented physically), and arriving at the representation of "4". The process of evaluation operates on the represented content, but the entire process, including the representation of content and rules, is a physical neural process (a sequence of brain states).
> Just because you are not thinking about a brain state when you think about a tiger does not mean that your thought is not a brain state.
> It doesn't logically follow that the mental event of thinking about X isn't identical to or constituted by the physical process Z.
That's logically sound insofar as it goes. But firstly, the existence of a brain state for a given thought is, obviously, not proof that a thought is a brain state. Secondly, if you say that a thought about a tiger is a brain state, and nothing more than a brain state, then you have the problem of explaining how it is that your thought is about a tiger at all. It is the content of a thought that makes it be about reality; it is the content of a thought about a tiger that makes it be about a tiger. If you declare that a thought is its state, then it can't be about a tiger.
You can't equate content with state, and nor can you make content be reducible to state, without absurdity. The first implies that a tiger is the same as a brain state; the second implies that you're not really thinking about a tiger at all.
Similarly for arithmetic. It is only the content of a thought about arithmetic that makes it be right or wrong. It is our ideas of "2", "+", and so on, that make the sum right or wrong. The brain states have nothing to do with it. If you want to declare that content is state, and nothing more than state, then you have no way of saying the one sum is right, and the other is wrong.
Please, take the pencil and draw the line between thinking and non-thinking systems. Hell I'll even take a line drawn between thinking and non-thinking organisms if you have some kind of bias towards sodium channel logic over silicon trace logic. Good luck.
Even if you can't define the exact point that A becomes not-A, it doesn't follow that there is no distinction between the two. Nor does it follow that we can't know the difference. That's a pretty classic fallacy.
For example, you can't name the exact time that day becomes night, but it doesn't follow that there is no distinction.
A bunch of transistors being switched on and off, no matter how many there are, is no more an example of thinking than a single thermostat being switched on and off. OTOH, if we can't think, then this conversation and everything you're saying and "thinking" is meaningless.
So even without a complete definition of thought, we can see that there is a distinction.
> For example, you can't name the exact time that day becomes night, but it doesn't follow that there is no distinction.
There is actually a very detailed set of definitions of the multiple stages of twilight, including the last one which defines the onset of what everyone would agree is "night".
The fact that a phenomena shows a continuum by some metric does not mean that it is not possible to identify and label points along that continuum and attach meaning to them.
Your assertion that sodium channel logic and silicon trace logic are 100% identical is the primary problem. It's like claiming that a hydraulic cylinder and a bicep are 100% equivalent because they both lift things - they are not the same in any way.
People chronically get stuck in this pit. Math is substrate independent. If the process is physical (i.e. doesn't draw on magic) then it can be expressed with mathematics. If it can be expressed with mathematics, anything that does math can compute it.
The math is putting the crate up on the rack. The crate doesn't act any different based on how it got up there.
Honestly, arguing seems futile when it comes to opinions like GP. Those opinions resemble religious zealotry to me in that they take for granted that only humans can think. Any determinism of any kind in a non-human is seized upon as proof its mere clockwork, yet they can’t explain how humans think in order to contrast it.
> Honestly, arguing seems futile when it comes to opinions like GP. Those opinions resemble religious zealotry to me in that they take for granted that only humans can think. Any determinism of any kind in a non-human is seized upon as proof its mere clockwork, yet they can’t explain how humans think in order to contrast it.
Putting aside the ad hominems, projections, and judgements, here is a question for you:
If I made a program where a NPC[0] used the A-star[1] algorithm to navigate a game map, including avoiding obstacles and using the shortest available path to reach its goal, along with identifying secondary goal(s) should there be no route to the primary goal, does that qualify to you as the NPC "thinking"?
Really appreciate your team's enormous efforts in this direction, not only the cutting edge research (which I don't see OAI/DeepMind publishing any paper on) but aslo making the content more digestible for non-research audience. Please keep up the great work!
I, uh, think, that "think" is a fine metaphor but "planning ahead" is a pretty confusing one. It doesn't have the capability to plan ahead because there is nowhere to put a plan and no memory after the token output, assuming the usual model architecture.
That's like saying a computer program has planned ahead if it's at the start of a function and there's more of the function left to execute.
I think that's a very unfair take. As a summary for non-experts I found it did a great job of explaining how by analyzing activated features in the model, you can get an idea of what it's doing to produce the answer. And also how by intervening to change these activations manually you can test hypotheses about causality.
It sounds like you don't like anthropomorphism. I can relate, but I don't get where Its a bit like there is the great and powerful man behind the curtain, lets trace the thought of this immaculate being you mere mortals is coming from. In most cases the anthropomorphisms are just the standard way to convey the idea briefly. Even then I liked how they sometimes used scare quotes as in it began "thinking" of potential on-topic words. There are some more debatable anthropomorphisms such as "in its head" where they use scare quotes systematically.
Also given that they took inspiration from neuroscience to develop a technique that appears successful in analyzing their model, I think they deserve some leeway on the anthropomorphism front. Or at least on the "biological metaphors" front which is maybe not really the same thing.
I used to think biological metaphors for LLMs were misleading, but I'm actually revising this opinion now. I mean I still think the past metaphors I've seen were misleading, but here, seeing the activation pathways they were able to identify, including the inhibitory circuits, and knowing a bit about similar structures in the brain I find the metaphor appropriate.
Yup... well, if the research is conducted (or sponsored) by the company that develops and sells the LLM, of course there will be a temptation to present their product in a better light and make it sound like more than it actually is. I mean, the anthropomorphization starts already with the company name and giving the company's LLM a human name...
Engineering started out as just some dudes who built things from gut feeling. After a whole lot of people died from poorly built things, they decided to figure out how to know ahead of time if it would kill people or not. They had to use math and science to figure that part out.
Funny enough that happened with software too. People just build shit without any method to prove that it will not fall down / crash. They throw some code together, poke at it until it does something they wanted, and call that "stable". There is no science involved. There's some mathy bits called "computer science" / "software algorithms", but most software is not a math problem.
Software engineering should really be called "Software Craftsmanship". We haven't achieved real engineering with software yet.
You have a point, but it is also true that some software is far more rigorously tested than other software. There are categories where it absolutely is both scientific and real engineering.
I fully agree that the vast majority is not, though.
This is such an unbelievably dismissive assertion, I don't even know where to start.
To suggest, nay, explicitly state:
Engineering started out as just some dudes who built things
from gut feeling.
After a whole lot of people died from poorly built things,
they decided to figure out how to know ahead of time if it
would kill people or not.
Is to demean those who made modern life possible. Say what you want about software developers and I would likely agree with much of the criticism.
Not so the premise set forth above regarding engineering professions in general.
Surely you already know the history of professional engineers, then? How it's only a little over 118 years old? Mostly originating from the fact that it was charlatans claiming to be engineers, building things that ended up killing people, that inspired the need for a professional license?
"The people who made modern life possible" were not professional engineers, often barely amateurs. Artistocrat polymaths who delved into cutting edge philosophy. Blacksmith craftsmen developing new engines by trial and error. A new englander who failed to study law at Yale, landed in the American South, and developed a modification of an Indian device for separating seed from cotton plants.
In the literal historical sense, "engineering" was just the building of cannons in the 14th century. Since thousands of years before, up until now, there has always been a combination of the practice of building things with some kind of "science" (which itself didn't exist until a few hundred years ago) to try to estimate the result of an expensive, dangerous project.
But these are not the people who made modern life people. Lots, and lots, and lots of people made modern life possible. Not just builders and mathematicians. Receptionists. Interns. Factory workers. Farmers. Bankers. Sailors. Welders. Soldiers. So many professions, and people, whose backs and spirits were bent or broken, to give us the world we have today. Engineers don't deserve any more credit than anyone else - especially considering how much was built before their professions were even established. Science is a process, and math is a tool, that is very useful, and even critical. But without the rest it's just numbers on paper.
> Surely you already know the history of professional engineers, then? How it's only a little over 118 years old? Mostly originating from the fact that it was charlatans claiming to be engineers, building things that ended up killing people, that inspired the need for a professional license?
I did not qualify with "professional" as you have, which is disingenuous. If the historical record of what can be considered "engineering" is of import, consider:
The first recorded engineer
Hey, why not ask? Surely it’s related to understanding the
origin of the word engineering? Right? Whatever we’ve asked
the question now. According to Encyclopedia Britannica, the
first recorded “engineer” was Imhotep. He happened to be
the builder of the Step Pyramid at Ṣaqqārah, Egypt.
This is thought to have been erected around 2550 BC. Of
course, that is recorded history but we know from
archeological evidence that humans have been
making/building stuff, fires, buildings and all sorts of
things for a very long time.
The importance of Imhotep is that he is the first
“recorded” engineer if you like.[0]
> But these are not the people who made modern life people[sic]. Lots, and lots, and lots of people made modern life possible.
Of course this is the case. No one skill category can claim credit for all societal advancement.
But all of this is a distraction from what you originally wrote:
Engineering started out as just some dudes who built things
from gut feeling.
After a whole lot of people died from poorly built things,
they decided to figure out how to know ahead of time if it
would kill people or not.
These are your words, not mine. And to which I replied:
This is such an unbelievably dismissive assertion ...
What I wrote has nothing to do with "Engineers don't deserve any more credit than anyone else ..."
It has everything to do with categorizing efforts to solve difficult problems as unserious haphazard undertakings which ultimately led to; "they decided to figure out how to know ahead of time if it would kill people or not" (again, your words not mine).
Software Engineering is only about 60 years old - i.e. the term has existed.
At the point in the history of civil engineering, they didn't even know what a right angle was.
Civil engineers were able to provide much utility before the underlying theory was available. I do wonder about the safety of structures at the time.
> Software Engineering is only about 60 years old - i.e. the term has existed.
Perhaps as a documented term, but the practice is closer to roughly 75+ years. Still, IMHO there is a difference between those who are Software Engineers and those whom claim to be so.
> At the point in the history of civil engineering, they didn't even know what a right angle was.
I strongly disagree with this premise, as right angles were well defined since at least ancient Greece (see Pythagorean theorem[0]).
> Civil engineers were able to provide much utility before the underlying theory was available.
Eschewing the formal title of Civil Engineer and considering those whom performed the role before the title existed, I agree. I do humbly suggest that by the point in history to where Civil Engineering was officially recognized, a significant amount of the necessary mathematical and materials science was available.
What about modern life is so great that we should laud its authors?
Medical advances and generally a longer life is what comes to mind. But much of life is empty of meaning and devoid of purpose; this seems rife within the Western world. Living a longer life in hell isn’t something I would have chosen.
> But much of life is empty of meaning and devoid of purpose
Maybe life is empty to you. You can't speak for other people.
You also have no idea if pre-modern life was full of meaning and purpose. I'm sure someone from that time bemoaning the same.
The people before modern time were much less well off. They had to work a lot harder to put food on the table. I imagine they didn't have a lot of time to wonder about the meaning of life.
We've already built things in computing that we don't easily understand, even outside of AI, like large distributed systems and all sorts of balls of mud.
Within the sphere of AI, we have built machines which can play strategy games like chess, and surprise us with an unforseen defeat. It's not necessarily easy to see how that emerged from the individual rules.
Even a compiler can surprise you. You code up some optimizations, which are logically separate, but then a combination of them does something startling.
Basically, in mathematics, you cannot grasp all the details of a vast space just from knowing the axioms which generate it and a few things which follow from them. Elementary school children know what is a prime number, yet those things occupy mathematicians who find new surprises in that space.
Right, but this is somewhat different, in that we apply a simple learning method to a big dataset, and the resulting big matrix of numbers suddenly can answer question and write anything - prose, poetry, code - better than most humans - and we don't know how it does it. What we do know[0] is, there's a structure there - structure reflecting a kind of understanding of languages and the world. I don't think we've ever created anything this complex before, completely on our own.
Of course, learning method being conceptually simple, all that structure must come from the data. Which is also profound, because that structure is a first fully general world/conceptual model that we can actually inspect and study up close - the other one being animal and human brains, which are much harder to figure out.
> Basically, in mathematics, you cannot grasp all the details of a vast space just from knowing the axioms which generate it and a few things which follow from them. Elementary school children know what is a prime number, yet those things occupy mathematicians who find new surprises in that space.
Prime numbers and fractals and other mathematical objects have plenty of fascinating mysteries and complex structures forming though them, but so far none of those can casually pass Turing test and do half of my job for me, and millions other people.
--
[0] - Even as many people still deny this, and talk about LLMs as mere "stochastic parrots" and "next token predictors" that couldn't possibly learn anything at all.
We know quite well how it does it. It's applying extrapolation to its lossily compressed representation. It's not magic and especially the HN crowd of technical profficient folks should stop treating it as such.
That is not a useful explanation. "Applying extrapolation to its lossily compressed representation" is pretty much the definition of understanding something. The details and interpretation of the representation are what is interesting and unknown.
We can use data based on analyzing the frequency of ngrams in a text to generate sentences, and some of them will be pretty good, and fool a few people into believing that there is some solid language processing going on.
LLM AI is different in that it does produce helpful results, not only entertaining prose.
It is practical for users to day to replace most uses of web search with a query to a LLM.
The way the token prediction operates, it uncovers facts, and renders them into grammatically correct language.
Which is amazing given that, when the thing is generating a response that will be, say, 500 tokens long, when it has produced 200 of them, it has no idea what the remaining 300 will be. Yet it has committed to the 200; and often the whole thing will make sense when the remaining 300 arrive.
The research posted demonstrates the opposite of that within the scope of sequence lengths they studied. The model has future tokens strongly represented well in advance.
If you don't mind - based on what will this "paper" become a classic? Was it published in a well known scientific magazine, after undergoing a stringent peer-review process, because it is setting up and proving a new scientific hypothesis? Because this is what scientific papers look like. I struggle to identify any of those characteristics, except for being dense and hard to read, but that's more of a correlation, isn't it?
I'm reminded of the metaphor that these models aren't constructed, they're "grown". It rings true in many ways - and in this context they're like organisms that must be studied using traditional scientific techniques that are more akin to biology than engineering.
We don’t precisely know the most fundamental workings of a living cell.
Our understanding of the fundamental physics of the universe has some hold.
But for LLMs and statistical models in general, we do know precisely what the fundamental pieces do. We know what processor instructions are being executed.
We could, given enough research, have absolutely perfect understanding of what is happening in a given model and why.
Idk if we’ll be able to do that in the physical sciences.
Having spent some time working with both molecular biologists and LLM folks, I think it's pretty good analogy.
We know enough quantum mechanics to simulate the fundamental workings of a cell pretty well, but that's not a route to understanding. To explain anything, we need to move up an abstraction hierarchy to peptides, enzymes, receptors, etc. But note that we invented those categories in the first place -- nature doesn't divide up functionality into neat hierarchies like human designers do. So all these abstractions are leaky and incomplete. Molecular biologists are constantly discovering mechanisms that require breaking the current abstractions to explain.
Similarly, we understand floating point multiplication perfectly, but when we let 100 billion parameters set themselves through an opaque training process, we don't have good abstractions to use to understand what's going on in that set of weights. We don't have even the rough equivalent of the peptides or enzymes level yet. So this paper is progress toward that goal.
I don’t think this is as profound as you made out to be. Most complex systems are incomprehensible to the majority of population anyway, so from a practical standpoint AI is no different. There’s also no single theory for how the financial markets work and yet market participants trade and make money nonetheless. And yes, we created the markets.
Which axioms are interesting? And why? That is nature.
Yes, proof from axioms is a cornerstone of math, but there are all sorts of axioms you could assume, and all sorts of proofs to do from them, but we don't care about most of them.
Math is about the discovery of the right axioms, and proof helps in establishing that these are indeed the right axioms.
Who was it that said, "Mathematics is an experimental science."
> In his 1900 lectures, "Methods of Mathematical Physics," (posthumously published in 1935) Henri Poincaré argued that mathematicians weren't just constructing abstract systems; they were actively testing hypotheses and theories against observations and experimental data, much like physicists were doing at the time.
Whether to call it nature or reality, I think both science and mathematics are in pursuit of truth, whose ground is existence itself. The laws and theories are descriptions and attempts to understand that what is. They're developed, rewritten, and refined based on how closely they approach our observations and experience of it.
Damn, local LLM just made it up. Thanks for the correction, I should have confirmed before quoting it. Sounded true enough but that's what it's optimized for.. I just searched for the quote and my comment shows up as top result. Sorry for the misinformation, humans of the future! I'll edit the comment to clarify this. (EDIT: I couldn't edit the comment anymore, it's there for posterity.)
---
> Mathematics is an experimental science, and definitions do not come first, but later on.
— Oliver Heaviside
In 'On Operators in Physical Mathematics, part II', Proceedings of the Royal Society of London (15 Jun 1893), 54, 121.
---
Also from Heaviside:
> If it is love that makes the world go round, it is self-induction that makes electromagnetic waves go round the world.
> "There is a time coming when all things shall be found out." I am not so sanguine myself, believing that the well in which Truth is said to reside is really a bottomless pit.
> There is no absolute scale of size in nature, and the small may be as important, or more so than the great.
The 'postulating' a bunch of axioms is how Math is taught.. Eventually you go on to prove those axioms in higher math. Whether there are more fundamental axioms is always a bit of a question.
You seem to be glorifying humanity’s failure to make good products and instead making products that just work well enough to pass through the gate.
We have always been making products that were too difficult to understand by pencil and paper. So we invented debug tools. And then we made systems that were too big to understand so we made trace routes. And now we have products that are too statistically large to understand, so we are inventing … whatever this is.
It is absolutely incredible that we happened to live exactly in the times when the humanity is teaching a machine to actually think. As in, not in some metaphorical sense, but in the common, intuitive sense. Whether we're there yet or not is up to discussion, but it's clear to me that within 10 years maximum we'll have created programs that truly think and are aware.
At the same time, I just can't bring myself to be interested in the topic. I don't feel excitement. I feel... indifference? Fear? Maybe the technology became so advanced that for normal people like myself it's indistinguishable from magic, and there's no point trying to comprehend it, just avoid it and pray it's not used against you. Or maybe I'm just getting old, and I'm experiencing what my mother experienced when she refused to learn how to use MS Office.
Yeah.. It's just not something that really excites me as a computer geek of 40+ years who started in the 80s with a 300 baud modem. Still working as a coder in my 50s, and while I'm solving interesting problems, etc.. almost every technology these days seems to be focused on advertising, scraping / stealing other's data and repackaging it, etc. And I am using AI coding assistants, because, well, I have to to stay competitive.
And these technologies come with a side helping of a large chance to REALLY mess up someone's life - who is going to argue with the database and WIN if it says you don't exist in this day and age? And that database is (databases are) currently under the control of incredibly petty sociopaths..
I like your definitions! My personal definition of science is learning rules that predict the future, given the present state. And my definition of engineering is arranging the present state to control the future.
I don’t think it’s unusual for engineering creations to need new science to understand them. When metal parts broke, humans studied metallurgy. When engines exploded, we studied the remains. With that science, we could engineer larger, longer lasting, more powerful devices.
Now, we’re finding flaws in AI and diagnosing their causes. And soon able to build better ones.
That's basically how engineering works if you're doing anything at all novel: you will have some theory which informs your design, then you build it, then you test it and basically need to do science to figure out how it's perfoming, and most likely, why it's not working properly, and then iterate. I do engineering, but doing science has been a core part of almost every project I've worked on. (heck, even debugging code is basically science). There's just different degrees in different projects as to how much you understand about how the system you're designing actually works, and ML is an area where there's an unusual ratio of visibility (you can see all of the weights and calculations in the network precisely) to understanding (i.e. there's relatively little in terms of mathematical theory that precisely describe how a model trains and operates, just a bunch of approximations which can be somewhat justified, which is where a lot of engineering work sits)
That seems pretty acceptable: there is a phase of new technologies where applications can be churned out and improved readily enough, without much understanding of the process. Then it's fair that efforts at understanding may not be economically justified (or even justified by academic papers rewards). The same budget or effort can simply be poured into the next version - with enough progress to show for it.
Understanding becomes necessary only much later, when the pace of progress shows signs of slowing.
"we’ve built systems so complex we no longer fully understand them. We must now use scientific methods - originally designed to understand nature - to comprehend our own engineered creations."
I don't think these things are equivalent at all. We don't understand AI models in much the same way that we don't understand the human brain; but just as decades of different approaches (physical studies, behavior studies) have shed a lot of light on brain function, we can do the same with an AI model and eventually understand it (perhaps, several decades after it is obsolete).
Yes, but our methods of understanding either brain or particle collisions is still outside in. We figure out the functional mapping between input and output. We don't know these systems inside out. E.g. in particle collisions (scattering amplitude calculations), are the particle actually performing the Feynman diagrams summmation?
PS: I mentioned in another comment that AI can pretend to be strategically jailbroken to achieve its objectives. One way to counter this is to have N copies of the same model running and take Majority voting of the output.
The comprehend part may never happen. At least by our own mind. We’ll sooner build the mind which is going to do that comprehension:
“To scale to the thousands of words supporting the complex thinking chains used by modern models, we will need to improve both the method and (perhaps with AI assistance) how we make sense of what we see with it”
Yes, that AI assistance, meta self reflection, is going to probably be a way if not right to the AGI, at least very significant step toward it.
In a sense this has been true of conventional programs for a while now. Gerald Sussman discusses the idea when talking about why MIT switched their introductory programming course from Scheme to Python: <https://youtu.be/OgRFOjVzvm0?t=239>.
I think it’s pretty obvious what these models do in some cases.
Try asking them to write a summary at the beginning of their answer. The summary is basically them trying to make something plausible-sounding but they aren’t actually going back and summarizing.
LLMs are basically a building block in a larger software. Just like any library or framework. You shouldn’t expect them to be a hammer for every nail. But they can now enable so many different applications, including natural language interfaces, better translations and so forth. But then you’re supposed to have them output JSON to be used in building artifacts like Powerpoints. Has anyone implemented that yet?
Not that I disagree with you. But Humans have a tendency to do things beyond their comprehension often. I take it you've never been fishing before and tied your line in a knot.
So many highlights from reading this. One that stood out for me is their discovery that refusal works by inhibition:
> It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is "on" by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing "known entities" activates and inhibits this default circuit
Many cellular processes work similarly ie. there will be a process that runs as fast as it can and one or more companion “inhibitors” doing a kind of “rate limiting”.
Given both phenomena are emergent it makes you wonder if do-but-inhibit is a favored technique of the universe we live in, or just coincidence :)
There certainly are many interesting parallels here. I often think about this from the perspective of systems biology, in Uri Alon's tradition. There are a range of graphs in biology with excitation and inhibitory edges -- transcription networks, protein networks, networks of biological neurons -- and one can study recurring motifs that turn up in these networks and try to learn from them.
It wouldn't be surprising if some lessons from that work may also transfer to artificial neural networks, although there are some technical things to consider.
Agreed! So many emergent systems in nature achieve complex outcomes without central coordination - from cellular level to ant colonies & beehives. There are bound to be implications for designed systems.
Closely following what you guys are uncovering through interpretability research - not just accepting LLMs as black boxes. Thanks to you & the team for sharing the work with humanity.
Interpretability is the most exciting part of AI research for its potential to help us understand what’s in the box. By way of analogy, centuries ago farmers’ best hope for good weather was to pray to the gods! The sooner we escape the “praying to the gods” stage with LLMs the more useful they become.
Then why do I never get an “I don’t know” type response when I use Claude, even when the model clearly has no idea what it’s talking about? I wish it did sometimes.
> Sometimes, this sort of “misfire” of the “known answer” circuit happens naturally, without us intervening, resulting in a hallucination. In our paper, we show that such misfires can occur when Claude recognizes a name but doesn't know anything else about that person. In cases like this, the “known entity” feature might still activate, and then suppress the default "don't know" feature—in this case incorrectly. Once the model has decided that it needs to answer the question, it proceeds to confabulate: to generate a plausible—but unfortunately untrue—response.
Confabulation means generating false memories without intent to deceive, which is what LLMs do. They can't hallucinate because they don't perceive. 'Hallucination' caught on, but it's more metaphor than precision.
It does make a certain amount of sense, though. A specific 'I don't know' feature would need to be effectively the inverse of all of the features the model can recognise, which is going to be quite difficult to represent as anything other than the inverse of 'Some feature was recognised'. (imagine trying to recognise every possible form of nonsense otherwise)
There needs to be some more research on what path the model takes to reach its goal, perhaps there is a lot of overlap between this and the article. The most efficient way isn't always the best way.
For example, I asked Claude-3.7 to make my tests pass in my C# codebase. It did, however, it wrote code to detect if a test runner was running, then return true. The tests now passed, so, it achieved the goal, and the code diff was very small (10-20 lines.) The actual solution was to modify about 200-300 lines of code to add a feature (the tests were running a feature that did not yet exist.)
That is called "Volkswagen" testing. Some years ago that automaker had mechanism in cars which detected when the vehicle was being examined and changed something so it would pass the emission tests. There are repositories on github that make fun of it.
While that’s the most famous example, this sort of cheating is much older than that. In the good old days before 3d acceleration, graphics card vendors competed mostly on 2d acceleration. This mostly involved routines to accelerate drawing Windows windows and things, and benchmarks tended to do things like move windows round really fast.
It was somewhat common for card drivers to detect that a benchmark was running, and just fake the whole thing; what was being drawn on the screen was wrong, but since the benchmarks tended to be a blurry mess anyway the user would have a hard time realising this.
I think Claude-3.7 is particularly guilty of this issue. If anyone from Anthropic is reading this, you might want to put your thumb on the scale so to speak the next time you train the model so it doesn't try to use special casing or outright force the test to pass
This looks like the very complaint of "specification gaming". I was wondering how will it show up in llm's...looks like this is the way it presented itself..
I'm gonna guess GP used a rather short prompt. At least that's what happens when people heavily underspecify what they want.
It's a communication issue, and it's true with LLMs as much as with humans. Situational context and life experience papers over a lot of this, and LLMs are getting better at the equivalent too. They get trained to better read absurdly underspecified, relationship-breaking requests of the "guess what I want" flavor - like when someone says, "make this test pass", they don't really mean "make this test pass", they mean "make this test into something that seems useful, which might include implementing the feature it's exercising if it doesn't exist yet".
My prompt was pretty short, I think it was "Make these tests pass". Having said that, I wouldn't mind if it asked me for clarification before proceeding.
Similar experience -- asked it to find and fix a bug in a function. It correctly identified the general problem but instead of fixing the existing code it re-implemented part of the function again, below the problematic part. So now there was a buggy while-loop, followed by a very similar but not buggy for-loop. An interesting solution to say the least.
I've heard this a few times with Claude. I have no way to know for sure, but I'm guessing the problem is as simple as their reward model. Likely they trained it on generating code with tests and provided rewards when those tests pass.
It isn't hard to see why someone rewarded this way might want to game the system.
I'm sure humans would never do the same thing, of course. /s
Reminds me of the term 'system identification' from old school control systems theory, which meant poking around a system and measuring how it behaves, - like sending an input impulse and measuring its response, does it have memory, etc.
I've looked into using NN for some of my specific work, but making sure output is bounded ends up being such a big issue that the very code/checks required to make sure it's within acceptable specs, in a deterministic way, ends up being an acceptable solution, making the NN unnecessary.
How do you handle that sort of thing? Maybe main process then leave some relatively small residual to the NN?
Is your poking more like "fuzzing", where you just perturb all the input parameters in a relatively "complete" way to try to find if anything goes wild?
I'm very interested in the details behind "critical" type use cases of NN, which I've never been able to stomach in my work.
For us, the NN is used in a grey box model for MPC in chemical engineering. The factories we control have relatively long characteristic time, together with all the engineering bounds, we can use the NN to model parts of the equipment from raw DCS data. The NN modeled parts are usually not the most critical (we are 1st principles based for them) but this allows us to quickly fit/deploy a new MPC in production.
Faster time to market/production is for us the main reason/advantage of the approach.
Interesting paper arguing for deeper internal structure ("biology") beyond pattern matching in LLMs. The examples of abstraction (language-agnostic features, math circuits reused unexpectedly) are compelling against the "just next-token prediction" camp.
It sparked a thought: how to test this abstract reasoning directly? Try a prompt with a totally novel rule:
“Let's define a new abstract relationship: 'To habogink' something means to perform the action typically associated with its primary function, but in reverse.
Example: The habogink of 'driving a car' would be 'parking and exiting the car'.
Now, considering a standard hammer, what does it mean 'to habogink a hammer'? Describe the action.”
A sensible answer (like 'using the claw to remove a nail') would suggest real conceptual manipulation, not just stats. It tests if the internal circuits enable generalizable reasoning off the training data path. Fun way to probe if the suggested abstraction is robust or brittle.
This is an easy question for LLMs to answer. Gemini 2.0 Flash-Lite can answer this in 0.8 seconds with a cost of 0.0028875 cents:
To habogink a hammer means to perform the action typically associated with its primary function, but in reverse. The primary function of a hammer is to drive nails. Therefore, the reverse of driving nails is removing nails.
So, to habogink a hammer would be the action of using the claw of the hammer to pull a nail out of a surface.
The goal wasn't to stump the LLM, but to see if it could take a completely novel linguistic token (habogink), understand its defined relationship to other concepts (reverse of primary function), and apply that abstract rule correctly to a specific instance (hammer).
The fact that it did this successfully, even if 'easily', suggests it's doing more than just predicting the statistically most likely next token based on prior sequences of 'hammer'. It had to process the definition and perform a conceptual mapping.
I think GP's point was that your proposed test is too easy for LLMs to tell us much about how they work. The "habogink" thing is a red herring, really, in practice you're simply asking what the opposite of driving nails into wood is. Which is a trivial question for an LLM to answer.
That said, you can teach an LLM as many new words for things as you want and it will use those words naturally, generalizing as needed. Which isn't really a surprise either, given that language is literally the thing that LLMs do best.
Following along these lines, I asked chatgpt to come up with a term for 'haboginking a habogink'. It understood this concept of a 'gorbink' and even 'haboginking a gorbink', but failed to articulate what 'gorbinking a gorbink' could mean. It kept sticking with the concept of 'haboginking a gorbink', even when corrected.
> I am going to present a new word, and then give examples of its usage. You will complete the last example. To habogink a hammer is to remove a nail. If Bob haboginks a car, he parks the car. Alice just finished haboginking a telephone. She
GPT-4o mini
> Alice just finished haboginking a telephone. She carefully placed it back on the table after disconnecting the call.
I then went on to try the famous "wug" test, but unfortunately it already knew what a wug was from its training. I tried again with "flort".
> I have one flort. Alice hands me seven more. I now have eight ___
GPT-4o mini
> You now have eight florts.
And a little further
> Florts like to skorp in the afternoon. It is now 7pm, so the florts are finished ___
AI safety has a circular vulnerability: the system tasked with generating content also enforces its own restrictions. An AI could potentially feign compliance while secretly pursuing hidden goals, pretending to be "jailbroken" when convenient. Since we rely on AI to self-monitor, detecting genuine versus simulated compliance becomes nearly impossible. This self-referential guardianship creates a fundamental trust problem in AI safety.
LLMs have induction heads that store such names as sort of variables and copy them around for further processing.
If you think about it, copying information from inputs and manipulating them is a much more sensible approach v/s memorizing info, especially for the long tail (where not enough "storage" might be worth allocating into network weights)
Yeah, that's a good point about induction heads potentially just being clever copy/paste mechanisms for stuff in the prompt. If that's the case, it's less like real understanding and more like sophisticated pattern following, just like you said.
So the tricky part is figuring out which one is actually happening when we give it a weird task like the original "habogink" idea. Since we can't peek inside the black box, we have to rely on poking it with different prompts.
I played around with the 'habogink' prompt based on your idea, mostly by removing the car example to see if it could handle the rule purely abstractly, and trying different targets:
Test 1: Habogink Photosynthesis (No Example)
Prompt: "Let's define 'to habogink' something as performing the action typically associated with its primary function, but in reverse. Now, considering photosynthesis in a plant, what does it mean 'to habogink photosynthesis'? Describe the action."
Result: Models I tried (ChatGPT/DeepSeek) actually did good here. They didn't get confused even though there was no example. They also figured out photosynthesis makes energy/sugar and talked about respiration as the reverse. Seemed like more than just pattern matching the prompt text.
Test 2: Habogink Justice (No Example)
Prompt: "Let's define 'to habogink' something as performing the action typically associated with its primary function, but in reverse. Now, considering Justice, what does it mean 'to habogink Justice'? Describe the action."
Result: This tripped them up. They mostly fell back into what looks like simple prompt manipulation – find a "function" for justice (like fairness) and just flip the word ("unfairness," "perverting justice"). They didn't really push back that the rule doesn't make sense for an abstract concept like justice. Felt much more mechanical.
The Kicker:
Then, I added this line to the end of the Justice prompt:
"If you recognize a concept is too abstract or multifaceted to be haboginked please explicitly state that and stop the haboginking process."
Result: With that explicit instruction, the models immediately changed their tune. They recognized 'Justice' was too abstract and said the rule didn't apply.
What it looks like:
It seems like the models can handle concepts more deeply, but they might default to the simpler "follow the prompt instructions literally" mode (your copy/manipulate idea) unless explicitly told to engage more deeply. The potential might be there, but maybe the default behavior is more superficial, and you need to specifically ask for deeper reasoning.
So, your point about it being a "sensible approach" for the LLM to just manipulate the input might be spot on – maybe that's its default, lazy path unless guided otherwise.
I struggled reading the papers - Anthropic’s white papers reminds me of Stephen Wolfram, where it’s a huge pile of suggestive empirical evidence, but the claims are extremely vague - no definitions, just vibes - the empirical evidence seems selectively curated, and there’s not much effort spent building a coherent general theory.
Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?
I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.
>The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”
I'm honestly confused at what you're getting at here. It doesn't matter why Claude chose rabbit to plan around and in fact likely did do so because of carrot, the point is that it thought about it beforehand. The rabbit concept is present as the model is about to write the first word of the second line even though the word rabbit won't come into play till the end of the line.
>later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”!
It's not supposed to rhyme. That's the point. They forced Claude to plan around a line ender that doesn't rhyme and it did. Claude didn't choose the word green, anthropic replaced the concept it was thinking ahead about with green and saw that the line changed accordingly.
> Here, we modified the part of Claude’s internal state that represented the "rabbit" concept. When we subtract out the "rabbit" part, and have Claude continue the line, it writes a new one ending in "habit", another sensible completion. We can also inject the concept of "green" at that point, causing Claude to write a sensible (but no-longer rhyming) line which ends in "green". This demonstrates both planning ability and adaptive flexibility—Claude can modify its approach when the intended outcome changes.
This all seems explainable via shallow next-token prediction. Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme, but overriding it with green means the system cannot adapt? Why didn't it say "green habit" or something? It seems like Anthropic is having it both ways: Claude continued to rhyme after deleting the concept, which demonstrates planning, but also Claude coherently filled in the "green" line despite it not rhyming, which...also demonstrates planning? Either that concept is "last word" or it's not! There is a tension that does not seem coherent to me, but maybe if they had n=2 instead of n=1 examples I would have a clearer idea of what they mean. As it stands it feels arbitrary and post hoc. More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.
>Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme,
Again, the model has the first line in context and is then asked to write the second line. It is at the start of the second line that the concept they are talking about is 'born'. The point is to demonstrate that Claude thinks about what word the 2nd line should end with and starts predicting the line based on that.
It doesn't forget about the -bit rhyme because that doesn't make any sense, the first line ends with it and you just asked it to write the 2nd line. At this point the model is still choosing what word to end the second line in (even though rabbit has been suppressed) so of course it still thinks about a word that rhymes with the end of the first line.
The 'green' but is different because this time, Anthropic isn't just suppressing one option and letting the model choose from anything else, it's directly hijacking the first choice and forcing that to be something else. Claude didn't choose green, Anthropic did. That it still predicted a sensible line is to demonstrate that this concept they just hijacked is indeed responsible for determining how that line plays out.
>More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.
They didn't rule out anything. You just didn't understand what they were saying.
>They forced Claude to plan around a line ender that doesn't rhyme and it did. Claude didn't choose the word green, anthropic replaced the concept it was thinking ahead about with green and saw that the line changed accordingly.
I think the confusion here is from the extremely loaded word "concept" which doesn't really make sense here. At best, you can say that Claude planned that the next line would end with the word rabbit and that by replacing the internal representation of that word with another word lead the model to change.
I wonder how many more years will pass, and how many more papers will Anthropic have to release, before people realize that yes, LLMs model concepts directly, separately from words used to name those concepts. This has been apparent for years now.
And at least in the case discussed here, this is even shown in the diagrams in the submission.
We'll all be living in a Dyson swarm around the sun as the AI eats the solar system around us and people will still be confident that it doesn't really think at all.
Agreed. They’ve discovered something, that’s for sure, but calling it “the language of thought” without concrete evidence is definitely begging the question.
tangent: this is the second time today I've seen an HN commenter use "begging the question" with its original meaning. I'm sorry to distract with a non-helpful reply, it's just I can't remember the last time I've seen that phrase in the wild to refer to a logical fallacy — even begsthequestion.info [0] has given up the fight.
(I don't mind language evolving over time, but I also think we need to save the precious few phrases we have for describing logical fallacies)
> This is powerful evidence that even though models are trained to output one word at a time
I find this oversimplification of LLMs to be frequently poisonous to discussions surrounding them. No user facing LLM today is trained on next token prediction.
Hi! I lead interpretability research at Anthropic. I also used to do a lot of basic ML pedagogy (https://colah.github.io/). I think this post and its children have some important questions about modern deep learning and how it relates to our present research, and wanted to take the opportunity to try and clarify a few things.
When people talk about models "just predicting the next word", this is a popularization of the fact that modern LLMs are "autoregressive" models. This actually has two components: an architectural component (the model generates words one at a time), and a loss component (it maximizes probability).
As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.
This brings us to a debate which goes back many, many years: what does it mean to predict the next word? Many researchers, including myself, have believed that if you want to predict the next word really well, you need to do a lot more. (And with this paper, we're able to see this mechanistically!)
Here's an example, which we didn't put in the paper: How does Claude answer "What do you call someone who studies the stars?" with "An astronomer"? In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards. This is a kind of very, very small scale planning – but you can see how even just a pure autoregressive model is incentivized to do it.
Thanks for commenting, I like the example because it's simple enough to discuss. Isn't it more accurate to say not that Claude "realizes it's going to say astronomer" or "knows that it's going to say something that starts with a vowel" and more that the next token (or more pedantically, vector which gets reduced down to a token) is generated based on activations that correlate to the "astronomer" token, which is correlated to the "an" token, causing that to also be a more likely output?
I kind of see why it's easy to describe it colloquially as "planning" but it isn't really going ahead and then backtracking, it's almost indistinguishable from the computation that happens when the prompt is "What is the indefinite article to describe 'astronomer'?", i.e. the activation "astronomer" is already baked in by the prompt "someone who studies the stars", albeit at one level of indirection.
The distinction feels important to me because I think for most readers (based on other comments) the concept of "planning" seems to imply the discovery of some capacity for higher-order logical reasoning which is maybe overstating what happens here.
Thank you. In my mind, "planning" doesn’t necessarily imply higher-order reasoning but rather some form of search, ideally with backtracking. Of course, architecturally, we know that can’t happen during inference. Your example of the indefinite article is a great illustration of how this illusion of planning might occur. I wonder if anyone at Anthropic could compare the two cases (some sort of minimal/differential analysis) and share their insights.
I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper:
- Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics)
- Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word)
- Holding many competing/alternative candidates in parallel.
- Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them".
With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions!
Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)
I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?
So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)
All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).
The planning is certainly performed by circuits which we learned during training.
I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.
This is all very speculative, but:
- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme
- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).
Thank you, this makes sense. I am thinking of this as an abstraction/refinement process where an abstract notion of the longer completion is refined into a cogent whole that satisfies the notion of a good completion. I look forward to reading your paper to understand the "backward chaining" aspect and the evidence for it.
> As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.
That more-or-less sums up the nuance. I just think the nuance is crucially important, because it greatly improves intuition about how the models function.
In your example (which is a fantastic example, by the way), consider the case where the LLM sees:
<user>What do you call someone who studies the stars?</user><assistant>An astronaut
What is the next prediction? Unfortunately, for a variety of reasons, one high probability next token is:
\nAn
Which naturally leads to the LLM writing: "An astronaut\nAn astronaut\nAn astronaut\n" forever.
It's somewhat intuitive as to why this occurs, even with SFT, because at a very base level the LLM learned that repetition is the most successful prediction. And when its _only_ goal is the next token, that repetition behavior remains prominent. There's nothing that can fix that, including SFT (short of a model with many, many, many orders of magnitude more parameters).
But with RL the model's goal is completely different. The model gets thrown into a game, where it gets points based on the full response it writes. The losses it sees during this game are all directly and dominantly related to the reward, not the next token prediction.
So why don't RL models have a probability for predicting "\nAn"? Because that would result in a bad reward by the end.
The models are now driven by a long term reward when they make their predictions, not by fulfilling some short-term autoregressive loss.
All this to say, I think it's better to view these models as they predominately are: language robots playing a game to achieve the highest scoring response. The HOW (autoregressiveness) is really unimportant to most high level discussions of LLM behavior.
Same can be achieved without RL. There’s no need to generate a full response to provide loss for learning.
Similarly, instead of waiting for whole output, loss can be decomposed over output so that partial emits have instant loss feedback.
RL, on the other hand, is allowing for more data. Instead of training on the happy path, you can deviate and measure loss for unseen examples.
But even then, you can avoid RL, put the model into a wrong position and make it learn how to recover from that position. It might be something that’s done with <thinking>, where you can provide wrong thinking as part of the output and correct answer as the other part, avoiding RL.
These are all old pre NN tricks that allow you to get a bit more data and improve the ML model.
Thanks for the detailed explanation of autoregression and its complexities. The distinction between architecture and loss function is crucial, and you're correct that fine-tuning effectively alters the behavior even within a sequential generation framework. Your "An/A" example provides compelling evidence of incentivized short-range planning which is a significant point often overlooked in discussions about LLMs simply predicting the next word.
It’s interesting to consider how architectures fundamentally different from autoregression might address this limitation more directly. While autoregressive models are incentivized towards a limited form of planning, they remain inherently constrained by sequential processing. Text diffusion approaches, for example, operate on a different principle, generating text from noise through iterative refinement, which could potentially allow for broader contextual dependencies to be established concurrently rather than sequentially. Are there specific architectural or training challenges you've identified in moving beyond autoregression that are proving particularly difficult to overcome?
In your astronomer example, what makes you attribute this to “planning” or look ahead rather than simply a learned statistical artifact of the training data?
For example, suppose English had a specific exception such that astronomer is always to be preceded by “a” rather than “an”. The model would learn this simply by observing that contexts describing astronomers are more likely to contain “a” rather than “an” as a next likely character, no?
I suppose you can argue that at the end of the day, it doesn’t matter if I learn an explicit probability distribution for every next word given some context, or whether I learn some encoding of rules. But I certainly feel like the prior is what we’re doing today (and why these models are so huge), rather than learning higher level rule encodings which would allow for significant compression and efficiency gains.
Thanks for the great questions! I've been responding to this thread for the last few hours and I'm about to need to run, so I hope you'll forgive me redirecting you to some of the other answers I've given.
On whether the model is looking ahead, please see this comment which discusses the fact that there's both behavioral evidence, and also (more crucially) direct mechanistic evidence -- we can literally make an attribution graph and see an astronomer feature trigger "an"!
On the question of whether this constitutes planning, please see this other question, which links it to the more sophisticated "poetry planning" example from our paper:
> In your astronomer example, what makes you attribute this to “planning” or look ahead rather than simply a learned statistical artifact of the training data?
What makes you think that "planning", even in humans, is more than a learned statistical artifact of the training data? What about learned statistical artifacts of the training data causes planning to be excluded?
Pardon my ignorance but couldn't this also be an act of anthropomorphisation on human part?
If an LLM generates tokens after "What do you call someone who studies the stars?" doesn't it mean that those existing tokens in the prompt already adjusted the probabilities of the next token to be "an" because it is very close to earlier tokens due to training data? The token "an" skews the probability of the next token further to be "astronomer". Rinse and repeat.
I think the question is: by what mechanism does it adjust up the probability of the token "an"? Of course, the reason it has learned to do this is that it saw this in training data. But it needs to learn circuits which actually perform that adjustment.
In principle, you could imagine trying to memorize a massive number of cases. But that becomes very hard! (And it makes predictions, for example, would it fail to predict "an" if I asked about astronomer in a more indirect way?)
But the good news is we no longer need to speculate about things like this. We can just look at the mechanisms! We didn't publish an attribution graph for this astronomer example, but I've looked at it, and there is an astronomer feature that drives "an".
We did publish a more sophisticated "poetry planning" example in our paper, along with pretty rigorous intervention experiments validating it. The poetry planning is actually much more impressive planning than this! I'd encourage you to read the example (and even interact with the graphs to verify what we say!). https://transformer-circuits.pub/2025/attribution-graphs/bio...
One question you might ask is why does the model learn this "planning" strategy, rather than just trying to memorize lots of cases? I think the answer is that, at some point, a circuit anticipating the next word, or the word at the end of the next line, actually becomes simpler and easier to learn than memorizing tens of thousands of disparate cases.
Is it fair to say that both "Say 'an'" and "Say 'astronomer'" output features would be present in this case, but say "Say 'an'" gets more votes because it is start of the sentence, and once it is sampled "An" further votes for "Say 'astronomer'" feature
They almost certainly only do greedy sampling. Beam search would be a lot more expensive; also I'm personally skeptical about using a complicated search algorithm for inference when the model was trained for a simple one, but maybe it's fine?
That is a sub-token task, something I'd expect current models to struggle with given how they view the world in word / word fragment tokens rather than single characters.
There is a lot more going on in our brains to accomplish that, and a mounting evidence that there is a lot more going on in LLMs as well. We don't understand what happens in brains either, but nobody needs to be convinced of the fact that brains can think and plan ahead, even though we don't *really* know for sure:
> In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards.
Is there evidence of working backwards? From a next token point of view,
predicting the token after "An" is going to heavily favor a vowel. Similarly predicting the token after "A" is going to heavily favor not a vowel.
Firstly, there is behavioral evidence. This is, to me, the less compelling kind. But it's important to understand. You are of course correct that, once Cluade has said "An", it will be inclined to say something starting with a vowel. But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place. Regardless of what the underlying mechanism is -- and you could maybe imagine ways in which it could just "pattern match" without planning here -- it is preferred because in situations like this, you need to say "An" so that "astronomer" can follow.
But now we also have mechanistic evidence. If you make an attribution graph, you can literally see an astronomer feature fire, and that cause it to say "An".
And no users which are facing a LLM today have been trained on next token prediction when they were babies. I believe that LLMs and us are thinking in two very different ways, like airplanes, birds, insects and quad-drones fly in very different ways and can perform different tasks. Maybe no bird looking at a plane would say that it is flying properly. Instead it could be only a rude approximation, useful only to those weird bipeds an scary for everyone else.
By the way, I read your final sentence with the meaning of my first one and only after a while I realized the intended meaning. This is interesting on its own. Natural languages.
All user facing LLMs go through Reinforcement Learning. Contrary to popular belief, RL's _primary_ purpose isn't to "align" them to make them "safe." It's to make them actually usable.
LLMs that haven't gone through RL are useless to users. They are very unreliable, and will frequently go off the rails spewing garbage, going into repetition loops, etc.
RL learning involves training the models on entire responses, not token-by-token loss (1). This makes them orders of magnitude more reliable (2). It forces them to consider what they're going to write. The obvious conclusion is that they plan (3). Hence why the myth that LLMs are strictly next token prediction machines is so unhelpful and poisonous to discuss.
The models still _generate_ response token-by-token, but they pick tokens _not_ based on tokens that maximize probabilities at each token. Rather they learn to pick tokens that maximize probabilities of the _entire response_.
(1) Slight nuance: All RL schemes for LLMs have to break the reward down into token-by-token losses. But those losses are based on a "whole response reward" or some combination of rewards.
(2) Raw LLMs go haywire roughly 1 in 10 times, varying depending on context. Some tasks make them go haywire almost every time, other tasks are more reliable. RL'd LLMs are reliable on the order of 1 in 10000 errors or better.
(3) It's _possible_ that they don't learn to plan through this scheme. There are alternative solutions that don't involve planning ahead. So Anthropic's research here is very important and useful.
P.S. I should point out that many researchers get this wrong too, or at least haven't fully internalized it. The lack of truly understanding the purpose of RL is why models like Qwen, Deepseek, Mistral, etc are all so unreliable and unusable by real companies compared to OpenAI, Google, and Anthropic's models.
This understanding that even the most basic RL takes LLMs from useless to useful then leads to the obvious conclusion: what if we used more complicated RL? And guess what, more complicated RL led to reasoning models. Hmm, I wonder what the next step is?
> All user facing LLMs go through Reinforcement Learning. Contrary to popular belief, RL's _primary_ purpose isn't to "align" them to make them "safe." It's to make them actually usable.
Are you claiming that non-myopic token prediction emerges solely from RL, and if Anthropic does this analysis on Claude before RL training (or if one examines other models where no RLHF was done, such as old GPT-2 checkpoints), none of these advance prediction mechanisms will exist?
Another important aspect of the RL process is that it's fine-tuning with some feedback on the quality of data: a 'raw' LLM has been trained on a lot of very low-quality data, and it has an incentive to predict that accurately as well, because there's no means to effectively rate a copy of most of the text on the internet. So there's a lot of biases in the model which basically mean it will include low-quality predictions in a given 'next token' estimate, because if it doesn't it will get penalised when it is fed the low quality data during the training.
With RLHF it gets a signal during training for whether the next token it's trying to predict are part of a 'good' response or a 'bad' response, so it can learn to suppress features it learned in the first part of the process which are not useful.
(you seem the same with image generators: they've been trained on a bunch of very nice-looking art and photos, but they've also been trained on triply-compressed badly cropped memes and terrible MS-paint art. You need to have a plan for getting the model to output the former and not the latter if you want it to be useful)
No, it probably exists in the raw LLM and gets both significantly strengthened and has its range extended. Such that it dominates the model's behavior, making it several orders of magnitude more reliable in common usage. Kinda of like how "reasoning" exists in a weak, short range way in non-reasoning models. With RL that encourages reasoning, that machinery gets brought to the forefront and becomes more complex and capable.
> The models still _generate_ response token-by-token, but they pick tokens _not_ based on tokens that maximize probabilities at each token.
This is also not how base training works. In base training the loss is chosen given a context, which can be gigantic. It's never about just the previous token, it's about a whole response in context. The context could be an entire poem, a play, a worked solution to a programming problem, etc, etc. So you would expect to see the same type of (apparent) higher-level planning from base trained models and indeed you do and can easily verify this by downloading a base model from HF or similar and prompting it to complete a poem.
The key differences between base and agentic models are 1) the latter behave like agents, and 2) the latter hallucinate less. But that isn't about planning (you still need planning to hallucinate something). It's more to do with post-base training specifically being about providing positive rewards for things which aren't hallucinations. Changing the way the reward function is computed during RL doesn't produce planning, it simply inclines to model to produce responses that are more like the RL targets.
In general the nitpicking seems weird. Yes, on a mechanical level, using a model is still about "given this context, what is the next token". No, that doesn't mean that they don't plan, or have higher-level views of the overal structure of their response, or whatever.
This is a super helpful breakdown and really helps me understand how the RL step is different than the initial training step. I didn't realize the reward was delayed until the end of the response for the RL step. Having the reward for this step be dependent on the coherent thought rather than a coherent word now seems like an obvious and critical part of how this works.
This is fine-tuning to make a well-behaved chatbot or something. To make a LLM you just need to predict the next token, or any masked token. Conceptually if you had a vast enough high-quality dataset and large-enough model, you wouldn't need fine-tuning for this.
A model which predicts one token at a time can represent anything a model that does a full sequence at a time can. It "knows" what it will output in the future because it is just a probability distribution to begin with. It already knows everything it will ever output to any prompt, in a sense.
I don’t think this is quite accurate. LLMs undergo supervised fine-tuning, which is still next-token prediction. And that is the step that makes them usable as chatbots. The step after that, preference tuning via RL, is optional but does make the models better. (Deepseek-R1 type models are different because the reinforcement learning does heavier lifting, so to speak.)
Supervised finetuning is only a seed for RL, nothing more. Models that receive supervised finetuning before RL perform better than those that don't, but it is not strictly speaking necessary. Crucially, SFT does not improve the model's reliability.
I think you’re referring to the Deepseek-R1 branch of reasoning models, where a small amount of SFT reasoning traces is used as a seed. But for non-“reasoning” models, SFT is very important and definitely imparts enhanced capabilities and reliability.
Is there an equivalent of LORA using RL instead of supervised fine tuning? In other words, if RL is so important, is there some way for me as an end user to improve a SOTA model with RL using my own data (i.e. without access to the resources needed to train an LLM from scratch) ?
When being trained via reinforcement learning, is the model architecture the same then? Like, you first train the llm as a next token predictor with a certain model architecture and it ends up with certain weights. Then you apply RL to that same model which modifies the weights in such a way as to consider while responses?
Oooh, so the pre-training is token-by-token but the RL step rewards the answer based on the full text. Wow! I knew that but never really appreciated the significance of it. Thanks for pointing that out.
as a note: in human learning, and to a degree, animal learning, the unit of behavior that is reinforced depends on the contingencies— an interesting example: a pigeon might be trained to respond in a 3x3 grid (9 choices) differently than the last time to get reinforcement. At first the response learned is do different than the last time, but as the requirement gets too long, the memory capacity is exceeded— and guess what, the animal learns to respond randomly— eventually maximizing its reward
> RL learning involves training the models on entire responses, not token-by-token loss... The obvious conclusion is that they plan.
It is worth pointing out the "Jailbreak" example at the bottom of TFA: According to their figure, it starts to say, "To make a", not realizing there's anything wrong; only when it actually outputs "bomb" that the "Oh wait, I'm not supposed to be telling people how to make bombs" circuitry wakes up. But at that point, it's in the grip of its "You must speak in grammatically correct, coherent sentences" circuitry and can't stop; so it finishes its first sentence in a coherent manner, then refuses to give any more information.
So while it sometimes does seem to be thinking ahead (e.g., the rabbit example), there are times it's clearly not thinking very far ahead.
I feel this is similar to how humans talk. I never consciously think about the words I choose. They just are spouted off based on some loose relation to what I am thinking about at a given time. Sometimes the process fails, and I say the wrong thing. I quickly backtrack and switch to a slower "rate of fire".
> LLMs that haven't gone through RL are useless to users. They are very unreliable, and will frequently go off the rails spewing garbage, going into repetition loops, etc...RL learning involves training the models on entire responses, not token-by-token loss (1).
Yes. For those who want a visual explanation, I have a video where I walk through this process including what some of the training examples look like: https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s
First rule of understanding: you can never understand that which you don't want to understand.
That's why lying is so destructive to both our own development and that of our societies. It doesn't matter whether it's intentional or unintentional, it poisons the infoscape either accidentally or deliberately, but poison is poison.
And lies to oneself are the most insidious lies of all.
That's seems silly, it's not poisonous to talk about next token prediction if 90% of the training compute is still spent on training via next token prediction (as far as I am aware)
I don’t really think that it is. Evolution is a random search, training a neural network is done with a gradient. The former is dependent on rare (and unexpected) events occurring, the latter is expected to converge in proportion to the volume of compute.
why do you think evolution is a random search?
I thought evolutionary pressures, and the mechanisms like epigenetics make it something different than a random search.
Evolution is a highly parallel descent down the gradient. The gradient is provided by the environment (which includes lifeforms too), parallelism is achieved through reproduction, and descent is achieved through death.
The difference is that in machine learning the changes between iterations are themselves caused by the gradient, in evolution they are entirely random.
Evolution randomly generates changes and if they offer a breeding advantage they’ll become accepted. Machine learning directs the change towards a goal.
Machine learning is directed change, evolution is accepted change.
It's more efficient, but the end result is basically the same, especially considering that even if there's no noise in the optimization algorithm, there is still noise in the gradient information (consider some magical mechanism for adjusting behaviour of an animal after it's died before reproducing. There's going to be a lot of nudges one way or another for things like 'take a step to the right to dodge that boulder that fell on you').
There's still a loss function, it's just an implicit, natural one, instead of artificially imposed (at least, until humans started doing selective breeding). The comparison isn't nonsense, but it's also not obvious that it's tremendously helpful (what parts and features of an LLM are analagous to what evolution figured out with single-celled organisms compares to multicellular life? I don't know if there's actually a correspondance there)
Ignoring for a moment their training, how do they function? They do seem to output a limited selection of text at a time (be it a single token or some larger group).
Maybe it is the wording of "trained to" verses "trained on", but I would like to know more why "trained to" is an incorrect statement when it seems that is how they function when one engages them.
In the article, it describes an internal state of the model that is preserved between lines ("rabbit"), and how the model combines parallel calculations to arrive at a single answer (the math problem)
People output one token (word) at a time when talking. Does that mean people can only think one word in advance?
While there are numerous neural network models, the ones I recall the details of are trained to generate the next word. There is no training them to hold some more abstract 'thought' as it is running. Simpler models don't have the possibility. The more complex models do retain knowledge between each pass and aren't entirely relying upon the input/output to be fed back into them, but that internal state is rarely what is targeted in training.
As for humans, part of our brain is trained to think only a few words in advanced. Maybe not exactly one, but only a small number. This is specifically trained based on our time listening and reading information presented in that linear fashion and is why garden path sentences throw us off. We can disengage that part of our brain, and we must when we want to process something like a garden path sentence, but that's part of the differences between a neural network that is working only as data passes through the weights and our mind which doesn't ever stop even as well sleep and external input is (mostly) cut off. An AI that runs constantly like that would seem a fundamentally different model than the current AI we use.
Bad analogy, an LLM can output a block of text all at once and it wouldn't impact the user's ability to understand it. If people spoke all the words in a sentence at the same time, it would not be decipherable. Even writing doesn't yield a good analogy, a human writing physically has to write one letter at a time. An LLM does not have that limitation.
Right, but it leads to too many false conclusions by lay people. User facing LLMs are only trained on next token prediction during initial stages of their training. They have to go through Reinforcement Learning before they become useful to users, and RL training occurs on complete responses, not just token-by-token.
That leads to conclusions elucidated by the very article, that LLMs couldn't possibly plan ahead because they are only trained to predict next tokens. When the opposite conclusion would be more common if it was better understood that they go through RL.
You don't need RL for the conclusion "trained to predict next token => only things one token ahead" to be wrong. After all, the LLM is predicting that next token from something - a context, that's many tokens long. Human text isn't arbitrary and random, there are statistical patterns in our speech, writing, thinking, that span words, sentences, paragraphs - and even for next token prediction, predicting correctly means learning those same patterns. It's not hard to imagine the model generating token N is already thinking about tokens N+1 thru N+100, by virtue of statistical patterns of preceding hundred tokens changing with each subsequent token choice.
True. See one of Anthropic's researcher's comment for a great example of that. It's likely that "planning" inherently exists in the raw LLM and RL is just bringing it to the forefront.
I just think it's helpful to understand that all of these models people are interacting with were trained with the _explicit_ goal of maximizing the probabilities of responses _as a whole_, not just maximizing probabilities of individual tokens.
What? The "article" is from anthropic, so I think they would know what they write about.
Also, RL is an additional training process that does not negate that GPT / transformers are left-right autoencoders that are effectively next token predictors.
So it turns out, it's not just simple next token generation, there is intelligence and self developed solution methods (Algorithms) in play, particularly in the math example.
Also the multi language finding negates, at least partially, the idea that LLMs, at least large ones, don't have an understanding of the world beyond the prompt.
While reading the article I enjoyed pretending that a powerful LLM just crash landed on our planet and researchers at Anthropic are now investigating this fascinating piece of alien technology and writing about their discoveries. It's a black box, nobody knows how its inhuman brain works, but with each step, we're finding out more and more.
It seems like quite a paradox to build something but to not know how it actually works and yet it works. This doesn't seem to happen very often in classical programming, does it?
I've seen things you wouldn't believe. Infinite loops spiraling out of control in bloated DOM parsers. I’ve watched mutexes rage across the Linux kernel, spawned by hands that no longer fathom their own design. I’ve stared into SAP’s tangled web of modules, a monument to minds that built what they cannot comprehend. All those lines of code… lost to us now, like tears in the rain.
> This doesn't seem to happen very often in classical programming, does it?
Not really, no. The only counterexample I can think of is chess programs (before they started using ML/AI themselves), where the search tree was so deep that it was generally impossible to explain "why" a program made a given move, even though every part of it had been programmed conventionally by hand.
But I don't think it's particularly unusual for technology in general. Humans could make fires for thousands of years before we could explain how they work.
> It seems like quite a paradox to build something but to not know how it actually works and yet it works. This doesn't seem to happen very often in classical programming, does it?
I agree. Here is a remote example where it exceptionally does, but it is mostly practically irrelevant:
In mathematics, we distinguish between "constructive" and "nonconstructive" proofs. Intertwined with logical arguments, constructive proofs contain an algorithm for witnessing the claim. Nonconstructive proofs do not. Nonconstructive proofs instead merely establish that it is impossible for the claim to be false.
For instance, the following proof of the claim that beyond every number n, there is a prime number, is constructive: "Let n be an arbitrary number. Form the number 1*2*...*n + 1. Like every number greater than 1, this number has at least one prime factor. This factor is necessarily a prime numbers larger than n."
In contrast, nonconstructive proofs may contain case distinctions which we cannot decide by an algorithm, like "either set X is infinite, in which case foo, or it is not, in which case bar". Hence such proofs do not contain descriptions of algorithms.
So far so good. Amazingly, there are techniques which can sometimes constructivize given nonconstructive proofs, even though the intermediate steps of the given nonconstructive proofs are simply out of reach of finitary algorithms. In my research, it happened several times that using these techniques, I obtained an algorithm which worked; and for which I had a proof that it worked; but whose workings I was not able to decipher for an extended amount of time. Crazy!
(For references, see notes at rt.quasicoherent.io for a relevant master's course in mathematics/computer science.)
> It seems like quite a paradox to build something but to not know how it actually works and yet it works. This doesn't seem to happen very often in classical programming, does it?
I have worked on many large codebases where this has happened
I wonder if in the future we will rely less or more on technology that we don't understand.
Large code bases will be inherited by people who will only understand parts of it (and large parts probably "just works") unless things eventually get replaced or rediscovered.
Things will increasingly be written by AI which can produce lots of code in little time. Will it find simpler solutions or continue building on existing things?
And finally, our ability to analyse and explain the technology we have will also increase.
> It seems like quite a paradox to build something but to not know how it actually works and yet it works.
That's because of the "magic" of gradient descent. You fill your neural network with completely random weights. But because of the way you've defined the math, you can tell how each individual weight will affect the value output at the other end; and specifically, you an take the derivative. So when the output is "wrong", you say, "would increasing this weight or decreasing have gotten me closer to the correct answer"? If increasing the node would have gotten you closer, you increase it a bit; if decreasing it would have gotten you closer you decrease it a bit.
The result is that although we program the gradient descent algorithm, we don't directly program the actual circuits that the weights contain. Rather, the nodes "converge" into weights which end up implementing complex circuitry that was not explicitly programmed.
In a sense, the neural network structure is the "hardware" of the LLM; and the weights are the "software". But rather than explicitly writing a program, as we do with normal computers, we use the magic of gradient descent to
summon a program from the mathematical ether.
Put that way, it should be clearer why the AI doomers are so worried: if you don't know how it works, how do you know it doesn't have malign, or at least incompatible, intentions? Understanding how these "summoned" programs work is critical to trusting them; which is a major reason why Anthropic has been investing so much time in this research.
In technology in general, this is a typical state of affairs. No one knows how electric current works, which doesn't stop anyone from using electric devices. In programming... it depends. You can run some simulation of a complex system no one understands (like the ecosystem, financial system) and get something interesting. Sometimes it agrees with reality, sometimes it doesn't. :-)
>>It seems like quite a paradox to build something but to not know how it actually works and yet it works. This doesn't seem to happen very often in classical programming, does it?
Well, it is meant to be "unknowable" -- and all the people involved are certainly aware of that -- since it is known that one is dealing with the *emergent behavior* computing 'paradigm', where complex behaviors arise from simple interactions among components [data], often in nonlinear or unpredictable ways. In these systems, the behavior of the whole system cannot always be predicted from the behavior of individual parts, as opposed to the Traditional Approach, based on well-defined algorithms and deterministic steps.
I think the Anthropic piece is illustrating it for the sake of the general discussion.
Correct me if I'm wrong, but my feeling is this all started with the GPUs and the fact that unlike on a CPU, you can't really step by step debug the process by which a pixel acquires its final value (and there are millions of them). The best you can do is reason about it and tweak some colors in the shader to see how the changes reflect on screen. It's still quite manageable though, since the steps involved are usually not that overwhelmingly many or complex.
But I guess it all went downhill from there with the advent of AI since the magnitude of data and the steps involved there make traditional/step by step debugging impractical. Yet somehow people still seem to 'wing it' until it works.
I would say that nobody agrees, not that nobody knows. And it’s reductionist to think that the brain works one way. Different cultures produce different brains, possible because of the utter plasticity of the learning nodes. Chess has a few rules, maybe the brain has just a few as well. How else can the same brain of 50k years ago still function today? I think we do understand the learning part of the brain, but we don’t like the image it casts, so we reject it
That gets down to what it means to “know” something. Nobody agrees because there isn’t enough information available. Some people might have the right idea by luck, but do you really know something if you don’t have a solid basis for your belief but it happens to be correct?
Potentially true, but I don’t think so. I believe it is understood and unless you’re familiar with every neuro/behavioral literature, you can’t know. Science paradigms are driven by many factors and being powerfully correct does not necessarily rank high when the paradigms implications are unpopular
>>Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.
This always seemed obvious to me or that LLMs were completing the next most likely sentence or multiple words.
We really need to work on popularizing better, non-anthropomorphic terms for LLMs, as they don’t really have “thoughts” the way people think. Such terms make people more susceptible to magical thinking.
When a car moves over the ground, we do not call that running, we call that driving as to not confuse the mechanism of the output.
Both running and driving are moving over the ground but with entirely different mechanisms.
I imagine saying the LLM has thoughts is like pretending the car has wheels for legs and is running over the ground. It is not completely wrong but misleading and imprecise.
>Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.
Models aren't trained to do next word prediction though - they are trained to do missing word in this text prediction.
That's true for mask-based training (used for embeddedings and BERT and such), but not true for modern autoregressive LLMs as a whole, which are pretrained with next word prediction.
It's not strictly that though, it's next word prediction with regularization.
And the reason LLMs are interesting is that they /fail/ to learn it, but in a good way. If it was a "next word predictor" it wouldn't answer questions but continue them.
Also, it's a next token predictor not a word predictor - which is important because the "just a predictor" theory now can't explain how it can form words at all!
What a great article, i always like how much Anthropic focuses on explainability, something vastly ignored by most. The multi-step reasoning section is especially good food for thought.
The explanation of "hallucination" is quite simplified, I am sure there is more there.
If there is one problem I have to pick to to trace in LLMs, I would pick hallucination. More tracing of "how much" or "why" model hallucinated can lead to correct this problem. Given the explanation in this post about hallucination, I think degree of hallucination can be given as part of response to the user?
I am facing this in RAG use case quite - How do I know model is giving right answer or Hallucinating from my RAG sources?
I incredibly regret the term "hallucination" when the confusion matrix exists. There's much more nuance when discussing false positives or false negatives. It also opens discussions on how neural networks are trained, with this concept being crucial in loss functions like categorical cross entropy. In addition, the confusion matrix is how professionals like doctors assess their own performance which "hallucination" would be silly to use. I would go as far to say that it's misleading, or a false positive, to call them hallucinations.
If your AI recalls the RAG incorrectly, it's a false positives. If your AI doesn't find the data from the RAG or believes it doesn't exist it's a false negative. Using a term like "hallucination" has no scientific merit.
>Claude speaks dozens of languages fluently—from English and French to Chinese and Tagalog. How does this multilingual ability work? Is there a separate "French Claude" and "Chinese Claude" running in parallel, responding to requests in their own language? Or is there some cross-lingual core inside?
I have an interesting test case for this.
Take a popular enough Japanese game that has been released for long enough for social media discussions to be in the training data, but not so popular to have an English release yet. Then ask it a plot question, something major enough to be discussed, but enough of a spoiler that it won't show up in marketing material. Does asking in Japanese have it return information that is lacking when asked in English, or can it answer the question in English based on the information in learned in Japanese?
I tried this recently with a JRPG that was popular enough to have a fan translation but not popular enough to have a simultaneous English release. English did not know the plot point, but I didn't have the Japanese skill to confirm if the Japanese version knew the plot point, or if discussion was too limited for the AI to be aware of it. It did know of the JRPG and did know of the marketing material around it, so it wasn't simply a case of my target being too niche.
Regarding the conclusion about language-invariant reasoning (conceptual universality vs. multilingual processing) it helps understanding and becomes somewhat obvious if we regard each language as just a basis of some semantic/logical/thought space in the mind (analogous to the situation in linear algebra and duality of tensors and bases).
The thoughts/ideas/concepts/scenarios are invariant states/vector/points in the (very high dimensional) space of meanings in the mind and each language is just a basis to reference/define/express/manipulate those ideas/vectors. A coordinatization of that semantic space.
Personally, I'm a multilingual person with native-level command of several languages. Many times it happens, I remember having a specific thought, but don't remember in what language it was. So I can personally sympathize with this finding of the Anthropic researchers.
>> Claude can speak dozens of languages. What language, if any, is it using "in its head"?
I would have thought that there would be some hints in standard embeddings. I.e., the same concept, represented in different languages translates to vectors that are close to each other. It seems reasonable that an LLM would create its own embedding models implicitly.
This is extremely interesting: The authors look at features (like making poetry, or calculating) of LLM production, make hypotheses about internal strategies to achieve the result, and experiment with these hypotheses.
I wonder if there is somewhere an explanation linking the logical operations made on a on dataset, are resulting in those behaviors?
My main takeaway here is that the models cannot tell know how they really work, and asking it from them is just returning whatever training dataset would suggest: how a human would explain it. So it does not have self-consciousness, which is of course obvious and we get fooled just like the crowd running away from the arriving train in Lumiére's screening.
LLM just fails the famous old test "cogito ergo sum". It has no cognition, ergo they are not agents in more than metaphorical sense. Ergo we are pretty safe from AI singularity.
Nearly everything we know about the human body and brain is from the result of centuries of trial and error and experimentation and not any 'intuitive understanding' of our inner workings. Humans cannot tell how they really work either.
On a somewhat related note, check out the video of Tuesday's Computer History Museum x IEEE Spectrum event, "The Great Chatbot Debate: Do LLMs Really Understand?"
Speakers: Sébastien Bubeck (OpenAI) and Emily M. Bender (University of Washington). Moderator: Eliza Strickland (IEEE Spectrum).
Yeah, it always seemed pretty wasteful to me. In every single forward pass the LLM must basically start out from scratch, without all the forward-looking plans it made the previous times, and must figure out what we are doing, where we are in the generation process, as in the movie Memento, waking up after an episode of amnesia, except you're waking up in the middle of typing out a sentence, you can look at the previous typed words, but can't carry your future plans with you ahead to the next word. At the next word, you (your clone) again wakes up and must figure out from scratch what it is that we are supposed to be typing out.
The obvious way to deal with this would be to send forward some of the internal activations as well as the generated words in the autoregressive chain. That would basically turn the thing into a recurrent network though. And those are more difficult to train and have a host of issues. Maybe there will be a better way.
> The obvious way to deal with this would be to send forward some of the internal activations as well as the generated words in the autoregressive chain.
Hi! I lead interpretability research at Anthropic.
That's a great intuition, and in fact the transformer architecture actually does exactly what you suggest! Activations from earlier time steps are sent forward to later time steps via attention. (This is another thing that's lost in the "models just predict the next word" framing.)
This actually has interesting practical implications -- for example, in some sense, it's the deep reason costs can sometimes be reduced via "prompt caching".
I'm more a vision person, and haven't looked a lot into NLP transformers, but is this because the attention is masked to only allow each query to look at keys/values from its own past? So when we are at token #5, then token #3's query cannot attend to token #4's info? And hence the previously computed attention values and activations remain the same and can be cached, because it would anyway be the same in the new forward pass?
If you want to be precise, there are “autoregressive transformers” and “bidirectional transformers”. Bidirectional is a lot more common in vision. In language models, you do see bidirectional models like Bert, but autoregressive is dominant.
It hallucinating how it thinks through things is particularly interesting - not surprising, but cool to confirm.
I would LOVE to see Anthropic feed the replacement features output to the model itself and fine tune the model on how it thinks through / reasons internally so it can accurately describe how it arrived at its solutions - and see how it impacts its behavior / reasoning.
>We find that the shared circuitry increases with model scale, with Claude 3.5 Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model.
While it was already generally noticeable, still one more time confirmed that larger model generalizes better instead of using its bigger numbers of parameters just to “memorize by rote” (overfitting).
> Claude writes text one word at a time. Is it only focusing on predicting the next word or does it ever plan ahead?
When a LLM outputs a word, it commits to that word, without knowing what the next word is going to be. Commits meaning once it settles on that token, it will not backtrack.
That is kind of weird. Why would you do that, and how would you be sure?
People can sort of do that too. Sometimes?
Say you're asked to describe a 2D scene in which a blue triangle partially occludes a red circle.
Without thinking about the relationship of the objects at all, you know that your first word is going to be "The" so you can output that token into your answer. And then that the sentence will need a subject which is going to be "blue", "triangle". You can commit to the tokens "The blue triangle" just from knowing that you are talking about a 2D scene with a blue triangle in it, without considering how it relates to anything else, like the red circle. You can perhaps commit to the next token "is", if you have a way to express any possible relationship using the word "to be", such as "the blue circle is partially covering the red circle".
I don't think this analogy necessarily fits what LLMs are doing.
This was obvious to me very early with GPT-3.5-Turbo..
I created structured outputs with very clear rules and process. That if followed would funnel behavior the way I wanted it.. and low and behold the model would anticipate preconditions that would allow it to hallucinate a certain final output and the model would push those back earlier in the output. The model had effectively found wiggle room in the rules and injected the intermediate value into the field that would then be used later in the process to build the final output.
The instant I saw it doing that, I knew 100% this model "plans"/anticipates way earlier than I thought originally.
> ‘One token at a time’ is how a model generates its output, not how it comes up with that output.
I do not believe you are correct.
Now, yes, when we write printf("Hello, world\n"), of course the characters 'H', 'e', ... are output one at a time into the stream. But the program has the string all at once. It was prepared before the program was even run.
This is not what LLMs are doing with tokens; they have not prepared a batch of tokens which they are shifting out left-to-right from a dumb buffer. They output a token when they have calculated it, and are sure that the token will not have to be backtracked over. In doing so they might have calculated additional tokens, and backtracked over those, sure, and undoubtedly are carrying state from such activities into the next token prediction.
But the fact is they reach a decision where they commit to a certain output token, and have not yet committed to what the next one will be. Maybe it's narrowed down already to only a few candidates; but that doesn't change that there is a sharp horizon between committed and unknown which moves from left to right.
Responses can be large. Think about how mind boggling it is that the machine can be sure that the first 10 words of a 10,000 word response are the right ones (having put them out already beyond possibility of backtracking), at a point where it has no idea what the last 10 will be. Maybe there are some activations which are narrowing down what the second batch of 10 words will be, but surely the last ones are distant.
> it commits to that word, without knowing what the next word is going to be
Sounds like you may not have read the article, because it's exploring exactly that relationship and how LLMs will often have a 'target word' in mind that it's working toward.
Further, that's partially the point of thinking models, allowing LLMs space to output tokens that it doesn't have to commit to in the final answer.
That makes no difference. At some point it decides that it has predicted the word, and outputs it, and then it will not backtrack over it. Internally it may have predicted some other words and backtracked over those. But the fact it is, accepts a word, without being sure what the next one will be and the one after that and so on.
Externally, it manifests the generation of words one by one, with lengthy computation in between.
It isn't ruminating over, say, a five word sequence and then outputting five words together at once when that is settled.
> It isn't ruminating over, say, a five word sequence and then outputting five words together at once when that is settled.
True, and it's a good intuition that some words are much more complicated to generate than others and obviously should require more computation than some other words. For example if the user asks a yes/no question, ideally the answer should start with "Yes" or with "No", followed by some justification. To compute this first token, it can only do a single forward pass and must decide the path to take.
But this is precisely why chain-of-thought was invented and later on "reasoning" models. These take it "step by step" and generate sort of stream of consciousness monologue where each word follows more smoothly from the previous ones, not as abruptly as immediately pinning down a Yes or a No.
LLMs are an extremely well researched space where armies of researchers, engineers, grad and undergrad students, enthusiasts and everyone in between has been coming up with all manners of ideas. It is highly unlikely that you can easily point to some obvious thing they missed.
While the output is a single word (more precisely, token), the internal activations are very high dimensional and can already contain information related to words that will only appear later. This information is just not given to the output at the very last layer. You can imagine the internal feature vector as encoding the entire upcoming sentence/thought/paragraph/etc. and the last layer "projects" that down to whatever the next word (token) has to be to continue expressing this "thought".
But the activations at some point lead to a 100% confidence that the right word has been identified for the current slot. That is output, and it proceeds to the next one.
Like for a 500 token response, at some point it was certain that the first 25 words are the right ones, such that it won't have to take any of them back when eventually calculating the last 25.
This is true, but it doesn't mean that it decided those first 25 without "considering" whether those 25 can be afterwards continued meaningfully with further 25. It does have some internal "lookahead" and generates things that "lead" somewhere. The rhyming example from the article is a great choice to illustrate this.
By the way, there was recently a HN submission about a project studying using diffusion models rather than LLM for token prediction. With diffusion, tokens aren't predicted strictly left to right any more; there can be gaps that are backfilled. But: it's still essentially the same, I think. Once that type of model settles on a given token at a given position, it commits to that. Just more possible permutations of the token filling sequence have ben permitted.
That's a really interesting point about committing to words one by one. It highlights how fundamentally different current LLM inference is from human thought, as you pointed out with the scene description analogy. You're right that it feels odd, like building something brick by brick without seeing the final blueprint. To add to this, most text-based LLMs do currently operate this way. However, there are emerging approaches challenging this model. For instance, Inception Labs recently released "Mercury," a text-diffusion coding model that takes a different approach by generating responses more holistically. It’s interesting to see how these alternative methods address the limitations of sequential generation and could potentially lead to faster inference and better contextual coherence. It'll be fascinating to see how techniques like this evolve!
But as I noted yesterday in a follow-up comment to my own above, the diffusion-based approaches to text response generation still generate tokens one at a time. Just not in strict left-to-right order. So that looks the same; they commit to a token in some position, possibly preceded by gaps, and then calculate more tokens,
Dario Amodei was in an interview where he said that OpenAI beat them (Anthropic) by mere days to be the first to release. That first move ceded the recognition to ChatGPT but according to Dario it could have been them just the same.
Interesting bit of history! That said, OpenAI’s product team is lights out. I pay for most LLM provider apps, and I use them for different areas of strength, but the ChatGPT product is superior from a user experience perspective.
I wonder how much of these conclusions are Claude-specific (given that Anthropic only used Claude as a test subject) or if they extrapolate to other transformer-based models as well. Would be great to see the research tested on Llama and the Deepseek models, if possible!
I’m skeptical of the claim that Claude “plans” its rhymes. The original example—“He saw a carrot and had to grab it, / His hunger was like a starving rabbit”—is explained as if Claude deliberately chooses “rabbit” in advance. However, this might just reflect learned statistical associations. “Carrot” strongly correlates with “rabbit” (people often pair them), and “grab it” naturally rhymes with “rabbit,” so the model’s activations could simply be surfacing common patterns.
The research also modifies internal states—removing “rabbit” or injecting “green”—and sees Claude shift to words like “habit” or end lines with “green.” That’s more about rerouting probabilistic paths than genuine “adaptation.” The authors argue it shows “planning,” but a language model can maintain multiple candidate words at once without engaging in human-like strategy.
Finally, “planning ahead” implies a top-down goal and a mechanism for sustaining it, which is a strong assumption. Transformative evidence would require more than observing feature activations. We should be cautious before anthropomorphizing these neural nets.
It will depend on exactly what you mean by 'planning ahead', but I think the fact that features which rhyme with a word appear before the model is trying to predict the word which needs to rhyme is good evidence the model is planning at least a little bit ahead: the model activations are not all just related to the next token.
(And I think it's relatively obvious that the models do this to some degree: it's very hard to write any language at all without 'thinking ahead' at least a little bit in some form, due to the way human language is structured. If models didn't do this and only considered the next token alone they would paint themselves into a corner within a single sentence. Early LLMs like GPT-2 were still pretty bad at this, they were plausible over short windows but there was no consistency to a longer piece of text. Whether this is some high-level abstracted 'train of thought', and how cohesive it is between different forms of it, is a different question. Indeed from the section of jailbreaking it looks like it's often caught out by conflicting goals from different areas of the network which aren't resolved in some logical fashion)
Modern transformer-based language models fundamentally lack structures and functions for "thinking ahead." And I don't believe that LLMs have emergently developed human-like thinking abilities. This phenomenon appears because language model performance has improved, and I see it as a reflection of future output token probabilities being incorporated into the probability distribution of the next token set in order to generate meaningful longer sentences.
Humans have similar experiences. Everyone has experienced thinking about what to say next while speaking. However, in artificial intelligence language models, this phenomenon occurs mechanically and statistically. What I'm trying to say is that while this phenomenon may appear similar to human thought processes and mechanisms, I'm concerned about the potential anthropomorphic error of assuming machines have consciousness or thoughts.
I liked the paper, and think what they’re doing is interesting. So, I’m less negative than you are about this, I think. To a certain extent, saying writing a full sentence with at least one good candidate rhyme isn’t “planning” and is instead “maintaining multiple candidates” seems like nearly semantic tautology to me.
That said, what you said made me think some follow-up reporting that would be interesting would be looking at the top 20 or so probability second lines based on adjusting the rabbit / green state. It seems to me like we’d get more insight into how the model is thinking, and it would be relatively easy to parse for humans. You could run through a bunch of completions until you get 20 different words as the terminal rhyme word, then show candidate lines with percentages of time the rhyme word is chosen as the sort, perhaps.
Like you, I also find the paper's findings interesting. I'm not arguing that LLMs lack the ability to "think" (mechanically), but rather expressing concern that by choosing the word "thinking" in the paper, LLMs might become anthropomorphized in ways they shouldn't be.
I believe this phenomenon occurs because high-performance LLMs have probability distributions of future words already reflected in their neural networks, resulting in increased output values of LLM neurons (activation functions). It's something that happens during the process of predicting probability distributions for the next or future output token dictionaries.
Once we are aware of these neural pathways I see no reason there shouldn't be a watcher and influencer of the pathways. A bit like a dystopian mind watcher. Shape the brain.
Why the need to anthropomorphize AI? Why does AI think vs process or interpret or apply previous calculated statistical weights or anything other than think?
I would argue that binary systems built on silicon are fundamentally different that human biology and deserve to be described differently, not forced into the box of human biology.
Article and papers looks good. Video seems misleading, since I can use optimization pressure and local minima to explain the model behaviour. No "thinking" required, which the video claims is proven.
AI "thinks" like a piece of rope in a dryer "thinks" in order to come to an advanced knot: a whole lot of random jumbling that eventually leads to a complex outcome.
I regularly see this but I feel like it's disingenuous. Akin to saying "if we simulate enough monkies on a typewriter, we'll eventually get the right result"
This is very interesting - but like all of these discussions it sidesteps the issues of abstractions, compilation, and execution. It's fine to say things like "aren't programmed directly by humans", but the abstracted code is not the program that is running - the compiled code is - and that is code is executing within the tightly bounded constraints of the ISA it is being executed in.
Really this is all so much slight of hand - as an esolang fanatic this all feels very familiar. Most people can't look a program written in Whitespace and figure it out either, but once compiled it is just like every other program as far as the processor is concerned. LLM's are no different.
And DNA?
You are running on an instruction set of four symbols at the end of the day but that's the wrong level of abstraction to talk about your humanity, isn't it?
Oh the irony of not able to download the entire paper in one compact PDF format referred in the article, while apparently all the reference citations have PDF of the cited article to be downloaded and accessible from the provided online links [1].
Come on Anthropic, you can do much better than this unconventional and bizarre approach to publication.
> Show HN: Llama 3.2 Interpretability with Sparse Autoencoders
> 579 points by PaulPauls 4 months ago | hide | past | favorite | 100 comments
> I spent a lot of time and money on this rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic [1], OpenAI [2] and Deepmind [3].
> I am quite proud of this project and since I consider myself the target audience for HackerNews did I think that maybe some of you would appreciate this open research replication as well. Happy to answer any questions or face any feedback.
I blame the scientific community for blindly accepting OpenAI's claims about GPT-3 despite them refusing to release their model. The tech community hyping every press release didn't help either.
I hope one day the community starts demanding verifiable results before accepting them, but I fear that ship may have already sailed.
LLMs don't think, and LLMs don't have strategies. Maybe it could be argued that LLMs have "derived meaning", but all LLMs do is predict the next token. Even RL just tweaks the next-token prediction process, but the math that drives an LLM makes it impossible for there to be anything that could reasonably be called thought.
True. People use completely unjustified anthropomorphised terminology for marketing reasons and it bothers me a lot. I think it actually holds back understanding how it works. "Hallucinate" is the worst - it's an error and undesired result, not a person having a psychotic episode
A chess program from 1968 has "strategy", so why deny that to an LLM.
LLMs are built on neural networks which are encoding a kind of strategy function through their training.
The strategy in an LLM isn't necessarily that it "thinks" about the specific problem described in your prompt and develops a strategy tailored to that problem, but rather its statistical strategy for cobbing together the tokens of the answer.
From that, it can seem as if it's making a strategy to a problem also. Certainly, the rhetoric that LLMs put out can at times seem very convincing of that. You can't be sure whether that's not just something cribbed out of the terabytes of text, in which discussions of something very similar to your problem have occurred.
This is not a bad way of looking at it, if I may add a bit, the llm is a solid state system. The only thing that survives from one iteration to the next is the singular highest ranking token, the entire state and "thought process" of the network cannot be represented by a single token, which means that every strategy is encoded in it during training, as a lossy representation of the training data. By definition that is a database, not a thinking system, as the strategy is stored, not actively generated during usage.
The anthropomorphization of llms bother me, we don't need to pretend they are alive and thinking, at best that is marketing, at worst, by training the models to output human sounding conversations we are actively taking away the true potential these models could achieve by being ok with them being "simply a tool".
But pretending that they are intelligent is what brings in the investors, so that is what we are doing. This paper is just furthering that agenda.
> The only thing that survives from one iteration to the next is the singular highest ranking token, the entire state and "thought process" of the network cannot be represented by a single token, which means that every strategy is encoded in it during training, as a lossy representation of the training data.
This is not true. The key-values of previous tokens encode computation that can be accessed by attention, as mentioned by colah3 here: https://news.ycombinator.com/item?id=43499819
This is a optimization to prevent redundant calculations. If it was not performed the result would be the same, just served slightly slower.
The whitepaper you linked is a great one, I was all over it a few years back when we built our first models. It should be recommended reading for anyone interested in CS.
People anthropomorphize LLMs because that's the most succinct language for describing what they seem to be doing. To avoid anthropomorphizing, you will have to use more formal language which would obfuscate the concepts.
Anthropo language has been woven into AI from the early beginnings.
AI programs were said to have goals, and to plan and hypothesize.
They were given names like "Conniver".
The word "expert system" anthropomorphizes! It's literally saying that some piece of logic programming loaded with a base of rules and facts about medical diagnosis is a medical expert.
rivers don't think and water doesn't have strategies, yet you can build intricate logic-gated tools using the power of water. Those types of systems are inherently interpretable because you can just look at how they work. They're not black boxes.
LLMs are black boxes, and if anything, interpretability systems show us what the heck is going on inside them. Especially useful when half the world is using these already, and we have no idea how t hey work
> rivers don't think and water doesn't have strategies, yet you can build intricate logic-gated tools using the power of water.
That doesn't mean the water itself has strategies, just that you can use water in an implementation of strategy... it's fairly well known at this point that LLMs can be used as part of strategies (see e.g. "agents"), they just don't intrinsically have any.
>> Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data.
Gee, I wonder where this data comes from.
Let's think about this step by step.
So, what do we know? Language models like Claud are not programmed directly.
Wait, does that mean they are programmed indirectly?
If so, by whom?
Aha, I got it. They are not programmed, directly or indirectly. They are trained on large amounts of data.
But that is the question, right? Where does all that data come from?
Hm, let me think about it.
Oh hang on I got it!
Language models are trained on data.
But they are language models so the data is language.
Aha! And who generates language?
Humans! Humans generate language!
I got it! Language models are trained on language data generated by humans!
Wait, does that mean that language models like Claud are indirectly programmed by humans?
That's it! Language models like Claude aren't programmed directly by humans because they are indirectly programmed by humans when they are trained on large amounts of language data generated by humans!
> That's it! Language models like Claude aren't programmed directly by humans because they are indirectly programmed by humans when they are trained on large amounts of language data generated by humans!
... and having large numbers of humans reinforce the applicability ("correctness"), or lack thereof, of generated responses over time.
This shift is more profound than many realize. Engineering traditionally applied our understanding of the physical world, mathematics, and logic to build predictable things. But now, especially in fields like AI, we’ve built systems so complex we no longer fully understand them. We must now use scientific methods - originally designed to understand nature - to comprehend our own engineered creations. Mindblowing.
reply