> LLMs do not have corporeal experience. But it's not obvious that this means th...

bubblyworld · on July 21, 2024

Transformer models have been shown to spontaneously form internal, predictive models of their input spaces. This is one of the most pervasive misunderstandings about LLMs (and other transformers) around. It is of course also true that the quality of these internal models depends a lot on the kind of task it is trained on. A GPT must be able to reproduce a huge swathe of human output, so the internal models it picks out would be those that are the most useful for that task, and might not include models of common mathematical tasks, for instance, unless they are common in the training set.

Have a look at the OthelloGPT papers (can provide links if you're interested). This is one of the reasons people are so interested in them!

brnt · on July 21, 2024

> can provide links if you're interested

Please do :)

persnickety · on July 21, 2024

https://thegradient.pub/othello/

bubblyworld · on July 21, 2024

Here's the paper on OthelloGPT's internal models I mentioned: https://arxiv.org/abs/2309.00941

The references in that paper are also good reading!

fshbbdssbbgdd · on July 21, 2024

> How could a system which produces a single next word based upon a likelihood and and a parameter called a "temperature" have a conceptual model underpinning it? Even theoretically?

Could a creature that simply evolved to survive and reproduce possibly have a conceptual model underpinning it? Model training and evolution are very different processes, but they are both ways of optimizing a physical system. It may be the case that evolution can give rise to intelligence and model training can’t, but we need some argument to prove that.

abernard1 · on July 22, 2024

> Could a creature that simply evolved to survive and reproduce possibly have a conceptual model underpinning it?

Yes. Obviously. I can create plans and think on them. I can think without the need for an internal monologue or talking to myself. This has nothing to do with modalities either: I do not think through text. I use text when I do as a way of conveying the thoughts I already have.

Anyone who claims we don't have the ability to form concepts in our head distinct from the medium in which they're transmitted is saying we're effectively ears, eyes, and skin. That the modality is what is important for intelligence.

This is clearly false, and academic silliness aside, yes—emphatically—humans and intelligent agents have internal concepts and models of the world.

gwervc · on July 20, 2024

> generative systems are basically always hallucinating: they do not have any concepts at all. That they are "right" as often as they are is a testament to the power of curve fitting and compression of basis functions in high dimensionality spaces

It's refreshing to read someone who "got it". Sad that before my upvote the comment was grayed out.

Any proponent of conceptual or other wishful/magical thinking shoud come with proofs, since it is the hypothesis that diverge from the definition of a LLM.

GeneralMayhem · on July 20, 2024

The argument would be that that conceptual model is encoded in the intermediate-layer parameters of the model, in a different but analogous way to how it's encoded in the graph and chemical structure of your neurons.

abernard1 · on July 20, 2024

I agree that's an argument. I would contend that argument is obviously false. If it were true, LLMs could multiply scalar numbers together trivially. It should be the easiest thing in the world for them. The network required to do that well is extremely small, the parameter sizes of these models are gigantic, and the textual expression is highly regular: multiplication is the simplest concept imaginable.

That they cannot do that basic task implies to me that they have almost no conceptual understanding unless the fit is almost memorizable or the space is highly regular. That LLMs can't multiply numbers properly isn't surprising if they don't really understand concepts prior to emitting text. Where they do logical tasks, that can be done with minimal or no understanding, because syllogisms and logical formalisms are highly structured in text arguments.

GaggiX · on July 21, 2024

Multiplication requires O(n^2) complexity with the usual algorithm used by humans, LLMs have a constant amount of computation available and they are not really efficient machines for math evaluation. They can definitely evaluate unseen expressions and you train a neural network to learn how to do sums and multiplications, I have trained models on sums and they are able to do sums never seen during training, the model learns the algorithm just by giving it inputs and outputs.

jdietrich · on July 21, 2024

LLMs do contain conceptual representations and LLMs are capable of abstract reasoning. This is trivially provable by asking them to reason about something that is a) purely abstract and b) not in the training data, e.g. "All floots are gronks. Some gronks are klorps. Are any floots klorps?" Any of the leading LLMs will correctly answer questions of this type much more often than chance.

abernard1 · on July 22, 2024

This does not indicate abstract reasoning. I said:

> Where they do logical tasks, that can be done with minimal or no understanding, because syllogisms and logical formalisms are highly structured in text arguments.

There is an enormous amount of text in the training set that is structured in the way you said such that syntactic replacement would be effective. That is also unsurprising and does not represent abstract reasoning any more than "King - Man + Woman = Queen" in word2vec. It's showing that there's high degrees of structure in syllogisms, and that it need know nothing about what a gronk, floot, or klorp is at all because the structure of the syllogism is repeated all over the internet.

stirfish · on July 21, 2024

I just asked chatgpt

"All floots are gronks. Some gronks are klorps. Are any floots klorps?"

------

To determine if any floots are klorps, let's analyze the given statements:

1. All floots are gronks. This means every floot falls into the category of gronks. 2. Some gronks are klorps. This means there is an overlap between the set of gronks and the set of klorps.

Since all floots are included in the set of gronks and some gronks are klorps, it is possible that some floots are klorps. However, we cannot conclusively say that any floots are klorps without additional information. It is only certain that if there is any overlap between floots and klorps, it is possible, but not guaranteed, that some floots are klorps.

card_zero · on July 21, 2024

Huh, almost right. ("possible, but not guaranteed?" it's necessarily true. That whole sentence was a waste of space, and wrong.)

Edit: I mean "if there is any overlap", it's necessarily true. I should have quoted the whole thing.

jdietrich · on July 21, 2024

Nope, ChatGPT was right, the answer is indeterminable. The klorps that are gronks could be a wholly distinct subset to the klorps that are floots. It also correctly evaluates "All gronks are floots. Some gronks are klorps. Are any floots klorps?", to which the answer is definitively yes.

card_zero · on July 21, 2024

> The klorps that are gronks could be a wholly distinct subset to the klorps that are floots.

So? It's still the case that "if there is any overlap between floots and klorps," it is "guaranteed, that some floots are klorps." It's tautological.

Unless there's a way to read "overlap" so that it doesn't mean "some of one category are also in the other category, and vice versa"?

Oh, when I said "it's necessarily true" I was refering to this last sentence of the output, not the question posed in the input. Hence we are at cross purposes I think.

LetsGetTechnicl · on July 21, 2024

That is not an example of a LLM being capable of abstract reasoning. Changing the question from "What is the capital of United States?" which is easily answerable to something completely abstract and "not in the training model" doesn't change that LLM's are just very advanced text prediction, and always will be. The nature of their design means they are incapable of AGI.

roenxi · on July 21, 2024

> LLM's are just very advanced text prediction, and always will be

How do you predict the next word in answering an abstract logic question without being capable of abstract reasoning, though?

In some sense it probably is possible, but this is a gaping flaw in your argument. A sufficiently advanced text prediction process has to encompass the process of abstract reasoning. The text prediction problem is necessarily a superset of the abstract reasoning problem. Ie, in the limit text prediction is fundamentally harder than abstract reasoning.

jdietrich · on July 21, 2024

The question I gave is a literal textbook example of abstract reasoning. LLMs are just very advanced text prediction, but they are also provably capable of abstract reasoning. If you think that those statements are contradictory, I would encourage you to read up on the Bayesian hypotheses in cognitive science - it is highly plausible that our brains are also just very advanced prediction models.

nsagent · on July 21, 2024

You're quite right that LLMs can seemingly do some abstract reasoning problems, but I would not say they aren't in the training data.

Sure, the exact form using the made up word gronk might not be in the training data, but the general form of that reasoning problem definitely exists, quite frequently in fact.

cgag · on July 21, 2024

Have you seen this?

``` You will be given a name of an object (such as Car, Chair, Elephant) and a letter in the alphabet. Your goal is to first produce a 1-line description of how that object can be combined with the letter in an image (for example, for an elephant and the letter J, the trunk of the elephant can have a J shape, and for the letter A and a house, the house can have an A shape with the upper triangle of the A being the roof). Following the short description, please create SVG code to produce this (in the SVG use shapes like ellipses, triangles etc and polygons but try to defer from using quadratic curves). ```

``` Round 5: A car and the letter E. Description: The car has an E shape on its front bumper, with the horizontal lines of the E being lights and the vertical line being the license plate. ```

Image generated here: https://imgur.com/a/Ia4Q2h3

How does it "just" predict the letter E could be used in such a way to draw a car? How does it just text predict working SVG code that draws the car made out of basic shapes and the letter E?

I don't know how anyone could suggest there are no conceptual models embedded in there.

jdietrich · on July 21, 2024

Yes, but the general form of the problem tells you nothing about the answer to any specific case. To perform any better than chance, the model has to actually reason through the problem.

smolder · on July 21, 2024

Pleasure and pain, along with subtler emotions that regulate our behavior, aren't things that arise from word prediction, or even from understanding the world, I don't think. So to say human brains are just prediction models seems like a mischaracterization.

brookst · on July 21, 2024

That’s a tautology that seems just as applicable to humans.

wonnage · on July 21, 2024

Or maybe they're just pattern matching on the very particular sentence structure you've chosen. This isn't a convincing example at all

jdietrich · on July 21, 2024

This isn't something I should convince you of. Just open up ChatGPT or Claude and try it for yourself. Think up a batch of your own questions and see how a modern LLM fares. I assure you that it'll do much better than chance. If you're so inclined, you can run enough tests to achieve statistical significance in the course of your lunch break.

It depresses me that we seem to be spending more time arguing and hypothesising about LLMs than empirically testing them. The question of whether LLMs can think is completely settled, as their performance at zero-shot problems is simply impossible through pure memorisation or pattern-matching. The question that remains is far more interesting - how do they think?

https://arxiv.org/pdf/2205.11916

nickpsecurity · on July 21, 2024

Given their training set, our hypothesis so far should be that they're just tweaking things they've already seen by applying a series of simple rules. They're still not doing what human beings do. We have introspection, creativity operating outside what we've seen, modeling others' thoughts, planning in new domains, and so on. We also operate without hallucination most of the time. I've yet to see an A.I. do all of this reliably and consistently. Then, that it did that without training input similar to the output.

So, they don't just pattern match or purely memorize. They do more than that. They do way less than humans. Unlike humans, they also try to do everything with one or a few components vs our (100-200?) brain components. Crossing that gap might be achievable. It will not be done by current architectures, though.

Zambyte · on July 21, 2024

> If it were true, LLMs could multiply scalar numbers together trivially.

FWIW most large models can do it better than I can in my head.

famouswaffles · on July 21, 2024

>If it were true, LLMs could multiply scalar numbers together trivially.

I mean, it's not like GPT-4 can't do this with more accuracy than a human without a calculator.

nsagent · on July 21, 2024

Using Occam's razor, that is less probable than the model picking up on statistical regularities in human language, especially since that's what they are trained to do.

mitthrowaway2 · on July 21, 2024

That's hard to conclude from Occam's razor here. Or, "statistical regularities" may have less explanatory power than you think, especially if the simplest statistical regularity is itself a fully predictive understanding of the concept of temperature.

drdeca · on July 21, 2024

> I would argue it is (obviously) impossible the way the current implementation of models work.

> How could a system which produces a single next word based upon a likelihood and and a parameter called a "temperature" have a conceptual model underpinning it? Even theoretically?

Any probability distribution over strings can theoretically be factored into a product of such a “probability that next token is x given that the text so far is y”. Now, whether a probability distribution over strings can efficiently computed in this form, is another question. But, if we are being so theoretical that we don’t care about the computational cost (as long as it is finite), then the “it is next token prediction” can’t preclude anything which “it produces a probability distribution over strings” doesn’t already preclude.

As for the temperature, given any probability distribution over a discrete set, we can modify it by adding a temperature parameter. Just take the log of the probabilities according to the original probability distribution, scale them all by a factor (the inverse of the temperature, I think. Either that or the temperature, but I think it is the inverse of the temperature.), then exponentiate each of these, and then normalize to produce a probability distribution.

So, the fact that they work by next token prediction, and have a temperature parameter, cannot imply any theoretical limitation that wouldn’t apply to any other way of expressing a probability distribution over strings, as far as discussing probability distributions in the abstract, over strings, rather than talking about computational processes that implement such probability distributions over strings.

But also like, going between P(next token is x | initial string so far is y) and P(the string begins with z) , isn’t that computationally costly? Well, in one direction anyway. Because like, P(next token is x|string so far is y) = P(string begins with yx) / P(string begins with y) .

Though, one might object to P(string starts with y) over P(string is y) ?

abernard1 · on July 22, 2024

> Any probability distribution over strings can theoretically be factored into a product of such a “probability that next token is x given that the text so far is y”.

And such a probability distribution would not generally understand concepts, efficient or otherwise. The P(next_token) is based upon the syntactical structure built via the model and some basic semantic distance that LLMs provide. They don't have enough conceptual power to reliably generate new facts and know that they are facts consistent with the model. That would be an internal representation system.

The academic exercise here is similar to monads: "yes, any computed function f(x) can be expressed as a sufficiently pre-computed large lookup table." With LLMs we're dealing with approximate lookups due to lossy compression, but that's still what these prior probabilities are: lookup tables. Lookup tables are not smart, do not understand concepts, and they have little to no capacity to generate new results not sufficiently represented in the training set.

drdeca · on July 22, 2024

My main concern here is the theoretical point, and so I’m not addressing the “this is what current (e.g. transformer based) models do” parts.

> The P(next_token) is based upon the syntactical structure built via the model and some basic semantic distance that LLMs provide.

Regardless of whether this is true for existing transformer-based models, this is not true for all computable conditional probability distributions over text.

Any computable task can be framed as sampling from some conditional probability distribution. (If the task is deterministic, that just means that the conditional probability distribution to sample from is one which has probability 1 for some string, when conditioned on the thing it is to be conditioned on.)

Whether transformer based models are lookup tables or not, not all computable probability distributions over text are. (As, of course, not all computable tasks can be expressed as a simple finite lookup table.)

I don’t know exactly what you mean by “generally understand concepts”, though I suppose

> They don't have enough conceptual power to reliably generate new facts and know that they are facts consistent with the model. That would be an internal representation system.

is describing that somewhat. And, in that case, if there is any computational process which counts as having “enough conceptual power to generate new facts and know that they are facts consistent with the model”, then, a computable conditional probability distribution over strings conditioned on their prefixes, and therefore also a computable probability distribution over next tokens given all-tokens-so-far , is also (theoretically) capable of that.

And so, it would follow that “it only predicts the next token” doesn’t (in principle/theory) preclude it having such an understanding of concepts, unless no computational process ever can.

abernard1 · on July 22, 2024

> “it only predicts the next token” doesn’t (in principle/theory) preclude it having such an understanding of concepts, unless no computational process ever can.

In my opinion, this is highly reductive and academic. Whether these models are transformers or not, lookup likelihood is not indicative of understanding of concepts in any reasonable way.

If the response to a algebraic equation was based upon probability of tokens in a corpus... and not an actual deterministic application of the rules of algebra, would that response know concepts? Would it be intelligent?

With math, specifically given the unbounded size of the tokens compared to language, it's clear that token prediction is not a useful methodology.

Let's say we're just trying to multiply two integers. Even if a model had Rain Man powers of memorization, and it memorized phone book after phone book of multiplication tables, the probabilistic likelihood model would fail for the very obvious reason that we cannot enumerate (and train on) all the possible outcomes of math and calculate their frequencies. We can however understand and use the concepts of math, which is distinct from their symbolic representation.

drdeca · on July 22, 2024

> lookup likelihood is not indicative of understanding of concepts in any reasonable way.

Where did I ever say that the thing was doing lookup? I only said it was producing a probability distribution.

Is your claim that all programs are just doing lookup?

> If the response to a algebraic equation was based upon probability of tokens in a corpus...

Ah, I see the confusion. When I say “probability distribution” I do not mean “for each option, an empirical fraction out of all the options, that this particular option appeared in the corpus”. Rather, by “probability distribution”, I mean (in the discrete case) “an assignment of a number which is at least zero and at most one, to each of the options, and such that the sum of the assigned values add up to 1”. I am allowing that this assignment of values is computed (from what is being conditioned on) in any way whatsoever .

If the correct answer is a number, it may compute the entire correct number through some standard means, and then look at however many correct tokens from the number are already present, and assign a probability of 1 to the correct next one, and 0 to all other tokens. If conditioning on a partial answer that has parts wrong, it may use an arbitrary distribution.

Davidzheng · on July 20, 2024

It's only because you can essentially put the llms in a simulations that you can have this argument. We can imagine the human brain also in a simulation which we can replay over and over again and adjust various parameters of the physical brain to change the temperature. These sort of arguments can never distinguish between llm and humans.

buu700 · on July 21, 2024

On that point, I would dispute the premise that "it's impossible to have true language skills without implicitly having a representation of self and environment". I don't see any contradiction between the following two ideas:

1. LLMs inherently lack any form of consciousness, subjective experience, emotions, or will

2. A sufficiently advanced LLM with sufficient compute resources would perform on par with human intelligence at any given task, insofar as the task is applicable to LLMs

IanCal · on July 21, 2024

> How could a system which produces a single next word based upon a likelihood and and a parameter called a "temperature" have a conceptual model underpinning it? Even theoretically?

You're limiting your view of their capabilities on the output format.

> Not so with LLMs!! Generative LLMs do not have a prior concept available before they start emitting text.

How do you establish that? What do you think of othellogpt? That seems to form an internal world model.

> That the "temperature" can chaotically change the output as the tokens proceed

Changing the temperature forcibly makes the model pick words it thinks fit worse. Of course it changes the output. It's like an improv game with someone shouting "CHANGE!".

Let's make two tiny changes.

One, let's tell a model to use the format

<innerthought>askjdhas</innerthought> as the voice in their head, and <speak>blah</speak> for the output.

Second, let's remove temperature and keep it at 0 so we're not playing a game where we force them to choose different words.

Now what remains of the argument?

abernard1 · on July 22, 2024

> You're limiting your view of their capabilities on the output format.

The "generation" of strings is related to this output format. It's critical to how they work. Legerdemain has been performed to argue that that's irrelevant, and the real intelligence or concepts are sitting inside the network architecture of the trained model prior to generation. But if that were the case, generation could be done based upon the conceptual representation, and not a syntactical representation token by token. This is not currently the case with LLMs.

I'd turn this question around: if my question is irrelevant, how would one go about building an effective real-world LLM that understand concepts and doesn't use likelihood lookups on a token-per-token basis, but instead generated directly from the conceptual basis? Such an argument, if it exists, would make me very happy.

Please note, I understand that there are prior systems which do this. Generative zero-shot transformer models didn't adopt this approach because it's elegant, but because it is efficient to compute with large data sets and has useful efficacy in generating strings. Some are creative. Some are more "accurate." The temperature parameter can affect which of those cases it selects, if any.

People have short memories, but the people who are both appreciative of LLMs as an engineering feat and critical of their claims of intelligence have been saying this for years. They've said that their token likelihood model is effective for seeing things well covered in the data set. They've been saying that due to the sparsity and structure of human language, large scale approximate compression ("curve fits") would be highly effective and efficient. They've been saying that due to the fact these are large scale fits of a data set, the models would eventually converge to something looking like the known knowledge they're trained on, and not exponentially accelerate in knowledge. All of these predictions have proved to be correct or looking highly likely at this point.

Transformer-based LLMs are a neat algorithmic approach to curve fits. But they are curve fits. Things like cosine transforms in JPEGs, wavelet or Fourier reconstruction in CAT scans, audio signal reconstruction from basis functions are also approximate reconstruction models that function along these lines, albeit in a nice Euclidian space without the generative parts of a transformer. But it was precisely knowledge of how systems like that worked which allowed scientists to understand and predict the limitations of these systems a long time ago. Lots of money and fresh eyes have created a useful computation technique, but these insights have been forgotten. I hope—truly—that progress happens in this space. But the critiques stand and there would be lots to gain by a less-hyped acknowledgement of where we are with these models and the tradeoffs baked into them as a compromise for them to be useful.

l33tbro · on July 21, 2024

I've always loved your takes on AI. You should air them here a bit more.