Gosh. Can somebody help me to understand how a LLM has achieved this capability.
I had thought that a LLM was essentially only doing completions under the hood using statistical likelihood of next words, wrapped in some lexical sugar/clever prompt mods. But evidently far more is going on here since, in OP’s example, some of its output (eg future time stamps) will not be present in its training data. Even with several billion parameters that seems impossible to me. (Clearly not!)
Could somebody join the dots for me on the nature of whatever framework the LLM is embedded inside that allows it to achieve these behaviours that emulate logic and allow generation of unique content.
I mean, that's really the mystery of it. One of the most notable advances of recent LLMs has been emergent, one-shot, and highly-contextually-aware behaviors that seem to only manifest in models with extremely large numbers of parameters.
Cleary the latent space of the model is able to encode some sort of reasoning around how time flows, typical information about houses, understanding how to transform and format information in a JSON snippet, etc. That's the "magic" of it all; amazing and powerful emergent behaviors from billions of weights.
Of course, that's also the limitation. They're opaque and incredibly difficult (impossible?) to inspect and reason about.
“seem to only manifest in models with extremely large numbers of parameters”
You can do the same with models trained on your laptop in a few seconds. The trick here is to let the model have attention to what is learned, and can also be used on images and other type of data.
The benefit of a lot of parameters is more related to training in parallel, faster and “remember” more from the data.
Any idea of how clever the wrapper around these things is? For example, would OP’s use case simply get forwarded to the neural network as one single input (string of words) or is there some clever preprocessing going on.
EDIT: I personally would state it that compression is understanding(statistical distribution + correlation), where I view intelligence as raw processing power and intellect as understanding + intelligence. Wisdom I think is understanding replayed on past events which allows to infer causation of thing that couldnt be conputed real-time-ish.
I think understanding has to be related to generalisation, because we need to clearly separate it from simple memorisation. In math, for example, being able to solve equations not in the training set would show generalisation. But understanding is mostly a receptive process.
On the other hand, intelligence is about acting, and is an emissive process. It will select the next action trying to achieve its goals. Intelligence also needs to generalise from few examples and work in new contexts, otherwise it is not really intelligent, just a custom solution for a custom problem.
Compression alone is not equal to intelligence because compression is looking at past data, so it needs to learn only past patterns, while intelligence is oriented forward and needs to reserve capacity for continual learning.
What if the universe is deterministic and you can compute the entire Hutter Prize text using just a few physics rules? Then compression!=intelligence, I suppose.
> Can somebody help me to understand how a LLM has achieved this capability.
It's worth clarifying what is being accomplished here. iOS is handling speech recognition, and Shortcuts is handling task execution with an unspecified and presumably long user script. What GPT does here is convert text instructions into JSON formatted slot filling[1] responses.
It's somewhat amazing that GPT is emitting valid JSON, but I guess it's seen enough JSON in the training set to understand the grammar, and we shouldn't be too surprised it can learn regular grammars if it can learn multiple human languages. Slot filling is a well studied topic, and with the very limited vocabulary of slots, it doesn't have as many options to go wrong as commercial voice assistants. I would be way more amazed if this were able to generate Shortcuts code directly, but I don't think that's allowed.
> some of its output (eg future time stamps) will not be present in its training data. Even with several billion parameters that seems impossible
Maybe this is a feature of attention, which lets each token look back to modify its own prediction, and special token segmentation[2] for dates?
I have also been having ChatGPT respond in JSON and it works incredibly well. Adding information into the middle of existing JSON output is easy as well. I find that ChatGPT allows you to "generate" random(terribly so) and unique IDs for references. It used to work a lot better, and was seemingly alterered just before the Dec 15 build was replaced.
With this you can create the structure for replies and have chatGPT respond in an easy to read way. When you walk the Assistant through each method to fill the variables, the result is an ability to have Assistant then run the same steps on any new topic while accurately refollowing the steps.
I have used this method to hold values for things such as image descriptions with "coordinates", behavioural chemistry, and moving "limbs" with limitations baked in using variables. In my image descriptions I would have 20-30 items capable of being interacted with. Each would be assigned a list of viable tasks and each item on the list would go through special instructions.
The interesting part is that the team running the Assistant layer has blocked the bot from doing some of the final steps of my tests (recently). My first test ended at a closed door with an override preventing the bot from using the door. Assistant got through the door anyway, successfully, with 6 different options to choose from. By correctly identifying the word use as the problem to focus on
The largest advantage I can see is the ability to locate invalid information very quickly for targeted retraining.
Have you ever had a seemingly brilliant idea to only later found out somebody else had it too and that actually analyzing the idea turns out it was a combination of existing ideas?
I would say, that a part of our human intelligence is not much more than doing exactly that, learning patterns in different languages. English, emotions, experiences are all interfaces between the world and the self (whatever that is).
When you learn to speak, at first you reproduce simple sounds, you add more sounds to this, more patterns. These patterns and "meta-patterns" are what intelligence is, in my opinion. The creepy part is just that they usually don't appear as patterns and pattern manipulation to us, but rather in a form that is useful to "us" in a way to interact with "the world". "The world", what that means to the individual is also just a useful representation that is accumulated in a similar manner. But what is this "self"? Does it exist at all? Is it the mere accumulating of meta-patterns? With a pair of eyes connected to it and some other useful appendages?
“Language model” just means a probabilistic model of a token sequence. I think what we’re seeing with LLMs is that something that we thought was very hard—approximating the true joint probability function of language in this case—might be possible with much higher compression than we expected. (Sure, 170B parameters is a lot, but the structure of Transformers like GPT is so highly redundant that it still seems simpler than I would have expected.)
We started out with very simple probabilistic models by making very strong limiting assumptions. A naive Bayes model assumes every token is independent; a Markov model assumes that a token only depends on the single token before it. But ultimately these are just models for the underlying “real” joint distribution, and in theory we should expect that the simple models are bad because they’re bad estimates of that true distribution. I think my experience with these poor approximates limits what I expected that any language model could do, and the success of LLMs is making me recalibrate my expectations.
But I think it’s important to keep in mind that the model is still just spitting out the most likely next tokens according to its own estimation of the joint distribution. It’s so good because the model of “most likely next token” is very accurate. It’s easy to fall into the trap of looking at how specific the output is and thinking “wow, out of all the possible outputs, how did it know to give this one? It could have said anything and the right thing is a very low probability event, so it must really understand the question in order to answer properly.” But another way to think about it might be “wow, it’s incredible that the model accurately estimates the probability distribution for this kind of token sequence.”
As to your specific example of things like future time stamps, I think I’ve seen folks say in the past that there is some additional input prefix or suffix information in addition to the prompt you send that is passed into the model. So the model is still a static function—it is not learning from experience within an interaction session; it doesn’t experience time or other external contextual factors as part of its execution. The operators update the contextual input to the model as time passes in the world.
Do you think our own brains when they “understand” are also “just” estimating probability well? If so perhaps we’re quite close to discovery of something important.
It’s really hard to say. I very much doubt that brains and LLMs operate on the same principle; but that doesn’t mean we aren’t discovering something important.
A few years ago Google claimed credit for achieving Quantum Supremacy by building a quantum processor that simulated itself…by running itself. If that sounds tautological, then you see the problem.
They fed in a program that described a quantum circuit. When they ran the program, the processor used its physical quantum gates to execute the circuit described by the program and sampled from the resulting distribution of the quantum state. It’s a bit like saying we “simulated” an electrical circuit by physically running the actual circuit. (Their point was that the processor could run many different circuits depending on the input program, so that makes it a “computer”.)
Ignoring the task, that processor did exactly what an LLM does: it sampled from a very complicated probability distribution. Did the processor “understand” the program? Did it “understand” quantum physics or mathematics? Did it “understand” the quantum state of the internal wave function? It definitely produced samples that came from the distribution of the wave function for the program under test. But it’s hard to argue that the processor was “understanding” anything—it was doing exactly the thing that it does.
If we had enough qbits, then in theory we could approximate the distribution of an LLM like GPT. Would we say then that the processor “understands” what it’s saying? I don’t think the Google processor understands the circuit, so at what point does approximating and sampling from a bigger distribution transition into “understanding”?
Like the quantum device, the LLM is just doing exactly the thing that it does: sampling from a probability distribution. In the case of the LLM it’s a distribution where the samples seem to mean something, so it’s tempting to think that the model _intended_ for the output to have that meaning. But in reality the model can’t do anything else.
None of that proves that humans do anything different, but it certainly seems like we (and many other animals) are more complex than that.
Yes, the training is based on predicting a missing word. At first, such a task allowed GPT2 to generate paragraphs that were surprisingly coherent, at the user's request. ChatGPT added a novel ability of conversation.
Regarding timestamps, the training did include many examples of timestamps in JSON format.
I had thought that a LLM was essentially only doing completions under the hood using statistical likelihood of next words, wrapped in some lexical sugar/clever prompt mods. But evidently far more is going on here since, in OP’s example, some of its output (eg future time stamps) will not be present in its training data. Even with several billion parameters that seems impossible to me. (Clearly not!)
Could somebody join the dots for me on the nature of whatever framework the LLM is embedded inside that allows it to achieve these behaviours that emulate logic and allow generation of unique content.