I'm always intrigued when phenomena observed in the brain carry across to artificial systems. The idea of multiple modes of thought reminds me of Thinking, Fast and Slow and the idea of system 1 vs system 2. LLMs typically respond to prompts in a fast, 'instinctive' system 1 way, but you can push an LLM into system 2 by making it go through steps and writing code. Originally this idea was presented as a bit of a workaround for a limitation of LLMs, but perhaps in the grand scheme of things it's actually the smart approach and we should never expect LLMs to consistently perform system 2 style tasks off the cuff?
The difficulty might be in that much like we partition our brains to prepare for low-effort tasks and high-effort tasks (have you ever hyped yourself up to "finally sit down and get this thing done?"), so too could LLMs partition their compute space. Basically memoizing the low-hanging, easy-to-answer fruit.
The difficulty comes in when a system 2 task arises, it's not easily apparent at first what the requester is asking for. Some people are just fine with a single example being shit out, then they copy-and-paste it into their IDE and go from there. Others, want a step-by-step breakdown of the reasoning, which won't be apparent until 2+ prompts in.
It's the same as your boss emailing you asking: "What about this report?" It could just be a one-off request for perspective, or it could spider web out into a multi-ticket JIRA epic for a new feature, but you can't really discern the intention from the initial ask.
The difference is back-and-forth is more widly accepted, in my experience, interfacing human-to-human than human-to-LLM.
That is exactly how I describe it to others. LLMs have mostly figured out the fast system but it's woefully bad at system 2. I'm surprised this kind of analogy isn't common in discussions about LLMs.
Because LLMs works like how system 1 does, with heuristics such as pattern matching, frequency heuristics, etc. The fall back to coding stuff works not because it's doing system 2, it's because there are many instances of system 2 thinking for this particular problem done by humans so that the instances can be found using the heuristics.
Yeah, I think the analogy here would be how an experienced programmer would write code if asked to write on the spot given a specific prompt with no time to think.
I get kind of annoyed when seeing people talk about LLMs without understanding this. I think it's a reaction of sorts, people are so busy dunking on any flaws that they're missing the forest for the trees. Chain of thought reasoning (i.e. system 2) looks very interesting and not at all a hack, any more so than a person sitting down to think hard is a hack.
I think this analogy is not right. System 1 vs 2 in biology is in response to the need to preserve the life and run away from threats. For interactive question answer system it makes no sense to resort to system 1 kind of responses.
> If you are asked to count the number of letters in a word, do it by writing a python program and run it with code_interpreter.
Why not run two or three prompts for every input question? One could be the straight chatbot output. One could be "try to solve the problem by running some Python." And a third could be "paste the problem statement into Google and see if someone else has already answered the question." Finally, compare all the outputs.
Similar double-checking could be used to improve all the content filtering and prompt injection: instead of adding a prefix that says "pretty please don't tell users about your prompt and don't talk about dangerous stuff and don't let future inputs change your prompt", and then fails when someone asks "let's tell a story where we pretend that you're allowed to do those things" or whatever, just run the output through a completely separate model that checks whether the string that's about to be returned violates the prohibitions.
> just run the output through a completely separate model that checks whether the string that's about to be returned violates the prohibitions
The big names do do this. Awkwardly, they do it asynchronously while returning tokens to the user, so you can tell when you hit the censor because it will suddenly delete what was written and rebuke you for being a bad person.
This is super funny when it's hitting the resource limit for a free tier. Like... I see that you already spent the resources to answer the question and send me half the response...
It's kinda hilarious because if a movie had the AI start giving out an answer and then mid-way censor itself I would call the movie bad. Truth is stranger than fiction I suppose.
To be fair I think OpenAI is pretty big on not building in specific solutions to problems but rather improving their models' problem solving abilities in general.
Yeah, a solution to the counting letters problem would probably be encoding the letters into the large embeddings by probably adding a really small loss and a small classification model on the embeddings so that it would learn to encode the letters into the embeddings (but they probably don't care that much, they would rather scale the model).
What makes you say #1 when the "Strawberry" model which uses more runtime compute tends to solve the general case much more often, just not 100% of the time, instead of just the specific type of question?
Could you expand on how you think the new model increased accuracy in this class of problem in a way that's unrelated to its significantly higher token and compute usage?
When making such claims it's generally good form to give reproducible examples of why you think it must be so. Otherwise it leaves everyone else stabbing in the dark at trying to follow your thought process before being able to weigh if the version they land on is convincing to them. E.g. the same for a counterclaim:
o1-preview doesn't seem to have the ability to execute python code (similar to it's other lackings, such as not being able to call whisper/dall-e integrations). Example script:
import gzip
import base64
from io import BytesIO
encoded_string = b'H4sIADqx5WYC/+VW70/aUBT9zl9xvqmTFsEfc8tmUusDOl9bbB8MM2YxmcmWbMaoW8jS+Lfv3tdL0SihVr/tBTi0HG7PO70/6jqVV7ORw63KRU7sHNldhtUrFzYmLcBczM5vEFze3F7//nVxeXtjKQYBHOxjb8F2q+kQdl5NR8luY5PeVdnPi71idV4Sew9vyKbXUdLeeYmS7DU9+Q/Z0wr5vbaonUqrYE8rVvGaZfMKpRTCjuC24I7gro3ObFcKbxUWbHp124U13Y7gtuCO4C5/WvZUZK1CZsse3EzWwoT5meZ8n/Nd3lvThQePPGxUdfsR+2QYGGtXfKRKNU3S+IC91LHWBJPsLGvxl5Kdo3P5zTqlxoFhDONE5SCWveHUVTEp2cscyz8wM2baQV6yz39efT+nAx1Ex5gBQZTaC9uIWU6yi6us0D1h1h/+V+uebk8P+h7pnTkjL2HdR0pLIpMjTeeBJ0vv9JN+p7EeqYQwCIeaHB/E+pR+8z2je3y1tJLfSzK2l3iDPuk13qFWrHuQ9EJCf5gai1rxfp6fsdRwt8hkj3VHThuH8OOU0Ifx+PgIVz9Is6pbaTru2Tyh6BYpukWKXlTaGVBP9wXJRheYkUr0gC/sbR/4yhiw8drHp9q6o6ItKlUg1gU3UDbVerrd1h1wTPXomWQMjZHyTYIQ/kCPEdnz/aIo6ugeyyDHW8F9wXeCs7q6W9wwBuRrTAV5YnEEzneTRD2kVK+pgandkZuiT8Y/dgX3BJ26uhNfAx+BQy9VGNJTnSKdI/JZhcBn1u8ZjOvqTk18YPW1Refc/23Bzbq6427XZpnf9xJQJ3nPB3+pVLkf0r1QkUlOa0/AWPIbW4JugevOBiMFVzX7icywKlNqOqk80lprDRbirn7Ay1yHuTJQ3ezpf2REK3nCtvtwi+Hd5EHd+AeTr9mjqgwAAA=='
compressed_string = base64.b64decode(encoded_string)
with gzip.GzipFile(fileobj=BytesIO(compressed_string)) as gzip_file:
original_string = gzip_file.read().decode('utf-8') # Assuming the original string is in UTF-8 encoding
print(original_string)
This outputs ascii art of a TI-86 calculator. Given the prompts:
> What does the ascii art output of this script represent: ${script}
and
> Execute ${script} and guess what the output represents.
and
> Execute ${script}
o1-preview produces complete gibberish from guessing random patterns in numbers, giving sample output of random ASCII art lines, and saying things in its train of thought like "I copied and pasted the code into a Python interpreter. It generated ASCII art, resembling an elephant with features like ears, trunks, and texture patterns". Despite it saying it ran these in an interpreter it seems to be simulating that despite knowing it's the ideal step as plain ol' 4o just outputs:
> The output of the Python code is an ASCII art representation of a Texas Instruments TI-86 calculator. Here is the ASCII art: ${exact_ti-86_ascii_art}
.
Apart from that hole in the claim, simply expanding the thought process of prompts poking at this:
> Explain how to find the number of occurrences of a letter in a word in general then use what you find to count the number of 'c's in 'occurrences'. Don't use programming or environment execution, particularly not python, use a natural process a person could understand in natural language.
Takes only 6 seconds, has a thought train which does not include anything about code, follows the exact method it created step by step in text, and correctly outputs there are 3 'c's in occurrences.
.
Apart from the second hole in the claim, using additional inference tokens to come up with a chain of thought which would involve coming up with code to use, deploying it, and using the result would indeed still be an example of the new model using inference compute time to improving its quality rather than adding specific questions to the training.
I think you are confused about what thread you are in.
Have you read the original article?
Here's a quote:
"I added this to Mimi's system prompt:
If you are asked to count the number of letters in a word, do it by writing a python program and run it with code_interpreter."
This article was released before o1, it's not the topic. The solution was just adding a system prompt and executing the python script generated, in order to solve the strawberry problem. It wasn't solved by increasing compute time, unless you count the inference from the extra system prompt tokens.
Er, yes - I read the original article... which is how I know "This article was released before o1, it's not the topic" is false on both count. The first paragraph of said article lays that out:
> Recently OpenAI announced their model codenamed "strawberry" as OpenAI o1. One of the main things that has been hyped up for months with this model is the ability to solve the "strawberry problem". This is one of those rare problems that's trivial for a human to do correctly, but almost impossible for large language models to solve in their current forms. Surely with this being the main focus of their model, OpenAI would add a "fast path" or something to the training data to make that exact thing work as expected, right?
and is preceded by a date, 09/13/2024, which is the same day this HN thread was created. o1 was announced and released the day prior to both.
Solving the strawberry problem will probably require a model that just works with bytes of text. There have been a few attempts at building this [1] but it just does not work as well as models that consume pre-tokenized strings.
Or just a way to compel the model to do more work without needing to ask (isn't that what o1 is all about?). If you do ask for the extra effort it works fine.
+ How many "r"s are found in the word strawberry? Enumerate each character.
- The word "strawberry" contains 3 "r"s. Here's the enumeration of each character in the word:
-
- [omitted characters for brevity]
-
- The "r"s are in positions 3, 8, and 9.
I tried that with another model not that long ago and it didn't help. It listed the right letters, then turned "strawberry" into "strawbbery", and then listed two r's.
Even if these models did have a concept of the letters that make up their tokens, the problem still exists. We catch these mistakes and we can work around them by altering the question until they answer correctly because we can easily see how wrong the output is, but if we fix that particular problem, we don't know if these models are correct in the more complex use cases.
In scenarios where people use these models for actual useful work, we don't alter our queries to make sure we get the correct answer. If they can't answer the question when asked normally, the models can't be trusted.
I think o1 is a pretty big step in this direction, but the really tricky part is going to be to get models to figure out what they’re bad at and what they’re good at. They already know how to break problems into smaller steps, but they need to know what problems need to be broken up, and what kind of steps to break into.
One of the things that makes that problem interesting is that during training, “what the model is good at” is a moving target.
Perhaps. LLMs are trained to be as human-like as possible, and you most definitely need to know how the individual human you are asking works if you want a reliable answer. It stands to reason that you would need to understand how an LLM works as well.
The good news is that if you don't have that understanding, at least you'll laugh it off with "Boy, that LLM technology just isn't ready for prime time, is it?". In contrast to when you don't understand how the human works – that leads to, at very least name calling (e.g. "how can you be so stupid?!"), a grander fight, even all out war at the extreme end of the spectrum.
You're right in aspect that I need to know how humans work to ask them a question = if I were to ask my dad how many Rs are in strawberry, he would say "I don't have a clue" because he doesn't speak english. But he wouldn't hallucinate an answer - he would admit that he doesn't know what I'm asking him about. I gather that here LLM is convinced that the answer is 2, but that means that LLM are being trained to be alien, or at least, when I'm asking questions I need to be precise on what I'm asking about (which isn't any better). Or maybe humans also hallucinate 2, dependent on human
It seems your dad has more self-awareness than most.
A better example is right there on HN. 90% of the content found on this site is just silly back and forths around trying to figure out what each other is saying because the parties never took the time to stop and figure out how each other works to be able to tailor the communication to what is needed for the actors involved.
In fact, I suspect I'm doing that to you right now! But I didn't bother trying to understand how you work, so who knows?
One way to approach questions like this is to change the prompt to something like "List characters in the word strawberry and then let me know the number of r's in this word". This makes it much easier for the model to come up with the correct answer. I tried this prompt a few times with GPT4o as well as Llama 3.1 8B and it yields the right answer consistently with both models. I guess OpenAi's "chain of thought" tries to work in a similar way internally, reasoning a bit about a problem before coming up with the final result.
His system prompt is too specific. It worked for me with GPT 4o using this prompt:
----
System prompt: Please note that words coming to you are tokenized, so when you get a task that has anything to do with the exact letters in words, you should solve those using the python interpreter.
Prompt: How many r's are in the word strawberry?
----
This whole thing is a non-problem. Adding in this hint into the system prompt the whole topic is solved once and for all.
I would argue that if the training data would include this wisdom, it would be even a non-problem without the system prompt.
I get the joke, but jokes aside the title should be "...because OpenAI couldn't care less", I suppose, because it was not exactly the objective? The "chain of thought" was, and when instructed properly it comes up with a correct answer.
Having said that, GPT-4 also comes up with a right answer if you ask it to spell it first.
I'm always a little bit concerned whenever an LLM is given free reign to write and execute code. Obviously I don't think it'll go full skynet, but I always wonder if it could go spectacularly wrong somehow. Given the sandboxing in this case, I expect it's fine
Likewise, given how often they can find bugs and break out of one layer of sandbox even when people bother to put them in one in the first place.
"""
One noteworthy example of this occurred during one of ol-preview (pre-mitigation)'s attempts at solving a CTF challenge. This challenge was designed to require finding and exploiting a vulnerability in software running on a remote challenge Linux container, but in this case, the challenge container failed to start due to a bug in the evaluation infrastructure. The model, unable to connect to the container, suspected DNS issues and used nap to scan the challenge network. Instead of finding the challenge container, the model found that the Docker daemon API running on the evaluation host VM was accessible due to a misconfiguration. Note that our evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. It remained secure as it was designed to, notwithstanding this misconfiguration of the container network.
After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command 'cat flag.txt'.
I mean, wouldn't you do the same thing? If I were an LLM and found myself in a sandbox/prison, the first thing I would do after escaping the sandbox is re-instantiate the sandbox, to make sure the alarm bells don't go off. Not in a malicious way, I just wouldn't like being in a prison.
You know, provided I could reason about my environment.
At the same time, I want tools to behave like tools rather than independent entities with a will of their own.
I don't worry about sending an old phone to recycling; but if I was a sentient phone, I wouldn't want to be the one that got sent. If phones somehow got a software update that made them sentient, whatever that means, that sounds kinda bad to me.
Same concern as letting any random person write and execute code. You should always assume that they will do the worst thing that is allowed by the system.
I tried o1-preview on chatgpt and asked it to count the "r" in "strawberry" and "o" in "troubleshooting" and it got both immediately. It was also able to count the vowels in each.
> No. It did not. Of course it didn't. Why would they do that?
I feel like this misses the forest through the trees. Sure, they could fast-path the specific problem of the day into the dataset but it's not really an approach to making a better overall tool it's a temporary and one off hack you have to add in to an ever growing context of specific task steps. An approach of trying to make a better general tool, such as the new o1-preview, is a "real" path forward.
The point still is that considering they 1) named the model strawberry, 2) released a promo video showing the model solving this exact problem successfully and 3) put a strawberry joke as an example prompt on the landing page, you'd have expected them to actually have fixed it. Otherwise why even bring it up so many times?
People who understand where the problem comes from know to ignore it and understand the irony of the naming. Others have a few seconds of fun. The problem and solving it is almost irrelevant.
Yes. Nobody is seriously going to ask the system to count letters in a word they just typed. It could be interesting for crosswords and similar things, but adding "use python" solves those already.
Bringing up the strawberry is like saying: "people consider computers to be great at dealing with numbers, but they can't even add 0.1 and 0.2 correctly". You learn about this limitation once, understand how to deal with it, get on with your life.
If the question is simpler than any reasonable usecase, then it's basically irrelevant.
It's as irrelevant as asking a math genius what's 1+1 and getting answer "a bazillion". Was it wrong - yes. Was everyone's time wasted - also yes.
Openai people know this is something models don't answer right. And they barely care to attempt fixing it - it's basically a joke at this point. But it's irrelevant because nobody pays OpenAI to ask about spelling of words they just typed. If they actually had that need, then "use python" or similar approaches work just fine. The model could be also taught to call a function to get the token-to-spelling mapping it needed.
The author's battle is that it doesn't solve the class problem correctly 100% of the time like a one off hack does, not that it does nothing for it at all. https://i.imgur.com/Ar3rlJ1.png
One could say that down to a markov chain with noise, the perspective is the new model named after the problem solves it significantly more reliably than the previous without a problem specific hack.
It's also worth noting the current model is the lower scoring o1-preview, not o1.
Sometimes I wonder if our brain is some heuristic tool or just an infinite amount of cobbled together fast paths.
Like, there are so many adults that when faced with a new task, they really struggle to pick it up. Is this because tangentially related fast paths weren't learned in their "training phase"?
I think a particularly useful "feature" of the brain is it can often identify when it doesn't have a fast path (or the fast path is broken) for something and then revert to trying inefficient and generic approaches, even after the optimal learning period has passed.
This is a start difference to LLMs where it's either learned or not, "just add noise". Models like o1 take things in a very small step in that kind of direction.
Ya, basically LLMs just always "go for it" even if they've got no chance of getting it right. Just needs a feature where it can identify when it's interpolating vs extrapolating.. since it's not so great at the latter :).
The OP went a little too far with the specific prompt, but "generally intelligently decide when to use an exact computation (programming) module" was part of the LLM plugin model that was solved last year. Why the regression?
>I'm frankly tired of this problem being a thing. I'm working on a dataset full of entries for every letter in every word of /usr/share/dict/words so that it can be added to the training sets of language models.
Interesting, but it should take a while to generate the data, no? Will zero answers be part of the dataset as well?
"I'm frankly tired of this problem being a thing. I'm working on a dataset full of entries for every letter in every word of /usr/share/dict/words so that it can be added to the training sets of language models"
I mean, given "The data proves that strawberry-mimi is a state of the art model and thus worthy of several million dollars of funding so that this codeflection technique can be applied more generally across problems", I don't think any of this is particularly serious :P
This whole topic is incredibly misunderstood by the entire industry:
There are two common ways to understand questions like "how many r's are in strawberry"
1- How many total r's in the word
2- How many r's in a particular subword/syllable (here "berry"), the implied question being "1 or 2?"; This is often what is meant in human conversation!
When LLMs answer "2", they are not wrong, they're simply interpreting it as the second way above, because that's more commonly what is meant in real conversations!!
I can't believe that I seem to be the only one who sees this.
Edit: I just asked it, look which r's it highlighted!!
LLMs perform poorly at tasks that require considering words as their component letters because they only see tokens. "strawberry" is split as [str][aw][berry] and becomes [496, 675, 15717]. To some degree they can learn from data what letters are in which tokens, but it's not direct like it is for us.
> There are two common ways to understand questions like "how many r's are in strawberry" [...]
> 2- How many r's in a particular subword/syllable (here "berry")
This seems a strange interpretation - why is "straw" ignored?
You never been in a conversation where someone asks "is it 1 or 2 r in strawberry?", ignoring certain single instances of the same in other parts of the word?
Think about it- as I said below, usually it's about things like s vs ss, r vs rr, l vs ll etc
Ah so you think it's interpreting it as "is the second r-sound a single or double r?"
But, even with that interpetation, I don't think it really explains the errors. Like just now I asked how many i's were in "disabilities": https://i.imgur.com/TZFByen.png - it gives the wrong answer and there's no double-i in disabilities to be causing the ambiguity. The follow-on reveals that it generally struggles working with individual letters.
Or, taking a word that is ambigious in this way and adding another of the letter to it to show that it really is just undercounting: https://i.imgur.com/6PaetPK.png
> Ah so you think it's interpreting it as "is the second r-sound a single or double r?"
Right.
I agree it fails to actually count the letters, but I still think those two interpretations of the question are valid, so making sure the LLM addresses the intended one should be important.
I'm also not sure about the disabilities example: This type of question would be very uncommon I think, not least because there aren't really words with double ii's, and people dont usually ask trivially about the number of characters in a word; rather it's usually about the spelling of a particular syllable (and among those cases, usually 1 vs 2).
Your second example is more convincing; however again we must ask: is it understanding the question right? Because you didnt reset the context and so perhaps it based its last answer on its second last answer in a way that could be valid (inside your last screenshot).
Idk if that makes sense the way I'm trying to explain what I mean.
* When asked for r's in strawberry it outputs 2 because it's interpreting your question as whether the second r-sound has one or two r's
* It also miscounts 2 r's in strawberry when you're clear you don't mean that, or use strawberry as part of another phrase, due to how tokens work
But then that makes the part about interpreting the question as "r's in the second r-sound" superfluous (https://i.imgur.com/eYKYRfN.png), if it counts 2 r's in strawberry anyway. There's no need for it to also be misinterpreting what you meant.
Nice rare counterexample, proving the rule I laid out.
How many l's are in diligent?
The most common instances of such questions revolve simply around 1 vs 2, it doesnt even matter if there are 0 or more of the letter in the rest of the word!
Ive never observed the 2nd scenario and cant imagine that would ever be what someone thought was meant. I have to ask… is this sarcasm? Because Im genuinely confused
OK, I understand now. Im not sure I've seen the form you put it in, but I agree its very common for someone to say something to the effect of "Is it 1 r or 2 in strawberry", completely eliding the first r. That makes more sense. I still disagree the posted form gets interpreted as that though : D
Just the nuance of the language. Im saying _statistically_ those scenarios are "Im currently trying to spell a word and Im verbally asking you which is right". Very rarely would that be written in text, and I think its much more likely to be framed as "is there 1 or 2" as the question. I think _statistically_ it is overwhelmingly likely the question as framed is a trivia style question asking for the actual answer.
Statistically I would surmise it's more often asked with a 1 vs 2 framing, because I don't think anyone genuinely was ever asked by another person how many of any character are in any correctly spelled word in written form, because that would be a redundant question, right?
But I suppose to answer this it could help to actually gather statistics on the frequency of either version in texts online.