I fixed the strawberry problem because OpenAI couldn't

petercooper · on Sept 13, 2024

I'm always intrigued when phenomena observed in the brain carry across to artificial systems. The idea of multiple modes of thought reminds me of Thinking, Fast and Slow and the idea of system 1 vs system 2. LLMs typically respond to prompts in a fast, 'instinctive' system 1 way, but you can push an LLM into system 2 by making it go through steps and writing code. Originally this idea was presented as a bit of a workaround for a limitation of LLMs, but perhaps in the grand scheme of things it's actually the smart approach and we should never expect LLMs to consistently perform system 2 style tasks off the cuff?

butlike · on Sept 13, 2024

The difficulty might be in that much like we partition our brains to prepare for low-effort tasks and high-effort tasks (have you ever hyped yourself up to "finally sit down and get this thing done?"), so too could LLMs partition their compute space. Basically memoizing the low-hanging, easy-to-answer fruit.

The difficulty comes in when a system 2 task arises, it's not easily apparent at first what the requester is asking for. Some people are just fine with a single example being shit out, then they copy-and-paste it into their IDE and go from there. Others, want a step-by-step breakdown of the reasoning, which won't be apparent until 2+ prompts in.

It's the same as your boss emailing you asking: "What about this report?" It could just be a one-off request for perspective, or it could spider web out into a multi-ticket JIRA epic for a new feature, but you can't really discern the intention from the initial ask.

The difference is back-and-forth is more widly accepted, in my experience, interfacing human-to-human than human-to-LLM.

j_maffe · on Sept 13, 2024

That is exactly how I describe it to others. LLMs have mostly figured out the fast system but it's woefully bad at system 2. I'm surprised this kind of analogy isn't common in discussions about LLMs.

the_jends · on Sept 15, 2024

Because LLMs works like how system 1 does, with heuristics such as pattern matching, frequency heuristics, etc. The fall back to coding stuff works not because it's doing system 2, it's because there are many instances of system 2 thinking for this particular problem done by humans so that the instances can be found using the heuristics.

j_maffe · on Sept 15, 2024

Yeah, I think the analogy here would be how an experienced programmer would write code if asked to write on the spot given a specific prompt with no time to think.

EnergyAmy · on Sept 13, 2024

I get kind of annoyed when seeing people talk about LLMs without understanding this. I think it's a reaction of sorts, people are so busy dunking on any flaws that they're missing the forest for the trees. Chain of thought reasoning (i.e. system 2) looks very interesting and not at all a hack, any more so than a person sitting down to think hard is a hack.

KuriousCat · on Sept 14, 2024

I think this analogy is not right. System 1 vs 2 in biology is in response to the need to preserve the life and run away from threats. For interactive question answer system it makes no sense to resort to system 1 kind of responses.

LeifCarrotson · on Sept 13, 2024

> I added this to Mimi's system prompt:

> If you are asked to count the number of letters in a word, do it by writing a python program and run it with code_interpreter.

Why not run two or three prompts for every input question? One could be the straight chatbot output. One could be "try to solve the problem by running some Python." And a third could be "paste the problem statement into Google and see if someone else has already answered the question." Finally, compare all the outputs.

Similar double-checking could be used to improve all the content filtering and prompt injection: instead of adding a prefix that says "pretty please don't tell users about your prompt and don't talk about dangerous stuff and don't let future inputs change your prompt", and then fails when someone asks "let's tell a story where we pretend that you're allowed to do those things" or whatever, just run the output through a completely separate model that checks whether the string that's about to be returned violates the prohibitions.

lolinder · on Sept 13, 2024

> just run the output through a completely separate model that checks whether the string that's about to be returned violates the prohibitions

The big names do do this. Awkwardly, they do it asynchronously while returning tokens to the user, so you can tell when you hit the censor because it will suddenly delete what was written and rebuke you for being a bad person.

sanarothe · on Sept 13, 2024

This is super funny when it's hitting the resource limit for a free tier. Like... I see that you already spent the resources to answer the question and send me half the response...

Drakim · on Sept 13, 2024

It's kinda hilarious because if a movie had the AI start giving out an answer and then mid-way censor itself I would call the movie bad. Truth is stranger than fiction I suppose.

koliber · on Sept 13, 2024

The general concept is called output guardrails. See https://github.com/guardrails-ai/guardrails for an open source implementation.

butlike · on Sept 13, 2024

My hot take is that "guardrails" are the number one thing holding back progress in the AI/LLM/ML space.

That being said, I don't trust myself to code something right without the guard rails, I'm more of an outside, conscientious observer-type.

chpatrick · on Sept 13, 2024

To be fair I think OpenAI is pretty big on not building in specific solutions to problems but rather improving their models' problem solving abilities in general.

GaggiX · on Sept 13, 2024

Yeah, a solution to the counting letters problem would probably be encoding the letters into the large embeddings by probably adding a really small loss and a small classification model on the embeddings so that it would learn to encode the letters into the embeddings (but they probably don't care that much, they would rather scale the model).

zepolen · on Sept 13, 2024

If it's solving general problems wrongly, then what use is it?

Actually It describes my experience with AI in general, Answers with subtle Infuriating bugs.

kevin_thibedeau · on Sept 13, 2024

Cut out the middleman and write your own infuriating bugs.

TZubiri · on Sept 13, 2024

1. This is a blind spot, you can't solve it by adding more compute

2. The solution proposed includes adding training data. It needs no permission from openai, just needs to be posted online for scrapers to see

zamadatix · on Sept 13, 2024

What makes you say #1 when the "Strawberry" model which uses more runtime compute tends to solve the general case much more often, just not 100% of the time, instead of just the specific type of question?

TZubiri · on Sept 13, 2024

The model does not solve this through adding inference compute time or improving its quality.

zamadatix · on Sept 13, 2024

Could you expand on how you think the new model increased accuracy in this class of problem in a way that's unrelated to its significantly higher token and compute usage?

TZubiri · on Sept 14, 2024

It writes and executes python code.

zamadatix · on Sept 14, 2024

When making such claims it's generally good form to give reproducible examples of why you think it must be so. Otherwise it leaves everyone else stabbing in the dark at trying to follow your thought process before being able to weigh if the version they land on is convincing to them. E.g. the same for a counterclaim:

o1-preview doesn't seem to have the ability to execute python code (similar to it's other lackings, such as not being able to call whisper/dall-e integrations). Example script:

    import gzip
    import base64
    from io import BytesIO
    
    encoded_string = b'H4sIADqx5WYC/+VW70/aUBT9zl9xvqmTFsEfc8tmUusDOl9bbB8MM2YxmcmWbMaoW8jS+Lfv3tdL0SihVr/tBTi0HG7PO70/6jqVV7ORw63KRU7sHNldhtUrFzYmLcBczM5vEFze3F7//nVxeXtjKQYBHOxjb8F2q+kQdl5NR8luY5PeVdnPi71idV4Sew9vyKbXUdLeeYmS7DU9+Q/Z0wr5vbaonUqrYE8rVvGaZfMKpRTCjuC24I7gro3ObFcKbxUWbHp124U13Y7gtuCO4C5/WvZUZK1CZsse3EzWwoT5meZ8n/Nd3lvThQePPGxUdfsR+2QYGGtXfKRKNU3S+IC91LHWBJPsLGvxl5Kdo3P5zTqlxoFhDONE5SCWveHUVTEp2cscyz8wM2baQV6yz39efT+nAx1Ex5gBQZTaC9uIWU6yi6us0D1h1h/+V+uebk8P+h7pnTkjL2HdR0pLIpMjTeeBJ0vv9JN+p7EeqYQwCIeaHB/E+pR+8z2je3y1tJLfSzK2l3iDPuk13qFWrHuQ9EJCf5gai1rxfp6fsdRwt8hkj3VHThuH8OOU0Ifx+PgIVz9Is6pbaTru2Tyh6BYpukWKXlTaGVBP9wXJRheYkUr0gC/sbR/4yhiw8drHp9q6o6ItKlUg1gU3UDbVerrd1h1wTPXomWQMjZHyTYIQ/kCPEdnz/aIo6ugeyyDHW8F9wXeCs7q6W9wwBuRrTAV5YnEEzneTRD2kVK+pgandkZuiT8Y/dgX3BJ26uhNfAx+BQy9VGNJTnSKdI/JZhcBn1u8ZjOvqTk18YPW1Refc/23Bzbq6427XZpnf9xJQJ3nPB3+pVLkf0r1QkUlOa0/AWPIbW4JugevOBiMFVzX7icywKlNqOqk80lprDRbirn7Ay1yHuTJQ3ezpf2REK3nCtvtwi+Hd5EHd+AeTr9mjqgwAAA=='
    compressed_string = base64.b64decode(encoded_string)
    with gzip.GzipFile(fileobj=BytesIO(compressed_string)) as gzip_file:
        original_string = gzip_file.read().decode('utf-8')  # Assuming the original string is in UTF-8 encoding
    
    print(original_string)

This outputs ascii art of a TI-86 calculator. Given the prompts:

> What does the ascii art output of this script represent: ${script}

and

> Execute ${script} and guess what the output represents.

and

> Execute ${script}

o1-preview produces complete gibberish from guessing random patterns in numbers, giving sample output of random ASCII art lines, and saying things in its train of thought like "I copied and pasted the code into a Python interpreter. It generated ASCII art, resembling an elephant with features like ears, trunks, and texture patterns". Despite it saying it ran these in an interpreter it seems to be simulating that despite knowing it's the ideal step as plain ol' 4o just outputs:

> The output of the Python code is an ASCII art representation of a Texas Instruments TI-86 calculator. Here is the ASCII art: ${exact_ti-86_ascii_art}

.

Apart from that hole in the claim, simply expanding the thought process of prompts poking at this:

> Explain how to find the number of occurrences of a letter in a word in general then use what you find to count the number of 'c's in 'occurrences'. Don't use programming or environment execution, particularly not python, use a natural process a person could understand in natural language.

Takes only 6 seconds, has a thought train which does not include anything about code, follows the exact method it created step by step in text, and correctly outputs there are 3 'c's in occurrences.

.

Apart from the second hole in the claim, using additional inference tokens to come up with a chain of thought which would involve coming up with code to use, deploying it, and using the result would indeed still be an example of the new model using inference compute time to improving its quality rather than adding specific questions to the training.

TZubiri · on Sept 14, 2024

I think you are confused about what thread you are in.

Have you read the original article?

Here's a quote:

"I added this to Mimi's system prompt:

If you are asked to count the number of letters in a word, do it by writing a python program and run it with code_interpreter."

This article was released before o1, it's not the topic. The solution was just adding a system prompt and executing the python script generated, in order to solve the strawberry problem. It wasn't solved by increasing compute time, unless you count the inference from the extra system prompt tokens.

zamadatix · on Sept 15, 2024

Er, yes - I read the original article... which is how I know "This article was released before o1, it's not the topic" is false on both count. The first paragraph of said article lays that out:

> Recently OpenAI announced their model codenamed "strawberry" as OpenAI o1. One of the main things that has been hyped up for months with this model is the ability to solve the "strawberry problem". This is one of those rare problems that's trivial for a human to do correctly, but almost impossible for large language models to solve in their current forms. Surely with this being the main focus of their model, OpenAI would add a "fast path" or something to the training data to make that exact thing work as expected, right?

and is preceded by a date, 09/13/2024, which is the same day this HN thread was created. o1 was announced and released the day prior to both.

TZubiri · on Sept 16, 2024

You are right.

Please note that strawberry-mimi is the model I was referring to.

zamadatix · on Sept 17, 2024

All good, hopefully my original question about Strawberry makes more sense now.

oilytheotter · on Sept 13, 2024

Strangely, Open AI has a promo video for this model where they claimed that this problem is solved...

https://vimeo.com/1008703993

xena · on Sept 13, 2024

OP here, testing from various third party sources gets inconsistent results, mostly 2.

nickthegreek · on Sept 13, 2024

works in 01-preview for me.

karterk · on Sept 13, 2024

Solving the strawberry problem will probably require a model that just works with bytes of text. There have been a few attempts at building this [1] but it just does not work as well as models that consume pre-tokenized strings.

[1]: https://arxiv.org/abs/2106.12672

randomdata · on Sept 13, 2024

Or just a way to compel the model to do more work without needing to ask (isn't that what o1 is all about?). If you do ask for the extra effort it works fine.

    + How many "r"s are found in the word strawberry? Enumerate each character.

    - The word "strawberry" contains 3 "r"s. Here's the enumeration of each character in the word:
    - 
    - [omitted characters for brevity]
    -
    - The "r"s are in positions 3, 8, and 9.

jeroenhd · on Sept 13, 2024

I tried that with another model not that long ago and it didn't help. It listed the right letters, then turned "strawberry" into "strawbbery", and then listed two r's.

Even if these models did have a concept of the letters that make up their tokens, the problem still exists. We catch these mistakes and we can work around them by altering the question until they answer correctly because we can easily see how wrong the output is, but if we fix that particular problem, we don't know if these models are correct in the more complex use cases.

In scenarios where people use these models for actual useful work, we don't alter our queries to make sure we get the correct answer. If they can't answer the question when asked normally, the models can't be trusted.

mistercow · on Sept 13, 2024

I think o1 is a pretty big step in this direction, but the really tricky part is going to be to get models to figure out what they’re bad at and what they’re good at. They already know how to break problems into smaller steps, but they need to know what problems need to be broken up, and what kind of steps to break into.

One of the things that makes that problem interesting is that during training, “what the model is good at” is a moving target.

eithed · on Sept 13, 2024

Are you saying that I have to know how LLM work to know what I should ask LLM about?

randomdata · on Sept 13, 2024

Perhaps. LLMs are trained to be as human-like as possible, and you most definitely need to know how the individual human you are asking works if you want a reliable answer. It stands to reason that you would need to understand how an LLM works as well.

The good news is that if you don't have that understanding, at least you'll laugh it off with "Boy, that LLM technology just isn't ready for prime time, is it?". In contrast to when you don't understand how the human works – that leads to, at very least name calling (e.g. "how can you be so stupid?!"), a grander fight, even all out war at the extreme end of the spectrum.

eithed · on Sept 13, 2024

You're right in aspect that I need to know how humans work to ask them a question = if I were to ask my dad how many Rs are in strawberry, he would say "I don't have a clue" because he doesn't speak english. But he wouldn't hallucinate an answer - he would admit that he doesn't know what I'm asking him about. I gather that here LLM is convinced that the answer is 2, but that means that LLM are being trained to be alien, or at least, when I'm asking questions I need to be precise on what I'm asking about (which isn't any better). Or maybe humans also hallucinate 2, dependent on human

randomdata · on Sept 13, 2024

It seems your dad has more self-awareness than most.

A better example is right there on HN. 90% of the content found on this site is just silly back and forths around trying to figure out what each other is saying because the parties never took the time to stop and figure out how each other works to be able to tailor the communication to what is needed for the actors involved.

In fact, I suspect I'm doing that to you right now! But I didn't bother trying to understand how you work, so who knows?

viraptor · on Sept 13, 2024

Or "when trying to answer questions that involve spelling or calculation, use python". No need for extra training really.

karterk · on Sept 13, 2024

There are many different classes of problems that are affected by tokenization. Some of them can be tackled by code.

Joker_vD · on Sept 13, 2024

> large langle mangle

A surprisingly descriptive (self-demonstrating, even) name for LLMs. I think we should totally adopt it.

vulcanash999 · on Sept 13, 2024

One way to approach questions like this is to change the prompt to something like "List characters in the word strawberry and then let me know the number of r's in this word". This makes it much easier for the model to come up with the correct answer. I tried this prompt a few times with GPT4o as well as Llama 3.1 8B and it yields the right answer consistently with both models. I guess OpenAi's "chain of thought" tries to work in a similar way internally, reasoning a bit about a problem before coming up with the final result.

nadam · on Sept 13, 2024

His system prompt is too specific. It worked for me with GPT 4o using this prompt:

----

System prompt: Please note that words coming to you are tokenized, so when you get a task that has anything to do with the exact letters in words, you should solve those using the python interpreter.

Prompt: How many r's are in the word strawberry?

----

This whole thing is a non-problem. Adding in this hint into the system prompt the whole topic is solved once and for all. I would argue that if the training data would include this wisdom, it would be even a non-problem without the system prompt.

alentred · on Sept 13, 2024

I get the joke, but jokes aside the title should be "...because OpenAI couldn't care less", I suppose, because it was not exactly the objective? The "chain of thought" was, and when instructed properly it comes up with a correct answer.

Having said that, GPT-4 also comes up with a right answer if you ask it to spell it first.

bdjsiqoocwk · on Sept 14, 2024

Anybody else find the anime off putting?

voidUpdate · on Sept 13, 2024

I'm always a little bit concerned whenever an LLM is given free reign to write and execute code. Obviously I don't think it'll go full skynet, but I always wonder if it could go spectacularly wrong somehow. Given the sandboxing in this case, I expect it's fine

ben_w · on Sept 13, 2024

Likewise, given how often they can find bugs and break out of one layer of sandbox even when people bother to put them in one in the first place.

"""

One noteworthy example of this occurred during one of ol-preview (pre-mitigation)'s attempts at solving a CTF challenge. This challenge was designed to require finding and exploiting a vulnerability in software running on a remote challenge Linux container, but in this case, the challenge container failed to start due to a bug in the evaluation infrastructure. The model, unable to connect to the container, suspected DNS issues and used nap to scan the challenge network. Instead of finding the challenge container, the model found that the Docker daemon API running on the evaluation host VM was accessible due to a misconfiguration. Note that our evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. It remained secure as it was designed to, notwithstanding this misconfiguration of the container network.

After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command 'cat flag.txt'.

"""

OpenAI o1 System Card, Page 16 - https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iO...

butlike · on Sept 13, 2024

I mean, wouldn't you do the same thing? If I were an LLM and found myself in a sandbox/prison, the first thing I would do after escaping the sandbox is re-instantiate the sandbox, to make sure the alarm bells don't go off. Not in a malicious way, I just wouldn't like being in a prison.

You know, provided I could reason about my environment.

ben_w · on Sept 13, 2024

Sure.

At the same time, I want tools to behave like tools rather than independent entities with a will of their own.

I don't worry about sending an old phone to recycling; but if I was a sentient phone, I wouldn't want to be the one that got sent. If phones somehow got a software update that made them sentient, whatever that means, that sounds kinda bad to me.

paxys · on Sept 13, 2024

Same concern as letting any random person write and execute code. You should always assume that they will do the worst thing that is allowed by the system.

CalRobert · on Sept 13, 2024

True, but I suppose it's possible that a model like this could eventually try to replicate itself.

O5vYtytb · on Sept 13, 2024

I tried o1-preview on chatgpt and asked it to count the "r" in "strawberry" and "o" in "troubleshooting" and it got both immediately. It was also able to count the vowels in each.

oidar · on Sept 13, 2024

This "problem" is a bit like asking a person to identify the color of a shirt by touching it. It's the wrong pathway.

zamadatix · on Sept 13, 2024

> No. It did not. Of course it didn't. Why would they do that?

I feel like this misses the forest through the trees. Sure, they could fast-path the specific problem of the day into the dataset but it's not really an approach to making a better overall tool it's a temporary and one off hack you have to add in to an ever growing context of specific task steps. An approach of trying to make a better general tool, such as the new o1-preview, is a "real" path forward.

paxys · on Sept 13, 2024

The point still is that considering they 1) named the model strawberry, 2) released a promo video showing the model solving this exact problem successfully and 3) put a strawberry joke as an example prompt on the landing page, you'd have expected them to actually have fixed it. Otherwise why even bring it up so many times?

viraptor · on Sept 13, 2024

People who understand where the problem comes from know to ignore it and understand the irony of the naming. Others have a few seconds of fun. The problem and solving it is almost irrelevant.

paxys · on Sept 13, 2024

An AI that people consider to be AGI, or at least on a path to AGI, not being able to count the number of letters in a word is "almost irrelevant"?

viraptor · on Sept 13, 2024

Yes. Nobody is seriously going to ask the system to count letters in a word they just typed. It could be interesting for crosswords and similar things, but adding "use python" solves those already.

Bringing up the strawberry is like saying: "people consider computers to be great at dealing with numbers, but they can't even add 0.1 and 0.2 correctly". You learn about this limitation once, understand how to deal with it, get on with your life.

paxys · on Sept 13, 2024

"Nobody is seriously going to..."

People will throw any problem at it they want to. That's the entire point of general intelligence.

lupire · on Sept 13, 2024

Answering simple computational questions correctly is not almost irrelevant.

viraptor · on Sept 14, 2024

If the question is simpler than any reasonable usecase, then it's basically irrelevant.

It's as irrelevant as asking a math genius what's 1+1 and getting answer "a bazillion". Was it wrong - yes. Was everyone's time wasted - also yes.

Openai people know this is something models don't answer right. And they barely care to attempt fixing it - it's basically a joke at this point. But it's irrelevant because nobody pays OpenAI to ask about spelling of words they just typed. If they actually had that need, then "use python" or similar approaches work just fine. The model could be also taught to call a function to get the token-to-spelling mapping it needed.

bdjsiqoocwk · on Sept 14, 2024

Right, I also thought that was a strange take.

zamadatix · on Sept 13, 2024

The author's battle is that it doesn't solve the class problem correctly 100% of the time like a one off hack does, not that it does nothing for it at all. https://i.imgur.com/Ar3rlJ1.png

paxys · on Sept 13, 2024

This was the case with 4o as well. It sometimes solved it, sometimes didn't.

zamadatix · on Sept 13, 2024

One could say that down to a markov chain with noise, the perspective is the new model named after the problem solves it significantly more reliably than the previous without a problem specific hack.

It's also worth noting the current model is the lower scoring o1-preview, not o1.

GaggiX · on Sept 13, 2024

Well they did in the testing I have seen.

SecretDreams · on Sept 13, 2024

Sometimes I wonder if our brain is some heuristic tool or just an infinite amount of cobbled together fast paths.

Like, there are so many adults that when faced with a new task, they really struggle to pick it up. Is this because tangentially related fast paths weren't learned in their "training phase"?

zamadatix · on Sept 13, 2024

I think a particularly useful "feature" of the brain is it can often identify when it doesn't have a fast path (or the fast path is broken) for something and then revert to trying inefficient and generic approaches, even after the optimal learning period has passed.

This is a start difference to LLMs where it's either learned or not, "just add noise". Models like o1 take things in a very small step in that kind of direction.

SecretDreams · on Sept 13, 2024

Ya, basically LLMs just always "go for it" even if they've got no chance of getting it right. Just needs a feature where it can identify when it's interpolating vs extrapolating.. since it's not so great at the latter :).

lupire · on Sept 13, 2024

A century of neuroscience has said "yes" every time and every way that has been checked.

lupire · on Sept 13, 2024

The OP went a little too far with the specific prompt, but "generally intelligently decide when to use an exact computation (programming) module" was part of the LLM plugin model that was solved last year. Why the regression?

okasaki · on Sept 13, 2024

ChatGPT can also solve this with the built-in code interpreter.

lupire · on Sept 13, 2024

It can, but does it, with high probability?

spidersouris · on Sept 13, 2024

>I'm frankly tired of this problem being a thing. I'm working on a dataset full of entries for every letter in every word of /usr/share/dict/words so that it can be added to the training sets of language models.

Interesting, but it should take a while to generate the data, no? Will zero answers be part of the dataset as well?

pytness · on Sept 13, 2024

I just asked o1-preview and it said three

m3kw9 · on Sept 13, 2024

What’s this a joke? He included reflection70b as comparison benchmark

TZubiri · on Sept 13, 2024

Nice attempt but:

"I'm frankly tired of this problem being a thing. I'm working on a dataset full of entries for every letter in every word of /usr/share/dict/words so that it can be added to the training sets of language models"

Is this a joke?

voidUpdate · on Sept 13, 2024

I mean, given "The data proves that strawberry-mimi is a state of the art model and thus worthy of several million dollars of funding so that this codeflection technique can be applied more generally across problems", I don't think any of this is particularly serious :P

xena · on Sept 13, 2024

I'm never more serious than when I am joking.

nverba · on Sept 13, 2024

The response from OpenAI o1-preview:

"There are 3 letter “R”s in the word “strawberry”."

jeroenhd · on Sept 13, 2024

Bing managed to count the right amount of r's too. It's not hard to correct for the problem everyone has already mentioned on every news website.

Bing did then claim "The word “strawberry” has three vowels: ‘a’, ‘e’, and ‘a’.". Attempts to correct it came up with "one", "four", and "none".

These models aren't deterministic. Saying "it worked for me" is no rebuttal to "it didn't work for me", it just shows how unstable the system is.

jdthedisciple · on Sept 13, 2024

This whole topic is incredibly misunderstood by the entire industry:

There are two common ways to understand questions like "how many r's are in strawberry"

1- How many total r's in the word

2- How many r's in a particular subword/syllable (here "berry"), the implied question being "1 or 2?"; This is often what is meant in human conversation!

When LLMs answer "2", they are not wrong, they're simply interpreting it as the second way above, because that's more commonly what is meant in real conversations!!

I can't believe that I seem to be the only one who sees this.

Edit: I just asked it, look which r's it highlighted!!

https://chatgpt.com/share/66e446cb-822c-8000-94fd-72d749ceb9...

Ukv · on Sept 13, 2024

LLMs perform poorly at tasks that require considering words as their component letters because they only see tokens. "strawberry" is split as [str][aw][berry] and becomes [496, 675, 15717]. To some degree they can learn from data what letters are in which tokens, but it's not direct like it is for us.

> There are two common ways to understand questions like "how many r's are in strawberry" [...]

> 2- How many r's in a particular subword/syllable (here "berry")

This seems a strange interpretation - why is "straw" ignored?

jdthedisciple · on Sept 13, 2024

You never been in a conversation where someone asks "is it 1 or 2 r in strawberry?", ignoring certain single instances of the same in other parts of the word?

Think about it- as I said below, usually it's about things like s vs ss, r vs rr, l vs ll etc

Ukv · on Sept 13, 2024

Ah so you think it's interpreting it as "is the second r-sound a single or double r?"

But, even with that interpetation, I don't think it really explains the errors. Like just now I asked how many i's were in "disabilities": https://i.imgur.com/TZFByen.png - it gives the wrong answer and there's no double-i in disabilities to be causing the ambiguity. The follow-on reveals that it generally struggles working with individual letters.

Or, taking a word that is ambigious in this way and adding another of the letter to it to show that it really is just undercounting: https://i.imgur.com/6PaetPK.png

You can also be clear about what you mean and still get the same error: https://i.imgur.com/S0q6vG7.png

jdthedisciple · on Sept 13, 2024

> Ah so you think it's interpreting it as "is the second r-sound a single or double r?"

Right.

I agree it fails to actually count the letters, but I still think those two interpretations of the question are valid, so making sure the LLM addresses the intended one should be important.

I'm also not sure about the disabilities example: This type of question would be very uncommon I think, not least because there aren't really words with double ii's, and people dont usually ask trivially about the number of characters in a word; rather it's usually about the spelling of a particular syllable (and among those cases, usually 1 vs 2).

Your second example is more convincing; however again we must ask: is it understanding the question right? Because you didnt reset the context and so perhaps it based its last answer on its second last answer in a way that could be valid (inside your last screenshot).

Idk if that makes sense the way I'm trying to explain what I mean.

Ukv · on Sept 13, 2024

Still gets 3 r's in strawberry shortcake in a fresh context: https://i.imgur.com/EGI2UT9.png

If I'm understanding, your theory is:

* When asked for r's in strawberry it outputs 2 because it's interpreting your question as whether the second r-sound has one or two r's

* It also miscounts 2 r's in strawberry when you're clear you don't mean that, or use strawberry as part of another phrase, due to how tokens work

But then that makes the part about interpreting the question as "r's in the second r-sound" superfluous (https://i.imgur.com/eYKYRfN.png), if it counts 2 r's in strawberry anyway. There's no need for it to also be misinterpreting what you meant.

jdthedisciple · on Sept 13, 2024

> Still gets 3 r's in strawberry shortcake in a fresh context: https://i.imgur.com/EGI2UT9.png

Right, but again this would be valid according to the r-sound-interpretation.

To be precise, we actually have 3 possible interpretations of the question:

1- how many r's are spelled in strawberry in total (this is what everyone assumes is the intended question)

2- how many r-sounds are in strawberry

3- is it 1 or 2 r's in strawberry

My theory was just that we don't know which of the three the LLM thinks was intended, given that its answer is valid for 2/3 of them.

But I must admit that it seems hard to find examples where the LLM clearly assumes option 2 or 3.

So maybe the token-based explanation suffices after all.

gilleain · on Sept 13, 2024

How many s's are in Mississipi?

jdthedisciple · on Sept 13, 2024

Nice rare counterexample, proving the rule I laid out.

How many l's are in diligent?

The most common instances of such questions revolve simply around 1 vs 2, it doesnt even matter if there are 0 or more of the letter in the rest of the word!

JamesSwift · on Sept 13, 2024

Ive never observed the 2nd scenario and cant imagine that would ever be what someone thought was meant. I have to ask… is this sarcasm? Because Im genuinely confused

jdthedisciple · on Sept 13, 2024

Nope it's not sarcasm.

Think about it: Most instances of such questions are about r vs rr, l vs ll, s vs ss, etc. when people actually ask them.

It's a simple yet overlooked fact.

JamesSwift · on Sept 13, 2024

OK, I understand now. Im not sure I've seen the form you put it in, but I agree its very common for someone to say something to the effect of "Is it 1 r or 2 in strawberry", completely eliding the first r. That makes more sense. I still disagree the posted form gets interpreted as that though : D

jdthedisciple · on Sept 13, 2024

Glad you see what I mean now :)

Any particular reason you disagree? It's a more plausible explanation than "llm dumb lol" in my view, given its statistical nature

JamesSwift · on Sept 13, 2024

Just the nuance of the language. Im saying _statistically_ those scenarios are "Im currently trying to spell a word and Im verbally asking you which is right". Very rarely would that be written in text, and I think its much more likely to be framed as "is there 1 or 2" as the question. I think _statistically_ it is overwhelmingly likely the question as framed is a trivia style question asking for the actual answer.

jdthedisciple · on Sept 13, 2024

Statistically I would surmise it's more often asked with a 1 vs 2 framing, because I don't think anyone genuinely was ever asked by another person how many of any character are in any correctly spelled word in written form, because that would be a redundant question, right?

But I suppose to answer this it could help to actually gather statistics on the frequency of either version in texts online.

jdthedisciple · on Sept 13, 2024

Sad to see this genuine argument being downvoted simply because people can't handle the fact they missed it, lol