Could you expand on how you think the new model increased accuracy in this class...

TZubiri · on Sept 14, 2024

It writes and executes python code.

zamadatix · on Sept 14, 2024

When making such claims it's generally good form to give reproducible examples of why you think it must be so. Otherwise it leaves everyone else stabbing in the dark at trying to follow your thought process before being able to weigh if the version they land on is convincing to them. E.g. the same for a counterclaim:

o1-preview doesn't seem to have the ability to execute python code (similar to it's other lackings, such as not being able to call whisper/dall-e integrations). Example script:

    import gzip
    import base64
    from io import BytesIO
    
    encoded_string = b'H4sIADqx5WYC/+VW70/aUBT9zl9xvqmTFsEfc8tmUusDOl9bbB8MM2YxmcmWbMaoW8jS+Lfv3tdL0SihVr/tBTi0HG7PO70/6jqVV7ORw63KRU7sHNldhtUrFzYmLcBczM5vEFze3F7//nVxeXtjKQYBHOxjb8F2q+kQdl5NR8luY5PeVdnPi71idV4Sew9vyKbXUdLeeYmS7DU9+Q/Z0wr5vbaonUqrYE8rVvGaZfMKpRTCjuC24I7gro3ObFcKbxUWbHp124U13Y7gtuCO4C5/WvZUZK1CZsse3EzWwoT5meZ8n/Nd3lvThQePPGxUdfsR+2QYGGtXfKRKNU3S+IC91LHWBJPsLGvxl5Kdo3P5zTqlxoFhDONE5SCWveHUVTEp2cscyz8wM2baQV6yz39efT+nAx1Ex5gBQZTaC9uIWU6yi6us0D1h1h/+V+uebk8P+h7pnTkjL2HdR0pLIpMjTeeBJ0vv9JN+p7EeqYQwCIeaHB/E+pR+8z2je3y1tJLfSzK2l3iDPuk13qFWrHuQ9EJCf5gai1rxfp6fsdRwt8hkj3VHThuH8OOU0Ifx+PgIVz9Is6pbaTru2Tyh6BYpukWKXlTaGVBP9wXJRheYkUr0gC/sbR/4yhiw8drHp9q6o6ItKlUg1gU3UDbVerrd1h1wTPXomWQMjZHyTYIQ/kCPEdnz/aIo6ugeyyDHW8F9wXeCs7q6W9wwBuRrTAV5YnEEzneTRD2kVK+pgandkZuiT8Y/dgX3BJ26uhNfAx+BQy9VGNJTnSKdI/JZhcBn1u8ZjOvqTk18YPW1Refc/23Bzbq6427XZpnf9xJQJ3nPB3+pVLkf0r1QkUlOa0/AWPIbW4JugevOBiMFVzX7icywKlNqOqk80lprDRbirn7Ay1yHuTJQ3ezpf2REK3nCtvtwi+Hd5EHd+AeTr9mjqgwAAA=='
    compressed_string = base64.b64decode(encoded_string)
    with gzip.GzipFile(fileobj=BytesIO(compressed_string)) as gzip_file:
        original_string = gzip_file.read().decode('utf-8')  # Assuming the original string is in UTF-8 encoding
    
    print(original_string)

This outputs ascii art of a TI-86 calculator. Given the prompts:

> What does the ascii art output of this script represent: ${script}

and

> Execute ${script} and guess what the output represents.

and

> Execute ${script}

o1-preview produces complete gibberish from guessing random patterns in numbers, giving sample output of random ASCII art lines, and saying things in its train of thought like "I copied and pasted the code into a Python interpreter. It generated ASCII art, resembling an elephant with features like ears, trunks, and texture patterns". Despite it saying it ran these in an interpreter it seems to be simulating that despite knowing it's the ideal step as plain ol' 4o just outputs:

> The output of the Python code is an ASCII art representation of a Texas Instruments TI-86 calculator. Here is the ASCII art: ${exact_ti-86_ascii_art}

.

Apart from that hole in the claim, simply expanding the thought process of prompts poking at this:

> Explain how to find the number of occurrences of a letter in a word in general then use what you find to count the number of 'c's in 'occurrences'. Don't use programming or environment execution, particularly not python, use a natural process a person could understand in natural language.

Takes only 6 seconds, has a thought train which does not include anything about code, follows the exact method it created step by step in text, and correctly outputs there are 3 'c's in occurrences.

.

Apart from the second hole in the claim, using additional inference tokens to come up with a chain of thought which would involve coming up with code to use, deploying it, and using the result would indeed still be an example of the new model using inference compute time to improving its quality rather than adding specific questions to the training.

TZubiri · on Sept 14, 2024

I think you are confused about what thread you are in.

Have you read the original article?

Here's a quote:

"I added this to Mimi's system prompt:

If you are asked to count the number of letters in a word, do it by writing a python program and run it with code_interpreter."

This article was released before o1, it's not the topic. The solution was just adding a system prompt and executing the python script generated, in order to solve the strawberry problem. It wasn't solved by increasing compute time, unless you count the inference from the extra system prompt tokens.

zamadatix · on Sept 15, 2024

Er, yes - I read the original article... which is how I know "This article was released before o1, it's not the topic" is false on both count. The first paragraph of said article lays that out:

> Recently OpenAI announced their model codenamed "strawberry" as OpenAI o1. One of the main things that has been hyped up for months with this model is the ability to solve the "strawberry problem". This is one of those rare problems that's trivial for a human to do correctly, but almost impossible for large language models to solve in their current forms. Surely with this being the main focus of their model, OpenAI would add a "fast path" or something to the training data to make that exact thing work as expected, right?

and is preceded by a date, 09/13/2024, which is the same day this HN thread was created. o1 was announced and released the day prior to both.

TZubiri · on Sept 16, 2024

You are right.

Please note that strawberry-mimi is the model I was referring to.

zamadatix · on Sept 17, 2024

All good, hopefully my original question about Strawberry makes more sense now.