Hacker News new | past | comments | ask | show | jobs | submit login

>Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.

Models aren't trained to do next word prediction though - they are trained to do missing word in this text prediction.






That's true for mask-based training (used for embeddedings and BERT and such), but not true for modern autoregressive LLMs as a whole, which are pretrained with next word prediction.

It's not strictly that though, it's next word prediction with regularization.

And the reason LLMs are interesting is that they /fail/ to learn it, but in a good way. If it was a "next word predictor" it wouldn't answer questions but continue them.

Also, it's a next token predictor not a word predictor - which is important because the "just a predictor" theory now can't explain how it can form words at all!


Yes, I know; I was clarifying their immediate misunderstanding using the same terminology as them.

There's obviously a lot more going on behind the scenes, especially with today's mid- and post-training work!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: