>Claude will plan what it will say many words ahead, and write to get to that de...

Philpax · 2025-03-28T12:49:48 1743166188

That's true for mask-based training (used for embeddedings and BERT and such), but not true for modern autoregressive LLMs as a whole, which are pretrained with next word prediction.

astrange · 2025-03-28T21:22:09 1743196929

It's not strictly that though, it's next word prediction with regularization.

And the reason LLMs are interesting is that they /fail/ to learn it, but in a good way. If it was a "next word predictor" it wouldn't answer questions but continue them.

Also, it's a next token predictor not a word predictor - which is important because the "just a predictor" theory now can't explain how it can form words at all!

Philpax · 2025-03-29T11:53:16 1743249196

Yes, I know; I was clarifying their immediate misunderstanding using the same terminology as them.

There's obviously a lot more going on behind the scenes, especially with today's mid- and post-training work!