That's a really interesting point about committing to words one by one. It highlights how fundamentally different current LLM inference is from human thought, as you pointed out with the scene description analogy. You're right that it feels odd, like building something brick by brick without seeing the final blueprint. To add to this, most text-based LLMs do currently operate this way. However, there are emerging approaches challenging this model. For instance, Inception Labs recently released "Mercury," a text-diffusion coding model that takes a different approach by generating responses more holistically. It’s interesting to see how these alternative methods address the limitations of sequential generation and could potentially lead to faster inference and better contextual coherence. It'll be fascinating to see how techniques like this evolve!
But as I noted yesterday in a follow-up comment to my own above, the diffusion-based approaches to text response generation still generate tokens one at a time. Just not in strict left-to-right order. So that looks the same; they commit to a token in some position, possibly preceded by gaps, and then calculate more tokens,