There are *far* more people running llama.cpp, various image generators, etc. th...

There are far more people running llama.cpp, various image generators, etc. than there are people developing that code. Even when the "users" are corporate entities, they're not necessarily doing any development in excess of integrating the existing code with their other systems.

We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.

What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.