Hacker News new | past | comments | ask | show | jobs | submit login

The quantization breakthrough here is amazing: can run ~200B parameter models on a single desktop machine.

But there are a few really interesting and new insights:

With transformer models above 6.7B parameters, a "phase shift" (their language) occurs where features (a dimension that "offers some weak explanation for the label") are shared between layers (in that all the layer agree which dimension to use for that feature).

This is really important because these key features are where the "knowledge" of the neural network is concentrated. The attention layers are very sparse ("Almost all sequence dimensions have zero probability.)

But the fully connected layers are very dense. It compares them to computer vision where fully connected layers can be pruned of 95% of the weights without serious impact, while a transformer after this 6.7B param point can only be pruned of 5% of the weights.

And this is really interesting:

> Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.

The possibility of training hundreds of billion param networks in 8-bit (or less!) precision would be a real breakthough.




From a neuroscience perspective it would seem obvious that neural networks can work with less than 8 bits. According to a study from 2015 [1], the synapses in the Hippocampus can store about 4.7 bits of information (26 discrete connection strengths). While the real brain graph is very different from a transformer, I think this should still be achievable for other architectures as it is most likely just a question of extra stabilization during training.

[1] https://elifesciences.org/articles/10778


That paper showed a minimum of 26 states, not a maximum. Later papers have increased this significantly.

[1] for example increased the number 10-fold. Papers like [2] have pushed this complexity per synapse into a much higher level of complexity (so much they don't even put a number on it).

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5247597/

[2] https://www.nature.com/articles/s41598-020-64874-9


Tbh I wouldn't be surprised if the sizes are not quantised at all. But if you look at the histogram in [1], most synapses fall into the low number of states range. This is probably related to the aforementioned sparsity of certain neural network layers. Pruning outliers from the brain is really difficult from an evolutionary perspective, but the approach linked in this post where you simply treat them differently than other parts seems like a reasonable way to go for artificial neural networks.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: