The quantization breakthrough here is amazing: can run ~200B parameter models on a single desktop machine.
But there are a few really interesting and new insights:
With transformer models above 6.7B parameters, a "phase shift" (their language) occurs where features (a dimension that "offers some weak explanation for the label") are shared between layers (in that all the layer agree which dimension to use for that feature).
This is really important because these key features are where the "knowledge" of the neural network is concentrated. The attention layers are very sparse ("Almost all sequence dimensions have zero probability.)
But the fully connected layers are very dense. It compares them to computer vision where fully connected layers can be pruned of 95% of the weights without serious impact, while a transformer after this 6.7B param point can only be pruned of 5% of the weights.
And this is really interesting:
> Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.
The possibility of training hundreds of billion param networks in 8-bit (or less!) precision would be a real breakthough.
From a neuroscience perspective it would seem obvious that neural networks can work with less than 8 bits. According to a study from 2015 [1], the synapses in the Hippocampus can store about 4.7 bits of information (26 discrete connection strengths). While the real brain graph is very different from a transformer, I think this should still be achievable for other architectures as it is most likely just a question of extra stabilization during training.
That paper showed a minimum of 26 states, not a maximum. Later papers have increased this significantly.
[1] for example increased the number 10-fold. Papers like [2] have pushed this complexity per synapse into a much higher level of complexity (so much they don't even put a number on it).
Tbh I wouldn't be surprised if the sizes are not quantised at all. But if you look at the histogram in [1], most synapses fall into the low number of states range. This is probably related to the aforementioned sparsity of certain neural network layers. Pruning outliers from the brain is really difficult from an evolutionary perspective, but the approach linked in this post where you simply treat them differently than other parts seems like a reasonable way to go for artificial neural networks.
I’m in your shoes but if you push through the first section they go back and define things so you can understand what they’re on about. Wether you’ll find it interesting is a different topic though!
Could any explain this section a little more? I didn’t quite follow this part. What does it mean to squish by 4 or by 2?
> The only way to improve quantization is through more normalization constants. A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3. We can increase precision, by squishing each vector only as much as is needed. For example, if you have the two vectors:
>
> [3, 1, 2, 3]
> [0, 2, 2, 0]
>
> Then you can squish the first by 4 and the second by 2. This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type.
It means to bring the range of the vector to between 0 and 1. In this case the operation is a simple division by a constant, and the argument is that you can decide on this constant per-vector so that you maximize the utilization of this space between 0 and 1.
If you'd have squished both by 4, the second vector would be [0, 0.5, 0.5, 0], leaving essentially half the quantization space (0.5-1.0) unused, and leaving you with less precision.
Ah, I think I got it, thanks. The broader range is the range (0-1) as opposed to (0-0.5)?
This is what the author means by "This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type."?
The wording is a bit weird, but "squishing" here just means using a constant number (the normalization constant) and multiplying all numbers in a vector by it. E.g: Multiplying [0,2,2,0] by 2 gives you [0,4,4,0], which is better distributed over the I3 distribution [0,2,4] (which goes all the way to 4). Any additional values that might get lost to rounding between 0 and 2 could be more easily restored that way. E.g. [0,1,2,2,0] would become [0,2,2,2,0] without squishing and [0,2,4,4,0] with it.
Ah thanks so the explanation, the second part is still a bit confusing for me. When the author writes "A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3." It makes me think normalisation in the range (0-2) which I3 is in. Do I have that right?
But there are a few really interesting and new insights:
With transformer models above 6.7B parameters, a "phase shift" (their language) occurs where features (a dimension that "offers some weak explanation for the label") are shared between layers (in that all the layer agree which dimension to use for that feature).
This is really important because these key features are where the "knowledge" of the neural network is concentrated. The attention layers are very sparse ("Almost all sequence dimensions have zero probability.)
But the fully connected layers are very dense. It compares them to computer vision where fully connected layers can be pruned of 95% of the weights without serious impact, while a transformer after this 6.7B param point can only be pruned of 5% of the weights.
And this is really interesting:
> Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.
The possibility of training hundreds of billion param networks in 8-bit (or less!) precision would be a real breakthough.