Emergent Features in 6.7B+ Param Transformers

nl · on Aug 18, 2022

The quantization breakthrough here is amazing: can run ~200B parameter models on a single desktop machine.

But there are a few really interesting and new insights:

With transformer models above 6.7B parameters, a "phase shift" (their language) occurs where features (a dimension that "offers some weak explanation for the label") are shared between layers (in that all the layer agree which dimension to use for that feature).

This is really important because these key features are where the "knowledge" of the neural network is concentrated. The attention layers are very sparse ("Almost all sequence dimensions have zero probability.)

But the fully connected layers are very dense. It compares them to computer vision where fully connected layers can be pruned of 95% of the weights without serious impact, while a transformer after this 6.7B param point can only be pruned of 5% of the weights.

And this is really interesting:

> Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.

The possibility of training hundreds of billion param networks in 8-bit (or less!) precision would be a real breakthough.

sigmoid10 · on Aug 18, 2022

From a neuroscience perspective it would seem obvious that neural networks can work with less than 8 bits. According to a study from 2015 [1], the synapses in the Hippocampus can store about 4.7 bits of information (26 discrete connection strengths). While the real brain graph is very different from a transformer, I think this should still be achievable for other architectures as it is most likely just a question of extra stabilization during training.

[1] https://elifesciences.org/articles/10778

SideQuark · on Aug 18, 2022

That paper showed a minimum of 26 states, not a maximum. Later papers have increased this significantly.

[1] for example increased the number 10-fold. Papers like [2] have pushed this complexity per synapse into a much higher level of complexity (so much they don't even put a number on it).

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5247597/

[2] https://www.nature.com/articles/s41598-020-64874-9

sigmoid10 · on Aug 19, 2022

Tbh I wouldn't be surprised if the sizes are not quantised at all. But if you look at the histogram in [1], most synapses fall into the low number of states range. This is probably related to the aforementioned sparsity of certain neural network layers. Pruning outliers from the brain is really difficult from an evolutionary perspective, but the approach linked in this post where you simply treat them differently than other parts seems like a reasonable way to go for artificial neural networks.

bmh · on Aug 18, 2022

This is an extremely accessible blog post, and very interesting, even if you're not deep into transformers.

neon_electro · on Aug 18, 2022

"Accessible" is relative, of course - can you share what you believe the minimum understanding of AI/ML would be to understand this post?

I'm a software engineer with a passing interest in this stuff, and my eyes glazed over.

EddySchauHai · on Aug 18, 2022

I’m in your shoes but if you push through the first section they go back and define things so you can understand what they’re on about. Wether you’ll find it interesting is a different topic though!

michaelbarton · on Aug 18, 2022

Could any explain this section a little more? I didn’t quite follow this part. What does it mean to squish by 4 or by 2?

> The only way to improve quantization is through more normalization constants. A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3. We can increase precision, by squishing each vector only as much as is needed. For example, if you have the two vectors: > > [3, 1, 2, 3] > [0, 2, 2, 0] > > Then you can squish the first by 4 and the second by 2. This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type.

thegeomaster · on Aug 18, 2022

It means to bring the range of the vector to between 0 and 1. In this case the operation is a simple division by a constant, and the argument is that you can decide on this constant per-vector so that you maximize the utilization of this space between 0 and 1.

If you'd have squished both by 4, the second vector would be [0, 0.5, 0.5, 0], leaving essentially half the quantization space (0.5-1.0) unused, and leaving you with less precision.

michaelbarton · on Aug 18, 2022

Ah, I think I got it, thanks. The broader range is the range (0-1) as opposed to (0-0.5)?

This is what the author means by "This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type."?

thegeomaster · on Aug 18, 2022

Precisely.

sigmoid10 · on Aug 18, 2022

The wording is a bit weird, but "squishing" here just means using a constant number (the normalization constant) and multiplying all numbers in a vector by it. E.g: Multiplying [0,2,2,0] by 2 gives you [0,4,4,0], which is better distributed over the I3 distribution [0,2,4] (which goes all the way to 4). Any additional values that might get lost to rounding between 0 and 2 could be more easily restored that way. E.g. [0,1,2,2,0] would become [0,2,2,2,0] without squishing and [0,2,4,4,0] with it.

michaelbarton · on Aug 18, 2022

Ah thanks so the explanation, the second part is still a bit confusing for me. When the author writes "A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3." It makes me think normalisation in the range (0-2) which I3 is in. Do I have that right?

sva_ · on Aug 18, 2022

Small error:

> Let’s do an example. Let’s say we have the vector [3, 1, 2, 3] in I5, and we want to quantize to I3.

> We see that our dequantization and quantization led to two errors: [3, 1, 2, 4] to [3, 0, 2, 3]

The author changed [3, 1, 2, 3] to [3, 1, 2, 4].

nharada · on Aug 19, 2022

I wonder what happens if you initialize the weights with these large outlier layers from the beginning.