He's advertising it as fixing the spiking outliers. Did your variant have those ...

chessgecko · on July 24, 2023

I guess yeah I was mostly responding to

Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).

I didn't test for outliers, but I don't think this will lead to a large improvement in attention overall/it will fix a lurking bug.

zackangelo · on July 24, 2023

He’s not trying or claiming to improve attention. He’s trying to reduce outliers to improve the ability to quantize the parameters.

chessgecko · on July 24, 2023

He refers all over the blog post to an "error" in attention. specifically says

The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man.

I'm saying it uses the current position to do this, that if it was a significant error I would expect it to improve the training loss. I sort of interpreted the blog post as being a bit more positive on the idea than just being about improving the quantization

lyjackal · on July 25, 2023

I agree that he used the term error somewhat incorrectly. But he seems mainly to just be making the point that sumac introduces a large outlier which in turn is only now an issue that the community is now aggressively trying to quantize models

nextaccountic · on July 25, 2023

> I didn't test for outliers

Then you don't know if the approach he is advocating actually improves what he is aiming for