Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).
I didn't test for outliers, but I don't think this will lead to a large improvement in attention overall/it will fix a lurking bug.
He refers all over the blog post to an "error" in attention. specifically says
The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man.
I'm saying it uses the current position to do this, that if it was a significant error I would expect it to improve the training loss. I sort of interpreted the blog post as being a bit more positive on the idea than just being about improving the quantization
I agree that he used the term error somewhat incorrectly. But he seems mainly to just be making the point that sumac introduces a large outlier which in turn is only now an issue that the community is now aggressively trying to quantize models