> it might be that the model trains all of the inputs to become very negative It... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

thomasahle on July 25, 2023 | parent | context | favorite | on: Attention Is Off By One

> it might be that the model trains all of the inputs to become very negative

It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.

Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.

redox99 on July 25, 2023 [–]

I guess you could add a sort of gating operation with a learnable parameter that sends the value to -inf if doesn't reach the threshold.

Of course it might have some other serious repercussions.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact