> it might be that the model trains all of the inputs to become very negative
It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.
Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.
It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.
Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.