Hacker News new | past | comments | ask | show | jobs | submit login

No. We don't know. My favorite hypothesis: SGD is...well, stochastic. Meaning you're not optimizing w.r.t the training corpus, but a tiny subset, so your gradient isn't quite right. Over-training allows you to bulldoze over local optima and recurse toward the true distribution rather than drive around a local over-fitting basin.



You can get it with full gradient descent though... https://www.nature.com/articles/s41467-020-14663-9

Honestly the fact that there doesn't seem to be a good explanation for this makes me think that we just fundamentally don't understand learning.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: