No. We don't know. My favorite hypothesis: SGD is...well, stochastic. Meaning yo... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

a_wild_dandan 11 months ago | parent | context | favorite | on: Learning Theory from First Principles [pdf]

No. We don't know. My favorite hypothesis: SGD is...well, stochastic. Meaning you're not optimizing w.r.t the training corpus, but a tiny subset, so your gradient isn't quite right. Over-training allows you to bulldoze over local optima and recurse toward the true distribution rather than drive around a local over-fitting basin.

canjobear 11 months ago [–]

You can get it with full gradient descent though... https://www.nature.com/articles/s41467-020-14663-9

Honestly the fact that there doesn't seem to be a good explanation for this makes me think that we just fundamentally don't understand learning.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact