Hacker News new | past | comments | ask | show | jobs | submit login

LLM scaling laws tell us that more parameters make models better, in general.

The key intuition behind why MoE works is that as long as those parameters are available during training, they count toward this scaling effect (to some extent).

Empirically, we can see that even if the model architecture is such that you only have to consult some subset of those parameters at inference time - the optimizer finds a way.

The inductive bias in this style of MoE model is to specialize (as there is a gating effect between 'experts'), which does not seem to present much of an impediment to gradient flow.




> LLM scaling laws tell us that more parameters make models better, in general.

That depends heavily on the amount and complexity of training data you have. This is actually one of the things than OpenAI have advantage, they scraped a lot of data on the internet before now it became too hard for new players to get.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: