Hacker News new | past | comments | ask | show | jobs | submit login

I don't know if it's a generalized result, but the Circuits team at Anthropic has a very compelling thesis: the first phase of descent corresponds to the model memorizing data points, the second phase corresponds to it shifting geometrically toward learning "features".

Here a "feature" might be seen as an abstract, very, very high dimensional vector space. The team is pretty deep in investigating the idea of superposition, where individual neurons encode for multiple concepts. They experiment with a toy model and toy data set where the latent features are represented explicitly and then compressed into a small set of data dimensions. This forces superposition. Then they show how that superposition looks under varying sizes of training data.

It's obviously a toy model, but it's a compelling idea. At least for any model which might suffer from superposition.

https://transformer-circuits.pub/2023/toy-double-descent/ind...




> The team is pretty deep in investigating the idea of superposition, where individual neurons encode for multiple concepts.

Wonder if it's a matter of perspective - that is, of transform. Consider an image. Most real-world images have pixels with high locality - distant pixels are less correlated than immediate neighbours.

Now take an FFT of that. You get an equivalent 2D image containing the same information, but suddenly each pixel contains information about every pixel of the original image! You can do some interesting things there, like erasing the centre of the picture (higher frequencies), which will give you blurred original image when you run FFT on the frequency-image to get proper pixels again.


I think that’s basically correct, the FFT representation is a better feature representation.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: