Hacker News new | past | comments | ask | show | jobs | submit login

Let's suppose that you are correct, which direction are the weights updated towards?

The implementations of this compute gradients locally.




It makes sense that all gradients are local. Does it make sense to say that gradient propagation through the layers is memoryless?


In my opinion, yes if and only if the update does not use a stateful optimiser, and the computation is easy / simple enough that the updated parameter value can be computed immediately.

In linear layers, it is possible. Once you have computed the gradient of the output of the vector ith vector, so a scalar, you scale the input by that value and add it to the parameters.

This is a simple FMA op: a=fma(eta*z, x, a), with z the gradient of the vector, x the input, a the parameters, and eta the learning rate. This computes a = a + eta*z*x in place.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: