Let's suppose that you are correct, which direction are the weights updated towa...

theGnuMe · on Jan 12, 2023

It makes sense that all gradients are local. Does it make sense to say that gradient propagation through the layers is memoryless?

ekleraki · on Jan 12, 2023

In my opinion, yes if and only if the update does not use a stateful optimiser, and the computation is easy / simple enough that the updated parameter value can be computed immediately.

In linear layers, it is possible. Once you have computed the gradient of the output of the vector ith vector, so a scalar, you scale the input by that value and add it to the parameters.

This is a simple FMA op: a=fma(eta*z, x, a), with z the gradient of the vector, x the input, a the parameters, and eta the learning rate. This computes a = a + eta*z*x in place.