Although the naming has a rationale (as explained in the footnote), using the name Temporal Difference Model for this method is a recipe for a lot of confusion.
You are not alone, but I personally think (and I think many others) that AI will come to the point that you dont really need to understand how the math behind it works but only how it works. So for example you only need to know how back propagation works and what it does, but not the exact math formula for it. I think you can see it already happening with Keras. You would need to know the math and the nitty gritty if you want to build or research state of the art ML/DL
I have been steeped in math for as long as I remember, so I frequently have a hard time telling whether an explanation is actually helpful to someone who doesn't already understand the underlying concept, and I'd like to use this opportunity to improve.
Could you specify where exactly you get lost, and why? (Is it something that's just not explained, or is there an explanation, but one that doesn't make it easier to understand?)
Things that weren't obvious to me as a non-mathematician:
f:S×A↦S
(I assumed it meant: the model is represented by function f, whose inputs can be any combination of S and A [domain], and will produce an output value in S [codomain])
Why do we set the Q to 0 below?
The constraint that Q(st,at,st+K,K)=0 enforces the feasibility of the trajectory
> the model is represented by function f, whose inputs can be any combination of S and A [domain], and will produce an output value in S [codomain])
Exactly. f:S×A↦S is a function signature, just like in a programming language. Basically, the model tells you which state you end up in after taking a certain action in the given state.
> Why do we set the Q to 0 below?
Q is introduced as
A temporal difference model (TDM)†, which we will write as Q(s,a,s_g,τ), is a function that, given a state s∈S, action a∈A, and goal state s_g∈S, predicts how close an agent can get to the goal within τ time steps. Intuitively, a TDM answers the question, “If I try to bike to San Francisco in 30 minutes, how close will I get?”
That means that setting Q(s_t,a_t,s_{t+K},K)=0 is the same as enforcing that the state s_{t+K} can actually be reached (distance 0) from s_t in K time steps. Without the constraint, it would be possible to plan a trajectory that can't be executed because the intermediate goals are too far away.
Models that have an explicit goal state, and a distance estimator, are naturally going to be more efficient than vanilla RL without this side information where it has to learn purely by exploration.
How is the distance getting estimated? Like another comment said, if a good distance estimator is provided this simplifies the task. Is there a baseline that uses distance in its input as well?
https://en.wikipedia.org/wiki/Temporal_difference_learning