I have been steeped in math for as long as I remember, so I frequently have a hard time telling whether an explanation is actually helpful to someone who doesn't already understand the underlying concept, and I'd like to use this opportunity to improve.
Could you specify where exactly you get lost, and why? (Is it something that's just not explained, or is there an explanation, but one that doesn't make it easier to understand?)
Things that weren't obvious to me as a non-mathematician:
f:S×A↦S
(I assumed it meant: the model is represented by function f, whose inputs can be any combination of S and A [domain], and will produce an output value in S [codomain])
Why do we set the Q to 0 below?
The constraint that Q(st,at,st+K,K)=0 enforces the feasibility of the trajectory
> the model is represented by function f, whose inputs can be any combination of S and A [domain], and will produce an output value in S [codomain])
Exactly. f:S×A↦S is a function signature, just like in a programming language. Basically, the model tells you which state you end up in after taking a certain action in the given state.
> Why do we set the Q to 0 below?
Q is introduced as
A temporal difference model (TDM)†, which we will write as Q(s,a,s_g,τ), is a function that, given a state s∈S, action a∈A, and goal state s_g∈S, predicts how close an agent can get to the goal within τ time steps. Intuitively, a TDM answers the question, “If I try to bike to San Francisco in 30 minutes, how close will I get?”
That means that setting Q(s_t,a_t,s_{t+K},K)=0 is the same as enforcing that the state s_{t+K} can actually be reached (distance 0) from s_t in K time steps. Without the constraint, it would be possible to plan a trajectory that can't be executed because the intermediate goals are too far away.
Could you specify where exactly you get lost, and why? (Is it something that's just not explained, or is there an explanation, but one that doesn't make it easier to understand?)