It's probably referring to policy iteration[1]. It works by iterating two steps:...

It's probably referring to policy iteration[1].

It works by iterating two steps:

- Keep the policy $\pi$ fixed, and determine the $Q$ values for that policy

- Keep the $Q$ value estimates fixed, and create a policy $\pi'$ such that

   -- $\pi'(s) = \arg\max_a Q(s,a)$ for the ordinary greedy algorithm,

   -- $\pi'(s,a) \propto \exp Q(s, a)$ if using greedy soft-max.