It works by iterating two steps:
- Keep the policy $\pi$ fixed, and determine the $Q$ values for that policy
- Keep the $Q$ value estimates fixed, and create a policy $\pi'$ such that
-- $\pi'(s) = \arg\max_a Q(s,a)$ for the ordinary greedy algorithm, -- $\pi'(s,a) \propto \exp Q(s, a)$ if using greedy soft-max.
It works by iterating two steps:
- Keep the policy $\pi$ fixed, and determine the $Q$ values for that policy
- Keep the $Q$ value estimates fixed, and create a policy $\pi'$ such that
[1] - Lecture 3 of http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html