Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's probably referring to policy iteration[1].

It works by iterating two steps:

- Keep the policy $\pi$ fixed, and determine the $Q$ values for that policy

- Keep the $Q$ value estimates fixed, and create a policy $\pi'$ such that

   -- $\pi'(s) = \arg\max_a Q(s,a)$ for the ordinary greedy algorithm,

   -- $\pi'(s,a) \propto \exp Q(s, a)$ if using greedy soft-max.
[1] - Lecture 3 of http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: