Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning

indescions_2017 · on Oct 22, 2017

DeepMind's Combining Policy Gradient and Q-Learning provides good historical background:

https://arxiv.org/pdf/1611.01626.pdf

And archived videos from Berkeley's Deep RL Bootcamp is a crash course to get one up to speed:

https://sites.google.com/view/deep-rl-bootcamp/lectures

But consider the multi-model Q-distribution in Fig. 3. That graph, and the variance of its peaks, is the key to understanding why training Deep RL algorithms becomes so intractable in stochastic, real-world settings.

Max Entropy Estimation and approximate Bellmen Equations will do very well in Atari games for example. Or creating unbeatable MOBA competitors. But for robotics a breakthrough that allows for quick generalization from sparse input data is what is needed.

One-Shot Visual Imitation Learning via Meta-Learning

https://sites.google.com/view/one-shot-imitation

WhitneyLand · on Oct 23, 2017

But what approach would it take to beat a human at Ms. Pac Man?

When people hear ML has felled the worlds classic strategy games like Chess and Go, leaving world champions helpless, the idea of humans beating computers any game seems quaint.

But forget world champions, the best systems cannot beat a mediocre player at Ms. Pac Man. Its not because no one has tried. There have been some competitions and research papers.

Of course, the challenge seems pretty damn unimportant compared to self driving cars. But still, if it were trivial to do, I would have guessed someone would have done it.

What other games exists, where AI can’t beat humans with less than 60 minutes of lifetime experience?

tlarkworthy · on Oct 23, 2017

Is that true about Ms. Pac Man? Microsoft seems to have posted near perfect score:- http://www.businessinsider.com/microsoft-ai-ms-pac-man-highe...

WhitneyLand · on Oct 23, 2017

Wow, that’s really new, within the last few months, thanks for the link. The challenge sounds smaller than it was. It had been worked on for years prior, as I mentioned with poor results.

Now on to Starcraft 2, where some major research labs are competing including DeepMind.

It looks like the best bots this year struggled to beat average players. https://www.wired.com/story/facebook-quietly-enters-starcraf...

Tenoke · on Oct 23, 2017

I am somewhat curious which game you will be latching on to when SC2 is also thoroughly beaten in a year (or whenever).

WhitneyLand · on Oct 24, 2017

“latching on to”? It’s not about trying to disrespect the research, or downplay the capabilities of AI after the last decade’s amazing progress.

It’s about noting all the little milestones and unsolved problems along the way, not just the headlines. The idea is to improve perspective on the whole of where the state of the art is.

As for the next game, I don’t imagine it will be too many years before we reach the 99% percentile of AI playing all games at least as well as the best humans. Why not throw out a prediction? Less than 5 years?

VHRanger · on Oct 22, 2017

I fail to see how this is different than keeping non-zero weights on suboptimal actions for exploration?

clickok · on Oct 23, 2017

In standard policy gradient, the weights for (apparently) suboptimal actions are free to trend downwards to arbitrarily small values. If you're using a Gibbs distribution[0] the probability of executing alternative actions is nonzero but potentially very small.

With entropy methods, you add an artificial bonus towards policies that execute more varied actions[1]. So while there's a tendency towards improving the policy by executing apparently favorable actions, it is counter-balanced by a term incentivizing different choices.

Why not just optimize using standard gradient descent? Typically, the variance is high and the algorithms optimize very greedily, so you end up in situations where a few bad trajectories cause certain actions to be avoided. Or the agent might need to refine its policy for the states that come afterwards, but because that hasn't happened yet, the actions that lead to those states are ignored[2].

So exploration needs to be incentivized, and while this can be done with ε-greedy or optimistic initialization, those techniques are kinda ad hoc, while entropy methods have a bit more mathematical/philosophical heft behind them. Plus they seem to work, which is always nice.

-----

0. That is, if you'll excuse the notation: p(a) = e^α/Σ e^β where 'p(a)' is the probability of taking action 'a', with weight 'α' and you're summing over all possible actions (with weights denoted 'β') in the denominator.

1. For a discrete space (absent other information), the maximum entropy distribution is the uniform distribution, i.e., all actions are equally likely.

2. For example, a human-like robot might try stepping forward, but hasn't learned how to balance after it lifts its foot, so it stumbles and falls. Crawling is less efficient, but provides constant progress and is easier to learn. So the robot 'gets stuck' crawling since it had a bad experience walking and doesn't attempt it again.

pizza · on Oct 23, 2017

> In a similar vein to general-to-specific transfer, we can compose new skills from existing policies—even without any fine-tuning—by intersecting different skills. The idea is simple: take two soft policies, each corresponding to a different set of behaviors, and combine them by adding together their Q-functions. In fact, it is possible to show that the combined policy is approximately optimal for the combined task, obtained by simply adding the reward functions of the constituent tasks, up to a bounded error.

> Because the maximum entropy formulation encourages agents to try all possible solutions, the agents learn to explore a large portion of the state space. Thus they learn to act in various situations, and are more robust against perturbations in the environment.

it is keeping non-zero weights on suboptimal actions during exploration. but that tells you almost nothing. there are an infinite number of ways to keep non-zero weights on suboptimal actions and some of those are much better than the rest. the maximum-entropy approach is a very specific way to use non-greedy weights.

AlexCoventry · on Oct 23, 2017

Could someone please explain the explanation of equ. 18 in the appendix to the paper? It seems key, but isn't clear to me.

> if one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains π˜ from π

clickok · on Oct 23, 2017

It wasn't clear to me either, and I really dislike when papers expect the reader to 'notice' some rather non-trivial manipulations. So I wrote up my unpacking of the equivalence: http://rl.ai/posts/max-entropy-rl-explanation.html Please forgive the possibly screwy LaTeX, as I am in a battle with my static site generator.

If you're just wondering about the meaning of (18), it claims that for a given state, the entropy of the original policy plus its expected Q-value is less than the entropy + expected Q-value of the π-tilde policy. Where they are using Q_{soft}^{\pi} to denote the state-action values (with entropy term) for an arbitrary policy. In (17) they introduce a new policy, \tilde{\pi} defined through the Q_{soft} values for some policy π, with the probability of taking action 'a' in state 's' proportional to the exponentiated Q(s,a).

AlexCoventry · on Oct 24, 2017

Thanks, this is really generous of you.

man-and-laptop · on Oct 23, 2017

It's probably referring to policy iteration[1].

It works by iterating two steps:

- Keep the policy $\pi$ fixed, and determine the $Q$ values for that policy

- Keep the $Q$ value estimates fixed, and create a policy $\pi'$ such that

   -- $\pi'(s) = \arg\max_a Q(s,a)$ for the ordinary greedy algorithm,

   -- $\pi'(s,a) \propto \exp Q(s, a)$ if using greedy soft-max.

[1] - Lecture 3 of http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

skndr · on Oct 23, 2017

Hi Alex! I don't have the answer, but I just sent this over to Seth. Looking forward to seeing you all again in December!

AlexCoventry · on Oct 24, 2017

Hey, Skander! Hope SF's treating you well.