I fail to see how this is different than keeping non-zero weights on suboptimal ...

clickok · on Oct 23, 2017

In standard policy gradient, the weights for (apparently) suboptimal actions are free to trend downwards to arbitrarily small values. If you're using a Gibbs distribution[0] the probability of executing alternative actions is nonzero but potentially very small.

With entropy methods, you add an artificial bonus towards policies that execute more varied actions[1]. So while there's a tendency towards improving the policy by executing apparently favorable actions, it is counter-balanced by a term incentivizing different choices.

Why not just optimize using standard gradient descent? Typically, the variance is high and the algorithms optimize very greedily, so you end up in situations where a few bad trajectories cause certain actions to be avoided. Or the agent might need to refine its policy for the states that come afterwards, but because that hasn't happened yet, the actions that lead to those states are ignored[2].

So exploration needs to be incentivized, and while this can be done with ε-greedy or optimistic initialization, those techniques are kinda ad hoc, while entropy methods have a bit more mathematical/philosophical heft behind them. Plus they seem to work, which is always nice.

-----

0. That is, if you'll excuse the notation: p(a) = e^α/Σ e^β where 'p(a)' is the probability of taking action 'a', with weight 'α' and you're summing over all possible actions (with weights denoted 'β') in the denominator.

1. For a discrete space (absent other information), the maximum entropy distribution is the uniform distribution, i.e., all actions are equally likely.

2. For example, a human-like robot might try stepping forward, but hasn't learned how to balance after it lifts its foot, so it stumbles and falls. Crawling is less efficient, but provides constant progress and is easier to learn. So the robot 'gets stuck' crawling since it had a bad experience walking and doesn't attempt it again.

pizza · on Oct 23, 2017

> In a similar vein to general-to-specific transfer, we can compose new skills from existing policies—even without any fine-tuning—by intersecting different skills. The idea is simple: take two soft policies, each corresponding to a different set of behaviors, and combine them by adding together their Q-functions. In fact, it is possible to show that the combined policy is approximately optimal for the combined task, obtained by simply adding the reward functions of the constituent tasks, up to a bounded error.

> Because the maximum entropy formulation encourages agents to try all possible solutions, the agents learn to explore a large portion of the state space. Thus they learn to act in various situations, and are more robust against perturbations in the environment.

it is keeping non-zero weights on suboptimal actions during exploration. but that tells you almost nothing. there are an infinite number of ways to keep non-zero weights on suboptimal actions and some of those are much better than the rest. the maximum-entropy approach is a very specific way to use non-greedy weights.