Background

GAEs were introduced in this paper, which are the premises of PPO, and are one of the fundamental reasons why PPO work so well.

They are based on eligibility traces and TD-λ, which have been explained on my other post. To understand GAEs, let’s step back and undertsand why we need them.

What is our problematic?

In Policy gradient methods, and reinforcement learning in general, there is a credit assignment problem: how do we know which action to reinforce when a reward is received long after?

The goal is to find balance between bias and variance. We introduce variance when we look very far ahead in the future to reinforce actions, because we need a lot of sampled trajectories to approach the true average - or expectation of our rewards. We introduce bias when we are short-sighted, i.e when we estimate our current state’s value to be a weighted average over n-steps returns, n being small (1 for TD residuals).

What tools do we have?

We know about eligibility traces and λ-returns, as explained in my latest post, TD-λ (see Reinforcement Learning: an Introduction). But λ-returns require a full episode before they can be computed, and TD-λ is a full algorithm which directly incorporate eligibility traces into the gradient vector. In PPO, we want to incorporate the advantage function into the loss function, and so we can’t directly use the TD-λ algorithm.

What does GAEs bring?

GAEs basically brings TD-λ version to policy gradient methods, by estimating the advantage function which can then be used in the loss function. Let’s recall that the advantage function computes the difference between an action’s value and the policy’s actions. It measures how better or worse an action is compared to the policy.

\[A(s_t, a_t) = Q(s_t,a_t) - V(s_t)\]

How do they work?

Let’s talk intuition here. In your advantage function, you need the value for the state-action pairs so you can measure its difference with the state value function or your current policy. But you don’t have the first one, so you need an estimate. You want your estimate to have low variance and low bias. For instance, you can take an exponential weighted average of the n-steps advantages, with one n-step advantage being defined as:

\[r + \gamma V(s') - V(s)\]

You can look at each n-step advantage this way:

\[\begin{align*} A_t^1 &= r_t + \gamma V(s_{t+1}) - V(s_t) \\ A_t^2 &= r_t + \gamma V(s_{t+1}) + \gamma^2 V(s_{t+2}) - V(s_t) \\ \cdots \\ A_t^n &= \sum_{l=0}^{\infty} \gamma^l \delta_{t+l}^V, \delta_{t+l}^V = r_{t+l} + \gamma V(s_{t+l+1}) - V(s_t) \end{align*}\]

Each of those advantages will have pros & cons. The first one is the TD(0) and has high bias but low variance, the last one will be MC with high variance and low bias. You can then take any weighted sum of all of those advantages. By doing that, we are saying that our estimate of the current advantage at time t is a weighted, (decaying in the case of GAEs) sum of the n-steps estimates of the state or state-action value. This is indeed what we were looking for: we minimize biais because we introduce more precise estimations. We minimize variance because we don’t weight the long-horizons estimates too much.

How are they different from the TD-λ

TD-λ really is just an algorithm using eligibility traces in a “backward” manner so we can update at every timestep, and that algorithm incorporates directly into the gradient update:

\[w_{t+1} = w_t + \alpha \delta_t z_t, \quad z_t = \gamma \lambda z_{t-1} + \nabla \hat{v}(s_t)\]

This is fine, but in policy gradient methods, we want to define our own loss function, for instance the one used in PPO and we don’t want to directly update the value function estimate, but the policy. So the authors of GAEs found a way to use that into a policy gradient method.

\[A_t^{GAE} = G_t^{\lambda} - \hat{v}(s_t) = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^V\]

And then use it in our loss function as something to minimize.

Conclusion

This short article build upon the latest one which introduced eligibility traces. This is making the link between those two, and we now have a clear understanding of what are GAEs, where they are coming from, and how usefull they can be in current state of the art algorithms, namely PPO, in RL.