Expected Return - What Drives a Reinforcement Learning Agent in an MDP
text
Expected Return - What Drives a Reinforcement Learning Agent in an MDP
What's up, guys? In this post, we're going to build on the way we think about the cumulative rewards that an agent receives in a Markov decision process and introduce the important concept of return. We'll see that the return is exactly what's driving the agent to make the decisions it makes, so let's get to it!
Expected return
Recall in the last post, we stated that the goal of an agent in an MDP is to maximize its cumulative rewards. We need a way to aggregate and formalize these cumulative rewards. For this, we introduce the concept of the expected return of the rewards at a given time step.
For now, we can think of the return simply as the sum of future rewards. Mathematically, we define the return \(G\) at time \(t\) as \begin{equation*} G_{t}=R_{t+1}+R_{t+2}+R_{t+3}+\cdots +R_{T}\text{,} \end{equation*} where \(T\) is the final time step.
This concept of the expected return is super important because it's the agent's objective to maximize the expected return. The expected return is what's driving the agent to make the decisions it makes.
Episodic vs. continuing tasks
In our definition of the expected return, we introduced \(T\), the final time step. When the notion of having a final time step makes sense, the agent-environment interaction naturally breaks up into subsequences, called episodes. For example, think about playing a game of pong. Each new round of the game can be thought of as an episode, and the final time step of an episode occurs when a player scores a point.
Each episode ends in a terminal state at time \(T\), which is followed by resetting the environment to some standard starting state or to a random sample from a distribution of possible starting states. The next episode then begins independently from how the previous episode ended.
Formally, tasks with episodes are called episodic tasks.
There exists other types of tasks though where the agent-environment interactions don't break up naturally into episodes, but instead continue without limit. These types of tasks are called continuing tasks.
Continuing tasks make our definition of the return at each time \(t\) problematic because our final time step would be \(T = \infty\), and therefore the return itself could be infinite since we have \begin{equation*} G_{t}=R_{t+1}+R_{t+2}+R_{t+3}+\cdots +R_{T}\text{.} \end{equation*} Because of this, we need to refine they way we're working with the return.
Discounted return
Our revision of the way we think about return will make use of discounting. Rather than the agent's goal being to maximize the expected return of rewards, it will instead be to maximize the expected discounted return of rewards. Specifically, the agent will be choosing action \(A_t\) at each time \(t\) to maximize the expected discounted return.
To define the discounted return, we first define the discount rate, \(\gamma\), to be a number between \(0\) and \(1\). The discount rate will be the rate for which we discount future rewards and will determine the present value of future rewards. With this, we define the discounted return as \begin{eqnarray*} G_{t} &=&R_{t+1}+\gamma R_{t+2}+\gamma ^{2}R_{t+3}+\cdots \\ &=&\sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\text{.} \end{eqnarray*}
This definition of the discounted return makes it to where our agent will care more about the immediate reward over future rewards since future rewards will be more heavily discounted. So, while the agent does consider the rewards it expects to receive in the future, the more immediate rewards have more influence when it comes to the agent making a decision about taking a particular action.
Now, check out this relationship below showing how returns at successive time steps are related to each other. We'll make use of this relationship later.
Also, check this out. Even though the return at time \(t\) is a sum of an infinite number of terms, the return is actually finite as long as the reward is nonzero and constant, and \(\gamma \lt 1\).
For example, if the reward at each time step is a constant \(1\) and \(\gamma \lt 1\), then the return is \begin{equation*} G_{t}=\sum_{k=0}^{\infty }\gamma ^{k}=\frac{1}{1-\gamma }\text{.} \end{equation*}
This infinite sum yields a finite result. If you want to understand this concept more deeply, then research infinite series convergence. For our purposes though, you're free to just trust the fact that this is true, and understand the infinite sum of discounted returns is finite if the conditions we outlined are met.
Wrapping up
Now we should have a good feel for the discounted return. The main take-away here is that it is the agent's objective to maximize the expected discounted return of rewards. While the agent does consider all of the expected future rewards when selecting an action, the more immediate rewards influence the agent greater than rewards that are expected to be received further out due to the discount rate.
Next time we'll be building on the ideas from our introduction to MDPs and discounted return to see how we can measure β how goodβ any particular state or any particular action is for the agent. I'll see ya in the next one!
quiz
resources
updates
Committed by on