# Reinforcement Learning - Introducing Goal Oriented Intelligence

with deeplizard.

## Expected Return - What Drives a Reinforcement Learning Agent in an MDP

September 23, 2018 by

Blog

### Expected Return - What Drives a Reinforcement Learning Agent in an MDP

What’s up, guys? In this post, we're going to build on the way we think about the cumulative rewards that an agent receives in a Markov decision process and introduce the important concept of return. We'll see that the return is exactly what's driving the agent to make the decisions it makes, so let’s get to it!

#### Expected return

Recall in the last post, we stated that the goal of an agent in an MDP is to maximize its cumulative rewards. We need a way to aggregate and formalize these cumulative rewards. For this, we introduce the concept of the expected return of the rewards at a given time step.

For now, we can think of the return simply as the sum of future rewards. Mathematically, we define the return $$G$$ at time $$t$$ as \begin{equation*} G_{t}=R_{t+1}+R_{t+2}+R_{t+3}+\cdots +R_{T}\text{,} \end{equation*} where $$T$$ is the final time step.

It is the agent’s goal to maximize the expected return of rewards.

This concept of the expected return is super important because it's the agent's objective to maximize the expected return. The expected return is what's driving the agent to make the decisions it makes.

### Episodic vs. continuing tasks

In our definition of the expected return, we introduced $$T$$, the final time step. When the notion of having a final time step makes sense, the agent-environment interaction naturally breaks up into subsequences, called episodes. For example, think about playing a game of pong. Each new round of the game can be thought of as an episode, and the final time step of an episode occurs when a player scores a point.

Each episode ends in a terminal state at time $$T$$, which is followed by resetting the environment to some standard starting state or to a random sample from a distribution of possible starting states. The next episode then begins independently from how the previous episode ended.

Formally, tasks with episodes are called episodic tasks.

There exists other types of tasks though where the agent-environment interactions don’t break up naturally into episodes, but instead continue without limit. These types of tasks are called continuing tasks.

Can you think of any examples of continuing tasks?

Continuing tasks make our definition of the return at each time $$t$$ problematic because our final time step would be $$T = \infty$$, and therefore the return itself could be infinite since we have \begin{equation*} G_{t}=R_{t+1}+R_{t+2}+R_{t+3}+\cdots +R_{T}\text{.} \end{equation*} Because of this, we need to refine they way we're working with the return.

### Discounted return

Our revision of the way we think about return will make use of discounting. Rather than the agent’s goal being to maximize the expected return of rewards, it will instead be to maximize the expected discounted return of rewards. Specifically, the agent will be choosing action $$A_t$$ at each time $$t$$ to maximize the expected discounted return.

It is the agent’s goal to maximize the expected discounted return of rewards.

To define the discounted return, we first define the discount rate, $$\gamma$$, to be a number between $$0$$ and $$1$$. The discount rate will be the rate for which we discount future rewards and will determine the present value of future rewards. With this, we define the discounted return as \begin{eqnarray*} G_{t} &=&R_{t+1}+\gamma R_{t+2}+\gamma ^{2}R_{t+3}+\cdots \\ &=&\sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\text{.} \end{eqnarray*}

This definition of the discounted return makes it to where our agent will care more about the immediate reward over future rewards since future rewards will be more heavily discounted. So, while the agent does consider the rewards it expects to receive in the future, the more immediate rewards have more influence when it comes to the agent making a decision about taking a particular action.

Now, check out this relationship below showing how returns at successive time steps are related to each other. We’ll make use of this relationship later. \begin{eqnarray*} G_{t} &=&R_{t+1}+\gamma R_{t+2}+\gamma ^{2}R_{t+3}+\gamma ^{3}R_{t+3}+\cdots \\ &=&R_{t+1}+\gamma \left( R_{t+2}+\gamma R_{t+3}+\gamma ^{2}R_{t+3}+\cdots \right) \\ &=&R_{t+1}+\gamma G_{t+1} \end{eqnarray*}

Also, check this out. Even though the return at time $$t$$ is a sum of an infinite number of terms, the return is actually finite as long as the reward is nonzero and constant, and $$\gamma \lt 1$$.

For example, if the reward at each time step is a constant $$1$$ and $$\gamma \lt 1$$, then the return is \begin{equation*} G_{t}=\sum_{k=0}^{\infty }\gamma ^{k}=\frac{1}{1-\gamma }\text{.} \end{equation*}

This infinite sum yields a finite result. If you want to understand this concept more deeply, then research infinite series convergence. For our purposes though, you’re free to just trust the fact that this is true, and understand the infinite sum of discounted returns is finite if the conditions we outlined are met.

#### Wrapping up

Now we should have a good feel for the discounted return. The main take-away here is that it is the agent's objective to maximize the expected discounted return of rewards. While the agent does consider all of the expected future rewards when selecting an action, the more immediate rewards influence the agent greater than rewards that are expected to be received further out due to the discount rate.

Next time we’ll be building on the ideas from our introduction to MDPs and discounted return to see how we can measure “ how good” any particular state or any particular action is for the agent. I’ll see ya in the next one!

Description

Welcome back to this series on reinforcement learning! In this video, we're going to build on the way we think about the cumulative rewards that an agent receives in a Markov decision process and introduce the important concept of return. We'll see that the return is exactly what's driving the agent to make the decisions it makes. We'll also introduce the idea of episodes and talk about episodic tasks vs. continuing tasks. Check out the corresponding blog and other resources for this video at: http://deeplizard.com/learn/video/a-SnJtmBtyA ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Ruicong Xie Najib Akram Support collective intelligence, and join the deeplizard hivemind: http://deeplizard.com/hivemind Follow deeplizard: YouTube: https://www.youtube.com/deeplizard Twitter: https://twitter.com/deeplizard Facebook: https://www.facebook.com/Deeplizard-145413762948316 Steemit: https://steemit.com/@deeplizard Instagram: https://www.instagram.com/deeplizard/ Pinterest: https://www.pinterest.com/deeplizard/ Check out products deeplizard suggests on Amazon: https://www.amazon.com/shop/deeplizard Support deeplizard by browsing with Brave: https://brave.com/dee530 Support deeplizard with crypto: Bitcoin: 1AFgm3fLTiG5pNPgnfkKdsktgxLCMYpxCN Litecoin: LTZ2AUGpDmFm85y89PFFvVR5QmfX6Rfzg3 Ether: 0x9105cd0ecbc921ad19f6d5f9dd249735da8269ef Recommended books on AI: The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive: http://amzn.to/2GtjKqu Life 3.0: Being Human in the Age of Artificial Intelligence https://amzn.to/2H5Iau4 Playlists: Data Science - https://www.youtube.com/playlist?list=PLZbbT5o_s2xrth-Cqs_R9-us6IWk9x27z Machine Learning - https://www.youtube.com/playlist?list=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU Keras - https://www.youtube.com/playlist?list=PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL TensorFlow.js - https://www.youtube.com/playlist?list=PLZbbT5o_s2xr83l8w44N_g3pygvajLrJ- PyTorch - https://www.youtube.com/watch?v=v5cngxo4mIg&list=PLZbbT5o_s2xrfNyHZsM6ufI0iZENK9xgG Reinforcement Learning - https://www.youtube.com/playlist?list=PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv Music: Thinking Music by Kevin MacLeod Jarvic 8 by Kevin MacLeod YouTube: https://www.youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ Website: http://incompetech.com/ Licensed under Creative Commons: By Attribution 3.0 License http://creativecommons.org/licenses/by/3.0/