What do Reinforcement Learning Algorithms Learn - Optimal Policies

video

expand_more

text

expand_more

What do reinforcement learning algorithms learn?

What's up, guys? In this post, we're going to focus on what it is exactly that reinforcement learning algorithms learn: optimal policies. Let's get to it!

Policies and value functions review

We're going to be learning about optimal policies, and in turn, we'll also learn about optimal value functions. Before discussing this new topic of optimality in detail, let's have a quick run down of policies and value functions in general.

Recall that last time, we got acquainted with the concept of value functions, which generally give us an idea of how good some given state or some given state-action pair is for an agent in terms of expected return. We also talked about policies, which give us a mapping from each state in the state space to the probabilities of taking each possible action from each state.

We brought these two ideas together by discussing how value functions are defined in terms of policies, where the value of a state \(s\) under a given policy \(\pi\) is the expected return from starting from state \(s\) and following \(\pi\) thereafter.

Optimality

It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards for the agent if the agent indeed follows that policy. Specifically, reinforcement learning algorithms seek to find a policy that will yield more return to the agent than all other policies.

Optimal policy

In terms of return, a policy \(\pi\) is considered to be better than or the same as policy \(\pi^\prime\) if the expected return of \(\pi\) is greater than or equal to the expected return of \(\pi^\prime\) for all states. In other words,

\pi \geq \pi^\prime \text{ if and only if } v_{π}(s) \geq v_{π^\prime}(s) \text{ for all } s\in\boldsymbol{S}\text{.}

Remember, \(v_{π}(s)\) gives the expected return for starting in state \(s\) and following \(\pi\) thereafter. A policy that is better than or at least the same as all other policies is called the optimal policy.

Optimal state-value function

The optimal policy has an associated optimal state-value function. Recall, we covered state-value functions in detail last time. We denote the optimal state-value function as \(v_{*}\) and define as \begin{equation*} v_{\ast }\left( s\right) =\max_{\pi }v_{\pi }\left( s\right) \end{equation*} for all \(s\in\boldsymbol{S}\text{.}\) In other words, \(v_{*}\) gives the largest expected return achievable by any policy \(\pi\) for each state.

Optimal action-value function

Similarly, the optimal policy has an optimal action-value function, or optimal Q-function, which we denote as \(q_{*}\) and define as \begin{equation*} q_{\ast }\left( s,a\right) =\max_{\pi }q_{\pi }\left( s,a\right) \end{equation*} for all \(s\in \boldsymbol{S}\) and \(a\in \boldsymbol{A}\left( s\right)\). In other words, \(q_{*}\) gives the largest expected return achievable by any policy \(\pi\) for each possible state-action pair.

Bellman optimality equation

One fundamental property of \(q_*\) is that it must satisfy the following equation.

\begin{eqnarray*} q_{\ast }\left( s,a\right) &=&E\left[ R_{t+1}+\gamma \max_{a^{\prime }}q_{\ast }\left( s^\prime,a^{\prime }\right)\right] \end{eqnarray*}

This is called the Bellman optimality equation. It states that, for any state-action pair \((s,a)\) at time \(t\), the expected return from starting in state \(s\), selecting action \(a\) and following the optimal policy thereafter (AKA the Q-value of this pair) is going to be the expected reward we get from taking action \(a\) in state \(s\), which is \(R_{t+1}\), plus the maximum expected discounted return that can be achieved from any possible next state-action pair \((s^\prime,a^\prime)\).

Since the agent is following an optimal policy, the following state \(s^\prime\) will be the state from which the best possible next action \(a^\prime\) can be taken at time \(t+1\).

We're going to see how we can use the Bellman equation to find \(q_{∗}\). Once we have \(q_{∗}\), we can determine the optimal policy because, with \(q_{∗}\), for any state \(s\), a reinforcement learning algorithm can find the action \(a\) that maximizes \(q_{∗}(s,a)\).

We're going to use this Bellman equation a lot going forward, so it will continue to materialize for us more as we progress.

Wrapping up

Keep me posted in the comments on how you're progressing so far, give a thumbs up to let us know you're learning, and be sure to check out the deeplizard hivemind for exclusive perks and rewards. Thanks for contributing to collective intelligence, and I'll see ya in the next one!

quiz

expand_more

resources

expand_more

Welcome back to this series on reinforcement learning! In this video, we're going to focus on what it is exactly that reinforcement learning algorithms learn: optimal policies. This will lead us to exploring optimal value functions, and specifically, optimal Q-functions, which we'll learn must satisfy a fundamental property called the Bellman optimality equation. Sources: Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow http://incompleteideas.net/book/RLbook2020.pdf Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Yoshua Bengio's TEDx Talk: https://www.youtube.com/watch?v=uawLjkSI7Mo 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 05:51 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👀 CHECK OUT OUR VLOG: 🔗 https://youtube.com/deeplizardvlog 💪 CHECK OUT OUR FITNESS CHANNEL: 🔗 https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: 🔗 https://neurohacker.com/shop?rfsn=6488344.d171c6 ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime 👀 Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard 🎓 Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more

DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

Updated
Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.

Reinforcement Learning - Developing Intelligent Agents

What do Reinforcement Learning Algorithms Learn - Optimal Policies

video

text

What do reinforcement learning algorithms learn?

Policies and value functions review

Optimality

Optimal policy

Optimal state-value function

Optimal action-value function

Bellman optimality equation

Wrapping up

quiz

Quiz Results

resources

updates

Update history for this page