Policies and Value Functions

video

expand_more

text

expand_more

What's up, guys? In this post, we're going to pick up where we left off with Markov Decision Processes and discuss the topics of policies and value functions. This will give us a way to measure “how good” it is for an agent to be in a given state or to select a given action, so let's get to it!

How good is a state or action?

Last time, we discussed the general idea of MDPs and how an agent in an environment can perform actions and get a rewarded for those actions.

With all the possible actions that an agent may be able to take in all the possible states of an environment, there are a couple of things that we might be interested in understanding.

First, we'd probably like to know how likely it is for an agent to take any given action from any given state. In other words, what is the probability that an agent will select a specific action from a specific state? This is where the notion of policies come into play, and we'll expand on this in just a moment.

Secondly, in addition to understanding the probability of selecting an action, we'd probably also like to know how good a given action or a given state is for the agent. In terms of rewards, selecting one action over another in a given state may increase or decrease the agent's rewards, so knowing this in advance will probably help our agent out with deciding which actions to take in which states. This is where value functions become useful, and we'll also expand on this idea in just a bit.

Question	Addressed by
How probable is it for an agent to select any action from a given state?	Policies
How good is any given action or any given state for an agent?	Value functions

Policies

A policy is a function that maps a given state to probabilities of selecting each possible action from that state. We will use the symbol \(\pi\) to denote a policy.

When speaking about policies, formally we say that an agent “follows a policy.” For example, if an agent follows policy \(\pi\) at time \(t\), then \(\pi(a|s)\) is the probability that \(A_t=a\) if \(S_t=s\). This means that, at time \(t\), under policy \(\pi\), the probability of taking action \(a\) in state \(s\) is \(\pi(a|s)\).

Note that, for each state \(s \in \boldsymbol{S}\), \(\pi\) is a probability distribution over \(a \in \boldsymbol{A}(s)\).

Value functions

Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.

This notion of how good a state or state-action pair is is given in terms of expected return. Remember, the rewards an agent expects to receive are dependent on what actions the agent takes in given states. So, value functions are defined with respect to specific ways of acting. Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies.

State-value function

The state-value function for policy \(\pi\), denoted as \(v_\pi\), tells us how good any given state is for an agent following policy \(\pi\). In other words, it gives us the value of a state under \(\pi\).

Formally, the value of state \(s\) under policy \(\pi\) is the expected return from starting from state \(s\) at time \(t\) and following policy \(\pi\) thereafter. Mathematically we define \(v_\pi(s)\) as \begin{eqnarray*} v_{\pi }\left( s\right) &=&E_{\pi}\left[ \rule[-0.05in]{0in}{0.2in}G_{t}\mid S_{t}=s\right] \\ &=&E_{\pi }\left[ \sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\mid S_{t}=s\right] \text{.} \end{eqnarray*}

Action-value function

Similarly, the action-value function for policy \(\pi\), denoted as \(q_\pi\), tells us how good it is for the agent to take any given action from a given state while following policy \(\pi\). In other words, it gives us the value of an action under \(\pi\).

Formally, the value of action \(a\) in state \(s\) under policy \(\pi\) is the expected return from starting from state \(s\) at time \(t\), taking action \(a\), and following policy \(\pi\) thereafter. Mathematically, we define \(q_\pi(s,a)\) as

\begin{eqnarray*} q_{\pi }\left( s,a\right) &=&E_{\pi }\left[ G_{t}\mid S_{t}=s,A_{t}=a \rule[-0.05in]{0in}{0.2in}\right] \\ &=&E_{\pi }\left[ \sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\mid S_{t}=s,A_{t}=a\right] \text{.} \end{eqnarray*}

Conventionally, the action-value function \(q_\pi\) is referred to as the Q-function, and the output from the function for any given state-action pair is called a Q-value. The letter “ Q” is used to represent the quality of taking a given action in a given state. We'll be working with Q-value functions a lot going forward.

Wrapping up

At this point, we now have an idea of the structure of MDPs, all the key components, and how, within an MDP, we can measure how good different states or different state-action pairs are for an agent through the use of value functions.

Reinforcement learning algorithms estimate value functions as a way to determine best routes for the agent to take. In the next post, we'll continue this discussion by covering optimal value functions and optimal policies.

Keep me posted on how your understanding of reinforcement learning is progressing so far in the comments, let me know what questions you have, and I'll see ya in the next one!

quiz

expand_more

resources

expand_more

Welcome back to this series on reinforcement learning! In this video, we're going to pick up where we left off with Markov Decision Processes and discuss the topics of policies and value functions. This will give us a way to measure “how good” it is for an agent to be in a given state or to select a given action. Sources: Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow http://incompleteideas.net/book/RLbook2020.pdf Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 06:22 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👀 CHECK OUT OUR VLOG: 🔗 https://youtube.com/deeplizardvlog 💪 CHECK OUT OUR FITNESS CHANNEL: 🔗 https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: 🔗 https://neurohacker.com/shop?rfsn=6488344.d171c6 ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime 👀 Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard 🎓 Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more

DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

Updated
Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.

Reinforcement Learning - Developing Intelligent Agents