Reinforcement Learning - Introducing Goal Oriented Intelligence

with deeplizard.

Policies and Value Functions - Good Actions for a Reinforcement Learning Agent

September 27, 2018 by

Blog

Policies and value functions

What’s up, guys? In this post, we’re going to pick up where we left off with Markov Decision Processes and discuss the topics of policies and value functions. This will give us a way to measure “how good” it is for an agent to be in a given state or to select a given action, so let’s get to it!

How good is a state or action?

Last time, we discussed the general idea of MDPs and how an agent in an environment can perform actions and get a rewarded for those actions.

With all the possible actions that an agent may be able to take in all the possible states of an environment, there are a couple of things that we might be interested in understanding.

First, we'd probably like to know how likely it is for an agent to take any given action from any given state. In other words, what is the probability that an agent will select a specific action from a specific state? This is where the notion of policies come into play, and we'll expand on this in just a moment.

Secondly, in addition to understanding the probability of selecting an action, we'd probably also like to know how good a given action or a given state is for the agent. In terms of rewards, selecting one action over another in a given state may increase or decrease the agent's rewards, so knowing this in advance will probably help our agent out with deciding which actions to take in which states. This is where value functions become useful, and we'll also expand on this idea in just a bit.

Question Addressed by
How probable is it for an agent to select any action from a given state? Policies
How good is any given action or any given state for an agent? Value functions

Policies

A policy is a function that maps a given state to probabilities of selecting each possible action from that state. We will use the symbol \(\pi\) to denote a policy.

When speaking about policies, formally we say that an agent “follows a policy.” For example, if an agent follows policy \(\pi\) at time \(t\), then \(\pi(a|s)\) is the probability that \(A_t=a\) if \(S_t=s\). This means that, at time \(t\), under policy \(\pi\), the probability of taking action \(a\) in state \(s\) is \(\pi(a|s)\).

Note that, for each state \(s \in \boldsymbol{S}\), \(\pi\) is a probability distribution over \(a \in \boldsymbol{A}(s)\).

Value functions

Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.

This notion of how good a state or state-action pair is is given in terms of expected return. Remember, the rewards an agent expects to receive are dependent on what actions the agent takes in given states. So, value functions are defined with respect to specific ways of acting. Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies.

State-value function

The state-value function for policy \(\pi\), denoted as \(v_\pi\), tells us how good any given state is for an agent following policy \(\pi\). In other words, it gives us the value of a state under \(\pi\).

Formally, the value of state \(s\) under policy \(\pi\) is the expected return from starting from state \(s\) at time \(t\) and following policy \(\pi\) thereafter. Mathematically we define \(v_\pi(s)\) as \begin{eqnarray*} v_{\pi }\left( s\right) &=&E_{\pi}\left[ \rule[-0.05in]{0in}{0.2in}G_{t}\mid S_{t}=s\right] \\ &=&E_{\pi }\left[ \sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\mid S_{t}=s\right] \text{.} \end{eqnarray*}

Action-value function

Similarly, the action-value function for policy \(\pi\), denoted as \(q_\pi\), tells us how good it is for the agent to take any given action from a given state while following following policy \(\pi\). In other words, it gives us the value of an action under \(\pi\).

Formally, the value of action \(a\) in state \(s\) under policy \(\pi\) is the expected return from starting from state \(s\) at time \(t\), taking action \(a\), and following policy \(\pi\) thereafter. Mathematically, we define \(q_\pi(s,a)\) as \begin{eqnarray*} q_{\pi }\left( s,a\right) &=&E_{\pi }\left[ G_{t}\mid S_{t}=s,A_{t}=a \rule[-0.05in]{0in}{0.2in}\right] \\ &=&E_{\pi }\left[ \sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\mid S_{t}=s,A_{t}=a\right] \text{.} \end{eqnarray*}

Conventionally, the action-value function \(q_\pi\) is referred to as the Q-function, and the output from the function for any given state-action pair is called a Q-value. The letter “ Q” is used to represent the quality of taking a given action in a given state. We’ll be working with Q-value functions a lot going forward.

Wrapping up

At this point, we now have an idea of the structure of MDPs, all the key components, and how, within an MDP, we can measure how good different states or different state-action pairs are for an agent through the use of value functions.

Reinforcement learning algorithms estimate value functions as a way to determine best routes for the agent to take. In the next post, we’ll continue this discussion by covering optimal value functions and optimal policies.

Keep me posted on how your understanding of reinforcement learning is progressing so far in the comments, let me know what questions you have, and I’ll see ya in the next one!

Description

Welcome back to this series on reinforcement learning! In this video, we’re going to pick up where we left off with Markov Decision Processes and discuss the topics of policies and value functions. This will give us a way to measure “how good” it is for an agent to be in a given state or to select a given action. 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👀 OUR VLOG: 🔗 https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og 👉 Check out the blog post and other resources for this video: 🔗 https://deeplizard.com/learn/video/eMxOGwbdqKY 💻 DOWNLOAD ACCESS TO CODE FILES 🤖 Available for members of the deeplizard hivemind: 🔗 https://www.patreon.com/posts/27743395 🧠 Support collective intelligence, join the deeplizard hivemind: 🔗 https://deeplizard.com/hivemind 🤜 Support collective intelligence, create a quiz question for this video: 🔗 https://deeplizard.com/create-quiz-question 🚀 Boost collective intelligence by sharing this video on social media! ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: yasser Prash 👀 Follow deeplizard: Our vlog: https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og Twitter: https://twitter.com/deeplizard Facebook: https://www.facebook.com/Deeplizard-145413762948316 Patreon: https://www.patreon.com/deeplizard YouTube: https://www.youtube.com/deeplizard Instagram: https://www.instagram.com/deeplizard/ 🎓 Other deeplizard courses: Reinforcement Learning - https://deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv NN Programming - https://deeplizard.com/learn/playlist/PLZbbT5o_s2xrfNyHZsM6ufI0iZENK9xgG DL Fundamentals - https://deeplizard.com/learn/playlist/PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU Keras - https://deeplizard.com/learn/playlist/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL TensorFlow.js - https://deeplizard.com/learn/playlist/PLZbbT5o_s2xr83l8w44N_g3pygvajLrJ- Data Science - https://deeplizard.com/learn/playlist/PLZbbT5o_s2xrth-Cqs_R9- Trading - https://deeplizard.com/learn/playlist/PLZbbT5o_s2xr17PqeytCKiCD-TJj89rII 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://www.amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard’s link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://www.youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ 🔗 http://incompetech.com/ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.