Policies and value functions
What’s up, guys? In this post, we’re going to pick up where we left off with Markov Decision Processes and discuss the topics of policies and value functions. This will give us a way to measure “how good” it is for an agent to be in a given state or to select a given action, so let’s get to it!
How good is a state or action?
Last time, we discussed the general idea of MDPs and how an agent in an environment can perform actions and get a rewarded for those actions.
With all the possible actions that an agent may be able to take in all the possible states of an environment, there are a couple of things that we might be interested in understanding.
First, we'd probably like to know how likely it is for an agent to take any given action from any given state. In other words, what is the probability that an agent will select a specific action from a specific state? This is where the notion of policies come into play, and we'll expand on this in just a moment.
Secondly, in addition to understanding the probability of selecting an action, we'd probably also like to know how good a given action or a given state is for the agent. In terms of rewards, selecting one action over another in a given state may increase or decrease the agent's rewards, so knowing this in advance will probably help our agent out with deciding which actions to take in which states. This is where value functions become useful, and we'll also expand on this idea in just a bit.
Question | Addressed by |
---|---|
How probable is it for an agent to select any action from a given state? | Policies |
How good is any given action or any given state for an agent? | Value functions |
Policies
A policy is a function that maps a given state to probabilities of selecting each possible action from that state. We will use the symbol \(\pi\) to denote a policy.
When speaking about policies, formally we say that an agent “follows a policy.” For example, if an agent follows policy \(\pi\) at time \(t\), then \(\pi(a|s)\) is the probability that \(A_t=a\) if \(S_t=s\). This means that, at time \(t\), under policy \(\pi\), the probability of taking action \(a\) in state \(s\) is \(\pi(a|s)\).
Note that, for each state \(s \in \boldsymbol{S}\), \(\pi\) is a probability distribution over \(a \in \boldsymbol{A}(s)\).
Value functions
Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.
This notion of how good a state or state-action pair is is given in terms of expected return. Remember, the rewards an agent expects to receive are dependent on what actions the agent takes in given states. So, value functions are defined with respect to specific ways of acting. Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies.
State-value function
The state-value function for policy \(\pi\), denoted as \(v_\pi\), tells us how good any given state is for an agent following policy \(\pi\). In other words, it gives us the value of a state under \(\pi\).
Formally, the value of state \(s\) under policy \(\pi\) is the expected return from starting from state \(s\) at time \(t\) and following policy \(\pi\) thereafter. Mathematically we define \(v_\pi(s)\) as \begin{eqnarray*} v_{\pi }\left( s\right) &=&E_{\pi}\left[ \rule[-0.05in]{0in}{0.2in}G_{t}\mid S_{t}=s\right] \\ &=&E_{\pi }\left[ \sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\mid S_{t}=s\right] \text{.} \end{eqnarray*}
Action-value function
Similarly, the action-value function for policy \(\pi\), denoted as \(q_\pi\), tells us how good it is for the agent to take any given action from a given state while following policy \(\pi\). In other words, it gives us the value of an action under \(\pi\).
Formally, the value of action \(a\) in state \(s\) under policy \(\pi\) is the expected return from starting from state \(s\) at time \(t\), taking action \(a\), and following policy \(\pi\) thereafter. Mathematically, we define \(q_\pi(s,a)\) as \begin{eqnarray*} q_{\pi }\left( s,a\right) &=&E_{\pi }\left[ G_{t}\mid S_{t}=s,A_{t}=a \rule[-0.05in]{0in}{0.2in}\right] \\ &=&E_{\pi }\left[ \sum_{k=0}^{\infty }\gamma ^{k}R_{t+k+1}\mid S_{t}=s,A_{t}=a\right] \text{.} \end{eqnarray*}
Conventionally, the action-value function \(q_\pi\) is referred to as the Q-function, and the output from the function for any given state-action pair is called a Q-value. The letter “ Q” is used to represent the quality of taking a given action in a given state. We’ll be working with Q-value functions a lot going forward.
Wrapping up
At this point, we now have an idea of the structure of MDPs, all the key components, and how, within an MDP, we can measure how good different states or different state-action pairs are for an agent through the use of value functions.
Reinforcement learning algorithms estimate value functions as a way to determine best routes for the agent to take. In the next post, we’ll continue this discussion by covering optimal value functions and optimal policies.
Keep me posted on how your understanding of reinforcement learning is progressing so far in the comments, let me know what questions you have, and I’ll see ya in the next one!