Exploration vs. Exploitation - Learning the Optimal Reinforcement Learning Policy
text
Q-learning - Choosing actions with an epsilon greedy strategy
What's up, guys? Last time, we left our discussion of Q-learning with the question of how an agent chooses to either explore the environment or to exploit it in order to select its actions. To answer this question, we'll introduce a type of strategy called an epsilon greedy strategy, so let's get to it!
Make sure you're up-to-speed on all that we've already introduced about Q-learning in the last post, and refresh your memory on where exactly we were in the example lizard game that we were using to illustrate Q-learning. We'll be picking up with that now.
Review: The lizard game
Now, let's jump back into the lizard game.
Remember, in each episode, the agent (the lizard in our case) starts out by choosing an action from the starting state based on the current Q-value estimates in the Q-table. The lizard chooses its action based on the highest Q-value for the given state.
Since we know that all of the Q-values are first initialized to zero, there's no way for the lizard to differentiate between them at the starting state of the first episode. So, the question remains, what action does it start with? Furthermore, for subsequent states, is it really as straight-forward as just selecting the action with the highest Q-value for the given state?
Additionally, we know that we need a balance of exploration and exploitation to choose our actions, but how exactly this is achieved is with an epsilon greedy strategy, so let's explore that now.
Epsilon greedy strategy
To get this balance between exploitation and exploration, we use what is called an
epsilon greedy strategy. With this strategy, we define an
exploration rate
As the agent learns more about the environment, at the start of each new episode,
To determine whether the agent will choose exploration or exploitation at each time step, we generate a random number between
if random_num > epsilon:
# choose action via exploitation
else:
# choose action via exploration
Choosing an action
So, recall, we first started talking about the exploration-exploitation trade-off last time because we were discussing how the lizard should choose its very first action since all the actions have a Q-value of
Well now, we should know that the action will be chosen randomly via exploration since our exploration rate is set to
Alright, so after the lizard takes an action, it observes the next state, the reward gained from its action, and updates the Q-value in the Q-table for the action it took from the previous state.
Let's suppose the lizard chooses to move right as its action from the starting state. We can see the reward we get in this new state is
Updating the Q-value
To update the Q-value for the action of moving right taken from the previous state, we use the Bellman equation that we highlighted previously:
We want to make the Q-value for the given state-action pair as close as we can to the right hand side of the Bellman equation so that the Q-value will eventually converge to the optimal Q-value
This will happen over time by iteratively comparing the loss between the Q-value and the optimal Q-value for the given state-action pair and then updating the Q-value over and over again each time we encounter this same state-action pair to reduce the loss.
To actually see how we update the Q-value, we first need to introduce the idea of a learning rate.
The learning rate
The learning rate is a number between
So, for example, suppose we have a Q-value in the Q-table for some arbitrary state-action pair that the agent has experienced in a previous time step. Well, if the agent experiences that same state-action pair at a later time step once it's learned more about the environment, the Q-value will need to be updated to reflect the change in expectations the agent now has for the future returns.
We don't want to just overwrite the old Q-value, but rather, we use the learning rate as a tool to determine how much information we keep about the previously computed Q-value for the given state-action pair versus the new Q-value calculated for
the same state-action pair at a later time step. We'll denote the learning rate with the symbol
The higher the learning rate, the more quickly the agent will adopt the new Q-value. For example, if the learning rate is
Calculating the new Q-value
The formula for calculating the new Q-value for state-action pair
So, our new Q-value is equal to a weighted sum of our old value and the
learned value. The old value in our case is
Our learned value is the reward the agent receives from moving right from the starting state plus the discounted estimate of the optimal future Q-value for the next state-action pair
All of the math for this calculation of our concrete example state-action pair of moving right from the starting state is shown below. Suppose the discount rate
Let's pause for a moment and focus on the term
Now, we can substitute the value
Alright, so now we'll take this new Q-value we just calculated and store it in our Q-table for this particular state-action pair.
We've now done everything needed for a single time step. This same process will happen for each time step until termination in each episode.
Once the Q-function converges to the optimal Q-function, we will have our optimal policy.
Max steps
Oh, and speaking of termination, we can also specify a max number of steps that our agent can take before the episode auto-terminates. With the way the game is set up right now, termination will only occur if the lizard reaches the state with five crickets or the state with the bird.
We could define some condition that states if the lizard hasn't reached termination by either one of these two states after
Wrapping up
Now, I know that was a lot, so I've summarized all of the steps for Q-learning we went through in the previous post and in this post! Again, go ahead and pause, check this out, let it simmer, and make sure you've got it.
-
Initialize all Q-values in the Q-table to
. - For each time-step in each episode:
- Choose an action ( considering the exploration-exploitation trade-off).
- Observe the reward and next state.
- Update the Q-value function ( using the formula we gave that will, overtime, make the Q-value function converge to the right hand side of the Bellman equation).
In the next post, we're going to see how we can implement this Q-learning algorithm step-by-step in code using Python to play a simple game. We'll have lots more to talk about there.
Let me know in the comments if you're stoked to actually start getting your hands dirty to implement some of this stuff! And be sure to leave a thumbs up if you are. Thanks for contributing to collective intelligence, and I'll see ya in the next one!
quiz
Using Q-learning with value iteration, the new Q-value calculated at a given timestep is equal to the weighted sum of the old value and the _______________.
Question by glenn
By setting the learning rate to
1
, _______________.Question by Rodrigo
When all the actions in the Q-table have a Q-value of
0
in the first episode, the agent will _______________.Question by marianna tzortzi
resources
updates
Committed by on
934293d
Committed by October 11, 2018
on