Reinforcement Learning - Developing Intelligent Agents

Deep Learning Course - Level: Advanced

Train Q-learning Agent with Python - Reinforcement Learning Code Project


expand_more chevron_left


expand_more chevron_left

Implementing Q-learning in code with Python

What's up, guys? As promised, in this post, we're going to write the code to implement our first reinforcement learning algorithm. Specifically, we'll implement the Q-learning algorithm to train an agent to play OpenAI Gym's Frozen Lake game that we introduced in the previous post, so let's get to it!

Last time, we went over the details for how the game Frozen Lake works, got our environments ready, and got our initial code written to create the Frozen Lake environment, initialize our Q-table, and configure our algorithm parameters.

Quick update

Speaking of those algorithm parameters, remember how we set our exploration_decay_rate to 0.01? Well, since the last post, I've done some further experimentation with this game and decided to change the exploration_decay_rate to 0.001.

I did this because I was seeing some inconsistent results with the larger decay rate. As a challenge, after finishing the video, why don't you try out testing both of these decay rates repeatedly, compare the results, and let me know your thoughts on why you think the change was needed. And don't be shy, I wanna here from you!

With all that behind us, we'll now get straight into implementing the algorithm.

Coding the Q-learning algorithm training loop

Let's start from the top.

First, we create this list to hold all of the rewards we'll get from each episode. This will be so we can see how our game score changes over time. We'll discuss this more in a bit.

rewards_all_episodes = []

In the following block of code, we'll implement the entire Q-learning algorithm we discussed in detail in a couple posts back. When this code is executed, this is exactly where the training will take place. This first for-loop contains everything that happens within a single episode. This second nested loop contains everything that happens for a single time-step.

# Q-learning algorithm
for episode in range(num_episodes):
    # initialize new episode params

    for step in range(max_steps_per_episode): 
        # Exploration-exploitation trade-off
        # Take new action
        # Update Q-table
        # Set new state
        # Add new reward        

    # Exploration rate decay   
    # Add current episode reward to total rewards list

For each episode

Let's get inside of our first loop. For each episode, we're going to first reset the state of the environment back to the starting state.

for episode in range(num_episodes):
    state = env.reset()[0]
    done = False
    rewards_current_episode = 0

    for step in range(max_steps_per_episode): 

The done variable just keeps track of whether or not our episode is finished, so we initialize it to False when we first start the episode, and we'll see later where it will get updated to notify us when the episode is over.

Then, we need to keep track of the rewards within the current episode as well, so we set rewards_current_episode to 0 since we start out with no rewards at the beginning of each episode.

For each time-step

Now we're entering into the nested loop, which runs for each time-step within an episode. The remaining steps, until we say otherwise, will occur for each time-step.

Exploration vs. exploitation

for step in range(max_steps_per_episode): 

    # Exploration-exploitation trade-off
    exploration_rate_threshold = random.uniform(0, 1)
    if exploration_rate_threshold > exploration_rate:
        action = np.argmax(q_table[state,:]) 
        action = env.action_space.sample()

For each time-step within an episode, we set our exploration_rate_threshold to a random number between 0 and 1. This will be used to determine whether our agent will explore or exploit the environment in this time-step, and we discussed the detail of this exploration-exploitation trade-off in a previous post of this series.

If the threshold is greater than the exploration_rate, which remember, is initially set to 1, then our agent will exploit the environment and choose the action that has the highest Q-value in the Q-table for the current state. If, on the other hand, the threshold is less than or equal to the exploration_rate, then the agent will explore the environment, and sample an action randomly.

Taking action

new_state, reward, done, truncated, info = env.step(action)

After our action is chosen, we then take that action by calling step() on our env object and passing our action to it. The function step() returns a tuple containing the new state, the reward for the action we took, whether or not the action ended our episode, and diagnostic information regarding our environment, which may be helpful for us if we end up needing to do any debugging.

Update the Q-value

After we observe the reward we obtained from taking the action from the previous state, we can then update the Q-value for that state-action pair in the Q-table. This is done using the formula we introduced in an earlier post, and remember, there we walked through a concrete example showing how to implement the Q-table update.

Here is the formula:

\begin{equation*} q^{new}\left( s,a\right) =\left( 1-\alpha \right) ~\underset{\text{old value} }{\underbrace{q\left( s,a\right) }\rule[-0.05in]{0in}{0.2in} \rule[-0.05in]{0in}{0.2in}\rule[-0.1in]{0in}{0.3in}}+\alpha \overset{\text{ learned value}}{\overbrace{\left( R_{t+1}+\gamma \max_{a^{^{\prime }}}q\left( s^{\prime },a^{\prime }\right) \right) }} \end{equation*}

And here is the same formula in code:

# Update Q-table for Q(s,a)
q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
    learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

So, remember, the new Q-value for this state-action pair is a weighted sum of our old value and the “learned value.” So we have our new Q-value equal to the old Q-value times one minus the learning rate plus the learning rate times the “learned value,” which is the reward we just received from our last action plus the discounted estimate of the optimal future Q-value for the next state action pair.

Transition to the next state

state = new_state
rewards_current_episode += reward 

Next, we set our current state to the new_state that was returned to us once we took our last action, and we then update the rewards from our current episode by adding the reward we received for our previous action.

if done == True: 

We then check to see if our last action ended the episode for us, meaning, did our agent step in a hole or reach the goal? If the action did end the episode, then we jump out of this loop and move on to the next episode. Otherwise, we transition to the next time-step.

Exploration rate decay

# Exploration rate decay
exploration_rate = min_exploration_rate + \
    (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)

Once an episode is finished, we need to update our exploration_rate using exponential decay, which just means that the exploration rate decreases or decays at a rate proportional to its current value. We can decay the exploration_rate using the formula above, which makes use of all the exploration rate parameters that we defined last time.


We then just append the rewards from the current episode to the list of rewards from all episodes, and that's it! We're good to move on to the next episode.

After all episodes complete

After all episodes are finished, we now just calculate the average reward per thousand episodes from our list that contains the rewards for all episodes so that we can print it out and see how the rewards changed over time.

# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.16800000000000012
2000 :  0.32800000000000024
3000 :  0.46900000000000036
4000 :  0.5350000000000004
5000 :  0.6580000000000005
6000 :  0.6910000000000005
7000 :  0.6470000000000005
8000 :  0.6550000000000005
9000 :  0.6980000000000005
10000 :  0.7000000000000005

From this printout, we can see our average reward per thousand episodes did indeed progress over time. When the algorithm first started training, the first thousand episodes only averaged a reward of 0.16, but by the time it got to its last thousand episodes, the reward drastically improved to 0.7.

Interpreting the training results

Let's take a second to understand how we can interpret these results. Our agent played 10,000 episodes. At each time step within an episode, the agent received a reward of 1 if it reached the frisbee, otherwise, it received a reward of 0. If the agent did indeed reach the frisbee, then the episode finished at that time-step.

So, that means for each episode, the total reward received by the agent for the entire episode is either 1 or 0. So, for the first thousand episodes, we can interpret this score as meaning that \(16\%\) of the time, the agent received a reward of 1 and won the episode. And by the last thousand episodes from a total of 10,000, the agent was winning \(70\%\) of the time.

By analyzing the grid of the game, we can see it's a lot more likely that the agent would fall in a hole or perhaps reach the max time steps than it is to reach the frisbee, so reaching the frisbee \(70\%\) of the time isn't too shabby, especially since the agent had no explicit instructions to reach the frisbee. It learned that this is the correct thing to do.


Lastly, we print out our updated Q-table to see how that has transitioned from its initial state of all zeros.

# Print updated Q-table


[[0.57804676 0.51767675 0.50499139 0.47330103]
[0.07903519 0.16544989 0.16052137 0.45023559]
[0.37592905 0.18333739 0.18905787 0.17227745]
[0.01504804 0.         0.         0.        ]
[0.59422496 0.42787803 0.43837162 0.45604075]
[0.         0.         0.         0.        ]
[0.1814022  0.13794979 0.31651935 0.09308381]
[0.         0.         0.         0.        ]
[0.43529839 0.32298132 0.36007182 0.64475741]
[0.3369853  0.75303211 0.42246585 0.50627733]
[0.65743421 0.48185693 0.32179817 0.35823251]
[0.         0.         0.         0.        ]
[0.         0.         0.         0.        ]
[0.53127669 0.63965638 0.86112718 0.53141807]
[0.68753949 0.94078659 0.76545158 0.71566071]
[0.         0.         0.         0.        ]]

Wrapping up

In the next post, we're going to get into the super fun part where we get to watch our trained agent play Frozen Lake. We'll get straight into the code for that then!

Until then, try to beat my score! Test and tune your parameters to try to get better than \(70\%\) wins in your last thousand episodes. Let me know in the comments what happens to your score when you tune your parameters, for better or worse, and what values you used! I'll see ya in the next one!


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Quiz Results


expand_more chevron_left
Welcome back to this series on reinforcement learning! As promised, in this video, we're going to write the code to implement our first reinforcement learning algorithm. Specifically, we'll use Python to implement the Q-learning algorithm to train an agent to play OpenAI Gym's Frozen Lake game that we introduced in the previous video. Let's get to it! Sources: Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies Thomas Simonini's Frozen Lake Q-learning implementation Gymnasium: TED Talk: 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 08:29 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👀 CHECK OUT OUR VLOG: 🔗 💪 CHECK OUT OUR FITNESS CHANNEL: 🔗 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: 🔗 ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime 👀 Follow deeplizard: Our vlog: Fitness: Facebook: Instagram: Twitter: Patreon: YouTube: 🎓 Deep Learning with deeplizard: AI Art for Beginners - Deep Learning Dictionary - Deep Learning Fundamentals - Learn TensorFlow - Learn PyTorch - Natural Language Processing - Reinforcement Learning - Generative Adversarial Networks - Stable Diffusion Masterclass - 🎓 Other Courses: DL Fundamentals Classic - Deep Learning Deployment - Data Science - Trading - 🛒 Check out products deeplizard recommends on Amazon: 🔗 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: 🔗 🎵 deeplizard uses music by Kevin MacLeod 🔗 ❤️ Please use the knowledge gained from deeplizard content for good, not evil.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

  • Updated
  • Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.