Reinforcement Learning - Introducing Goal Oriented Intelligence

with deeplizard.

Build Deep Q-Network - Reinforcement Learning Code Project

July 9, 2019 by

Blog

Building a Deep Q-Network in Code

Welcome back to this series on reinforcement learning! In this episode, we’ll get started with building our deep Q-network to be able to perform in the cart and pole environment. Let’s get to it!

After following the environment prep we covered last time, we’re now ready to start writing our code. We’ll be making use of everything we’ve learned about deep Q-networks so far, including the topics of experience replay, fixed Q-targets, and epsilon greedy strategies, to develop our code.

We’ll use the final summary of the DQN training process below that we discussed in an earlier episode to guide our understanding while developing our code. Make sure you’ve familiarized yourself with these concepts fundamentally first so that you can gain a solid grasp for why we’re doing what we’re doing in the upcoming code.

  1. Initialize replay memory capacity.
  2. Initialize the policy network with random weights.
  3. Clone the policy network, and call it the target network.
  4. For each episode:
    1. Initialize the starting state.
    2. For each time step:
      1. Select an action.
        • Via exploration or exploitation
      2. Execute selected action in an emulator.
      3. Observe reward and next state.
      4. Store experience in replay memory.
      5. Sample random batch from replay memory.
      6. Preprocess states from batch.
      7. Pass batch of preprocessed states to policy network.
      8. Calculate loss between output Q-values and target Q-values.
        • Requires a pass to the target network for the next state
      9. Gradient descent updates weights in the policy network to minimize loss.
        • After \(x\) time steps, weights in the target network are updated to the weights in the policy network.

Also, remember I mentioned last time that we will be using PyTorch to train our DQN. I also just wanted to quickly mention that the PyTorch code we use can be adapted to whatever other neural network API you may want to use as well. The code and implementation should be easily generalizable.

Just a quick announcement before getting to the code, recall that last time we also discussed how the code we’d be writing would be based on the original PyTorch deep Q-network code available on PyTorch’s website with just some minor tweaks modifications of my own. Since the last episode, though, I’ve spent more time going over the code and decided on several more changes that will differ from the original tutorial on PyTorch’s site. I just wanted to give you a heads-up on that since there will now be considerably more differences than what I originally alluded to last time.

Without further ado, let’s jump into it!

Code set up

Import libraries

%matplotlib inline
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple
from itertools import count
from PIL import Image
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T    

As expected, the first thing we’re doing is importing all of the libraries we’re going to be making use of. We’ve got gym and some PyTorch modules here plus many standard libraries like numpy, matplotlib, random, and a few others.

Set up display

Next, we import IPython’s display module to aid us in plotting images to the screen later.

is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython: from IPython import display

Now that we’ve gotten past the overarching initial set up of things, we can now move on to implementing some of the concepts we’ve been discussing throughout this series.

I’ve organized this code in a very object oriented way, which I think makes things a lot easier to understand. We’re going to start out by covering all of the classes functions we need to create, and then at the end, we’ll see the use of all the classes and functions come into play in our main program.

Deep Q-network

Let’s start first with our deep Q-network. This is where PyTorch comes into play. To build a neural network in PyTorch, we use the torch.nn package, which we gave the alias nn when we imported it earlier. This package contains all of the typical components needed to build neural networks.

Within the nn package, there is a class called Module. Module is the base class for all neural network modules, and so our network and all of its layers will extend the nn.Module class.

We define our DQN as a class that extends nn.Module. Our DQN will receive screenshot-like images of the cart and pole environment as input, so to create a DQN object, we’ll require the height and width of the image input that will be coming in to this model.

class DQN(nn.Module):
    def __init__(self, img_height, img_width):
        super().__init__()
            
        self.fc1 = nn.Linear(in_features=img_height*img_width*3, out_features=24)   
        self.fc2 = nn.Linear(in_features=24, out_features=32)
        self.out = nn.Linear(in_features=32, out_features=2)

To start out with a very simple network, our network will consist only of two fully connected hidden layers, and an output layer. PyTorch refers to fully connected layers as Linear layers.

Our first Linear layer accepts input with dimensions equal to the passed in image_height times image_width times 3. The 3 corresponds to the three color channels from our RGB images that will be received by the network as input.

This first Linear layer will have 24 outputs, and therefore our second Linear layer will accept 24 inputs. Our second layer will have 32 outputs, and lastly, our output layer will have 32 inputs from the previous layer, and will have 2 outputs.

In our particular cart and pole example, remember that the network will be outputting the Q-values that correspond to each possible action that the agent can take from a given state. Our only available actions are to move right or to move left, therefore, the number outputs will be equal to two.

As you can see, this architecture is being built within the DQN class constructor, and we’ve given these arbitrary names of fc1 and fc2 to the two fully connected layers, and out as the output layer.

Also, note that this network is pretty arbitrary and also very basic. It doesn’t even contain any convolutional layers. I wanted to start out with something very straight forward at first, and then once we see how this network performs, we can start tuning the architecture and experimenting with different variations.

The last thing we have to do for our DQN class is to define a function called forward(). This function will implement a forward pass to the network. Note that all PyTorch neural networks require an implementation of forward().

def forward(self, t):
    t = t.flatten(start_dim=1)
    t = F.relu(self.fc1(t))
    t = F.relu(self.fc2(t))
    t = self.out(t)
    return t

For any particular image tensor, t, passed to the network, t will first need to be flattened before it can be passed to the first fully connected layer. After this, t will be passed to the fully connected layer and then have relu applied to it. Then, this result will be passed to the second fully connected layer, and again have relu applied. This result will then be passed to the output player. The result from the output layer will be returned by the forward() function.

If this is your first time being exposed to PyTorch and you want to go deeper into understanding the steps that we just covered to build a network, be sure to check out our PyTorch series, where all of this is covered in complete and thorough detail. Otherwise, if you’re at all shaky on the fundamental concepts of forward passes, relu, layer input or output, then you’ll definitely want to spend some time on the Deep Learning Fundamentals series .

Experience

Now that we have our network, let’s move on to experiences. Recall that experiences from replay memory is what we’ll use to train our network. To create experiences, we creating a class called Experience. This class will be used to create instances of Experience objects that will get stored in and sampled from replay memory later.

Experience = namedtuple(
    'Experience',
    ('state', 'action', 'next_state', 'reward')
)

As you can see, we’re creating this class by calling namedtuple(), which is a Python function for creating tuples with named fields.

Here, namedtuple() is returning a new tuple subclass named Experience, which is specified by our first argument. This new Experience class will be used to create tuple-like objects that have the fields state, action, next state, and reward. Remember, these are the exact fields that we previously discussed, which make up an individual experience.

Let’s show a quick example of an Experience object.

e = Experience(2,3,1,4)    

We’ll set e equal to an instance of the Experience class and pass in the parameters 2, 3, 1, 4. Given the way we set up the Experience class, 2 will be the state of experience e, 3 will be the action, 1 will be the next_state, and 4 will be the reward.

e 
> Experience(state=2, action=3, next_state=1, reward=4)   

Replay Memory

Now that we have our Experience class, let’s define our ReplayMemory class, which is where these experiences will be stored.

Recall that replay memory will have some set capacity. This capacity is the only parameter that needs to be specified when creating a ReplayMemory object.

class ReplayMemory():
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.push_count = 0

We initialize ReplayMemory’s capacity to whatever was passed in, and we also define a memory attribute equal to an empty list. memory will be the structure that actually holds the stored experiences. We also create a push_count attribute, which we initialize to 0, and we’ll use this to keep track of how many experiences we’ve added to memory.

Now, we need a way to store experiences in replay memory as they occur, so we define this push() function to do just that.

def push(self, experience):
    if len(self.memory) < self.capacity:
        self.memory.append(experience)
    else:
        self.memory[self.push_count % self.capacity] = experience
    self.push_count += 1

push() accepts experience, and when we want to push a new experience into replay memory, we have to check first that the amount of experiences we already have in memory is indeed less than the memory capacity. If it is, then we append the experience to memory.

If, on the other hand, the amount of experiences we have in memory has reached capacity, then we begin to push new experiences onto the front of memory, overwriting the oldest experiences first. We then update our push_count by incrementing by 1.

Aside from storing experiences in replay memory, we also want to be able to sample experiences from replay memory. Remember, these sampled experiences will be what we use to train our DQN.

We define this sample() function, which returns a random sample of experiences. The number of randomly sampled experiences returned will be equal to the batch_size parameter passed to the function.

def sample(self, batch_size):
    return random.sample(self.memory, batch_size)

Finally, we have this can_provide_sample() function that returns a boolean to tell us whether or not we can sample from memory. Recall that the size of a sample we’ll obtain from memory will be equal to the batch size we use to train our network.

def can_provide_sample(self, batch_size):
    return len(self.memory) >= batch_size

For example, suppose we only have \(20\) experiences in replay memory and that our batch size is \(50\). Then, we will be unable to sample because we do not have \(20\) experiences yet. Therefore, before we try to sample from memory, we’ll do a check to see if it’s possible to do so by calling the can_provide_sample() function first. We’ll see this in practice later.

Epsilon Greedy Strategy

Hopefully you remember from earlier in this series the concept of exploration versus exploitation. This has to do with the way our agent selects actions. Recall, our agent’s actions will either fall in the category of exploration, where the agent is just exploring the environment by taking a random action from a given state, or the category of exploitation, where the agent exploits what it’s learned about the environment to take the best action from a given state.

To get a balance of exploration and exploitation, we use what we previously introduced as an epsilon greedy strategy. With this strategy, we define an exploration rate called epsilon that we initially set to \(1\). This exploration rate is the probability that our agent will explore the environment rather than exploit it. With epsilon equal to \(1\), it is \(100\) percent certain that the agent will start out by exploring the environment.

As the agent learns more about the environment, though, epsilon will decay by some decay rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment. We’re now going to write an EpsilonGreedyStrategy class that puts this idea into code.

class EpsilonGreedyStrategy():
    def __init__(self, start, end, decay):
        self.start = start
        self.end = end
        self.decay = decay

Our EpsilonGreedyStrategy accepts start, end, and decay, which correspond to the starting, ending, and decay values of epsilon. These attributes all get initialized based on the values that are passed in during object creation.

def get_exploration_rate(self, current_step):
    return self.end + (self.start - self.end) * \
        math.exp(-1. * current_step * self.decay)

We then have this single function get_exploration_rate(), which requires the current_step of the agent to be passed. This function returns the calculated exploration rate, which is based on the formula that we covered in an earlier episode. Our agent is going to be able to use the exploration rate to determine how it should select it’s actions, either by exploring or exploiting the environment.

Reinfocement Learning Agent

Speaking of our agent, an Agent class is where we’re headed next. Our Agent class will require a strategy and num_actions.

So, later when we create an Agent object, we’ll need to already have an instance of EpsilonGreedyStrategy class created so that we can use that strategy to create our agent. num_actions corresponds to how many possible actions can the agent take from a given state. In our cart and pole example, this number will always be two since the agent can always choose to only move left or right.

class Agent():
    def __init__(self, strategy, num_actions):
        self.current_step = 0
        self.strategy = strategy
        self.num_actions = num_actions

We initialize the agent’s strategy and num_actions accordingly, and we also initialize the current_step attribute to 0. This corresponds to the agent’s current step in the environment. The Agent class has a single function called select_action(), which requires a state and a policy_net.

def select_action(self, state, policy_net):
    rate = self.strategy.get_exploration_rate(self.current_step)
    self.current_step += 1

    if rate > random.random():
        return random.randrange(self.num_actions) # explore      
    else:
        with torch.no_grad():
            return policy_net(state).argmax(dim=1).item() # exploit    

Remember a policy network is the name we give to our deep Q-network that we train to learn the optimal policy.

Within this function, we first initialize rate to be equal to the exploration rate returned from the epsilon greedy strategy that was passed in when we created our agent, and we increment the agent’s current_step by 1.

We then check to see if the exploration rate is greater than a randomly generated number between 0 and 1. If it is, then we explore the environment by randomly selecting an action, either 0 or 1, corresponding to left or right moves.

If the exploration rate is not greater than the random number, then we exploit the environment by selecting the action that corresponds to the highest Q-value output from our policy network for the given state.

We’re specifying with torch.no_grad() before we pass data to our policy_net to turn off gradient tracking since we’re currently using the model for inference and not training.

During training PyTorch keeps track of all the forward pass calculations that happen within the network. It needs to do this so that it can know how to apply backpropagation later. Since we’re only using the model for inference at the moment, we’re telling PyTorch not to keep track of any forward pass calculations.

Wrapping up

Next time, we’ll pick up with the code for how we’ll be extracting and preprocessing the cart and pole input for our DQN.

Let me know in the video comments how you’re moving so far, and please like this video to let us know you’re learning! Don’t forget to take the corresponding quiz to test your own understanding. See ya in the next one!

Description

Welcome back to this series on reinforcement learning! In this episode, we’ll get started with building our deep Q-network to be able to perform in the cart and pole environment. We’ll be making use of everything we’ve learned about deep Q-networks so far, including the topics of experience replay, fixed Q-targets, and epsilon greedy strategies, to develop our code. 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👀 OUR VLOG: 🔗 https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og 👉 Check out the blog post and other resources for this video: 🔗 https://deeplizard.com/learn/video/PyQNfsGUnQA 💻 DOWNLOAD ACCESS TO CODE FILES 🤖 Available for members of the deeplizard hivemind: 🔗 https://www.patreon.com/posts/27743395 🧠 Support collective intelligence, join the deeplizard hivemind: 🔗 https://deeplizard.com/hivemind 🤜 Support collective intelligence, create a quiz question for this video: 🔗 https://deeplizard.com/create-quiz-question 🚀 Boost collective intelligence by sharing this video on social media! ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: yasser Prash 👀 Follow deeplizard: Our vlog: https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og Twitter: https://twitter.com/deeplizard Facebook: https://www.facebook.com/Deeplizard-145413762948316 Patreon: https://www.patreon.com/deeplizard YouTube: https://www.youtube.com/deeplizard Instagram: https://www.instagram.com/deeplizard/ 🎓 Deep Learning with deeplizard: Fundamental Concepts - https://deeplizard.com/learn/video/gZmobeGL0Yg Beginner Code - https://deeplizard.com/learn/video/RznKVRTFkBY Advanced Code - https://deeplizard.com/learn/video/v5cngxo4mIg Advanced Deep RL - https://deeplizard.com/learn/video/nyjbcRQ-uQ8 🎓 Other Courses: Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://www.amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard’s link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://www.youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ 🔗 http://incompetech.com/ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.