Reinforcement Learning - Introducing Goal Oriented Intelligence

with deeplizard.

Deep Q-Network Training Code - Reinforcement Learning Code Project

July 28, 2019 by


Training a Deep Q-Network - Reinforcement Learning Code Project

Welcome back to this series on reinforcement learning! In this episode we’ll be bringing together all the classes and functions we’ve developed so far, and incorporating them into our main program to train our deep Q-network for the cart and pole environment.

Then, we’ll see the training process live as we watch our agent’s ability to balance the pole on the cart increase as it learns.

Main Program

Within our main program, we’re first initializing all of our hyperparameters.


Note that these parameters are ones that we’ll want to tune and experiment with to try to improve performance. In later episodes, we’ll see some of this experimentation in action.

batch_size = 256
gamma = 0.999
eps_start = 1
eps_end = 0.01
eps_decay = 0.001
target_update = 10
memory_size = 100000
lr = 0.001
num_episodes = 1000

We’re first setting the batch_size for our network to 256. gamma, which is the discount factor used in the Bellman equation, is being set to 0.999.

We then have these three eps variables: eps_start, eps_end, and eps_decay.

eps_start is the starting value of epsilon. Remember, epsilon is the name we’ve given to the exploration rate. eps_end is the ending value of epsilon, and eps_decay is the decay rate we’ll use to decay epsilon over time.

We’ve covered the exploration rate in full detail in an earlier episode, so if you need a refresher, be sure to check that out.

Next, we set target_update to 10, and this is how frequently, in terms of episodes, we’ll update the target network weights with the policy network weights. So, with target_update set to 10, we’re choosing to update the target network every 10 episodes.

Next, we set the memory_size, which is the capacity of the replay memory, to 100,000. We then set the learning rate lr that is used during training of the policy network to 0.001, and the number of episodes we want to play num_episodes to 1000.

Essential objects

That’s it for the hyperparameters. Now, we’ll set up all of the essential objects using the classes we’ve built in the previous episodes.

First though, let’s set up our device for PyTorch. This tells PyTorch to use a GPU if it’s available, otherwise use the CPU.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Now, we set up our environment manager em using the CartPoleEnvManager class, and we pass in the required device. We then set our strategy to be an instance of the EpsilonGreedyStrategy class, and we pass in the required start, end, and decay values for epsilon.

em = CartPoleEnvManager(device)
strategy = EpsilonGreedyStrategy(eps_start, eps_end, eps_decay)

We then define an agent using our Agent class and pass in the required strategy, number of actions available, and device. We then initialize memory to be an instance of ReplayMemory and pass in the capacity using memory_size.

agent = Agent(strategy, em.num_actions_available(), device)
memory = ReplayMemory(memory_size)

Now, we define both our policy network and target network by creating two instances of our DQN class and passing in the height and width of the screen to set up the appropriate input shape of the networks. We put these networks on our defined device using PyTorch’s to() function.

policy_net = DQN(em.get_screen_height(), em.get_screen_width()).to(device)
target_net = DQN(em.get_screen_height(), em.get_screen_width()).to(device)

We then set the weights and biases in the target_net to be the same as those in the policy_net using PyTorch’s state_dict() and load_state_dict() functions. We also put the target_net into eval mode, which tells PyTorch that this network is not in training mode. In other words, this network will only be used for inference.


Lastly, we set optimizer equal to the Adam optimizer, which accepts our policy_net.parameters() as those for which we’ll be optimizing, and our defined learning rate lr.

optimizer = optim.Adam(params=policy_net.parameters(), lr=lr)

Training loop

We’re now all set up to start training.

We’re going to be storing our episode_durations during training in order to plot them using the plot() function we developed last time, so we create an empty list to store them in.

episode_durations = []

The steps that we’ll be covering now in our main training loop will be the implementation of the algorithm below that we covered in a previous episode. So, if at any point you get lost in where something in the upcoming code fits in, be sure to refresh your memory by taking a look at this.

  1. Initialize replay memory capacity.
  2. Initialize the policy network with random weights.
  3. Clone the policy network, and call it the target network.
  4. For each episode:
    1. Initialize the starting state.
    2. For each time step:
      1. Select an action.
        • Via exploration or exploitation
      2. Execute selected action in an emulator.
      3. Observe reward and next state.
      4. Store experience in replay memory.
      5. Sample random batch from replay memory.
      6. Preprocess states from batch.
      7. Pass batch of preprocessed states to policy network.
      8. Calculate loss between output Q-values and target Q-values.
        • Requires a pass to the target network for the next state
      9. Gradient descent updates weights in the policy network to minimize loss.
        • After \(x\) time steps, weights in the target network are updated to the weights in the policy network.

We’ll now step into our training loop. The first for loop is going to iterate over each episode.

for episode in range(num_episodes):
    state = em.get_state()

For each episode, we first reset the environment, then get the initial state.

Now, we'll step into the nested for loop that will iterate over each time step within each episode.

for timestep in count():
    action = agent.select_action(state, policy_net)
    reward = em.take_action(action)
    next_state = em.get_state()
    memory.push(Experience(state, action, next_state, reward))
    state = next_state

For each time step, our agent selects an action based on the current state. Recall, we also need to pass in the required policy_net since the agent will use this network to select it’s action if it exploits the environment rather than explores it.

The agent then takes the chosen action and receives the associated reward, and we get the next_state.

We now can create an Experience using the state, action, next_state, and reward and push this onto replay memory. After which, we transition to the next state by setting our current state to next_state.

Now that our agent has had an experience and stored it in replay memory, we’ll check to see if we can get a sample from replay memory to train our policy_net. Remember, we covered in a previous episode that we can get a sample equal to the batch_size from replay memory as long as the current size of memory is at least the batch_size.

if memory.can_provide_sample(batch_size):
    experiences = memory.sample(batch_size)
    states, actions, rewards, next_states = extract_tensors(experiences)
    current_q_values = QValues.get_current(policy_net, states, actions)
    next_q_values = QValues.get_next(target_net, next_states)
    target_q_values = (next_q_values * gamma) + rewards

    loss = F.mse_loss(current_q_values, target_q_values.unsqueeze(1))

If we can get a sample from memory, then we get a sample equal to batch_size and assign this sample to the variable experiences. We’re then going to do some data manipulation to extract all the states, actions, rewards, and next_states into their own tensors from the experiences list.

We do this using the extract_tensors() function. We haven’t covered the inner workings of this function yet, but stick around until the end, and we’ll circle back around to cover it in detail. For now, let’s continue with our training loop so that we can stay in flow.

Continuing with the code above, we now we get the q-values for the corresponding state-action pairs that we’ve extracted from the experiences in our batch. We do this using QValues.get_current(), to which we pass our policy_net, states, and actions.

We’ll be covering the QValues class later as well, but for now, just know that get_current() will return the q-values for any given state-action pairs, as predicted from the policy network. The q-values will be returned as a PyTorch tensor.

We also need to get the q-values for the next states in the batch as well. We’re able to do this using QValues.get_next(), and passing in the target_net and next_states that we extracted from the experiences.

This function will return the maximum q-values for the next states using using the best corresponding next actions. It does this using the target network because, remember from our episode on fixed Q-targets, the q-values for next states are calculated using the target network.

These q-values will also be returned as a PyTorch tensor.

Now, we’re able to calculate the target_q_values using this formula that we also covered in that previous episode.

\begin{eqnarray*} q_{\ast }\left( s,a\right)= E\left[ R_{t+1}+\gamma \max_{a^{\prime }}q_{\ast }\left( s^\prime,a^{\prime }\right)\right] \end{eqnarray*}

We multiply each of the next_q_values by our discount rate gamma and add this result to the corresponding reward in the rewards tensor to create a new tensor of target_q_values.

We now can calculate the loss between the current_q_values and the target_q_values using mean squared error mse as our loss function, and then we zero out the gradients using optimizer.zero_grad().

This function sets the gradients of all the weights and baises in the policy_net to zero. Since PyTorch accumulates the gradients when it does backprop, we need to call zero_grad() before backprop occurs. Otherwise, if we didn’t zero out the gradients each time, then we’d be accumulating gradients across all backprop runs.

We then call loss.backward(), which computes the gradient of the loss with respect to all the weights and biases in the policy_net.

We now call step() on our optimizer, which updates the weights and biases with the gradients that were computed when we called backward() on our loss.

We then check to see if the last action our agent took ended the episode by getting the value of done from our environment manager em. If the episode ended, then we append the current timestep to the episode_durations list to store how long this particular episode lasted.

if em.done:
    plot(episode_durations, 100)

We then plot the duration and the 100-period moving average to the screen and break out of the inner loop so that we can start a new episode.

Before starting a new episode though, we have one final check to see if we should do an update to our target_net.

if episode % target_update == 0:

Recall, our target_update variable is set to 10, so we check if our current episode is a multiple of 10, and if it is, then we update the target_net weights with the policy_net weights.

At this point, we can start a new episode. This whole process will end once we’ve reached the number of episodes set in num_episodes. At that point, we'll close the enironment manager.


That’s it for the training loop! We’ll run this in just a moment to see what our training looks like, both from our progress being plotted on the chart, as well as by checking out how the cart and pole performance in the environment changes as it learns.

Tensor processing

Before we do that though, let’s circle back to the extract_tensors() function that I mentioned we’d come back to to see what’s happening there. Remember, this is the function that we called to extract all the states, actions, rewards, and next_states into their own tensors from a given batch of experiences.

def extract_tensors(experiences):
    # Convert batch of Experiences to Experience of batches
    batch = Experience(*zip(*experiences))

    t1 =
    t2 =
    t3 =
    t4 =

    return (t1,t2,t3,t4)

extract_tensors() accepts a batch of Experiences and first transposes it into an Experience of batches. I know this step sounds kinda weird, right? So, let’s look at an example of what we’re doing in this step before we move on.

First, let’s create three sample experiences, and put them in a list and see how that looks.

e1 = Experience(1,1,1,1)
e2 = Experience(2,2,2,2)
e3 = Experience(3,3,3,3)

experiences = [e1,e2,e3]

> [Experience(state=1, action=1, next_state=1, reward=1),
   Experience(state=2, action=2, next_state=2, reward=2),
   Experience(state=3, action=3, next_state=3, reward=3)]

We can see the first Experience in the list has a state, action, next_state, and reward all equal to 1. The second Experience has 2 as the value of all of these attributes, and the third Experience has 3 as the value for all attributes.

Now, we execute the same line from our function that we’re trying understand.

batch = Experience(*zip(*experiences))

> Experience(state=(1, 2, 3), action=(1, 2, 3), next_state=(1, 2, 3), reward=(1, 2, 3))

We can see that we now do indeed have this Experience object where the state attribute is set to the tuple containing all the states from e1, e2, and e3 in the original experiences list. Similarly, the action, next_state, and reward attributes contain tuples containing all the corresponding values from the experiences list.

So now that we see what this line does, let’s go back to extract_tensors() to see what happens next.

def extract_tensors(experiences):
    # Convert batch of Experiences to Experience of batches
    batch = Experience(*zip(*experiences))

    t1 =
    t2 =
    t3 =
    t4 =

    return (t1,t2,t3,t4)

We call the result of this operation we just demonstrated batch, and then, by calling we extract all the states from this batch into their own state tensor.

We go through this same process with all the actions, rewards, and next states as well, and then return a tuple that contains the states tensor, actions tensor, rewards tensor, and next_states tensor.

Calculating Q-values

We now have one last thing to cover, the QValues class. This is the class that we used to calculate the q-values for the current states using the policy_net, and the next states using the target_net.

This class contains two static methods, meaning that we’re able to call these methods without creating an instance of the class first. Because we’re creating the class in this way, we’re setting up its own device since we won’t be creating an instance of this class and passing in our device from our main program.

class QValues():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

As you can see, device is defined in the same way as we defined it in our main program earlier.

The first static method is get_current(). This function accepts a policy_net, states, and actions. When we call this function in our main program, recall that these states and actions are the state-action pairs that were sampled from replay memory. So, the states and actions correspond with each other.

def get_current(policy_net, states, actions):
    return policy_net(states).gather(dim=1, index=actions.unsqueeze(-1))

The function just returns the predicted q-values from the policy_net for the specific state-action pairs that were passed in.

The next static method is get_next().

This function is a bit more technical than anything else we’ve covered so far in this episode, so if you have any trouble understanding the first time, don’t worry. Just take it slow. It requires your attention, so here we go!

def get_next(target_net, next_states):                
    final_state_locations = next_states.flatten(start_dim=1) \
    non_final_state_locations = (final_state_locations == False)
    non_final_states = next_states[non_final_state_locations]
    batch_size = next_states.shape[0]
    values = torch.zeros(batch_size).to(QValues.device)
    values[non_final_state_locations] = target_net(non_final_states).max(dim=1)[0].detach()
    return values

This function accepts a target_net and next_states. Recall that for each next state, we want to obtain the maximum q-value predicted by the target_net among all possible next actions.

To do that, we first look in our next_states tensor and find the locations of all the final states. If an episode is ended by a given action, then we’re calling the next_state that occurs after that action was taken the final state.

Remember, last time we discussed that final states are represented with an all black screen. Therefore, all the values within the tensor that represent that final state would be zero.

We want to know where the final states are, if we even have any at all in a given batch, because we’re not going to want to pass these final states to our target_net to get a predicted q-value. We know that the q-value for final states is zero because the agent is unable to receive any reward once an episode has ended.

So, we’re finding the locations of these final states so that we know not to pass them to the target_net for q-value predictions when we pass our non-final next states.

To find the locations of these potential final states, we flatten the next_states tensor along dimension 1, and we check each individual next state tensor to find its maximum value. If its maximum value is equal to 0, then we know that this particular next state is a final state, and we represent that as a True within this final_state_locations tensor. next_states that are not final are represented by a False value in the tensor.

We then create a second tensor non_final_state_locations, which is just an exact opposite of final_state_locations. It contains True for each location in the next_states tensor that corresponds to a non-final state and a False for each location that corresponds to a final state.

Now that we know the locations of the non-final states, we can now get the values of these states by indexing into the next_states tensor and getting all of the corresponding non_final_states.

Next, we find out the batch_size by checking to see how many next states are in the next_states tensor. Using this, we create a new tensor of zeros that has a length equal to the batch size. We also send this tensor to the device defined at the start of this class.

We then index into this tensor of zeros with the non_final_state_locations, and we set the corresponding values for all of these locations equal to the maximum predicted q-values from the target_net across each action.

This leaves us with a tensor that contains zeros as the q-values associated with any final state and contains the target_net's maximum predicted q-value across all actions for each non-final state. This result is what is finally returned by get_next().

The whole point of all this code in this function was to find out if we have any final states in our next_states tensor. If we do, then we need to find out where they are so that we don’t pass them to the target_net. We don’t want to pass them to the target_net for a predicted q-value since we know that their associated q-values will be zero.

The only reason this may seem a little more complicated is due to the use of tensors and how we’re indexing into the tensors. So, again, just a spend a little more time on this one, and it should all come together for you.

Update to plot

Just one quick update before we get to training. I’ve added a print statement in the plot() function we defined last time so that, in addition to the plot on the screen, we could also have a print out of the moving average at the current episode.

def plot(values, moving_avg_period):

    moving_avg = get_moving_average(moving_avg_period, values)
    print("Episode", len(values), "\n", \
        moving_avg_period, "episode moving avg:", moving_avg[-1])
    if is_ipython: display.clear_output(wait=True)

So make sure to update your code with this change if you were following along in real time with the code last time.

Here is an example plot with the added print out.

plot(np.random.rand(300), 100)

Train agent

Alright, that wraps up all the code! Let’s finally run our main program and check out our performance. Check out the video to see the training occur overtime in a timelapse recording.

Here is the final plot after 1000 episodes.

We can see that the agent definitely did learn overtime, but it didn’t solve cart and pole, as our 100-episode moving average never reached a duration of 195 or more.

Wrapping up

We’ll see in a future episode how we can experiment with tuning our hyperparameters and network architecture to increase performance. In the mean time, I encourage you to test and tune these parameters yourself and see if you can get the agent to perform any better than this. If you can, comment with what you changed and what 100-episode moving average you were able to achieve!

Until then, please like this video to let us know you’re learning, and take the corresponding quiz to test your own understanding! Don’t forget about the deeplizard hivemind for exclusive perks and rewards. See ya in the next one!


Welcome back to this series on reinforcement learning! In this episode we’ll be bringing together all the classes and functions we’ve developed so far, and incorporating them into our main program to train our deep Q-network for the cart and pole environment. We’ll see the training process live as we watch our agent’s ability to balance the pole on the cart increase as it learns. 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👀 OUR VLOG: 🔗 👉 Check out the blog post and other resources for this video: 🔗 💻 DOWNLOAD ACCESS TO CODE FILES 🤖 Available for members of the deeplizard hivemind: 🔗 🧠 Support collective intelligence, join the deeplizard hivemind: 🔗 🤜 Support collective intelligence, create a quiz question for this video: 🔗 🚀 Boost collective intelligence by sharing this video on social media! ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: yasser Prash 👀 Follow deeplizard: Our vlog: Twitter: Facebook: Patreon: YouTube: Instagram: 🎓 Other deeplizard courses: Reinforcement Learning - NN Programming - DL Fundamentals - Keras - TensorFlow.js - Data Science - Trading - 🛒 Check out products deeplizard recommends on Amazon: 🔗 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard’s link: 🔗 🎵 deeplizard uses music by Kevin MacLeod 🔗 🔗 ❤️ Please use the knowledge gained from deeplizard content for good, not evil.