Reinforcement Learning - Developing Intelligent Agents

Deep Learning Course - Level: Advanced

Replay Memory Explained - Experience for Deep Q-Network Training


expand_more chevron_left


expand_more chevron_left

Replay Memory explained

What's up, guys? In this post, we'll continue our discussion of deep Q-networks and focus in on an important technique called experience replay that is utilized during the training process of a DQN. So, let's get to it!

Last time, we covered each piece of the architecture that makes up a typical deep Q-network. Now, before we can move on to discussing exactly how a DQN is trained, we're first going to explain the concepts of experience replay and replay memory.

Experience Replay and Replay Memory

With deep Q-networks, we often utilize this technique called experience replay during training. With experience replay, we store the agent's experiences at each time step in a data set called the replay memory. We represent the agent's experience at time \(t\) as \(e_t\).

At time \(t\), the agent's experience \(e_t\) is defined as this tuple:


This tuple contains the state of the environment \(s_t\), the action \(a_t\) taken from state \(s_t\), the reward \(r_{t+1}\) given to the agent at time \(t+1\) as a result of the previous state-action pair \((s_t,a_t)\), and the next state of the environment \(s_{t+1}\). This tuple indeed gives us a summary of the agent's experience at time \(t\).


All of the agent's experiences at each time step over all episodes played by the agent are stored in the replay memory. Well actually, in practice, we'll usually see the replay memory set to some finite size limit, \(N\), and therefore, it will only store the last \(N\) experiences.

This replay memory data set is what we'll randomly sample from to train the network. The act of gaining experience and sampling from the replay memory that stores these experience is called experience replay.

Why use experience replay?

Why would we choose to train the network on random samples from replay memory, rather than just providing the network with the sequential experiences as they occur in the environment?

A key reason for using replay memory is to break the correlation between consecutive samples.

If the network learned only from consecutive samples of experience as they occurred sequentially in the environment, the samples would be highly correlated and would therefore lead to inefficient learning. Taking random samples from replay memory breaks this correlation.

Combining a deep Q-network with experience replay

Alright, we now have the idea of experience replay down. From last time, we should also have an understanding of a general deep Q-network architecture, the data that the network accepts, and the output from the network.

As a quick refresher, remember that the network is passed a state from the environment, and in turn, the network outputs the Q-value for each action that can be taken from that state.

Let's now bring all of this information in together with experience replay to see how they fit in with each other.

Setting up

Before training starts, we first initialize the replay memory data set \(D\) to capacity \(N\). So, the replay memory \(D\) will hold \(N\) total experiences.

Next, we initialize the network with random weights. We've covered weight initialization in the Deep Learning Fundamentals series, so if you need a refresher on this topic, check that out. The exact same concepts we covered there applies for deep Q-network weight initialization.

Next, for each episode, we initialize the starting state of the episode. In our previous discussion, we talked about states, including the starting state, being a frame of raw pixels from a game screen as an example.

Gaining experience

Now, for each time step \(t\) within the episode, we either explore the environment and select a random action, or we exploit the environment and select the greedy action for the given state that gives the highest Q-value. Remember, this is the exploration-exploitation trade-off that we discussed in detail in a previous post.

We then execute the selected action \(a_t\) in an emulator. So, for example, if the selected action was to move right, then from an emulator where the actions were being executed in the actual game environment, the agent would actually move right. We then observe the reward \(r_{t+1}\) given for this action, and we also observe the next state of the environment, \(s_{t+1}\). We then store the entire experience tuple \(e_t=(s_t,a_t,r_{t+1},s_{t+1})\) in replay memory \(D\).

Wrapping up

Here's a summary of what we have so far:

  1. Initialize replay memory capacity.
  2. Initialize the network with random weights.
  3. For each episode:
    1. Initialize the starting state.
    2. For each time step:
      1. Select an action.
        • Via exploration or exploitation
      2. Execute selected action in an emulator.
      3. Observe reward and next state.
      4. Store experience in replay memory.

In the next post, we're going to discover how exactly we sample from replay memory during training, as well as all the other details we need to know about training a DQN. Thanks for contributing to collective intelligence, and I'll see ya in the next one!


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Quiz Results


expand_more chevron_left
Welcome back to this series on reinforcement learning! In this video, we'll continue our discussion of deep Q-networks. Before we can move on to discussing exactly how a DQN is trained, we're first going to explain the concepts of experience replay and replay memory, which are utilized during the training process. So, let's get to it! Sources: Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies JΓΌrgen Schmidhuber interview: πŸ•’πŸ¦Ž VIDEO SECTIONS πŸ¦ŽπŸ•’ 00:00 Welcome to DEEPLIZARD - Go to for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 05:51 Collective Intelligence and the DEEPLIZARD HIVEMIND πŸ’₯🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎πŸ’₯ πŸ‘‹ Hey, we're Chris and Mandy, the creators of deeplizard! πŸ‘€ CHECK OUT OUR VLOG: πŸ”— πŸ’ͺ CHECK OUT OUR FITNESS CHANNEL: πŸ”— 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: πŸ”— ❀️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime πŸ‘€ Follow deeplizard: Our vlog: Fitness: Facebook: Instagram: Twitter: Patreon: YouTube: πŸŽ“ Deep Learning with deeplizard: AI Art for Beginners - Deep Learning Dictionary - Deep Learning Fundamentals - Learn TensorFlow - Learn PyTorch - Natural Language Processing - Reinforcement Learning - Generative Adversarial Networks - Stable Diffusion Masterclass - πŸŽ“ Other Courses: DL Fundamentals Classic - Deep Learning Deployment - Data Science - Trading - πŸ›’ Check out products deeplizard recommends on Amazon: πŸ”— πŸ“• Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: πŸ”— 🎡 deeplizard uses music by Kevin MacLeod πŸ”— ❀️ Please use the knowledge gained from deeplizard content for good, not evil.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

  • Updated
  • Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.