Reinforcement Learning - Introducing Goal Oriented Intelligence

with deeplizard.

Replay Memory Explained - Experience for Deep Q-Network Training

November 3, 2018 by

Blog

Replay Memory explained

What’s up, guys? In this post, we’ll continue our discussion of deep Q-networks and focus in on an important technique called experience replay that is utilized during the training process of a DQN. So, let’s get to it!

Last time, we covered each piece of the architecture that makes up a typical deep Q-network. Now, before we can move on to discussing exactly how a DQN is trained, we're first going to explain the concepts of experience replay and replay memory.

Experience Replay and Replay Memory

With deep Q-networks, we often utilize this technique called experience replay during training. With experience replay, we store the agent’s experiences at each time step in a data set called the replay memory. We represent the agent’s experience at time $$t$$ as $$e_t$$.

At time $$t$$, the agent's experience $$e_t$$ is defined as this tuple:

$$e_t=(s_t,a_t,r_{t+1},s_{t+1})$$

This tuple contains the state of the environment $$s_t$$, the action $$a_t$$ taken from state $$s_t$$, the reward $$r_{t+1}$$ given to the agent at time $$t+1$$ as a result of the previous state-action pair $$(s_t,a_t)$$, and the next state of the environment $$s_{t+1}$$. This tuple indeed gives us a summary of the agent’s experience at time $$t$$.

All of the agent's experiences at each time step over all episodes played by the agent are stored in the replay memory. Well actually, in practice, we’ll usually see the replay memory set to some finite size limit, $$N$$, and therefore, it will only store the last $$N$$ experiences.

This replay memory data set is what we’ll randomly sample from to train the network. The act of gaining experience and sampling from the replay memory that stores these experience is called experience replay.

Why use experience replay?

Why would we choose to train the network on random samples from replay memory, rather than just providing the network with the sequential experiences as they occur in the environment?

A key reason for using replay memory is to break the correlation between consecutive samples.

If the network learned only from consecutive samples of experience as they occurred sequentially in the environment, the samples would be highly correlated and would therefore lead to inefficient learning. Taking random samples from replay memory breaks this correlation.

Combining a deep Q-network with experience replay

Alright, we now have the idea of experience replay down. From last time, we should also have an understanding of a general deep Q-network architecture, the data that the network accepts, and the output from the network.

As a quick refresher, remember that the network is passed a state from the environment, and in turn, the network outputs the Q-value for each action that can be taken from that state.

Let’s now bring all of this information in together with experience replay to see how they fit in with each other.

Setting up

Before training starts, we first initialize the replay memory data set $$D$$ to capacity $$N$$. So, the replay memory $$D$$ will hold $$N$$ total experiences.

Next, we initialize the network with random weights. We've covered weight initialization in the Deep Learning Fundamentals series, so if you need a refresher on this topic, check that out. The exact same concepts we covered there applies for deep Q-network weight initialization.

Next, for each episode, we initialize the starting state of the episode. In our previous discussion, we talked about states, including the starting state, being a frame of raw pixels from a game screen as an example.

Gaining experience

Now, for each time step $$t$$ within the episode, we either explore the environment and select a random action, or we exploit the environment and select the greedy action for the given state that gives the highest Q-value. Remember, this is the exploration-exploitation trade-off that we discussed in detail in a previous post.

We then execute the selected action $$a_t$$ in an emulator. So, for example, if the selected action was to move right, then from an emulator where the actions were being executed in the actual game environment, the agent would actually move right. We then observe the reward $$r_{t+1}$$ given for this action, and we also observe the next state of the environment, $$s_{t+1}$$. We then store the entire experience tuple $$e_t=(s_t,a_t,r_{t+1},s_{t+1})$$ in replay memory $$D$$.

Wrapping up

Here's a summary of what we have so far:

1. Initialize replay memory capacity.
2. Initialize the network with random weights.
3. For each episode:
1. Initialize the starting state.
2. For each time step:
1. Select an action.
• Via exploration or exploitation
2. Execute selected action in an emulator.
3. Observe reward and next state.
4. Store experience in replay memory.

In the next post, we're going to discover how exactly we sample from replay memory during training, as well as all the other details we need to know about training a DQN. Thanks for contributing to collective intelligence, and I'll see ya in the next one!

Description