In this assignment we will implement deep Q learning, following DeepMinds paper ([mnih2015human] and [mnih-atari-2013]) that learns to play Atari from raw pixels. The purpose is to understand the effectiveness of deep neural network as well as some of the techniques used in practice to stabilize training and achieve better performance. Youll also have to get comfortable with Tensorflow. We will train our networks on the Pong-v0 environment from OpenAI gym, but the code can easily be applied to any other environment.
In Pong, one player wins if the ball passes by the other player. Winning a game gives a reward of 1, while losing gives a negative reward of -1. An episode is over when one of the two players reaches 21 wins. Thus, the final score is between -21 (lost episode) or +21 (won episode). Our agent plays against a decent hardcoded AI player. Average human performance is 3 (reported in [mnih-atari-2013]). If you go to the end of the homework successfully, you will train an AI agent with super-human performance, reaching at least +10 (hopefully more!).
1 Test Environment
Before running our code on Pong, it is crucial to test our code on a test environment. You should be able to run your models on CPU in no more than a few minutes on the following environment:
- 4 states: 0,1,2,3
- 5 actions: 0,1,2,3, Action 0 i 3 goes to state i, while action 4 makes the agent stay in the same state.
- Rewards: Going to state i from states 0, 1, and 3 gives a reward R(i), where R(0) = 0.1,R(1) = 0.2,R(2) = 0,R(3) = 0. If we start in state 2, then the rewards defind above are multiplied by 10. See Table 1 for the full transition and reward structure.
1
- One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the initial state).
State (s) | Action (a) | Next State (s) | Reward (R) |
0 | 0 | 0 | 0.1 |
0 | 1 | 1 | -0.2 |
0 | 2 | 2 | 0.0 |
0 | 3 | 3 | -0.1 |
0 | 4 | 0 | 0.1 |
1 | 0 | 0 | 0.1 |
1 | 1 | 1 | -0.2 |
1 | 2 | 2 | 0.0 |
1 | 3 | 3 | -0.1 |
1 | 4 | 1 | -0.2 |
2 | 0 | 0 | -1.0 |
2 | 1 | 1 | 2.0 |
2 | 2 | 2 | 0.0 |
2 | 3 | 3 | 1.0 |
2 | 4 | 2 | 0.0 |
3 | 0 | 0 | 0.1 |
3 | 1 | 1 | -0.2 |
3 | 2 | 2 | 0.0 |
3 | 3 | 3 | -0.1 |
3 | 4 | 3 | -0.1 |
Table 1: Transition table for the Test Environment
An example of a path (or an episode) in the test environment is shown in Figure 1, and the trajectory can be represented in terms of st,at,Rt as: s0 = 0,a0 = 1,R0 = 0.2,s1 = 1,a1 = 2,R1 = 0,s2 = 2,a2 = 4,R2 = 0,s3 = 2,a3 = 3,R3 = (0.1) (10) = 1,s4 = 3,a4 = 0,R4 = 0.1,s5 = 0.
Figure 1: Example of a path in the Test Environment
- (written 5pts) What is the maximum sum of rewards that can be achieved in a single episode in the test environment, assuming = 1?
2 Q-learning
Tabular setting In the tabular setting, we maintain a table Q(s,a) for each tuple state-action. Given an experience sample (s,a,r,s0), our update rule is
, (1)
where R is the learning rate, the discount factor.
Approximation setting Due to the scale of Atari environments, we cannot reasonably learn and store a Q value for each state-action tuple. We will instead represent our Q values as a function q(s,a,w) where w are parameters of the function (typically a neural networks weights and bias parameters). In this approximation setting, our update rule becomes
w . (2)
In other words, we are try to minimize
(3)
Target Network DeepMinds paper [mnih2015human] [mnih-atari-2013] maintains two sets of parameters, w (to compute q(s,a)) and w (target network, to compute q(s0,a0)) such that our update rule becomes
w . (4)
The target networks parameters are updated with the Q-networks parameters occasionally and are kept fixed between individual updates. Note that when computing the update, we dont compute gradients with respect to w (these are considered fixed weights).
Replay Memory As we play, we store our transitions (s,a,r,s0) in a buffer. Old examples are deleted as we store new transitions. To update our parameters, we sample a minibatch from the buffer and perform a stochastic gradient descent update.
-Greedy Exploration Strategy During training, we use an -greedy strategy. DeepMinds paper [mnih2015human] [mnih-atari-2013] decreases from 1 to 0.1 during the first million steps. At test time, the agent choses a random action with probability
There are several things to be noted:
- In this assignment, we will update w every learningfreq steps by using a minibatch of experiences sampled from the replay buffer.
- DeepMinds deep Q network takes as input the state s and outputs a vector of size = number of actions.
In the Pong environment, we have 6 actions, thus q(s,w) R6.
- The input of the deep Q network is the concatenation 4 consecutive steps, which results in an input after preprocessing of shape (80 80 4).
We will now examine these assumptions and implement the epsilon-greedy strategy.
- (written 3pts) What is one benefit of using experience replay?
- (written 3pts) What is one benefit of the target network?
- (written 3pts) What is one benefit of representing the Q function as q(s,w) RK
- (coding 3pts) Implement the getaction and update functions in py. Test your implementation by running python q1schedule.py.
3 Linear Approximation
- (written 3pts) Show that Equations (1) and (2) from section 2 above are exactly the same when q(s,a,w) = wTx(s,a), where w R|S||A| and x : S A R|S||A| such that
1 if s0 = s,a0 = a
s,a s0,a0
0 otherwise
for all (s,a) SA, x(s,a) is a vector of length |S||A| where the element corresponding to s0 S,a0 A is 1 when s0 = s,a0 = a and is 0 otherwise.
- (written 3pts) Derive the gradient with regard to the value function parameter w Rn given q(s,a,w) = wTx(s,a) for any function x(s,a) 7 x Rn and write the update rule for w.
- (coding 15pts) We will now implement linear approximation in Tensorflow. This question will setup the whole pipeline for the remiander of the assignment. Youll need to implement the following functions in py (pleasd read throughq2linear.py) :
- addplaceholdersop
- getqvaluesop
- addupdatetargetop
- addlossop
- addoptimizerop
Test your code by running python q2linear.py locally on CPU. This will run linear approximation with Tensorflow on the test environment. Running this implementation should only take a minute or two.
- (written 5pts) Do you reach the optimal achievable reward on the test environment? Attach the plot png from the directory results/q2linear to your writeup.
4 Implementing DeepMinds DQN
- (coding 10pts) Implement the deep Q-network as described in [mnih2015human] by implementing getq valuesop in py. The rest of the code inherits from what you wrote for linear approximation. Test your implementation locally on CPU on the test environment by running python q3nature.py. Running this implementation should only take a minute or two.
- (written 5pts) Attach the plot of scores, png, from the directory results/q3nature to your writeup. Compare this model with linear approximation. How do the final performances compare?
How about the training time?
5 DQN on Atari
The Atari environment from OpenAI gym returns observations (or original frames) of size (2101603), the last dimension corresponds to the RGB channels filled with values between 0 and 255 (uint8). Following DeepMinds paper [mnih2015human], we will apply some preprocessing to the observations:
- Single frame encoding: To encode a single frame, we take the maximum value for each pixel color value over the frame being encoded and the previous frame. In other words, we return a pixel-wise max-pooling of the last 2 observations.
- Dimensionality reduction: Convert the encoded frame to grey scale, and rescale it to (80 80 1). (See Figure 2)
The above preprocessing is applied to the 4 most recent observations and these encoded frames are stacked together to produce the input (of shape (80804)) to the Q-function. Also, for each time we decide on an action, we perform that action for 4 time steps. This reduces the frequency of decisions without impacting the performance too much and enables us to play 4 times as many games while training. You can refer to the Methods Section of [mnih2015human] for more details.
(a) Original input (210 160 3) with RGB colors (b) After preprocessing in grey scale of shape (80801) Figure 2: Pong-v0 environment
- (written 2pts) Why do we use the last 4 time steps as input to the network for playing Atari games?
- (written 5pts) Whats the number of parameters of the DQN model (for Pong) if the input to the Q-network is a tensor of shape (80,80,4) and we use SAME padding? How many parameters are required for the linear Q-network, assuming the input is still of shape (80,80,4)? How do the number of parameters compare between the two models?
- (coding and written 5pts). Now, were ready to train on the Atari Pong-v0 First, launch linear approximation on pong with python q4trainatarilinear.pyon Azures GPU. This will train the model for 500,000 steps and should take approximately an hour. What do you notice about the performance?
- (coding and written 10 pts). In this question, well train the agent with DeepMinds architecture on the Atari Pong-v0 Run python q5trainatarinature.py on Azures GPU. This will train the model for 5 million steps and should take around 12 hours. Attach the plot scores.png from the directory results/q5trainatarinature to your writeup. You should get a score of around 13-15 after 5 million time steps. As stated previously, the Deepmind paper claims average human performance is 3.
As the training time is roughly 12 hours, you may want to check after a few epochs that your network is making progress. The following are some training tips: If you terminate your terminal session, the training will stop. In order to avoid this, you should use screen to run your training in the background.
- The evaluation score printed on terminal should start at -21 and increase.
- The max of the q values should also be increasing
- The standard deviation of q shouldnt be too small. Otherwise it means that all states have similar q values
- You may want to use Tensorboard to track the history of the printed metrics. You can monitor your training with Tensorboard by typing the command tensorboard logdir=results and then connecting to ip-of-you-machine:6006. Below are our Tensorboard graphs from one training session:
- (written 5pts) Compare the performance of the DeepMind architecure with the linear Q-network approximation. How can you explain the gap in performance?
6 Real world RL with neural networks
Given a stream of batches of n environment interactions (si,ai,ri,s0i) we want to learn the optimal value function using a neural network. The underlying MDP has a finite sized action space.
- (written 4pts) Your friend first suggests the following approach
- Initialize parameters of neural network V
- For each batch of k tuples () do Stochastic Gradient Descent with loss function V(si)|2 where yi = maxai[ri + V(s0i)]
What is the problem with this approach? (Hint: Think about the type of data we have)
- (written 3pts) Your friend now suggests the following
- Initialize parameters of neural network for state-action value function Q(s,a)
- For each batch of k tuples (si,ai,ri,s0i) do Stochastic Gradient Descent with loss function Q(si,ai)|2 where
Now as we just have the network Q(s,a) how would you determine V (s) needed for the above training procedure?
- (written 3pts) Is the above method of learning the Q network guaranteed to give us an approximation of the optimal state action value function?
Reviews
There are no reviews yet.