For this assignment, you will work with the Pushing2D-v1 environment, note that this environment is slightly different from the one you used in the previous homework. In particular, the state is now 10 dimensional and includes the velocity for the pusher and the box. To recap, in order to make the environment using gym, you can use the following code:
import gym import envs env = gym.make(Pushing2D-v1)
Then you can interact with the environment as you usually do. Similarly to Homework 4, the environment is considered solved once the percent of success (i.e., the box reaches the goal within the episode) reaches 90%. A render option is provided in the step function to provide a visualization.
1 Model-Based Reinforcement Learning with PETS
For this section, you will implement a model-based reinforcement learning (MBRL) method called PETS which stands for probabilistic ensemble and trajectory sampling [1]. There are 3 main components to MBRL with PETS:
- Probabilistic ensemble of networks: you will be using probabilistic networks that output a distribution over the resulting states given a state and action pair.
- Trajectory sampling: propagate hallucinated trajectories through time by passing hypothetical state-action pairs through different networks of the ensemble.
- Planning with model predictive control: Use the trajectory sampling method along with a cost function to perform planning and select good actions.
We will start with the third component. You will first implement CEM and MPC and test it on a ground truth dynamics. Then, you will replace the ground truth dynamics with a learned, probabilistic ensemble dynamics model. It is recommended to go through all the questions before you start the implementation.
1.1 Planning with Cross Entropy Method(CEM) [20 pts]
Before learning the model, let us first understand how do we obtain a policy from a dynamics model p(st+1|st,a). One approach is to train a model-free policy on the imagined trajectories generated by the model. Another simpler approach is to do random shooting, where an action
Algorithm 1 Cross Entropy Method (CEM)
1: procedure CEM(population size M, # elites e, # iters I, initial mean , initial std , plan horizon T, action dimensionality A)
2: for i in 1 : I:
3: Sample M action sequences a1:T,m according to and from normal distribution 4: where a1:T,m, RAT and RATAT is a diagonal covariance matrix.
5: for m in 1 : M:
6: Sample a trajectory m = (s1,a1,,aT ,sT+1)m using the dynamics 7: model p(st+1|st,at) and the action sequence a1:T,m where st RS,at RA.
8: Calculate the cost of a1:T,m based on m.
9: Update and using the top e action sequences.
10: return: 11: end procedure
sequence at:t+ is optimized on the model to minimize a cost, which often works well. The Cross Entropy Method (CEM) is one of the popular random shooting algorithms. To make things precise, we show CEM in Algorithm 1. Note that when updating the variance of the actions, we assume that the actions across time are independent.
Cost Function In Line 8 of Algorithm 1, we need a cost function to evaluate the fitness of different states and action pairs. Defining the right cost function is often the hardest part of getting model-based reinforcement learning to work, since the action selection and resulting trajectories from CEM depend on the cost function. For this homework, you will be using the following cost function:
goalx
cost(pusher,box,goal,a) = d(pusher,box) + 2d(box,goal) + 5
goaly
where d(p,q) is the Euclidean distance between points p R2 and q R2. This function is already implemented in obs cost fn() of mpc.py. Given this state based cost, you can get the cost of a trajectory by summing up all the cost of all the states within one trajectory.
Policy from CEM When sampling trajectories using a policy with CEM, given a state, we can run Algorithm 1 and generate an action sequence of length T, which we can use in the future T time steps. On top of CEM, one thing can be done to give better planning results is Model Predictive Control (MPC), where we only use the first action in the planned action sequence and re-run CEM for each time step. The pseudocode is shown in Algorithm 2. MPC proceeds by starting with an initial and and use that as input to CEM. Then we take the updated from CEM and execute the action in the first time step in the environment to get a new state that we use for MPC in the next time step. We then update the that is used for the next timestep to be the from CEM for the remaining steps in the plan horizon and initialize the last time step to 0. Finally, we return all the state transitions gathered so that they can be appended to D.
Dynamics Model For Problem 1.1, use the ground truth dynamics, which can be accessed through the env.get nxt state function, which takes in the current state and action and generate the next state.
The hyper-parameters are summarized in Table 1.
Parameter | Value |
Population size M | 200 |
Action dimensionality A | 2 |
State dimensionality S | 8 |
# elites e | 20 |
# iters I | 5 |
# plan horizon T | 5 |
Initial mean | 0 RAT |
Initial std | 0.5I RATAT |
Table 1: Summarized hyper-parameters.
Algorithm 2 Generating an episode using MPC
1: procedure MPC(env, plan horizon T)2: transitions = [] | |
3: | s = env.reset() |
4: | = 0, = 0.5I |
5: | while not done: |
6: | =CEM(200, 20, 5, , ) |
7: | a = [0,:] |
8: | s0 = env.step(a) |
9: | transitions.append(s,a,s0) |
10: | s = s0 |
11: | = [1 : T].append(0) |
12: return: transitions13: end procedure |
- Before you begin to implement, we recommend going through how py works in a top-down manner. See how each component of the codes work toghether. For this question, implement CEM (solve function) in cem.py. You also need to implement act and predict next state gt functions in mpc.py. Then test it on the Pushing2D-v1 environment. The CEM policy you implement will also be used later for planning over a learned dynamics model. For all the questions in 1.1, we provide the starter code in the ExperimentGTDynamics class in run.py. All the hyper-parameters are provided in the code. Report the percentage of success over 50 episodes.
- Instead of CEM, plan with random action sequences where each action is generated independently from a normal distribution N(0,0.5I), where I is an identity matrix. Use the same number of trajectories for planning as CEM, i.e. for each state, sample M*I trajectories of length T and pick the best one. Implement this random planning method (solve function) in py. Report the percentage of success over 50 episodes. How does its performance compare to CEM?
3.
- Implement MPC for sampling trajectories in the function act in py.
- We provide another environment Pushing2DNoisyControl-v1, where a control noise is added to the algorithm. Test the two algorithm CEM and MPC+CEM on both environments and report the percentage of success over 50 episodes for each of them.
- Which algorithm performs better on which environments? Why? Discuss the pros and cons of MPC.
1.2 Probabilistic Ensemble and Trajectory Sampling (PETS) [70 pts]
Probabilistic Ensemble
Now, we will first try to learn a single probabilistic dynamics model but will show the algorithm in the form of learning the probabilistic ensemble.
You are provided starter code in model.py that specifies that model architecture that you will be using for each member of the ensemble. Specifically, each network is a fully connected network with 3-hidden layers, each with 400 hidden nodes. If you have trouble running this network, a smaller network may work as well, but may require additional hyperparameter tuning. The starter code also includes a method for calculating the output of each network, which will return the mean and log variance of the resulting states.
The training routine for the ensemble is shown in Algorithm 3. Consider defining a list of operations that you can run simultaneously with a single call to Session.run(). This will help speed up training significantly.
Algorithm 3 Training the Probabilistic Ensemble
1: procedure train(data D, # networks N, # epochs E)
2: Sample |D| transitions from D for each network (sample with replacement).
3: for e in 1 : E:
4: for n in 1 : N:
5: Shuffle the sampled transitions for network n.
6: Form batches of size 128.
7: Loop through batches and take a gradient step of the loss for each batch.
8: end procedure
Single probabilistic network Implement and train a single probabilistic network for 100 epochs on transitions from 1000 randomly sampled episodes and answer the following questions. You need to implement functions in model.py and predict next state model function in mpc.py.
Note: As the dynamics model is now probabilistic, when planning over the model using CEM (i.e. when generating trajectories from the dynamics model), you should sample P trajectories and use the average cost of the P trajectories. In this homework, we recommend using P = 6.
- The loss that you should use to train each network is the negative log likelihood of the actual resulting state under the predicted mean and variance from the network. Given state transition pairs st,at,st+1, assuming the output distribution of the network is a Gaussian distribution N((sn,an),), where and are outputs of the network. Derive the loss function LGauss for training the network. You will note that both the and the depend on the input state and action.
- Plot the loss and RMSE vs number of epochs trained for the single network.
- Combine your model with planning using randomly sampled actions + MPC. Evaluate the performance of your model when planning using a time horizon of 5 and 1000 possible action sequences. Do this by reporting the percent successes on 50 episodes.
- Combine your model with planning using CEM+MPC. Evaluate the performance of your model when planning using a time horizon of 5, a population of 200, 20 elites, and 5 epochs. Do this by reporting the percent successes on 50 episodes.
- Which planning method performs better, random or CEM? How did MPC using this model perform in general? When is the derived policy able to succeed and when does it fail?
Trajectory Sampling
With an ensemble of probabilistic dynamics model, how do we we sample trajectories from the ensemble model? One approach called TS1 sampling is to randomly pick one of the dynamics model at each time step of the trajectories. Algorithm 4 shows how TS1 determines which network to use for each time step. It is only one component of model predictive control and will be combined with the model and the planning method. Implement TS1 in predict next state model function in mpc.py.
Similar to single probabilistic case, we will sample P trajectories or particles, each of which represents a trajectory of length T.
Algorithm 4 Trajectory Sampling with TS1
1: procedure TS1(# networks N, # particles P, plan horizon T)
2: Initialize array S of dimension P T to store network assignments for each particle.
3: for p in 1 : P:
4: Randomly sample a sequence s of length T where s {1,,N}T .
5: Set S[p,:] = s.
6: end procedure
Algorithm 5 MBRL with PETS
1: procedure MBRL(# of epochs I)
2: Initialize empty data array D and initialize probabilistic ensemble (PE) of models.
3: Sample 100 episodes from the environment using random actions and store into D.
4: Train ensemble of networks for 10 epochs.
5: repeat for I epochs:
6: Sample 1 episode using MPC and latest PE and add to D.
7: Train PE for 5 epochs over D.
8: end procedure
MBRL with PETS .
- Describe in detail your implementation of PETS. Include how you implement CEM, MPC, TS1. Include any additional hyper-parameters you need to tune other than the default hyper-parameters. Your response should walk the reader through your submitted code so that he/she will understand the key components of your implementation and be able to run your code with different arguments.
- Run Algorithm 5 for 500 epochs (i.e., collect 100 initial episodes and then 500 episodes using MPC). Plot the loss and RMSE vs number of epochs trained. This can take hours to run. Read though all the questions below before you start running this experiment.
- Every 50 epochs, test your model with both CEM+MPC and random actions+MPC on 20 episodes and report the percent of successes. Plot this as a function of the number of epochs of PETS. Which planning algorithm performs better? Combine this result with the observation in the single dynamics model and the ground truth dynamics cases, and discuss the comparison between CEM+MPC and random actions + MPC. Make sure to include both your observations and your explaination.
- What are some limitations of MBRL? Under which scenarios would you prefer MBRL to policy gradient methods like the ones you implemented in the previous homeworks?
- Theoretical Questions
- Aleatoric vs epistemic uncertainty. Aleatoric uncertainty is also known as statistical uncertainty and represents the random variability that is unpredictable, such as the outcome of a dice roll. Aleatoric uncertainty cannot be resolved by gathering more information. Epistemic uncertainty, also known as systematic uncertainty, which can be resolved by gathering more information. In the case of model-based RL, aleatoric uncertainty refers to the uncertainty due to the stochasticity in the environment and epistemic refers to the model error due to the capacity of the neural network or the difficulty in optimization. In the PETS framework, describe how we can measure the aleatoric and epistemic uncertainty. Make sure your description is specific such that a reader would be able to write practical algorithms to measure the two uncertainties based on your description.
- What is the failure mode if the aleatoric uncertainty is not considered? What is the failure mode if the epistemic uncertainty is not considered? In other word, describe in what ways a model-based agent will fail in these two cases.
Important Implementation Advice
It takes quite a while to run the experiments in this homework. Please plan accordingly and get started early! Please turn your homework in on time to facilitate grading! Again, to make debugging easier, you should implement your code piecewise and test each component as you build up your implementation, i.e., test the probabilistic ensemble, test the trajectory sampling, test the cross entropy method, test the cost calculation, test the MPC, etc.
Reviews
There are no reviews yet.