Reinforcement Learning
1 IntroductionActor-Critic method reduces the variance in Monte-Carlo policy gradient by directly estimating the action-value function. The goal of this assignment is to do experiment with Asynchronous Advantage Actor-Critic (A3C). Due to the inherent advantage of PG method, AC method is able to tackle the environment with continuous action space. However, a naive application of AC method with neural network approximation is unstable for challenging problem. In this assignment, youre required to train the agent with continuous action space and have some fun in some classical RL continuous control scenarios.2 Actor-Critic AlgorithmIn original policy gradient vlog(st, at)vt, return vt is the unbiased estimation of expected long-term value Q(s, a) following a policy (s) (Actor). However, original policy gradient suffers from high variance. Actor-Critic Algorithm uses Q value function Qw(s,a), named Critic, to estimate Q(s,a).In A3C, we maintain several instances of local agent and a global agent. Instead of experience replay, we asynchronously execute all the local agents in parallel. The parameter of the global agent is updated by all the local experience.The pseudo code is listed at the end of the article. For more details of the algorithm, you can refer to the original paper in the following. Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]//International conference on machine learning. 2016: 1928-1937.3 Experiment Description Programming language: python3 You are required to implement A3C algorithm. You should test your agent in a classical RL control environmentPendulum. OPENAI gym provides this environment, which is implemented with python (https://github.com/openai/gym/wiki/Pendulum-v0).
Reviews
There are no reviews yet.