In this homework you will implement Q-learning with function approximation for the cart pole task  in OpenAI Gym environment. As in previous homework, do not care about done variable. Terminate the episode after 500 iterations. You can consider the task is solved if you consistently get +400 reward.
2 Function Approximation
Instead of using a large table which is not feasible for continuous-valued variables, we can use a function. As you might have noticed in the first homework, you have to discretize states to keep a table. However, it is cumbersome in general since you might not know anything about the environment, how to discretize and so on. What we do here instead is using a function approximator which will directly give us action values. After all, all we need is to select the best action.
As this task is quite easy, a linear transformation should suffice. You will observe a fourdimensional state. You will have a [4, 2] sized matrix A, and  sized vector b as your parameter set. The computation is:
out = np.matmul(observation, A) + b
which will correspond to Q(s,a). Here, out will be two-dimensional, one Q value for each action. To update A and b, you need some sort of direction, supervision. Remember the update rule for Q-table learning:
Qnew(st,at) = Q(st,at) + α(rt + γ maxQ(st+1,a0) − Q(st,at)) (1)
Motivated from this update rule, we will use the following function as our loss function (also known as: objective function, error function) and update our parameters with respect to this loss function:
This is also known as temporal difference learning . Since this loss function is differentiable with respect to our parameter set, we can use gradient-based learning. You need to find the
gradients, ∂L/∂A, ∂L/∂b.
∂L ∂L ∂out
∂A ∂out ∂A
∂L ∂L ∂out
∂b ∂out ∂b
After you correctly calculate these gradients, you can update your parameters using stochastic gradient descent.
A (5) ∂A
where η is the learning rate. As in the previous homework, the convergence of the algorithm depends on your hyperparameter settings. One of the most important thing is to clip your parameters into a range, [-lim, lim], to stabilize learning.
- Stochastic Gradient Descent. https://en.wikipedia.org/wiki/Stochastic_gradient_ descent#Momentum.
- Temporal Difference Learning. https://en.wikipedia.org/wiki/Temporal_difference_ learning.
- Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.