In this homework you will implement Q-learning with **function approximation **for the cart pole task [3] in OpenAI Gym environment. As in previous homework, do not care about done variable. Terminate the episode after 500 iterations. You can consider the task is solved if you consistently get +400 reward.

## 2 Function Approximation

Instead of using a large table which is not feasible for continuous-valued variables, we can use a function. As you might have noticed in the first homework, you have to discretize states to keep a table. However, it is cumbersome in general since you might not know anything about the environment, how to discretize and so on. What we do here instead is using a function approximator which will directly give us action values. After all, all we need is to select the best action.

As this task is quite easy, a linear transformation should suffice. You will observe a fourdimensional state. You will have a [4, 2] sized matrix *A*, and [2] sized vector *b *as your parameter set. The computation is:

out = np.matmul(observation, A) + b

which will correspond to *Q*(*s,a*). Here, out will be two-dimensional, one Q value for each action. To update *A *and *b*, you need some sort of direction, supervision. Remember the update rule for Q-table learning:

*Q ^{new}*(

*s*) =

_{t},a_{t}*Q*(

*s*) +

_{t},a_{t}*α*(

*r*+

_{t }*γ*max

*Q*(

*s*

_{t}_{+1}

*,a*

^{0}) −

*Q*(

*s*)) (1)

_{t},a_{t}*a*0

Motivated from this update rule, we will use the following function as our loss function (also known as: objective function, error function) and update our parameters with respect to this loss function:

2

out[(2)

This is also known as **temporal difference learning **[2]. Since this loss function is differentiable with respect to our parameter set, we can use gradient-based learning. You need to find the

1

gradients, *∂L/∂*A, *∂L/∂*b.

*∂L ∂L ∂*out

= (3)

*∂*A *∂*out *∂*A

*∂L ∂L ∂*out

= (4)

*∂*b *∂*out *∂*b

After you correctly calculate these gradients, you can update your parameters using stochastic gradient descent.

A (5) *∂*A

b (6)

*∂*b

where *η *is the learning rate. As in the previous homework, the convergence of the algorithm depends on your hyperparameter settings. One of the most important thing is to clip your parameters into a range, [-lim, lim], to stabilize learning.

## References

- Stochastic Gradient Descent. https://en.wikipedia.org/wiki/Stochastic_gradient_ descent#Momentum.
- Temporal Difference Learning. https://en.wikipedia.org/wiki/Temporal_difference_ learning.
- Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems.
*IEEE transactions on systems, man, and cybernetics*, (5):834–846, 1983.

## Reviews

There are no reviews yet.