# [Solved] CMPE58y Homework2-Q-learning with function approximation

Click And Check all our Assignments

10 USD \$

Category: Tag:

In this homework you will implement Q-learning with function approximation for the cart pole task  in OpenAI Gym environment. As in previous homework, do not care about done variable. Terminate the episode after 500 iterations. You can consider the task is solved if you consistently get +400 reward.

## 2 Function Approximation

Instead of using a large table which is not feasible for continuous-valued variables, we can use a function. As you might have noticed in the first homework, you have to discretize states to keep a table. However, it is cumbersome in general since you might not know anything about the environment, how to discretize and so on. What we do here instead is using a function approximator which will directly give us action values. After all, all we need is to select the best action.

As this task is quite easy, a linear transformation should suffice. You will observe a fourdimensional state. You will have a [4, 2] sized matrix A, and  sized vector b as your parameter set. The computation is:

out = np.matmul(observation, A) + b

which will correspond to Q(s,a). Here, out will be two-dimensional, one Q value for each action. To update A and b, you need some sort of direction, supervision. Remember the update rule for Q-table learning:

Qnew(st,at) = Q(st,at) + α(rt + γ maxQ(st+1,a0) − Q(st,at)) (1)

a0

Motivated from this update rule, we will use the following function as our loss function (also known as: objective function, error function) and update our parameters with respect to this loss function:

2

out[(2)

This is also known as temporal difference learning . Since this loss function is differentiable with respect to our parameter set, we can use gradient-based learning. You need to find the

1

∂L ∂L ∂out

= (3)

A out A

∂L ∂L ∂out

= (4)

b out b

A (5) A

b (6)

b

where η is the learning rate. As in the previous homework, the convergence of the algorithm depends on your hyperparameter settings. One of the most important thing is to clip your parameters into a range, [-lim, lim], to stabilize learning.

## Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart  [Solved] CMPE58y Homework2-Q-learning with function approximation
10 USD \$