HW5
-greedy Q-learning
I imply the Q-learning algorithm, using the reward function and with the -greedy
exploration algorithm by setting specific k,k, . The Q-function serves as a metric to
evaluate the value of a state-action pair (s,a) in relation to the rewards obtained when
an agent performs a task. An optimal policy is achieved by maximizing the Q-function
values, which are computed for all possible (s,a) pairs with respect to the task at hand.
In order to obtain the Q-function, the following steps are taken:
(1) Initialize parameter: Discount factor ; exploration probability k; learning rate
k.
(2) Initialize Q-function.
(3) Determine the initial state s0.
(4) For time step k, select action k according to:
ak =
a argmax
a
Qk(sk, a) with probability 1k
an action uniformly randomly selected from all
actions available at state sk
with probability k
(5) Apply action ak, receive reward rk+1, then observe next state sk+1.
(6) Update Q-function using Bellman equation:
Qk+1(sk,ak) = Qk(sk,ak)+k(rk+1+ max
a Qk(sk+1,a
)Qk(sk,ak));
(7) Set k = k+1 and repeat for next loop.
(8) Break the loop if one of these conditions are reached:
Robot reaches goal state sk = 100
k < 0.005 (optional)
Maximum number of time step kmax = 3000 is reached.
(9) Run 10 times from (2) to (8) and obtain a result of a set of fixed parameter.
Task 1
Here, I set the random seed as 5904. The performance of task 1 is show:
Table 1: Parameter values and performance of Q-Learning
k, k
No. of goal-reached runs Execution time (sec.)
= 0.5 = 0.9 = 0.5 = 0.9
1k
0 0 N.A. N.A.
100
100+k 0 9 N.A. 1.089
1+log(k)
k 0 0 N.A. N.A.
1+5log(k)
k 0 10 N.A. 1.7712
1
The optimal policies and optimal paths are shown in Fig. 1-2.
(a) Optimal path (b) Optimal policy
Figure 1: Performance of = 0.9 and k = 100
100+k
(a) Optimal path (b) Optimal policy
Figure 2: Performance of = 0.9 and k = 1+5log(k)
k ,
Comments
(1) Only when = 0.9, the robot can find the terminal. Only when k = 100
100+k or
k = 1+5log(k)
k , the robot with exploration action can reach the goal.
(2) The result may be not reproducible because of exploration. Therefore, I set the
random seed at 5904. The execution time fluctuates because of computing resource
status unstable.
(3) The reason why k = 100
100+k and k = 1+5log(k)
k are able to find the terminal is that
their decreasing speed is more slow according to the Fig.3
2
Figure 3: Decreasing speeds of different k, k expression
If the k drops too fast, the exploration will be less. The robot is more easily stuck
in the trap of argmax
a
Qk(sk, a).
Task 2
To design a Q-learning using my own parameter. I firstly test some new k and .
Table 2: = [0.7, 0.8, 0.9] and performance of Q-Learning
k, k
No. of goal-reached runs Execution time (sec.)
= 0.7 = 0.8 = 0.9 = 0.7 = 0.8 = 0.9
100
100+k 10 10 10 0.65 0.92 1.08
1+5log(k)
k 9 9 7 0.82 1.06 1.04
100
100+
k 10 10 10 0.16 0.16 0.17
1+10log(k)
k 10 10 10 0.69 0.63 0.82
exp(0.001k) 10 10 10 0.09 0.11 0.13
1
k0.1 10 10 10 0.06 0.07 0.07
Figure 4: Decreasing speeds of different k, k expression
From the Table 2, I choose the best performance parameter :k = 1
k0.1 and = 0.7.
3
Conclusion
From this project, I review the structure of Reinforcement Learning and learn the influence
on the mode with difference k and . It is not always the complex k the best. It
depends on the task and a balance of damping speed and computing complexity should
be reached . I also learn the Bellman equation is really important for Q-learning. In the
2nd task, I design a new k inspired from the given. It decreases faster in the first 200
attempts and slower in the last attempts than other functions. Then I find the best via
test different value on the given reward. In my opinion, this may derive to the model
overfitting to task 1 dataset and perform weak in evaluation set.
4
Reviews
There are no reviews yet.