Exploration in Reinforcement Learning (theory)
1UCBWe find ourselves in the setting of multi-arm bandits.Sj,t =Nj,t = j,t=t k=1t k=1Xik,k 1(ik = j)1(ik = j)Sj,t Nj,tThe question is to prove whether or not j,t is an unbiased estimator of j. At first sight, one could interpret j,t as the simple mean estimate of j and thus would be unbiased. However, this would only apply if samples Xik,k were independent and identically distributed (iid), which is not the case here in the online on-policy learning of UCB. Whether an arm is pulled or not depends on previous samples and therefore one can expect the estimate to rather have some bias. To prove the biasedness of j,t, or rather to show that it is not unbiased in the general case, we will consider a simple case and compute its analytical bias. Let us consider the setting of Bernoulli bandits as in section 3 with k = 2 binary arms of parameters 1 and 2. One pulls the arm it such thatit E arg max j j,t + U(Nj,t,)We assume here that arms are pulled randomly in case of a tie. The UCB exploration term is infinite for t E {1, 2} where both arms are pulled successively. At t = 3, both arms have been pulled once and one of them is going to be pulled again. We look at the sample mean estimates 1,3 and 2,3 after the third action. 1,3 = 1 P 2 = (1- 1)(1 1 1(1 1 1 2) 2 + 1)(1 2) + 1 2 2 = 1 2 + 1 2) 1 (2 2 1) 2 2) 1 + 1 ( 1,3 = = 1(1 + 1 2(1 -2 ) = 1 ( 1 + 2 2 1 2)1

![[Solved] Reinforcement Learning -Homework 3](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] Reinforcement Learning -Homework 1](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.