5/5 - (1 vote)

CS234_HW1_2021

1 Flappy Karel MDP

There is a hot new mobile game on the market called Flappy Karel, where Karel the robot must dodge the red pillars of doom and flap its way to the green pasture. Consider the following 2 grid environments (Flappy World 1 and Flappy World 2). Starting from any unshaded square, Karel can either move right & up, or right & down (e.g from state 4 you can move to state 10 or 12, think checkers). Actions are deterministic and always succeed unless they will cause Karel to run into a wall. The thicker edges indicate walls, and attempting to move in the direction of a wall results in falling down one square (e.g. going in any direction from state 30 leads to falling into state 31). A successful run by Karel in Flappy World 1 is shown in Figure 1b. Taking any action from the green target squares (no. 32) earns a reward of r_gand ends the episode. Taking any action from the red squares of doom (no. 1, 7, 8, 12, 13) earns a reward of r_rand ends the episode. Otherwise, from every other square, taking any action is associated with a reward r_s. Assume the discount factor = 0.9, r_g= +5, and r_r= 5 unless otherwise specified. Notice the Horizon is technically infinite in both worlds.

1	8	15	22	29
2	9	16	23	30
3	10	17	24	31
4	11	18	25	32
5	12	19	26	33
6	13	20	27	34
7	14	21	28	35

Figure 1

Let r_s {4,1,0,1}. Starting in square 2, for each of the possible values of r_sbriefly explain what the optimal policy would be in Flappy World 1. In each case is the optimal policy unique and does the optimal policy depend on the value of the discount factor ? Explain your answer. [5 pts]
What value of r_swould cause the optimal policy to return the shortest path to the green target square? Using this value of r_sfind the optimal value function for each square in Flappy world 1. What is the optimal action from square 27? [5 pts]

Now consider Flappy world 2. It is the same as Flappy world 1, except there are no walls on the right and left sides. Going past the right end of flappy world 2 simply loops you to left hand side. Take a look at Figure 1b for a successful run by Karel in Flappy World 2.

(b) A successful run by Karel in Flappy World 2

(a) Flappy World 2

Figure 2

Let r_s {4,1,0,1}. Starting in square 3, for each of the possible values of r_sbriefly explain what the optimal policy would be in Flappy World 2. Using the value of r_s, that would cause the optimal policy to return the shortest path to the green target square, find the optimal value function for each square in Flappy world 2. What is the optimal action from square 27? [5 pts]
Consider a general MDP with rewards, and transitions. Consider a discount factor of . For this case assume that the horizon is infinite (so there is no termination). A policy in this MDP induces a value function V (lets refer to this as). Now suppose we have the same MDP where all rewards have a constant c added to them and then have been scaled by a constant a (i.e. r_new= a(c + r_old)). Can you come up with an expression for the new value function V induced by in this second MDP in terms of V_old,c,a, and ? [5 pts]
Can scaling all the rewards by a fixed amount change the optimal policy of a MDP? If so, describe how different ranges of the constant a (where r_new= a (r_old)) would change the optimal policy of the MDP from part (c). [5 pts]

2 Applications of the Performance Difference Lemma

The purpose of this exercise is to get familiar on how to compare the value of different policies, ₁and ₂, on a fixed horizon MDP. A fixed horizon MDP is an MDP where the agents state is reset after H timesteps; H is called the horizon of the MDP. There is no discount (i.e., = 1) and policies are allowed to be non-stationary, i.e., the action identified by a policy depends on the timestep in addition to the state. Let x_t denote the distribution over states at timestep t (for 1 t H) upon following policy and V_t(x_t) denote the value function of policy in state x_tand timestep t, and Q_t(x_t,a) denote the corresponding Q value associated to action a. As a clarifying example, we denote Ex_t₁V (x_t) to represent the average value of the value function V () over the states at timestep t encountered upon following policy ₁. The following equality is called performance

difference lemma :

Intuition: The above expression can be interpreted in the following way. For concreteness, assume that ₁is the better policy, i.e., achieving. Suppose youre following policy ₂and you are at timestep t in state x_t. You have the option to follow ₁(the better policy) until the end of the episode, totalling return from the current state-timestep; or you have the option to follow ₂for one timestep and then follow ₁instead until the end of the episode (you can follow many other policies of course). This would give you a loss of Q_t¹(x_t,₁(x_t,t)) Q_t¹(x_t,₂(x_t,t)) that originates from following the worse policy ₂instead of ₁in that timestep. Then the equation above means that the value difference of the two policies is the sum of all the losses induced by following the suboptimal policy for every timestep, weighted by the expected trajectory of the policy youre following.

Question You will use the performance difference lemma to solve this problem. Consider an MDP where the state space S is partitioned into two sets of states S⁺and its complement S⁺.

In every state s S⁺there exists an action a⁺that leads to the same state with probability 1 and gives a unitary reward:

p(s_t₊₁= s | s_t= s,a_t= a⁺) = 1, p(st + 1 6= s | s_t= s,a_t= a⁺) = 0.

The reward function is always positive. In S⁺the reward function equals 1 upon playing a⁺and H upon playing any action a 6= a⁺. Therefore in S⁺

r(s,a⁺) = 1, r(s,a) = H, a 6= a⁺

Conversely, in any state s 6 S⁺, the reward function is in [0,1] (s 6 S⁺a r(s,a) [0,1]).

Consider a policy and define a policy ⁺that takes action a⁺in any state S⁺and is otherwise equal to :

⁺(s) = a⁺if s S⁺, ⁺(s) = (s) if s S6⁺

Intuitively, accumulates higher return than ⁺: in any state in S⁺the policy ⁺chooses to take a unitary reward forever instead of a reward of H and then maybe more. Using the performance difference lemma show that at any state s₀

3 Nonstationary Discount Factor

In this problem you will consider a variable discount factor . In lecture 2, we proved that the Bellman backup is a contraction for < 1 in the infinity norm.

In this problem we consider having a non-stationary discount factor and assume you want to run K iterations of value iterations. Let V_Kand V_K⁰be any two arbitrary initial value functions (at timestep K). The time-dependent Bellman backup operator B_kis defined as

def ^X⁰|s,a)V_k(s⁰)]

V_k₁= B_kV_k= max[R(s,a) + _kp(s

a s⁰S

where

Notice that the value function index is decreasing: K,K 1,,2,1

10pt Similarly to what youve done in class, show that the Bellman operator with non-stationary discount factor at time step k is still a contraction, i.e.,

kB_kV B_kV ⁰k _kkV V ⁰k

10pt Using the above inequality prove that

kB1B2 BKVK B1B2 BKVK0 k 12 kkVK VK0 k

10pt Unfortunately _k 1 when k is large so you cannot conclude that the convergence occurs exponentially fast. However, the error still shrinks: show that

which allows you to write

and ensure convergence, albeit at a slower rate.

4 Frozen Lake MDP

Now you will implement value iteration and policy iteration for the Frozen Lake environment from OpenAI Gym. We have provided custom versions of this environment in the starter code.

(coding) Read through vi_and_pi.py and implement policy_evaluation, policy_improvement and policy_iteration. The stopping tolerance (defined as max_s|V_old(s) V_new(s)|) is tol = 10³. Use = 0.9. Return the optimal value function and the optimal policy. [10pts]
(coding) Implement value_iteration in vi_and_pi.py. The stopping tolerance is tol =

10³. Use = 0.9. Return the optimal value function and the optimal policy. [10 pts]

(written) Run both methods on the Deterministic-44-FrozenLake-v0 and

Stochastic-44-FrozenLake-v0 environments. In the second environment, the dynamics of the world are stochastic. How does stochasticity affect the number of iterations required, and the

resulting policy? [5 pts]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS234 Assignment1

CS234_HW1_2021

1 Flappy Karel MDP

2 Applications of the Performance Difference Lemma

3 Nonstationary Discount Factor

4 Frozen Lake MDP

Reviews

Whatsapp Us

[SOLVED] CS234 Assignment1

CS234_HW1_2021

1 Flappy Karel MDP

2 Applications of the Performance Difference Lemma

3 Nonstationary Discount Factor

4 Frozen Lake MDP

Reviews

Related products

[SOLVED] Cs234 assignment1-gridworld

[Solved] CS234 Week1

[Solved] CS234 Assignment2

[Solved] CS234 Assignment1-Gridworld

[Solved] CS234 Assignment 2- implement deep Q learning

[Solved] CS 234- Assignment #3