[Solved] CS234 Week1

$25

File Name: CS234_Week1.zip
File Size: 103.62 KB

SKU: [Solved] CS234 Week1 Category: Tag:
5/5 - (1 vote)
  • [CA session] Problem 1: MDP Design

You are in a Las Vegas casino! You have $20 for this casino venture and will play until you lose it all or as soon as you double your money (i.e., increase your holding to at least $40). You can choose to play two slot machines: 1) slot machine A costs $10 to play and will return $20 with probability 0.05 and $0 otherwise; and 2) slot machine B costs $20 to play and will return $30 with probability 0.01 and $0 otherwise. Until you are done, you will choose to play machine A or machine B in each turn. In the space below, provide an MDP that captures the above description.

Describe the state space, action space, rewards and transition probabilities. Assume the discount factor = 1. Rewards should yield a higher reward when terminating with $40 than when terminating with $0. Also, the reward for terminating with $40 should be the same regardless of how we got there (and equivalently for $0).

  • Problem 2: Contradicting Contractions?

Consider an MDP M(S,A,P,R,) with 2 states S = {S1,S2}. From each state there are 2 available actions A = {stay,go}. Choosing stay from any state leaves you in the same state and gives reward -1. Choosing go from state S1 takes you to state S2 deterministically giving reward -2, while choosing go from state S2 ends the episode giving reward 3. Let = 1.

Let V (s) be the optimal value function in state s. As you learned in class, value iteration produces iterates V1(s),V2(s),V3(s), that eventually converge to V (s).

  • [CA Session]

Prove that the -norm distance between the current value function V k and the optimal value function V decreases after each iteration of value iteration.

  • [Breakout Rooms]

Now let us consider exactly what forms of convergence are ensured.

For the given MDP, let us initialize value iteration as V0 = [0,0]. Then V1 = [1,3] and V2 = [1,3]. We also have V = [1,3].

Is there monotonic improvement in the V estimates for all states? If not, does this contradict the result in Q2(a) and why or why not?

  • [Breakout Rooms] Problem 3: Stochastic Optimal Policies

Given an optimal policy that is stochastic in an MDP, show that there is always another deterministic policy that has the same (optimal) value.

  • [Breakout Rooms] Problem 4: Parallelizing Value Iteration

During a single iteration of the Value Iteration algorithm, we typically iterate over the states in S in some order to update Vt(s) to Vt+1(s) for all states s. Is it possible to do this iterative process in parallel? Explain why or why not.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS234 Week1
$25