5/5 - (1 vote)

Problem 1.

Find the shortest path from start to end in the following figure.
Find the longest path from start to end in the following figure.

Problem 2. Consider a MDP with two states S={0, 1}, two actions A = {1,2}, and the follow reward function

sa 1, (s,a) = (0,1)

^{( )}=4, (s,a) = (0,2) (1)

3, (s,a) = (1,1)

2, (s,a) = (1,2)

and the transition probabilities as follows:

(2)

The other probabilities can be deduced, for example:

. (3)

The discount factor is
= 3/4.	(4)

For the policy that chooses action 1 in state 0, and action 2 in state 1, find the state value function v(s), by writing out the Bellmans expectation equation, and solve the equation explicitly.
For the same policy , obtain the state value function using iterative update based on the Bellmans expectation equation. You need to list the first 5 iteration values of v(s).
For the policy , calculate the q(s,a) function.
Based on the value function v(s), obtain an improved policy based on

(s) = argmaxq(s,a). (5)

Obtain the optimal value function v(s) using value iteration based on the Bellmans optimality equation, with all initial values set to 0.
Obtain the optimal policy.

Problem 3. Consider a MDP with two states S={0, 1}, two actions A = {1,2}, and the follow reward function

sa 1, (s,a) = (0,1)

^{( )}=4, (s,a) = (0,2) (6)

3, (s,a) = (1,1)

2, (s,a) = (1,2)

and the transition probabilities as follows:

(7)

The other probabilities can be deduced, for example:

. (8)

The discount factor is
= 3/4.	(9)

Exercise on model-free prediction:

For the policy that chooses action 1 in state 0, and action 2 in state 1, starting from state 0, generate one episode E of 10000 triplets of (Ri,Si,Ai), i=0, 2, , 9999, with R₀= 0, S₀= 0.
Based on the episode E, use Monte Carlo policy evaluation to estimate the value function v(s).
Based on the episode E, use n-step temporal difference policy evaluation to estimate the value function v(s).

Exercise on model-free control:

Use the SARSA algorithm to estimate the optimal action-value function q(s,a), by running the algorithm in Sutton and Bartos book (2nd edition, available online).
Use the Q-learning algorithm to estimate the optimal action-value function q(s,a), by running the algorithm in Sutton and Bartos book (2nd edition, available online).

You only need to simulate one episode. In both cases, you will need to decide an appropriate fixed step-size , and exploration probability , and number of time steps in the episode.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] EE526 Homework 5

Reviews

Whatsapp Us

[Solved] EE526 Homework 5

Reviews

Related products

[Solved] EE526 Final Project

[Solved] EE526 Homework 4

[Solved] EE526 Homework 1

[Solved] EE526 Homework 3

[Solved] EE526 Homework 2