Implement (in C++) the Value Iteration algorithm (detailed in chapter 3 [Sutton and Barto, 1998] and chapter 13 [Mitchell, 1997]) in order to find the optimal value (V ) for each state in a small grid-world (figure 1). Use the following information:
- The agent has 4 actionsstates { s1 2 3 4 5{sleft, right, up, down6 }. Figure 1 shows the possible transitions}, and the grid-world 6
, s , s , s , s ,
between states (actions for given states).
- The state transition distributionis deterministic, so 0 for all states and actions.
- Rewards for all state transitions are zero ( = 0), except the following:
- State s3 is the terminal state.
- The discount factor is 0.8, e. = 0.8.
Figure 1: A small grid-world, where arrows show possible transitions between states. Note that state s3 is a terminal state.
Question 1: How many iterations does it take for the Value Iteration algorithm to converge? In an output text file list the optimal values (V for each state).
Question 2: Assume we start in state s1, give the states that form the optimal policy () to reach the terminal state (s3).
Question 3: Is it possible to change the reward function function so that V changes, but the optimal policy () remains unchanged?
If yes, describe how the reward function must be changed and the resulting change to V . Otherwise, briefly explain why this is impossible.
In a ZIP file, place the source code, makefile, and output text file (answers to questions 1, 2, 3). Upload the ZIP file to Vula before 10.00 AM, Monday, 23 September.
References
[Mitchell, 1997] Mitchell, T. (1997). Machine Learning: Chapter 13: Reinforcement Learning. McGraw Hill, New York, USA.
[Sutton and Barto, 1998] Sutton, R. and Barto, A. (1998). An Introduction to Reinforcement Learning. John Wiley and Sons, Cambridge, USA.
Reviews
There are no reviews yet.