Figure 1: Hidden Markov Model (HMM)

Consider the HMM described in Figure 1. let the states of *H _{i }*be {1

*,*2

*,*3}, and the states of

*O*be {”

_{i }*the*”

*,*”

*cat*”

*,*”

*jumps*”

*,*”

*up*”}.

- Let
*T*= 3. Write out the factorized form of*P*(*O*_{1 }= ”*the*”*,O*_{2 }= ”*cat*”*,O*_{3 }= ”*jumps*”) in terms of the initial, transition and emission probabilities. Specify all summations and products clearly. - Show that if any initial parameter
*π*or transition parameter_{i }*p*is initially set to 0, then it will remain 0 in all subsequent updates of the EM algorithm._{ij }

# Problem 2

In this problem, we will study the multi-armed (k = 10) bandit problem. Begin by including the following code in your script:

**import **numpy as np **import **matplotlib . pyplot as plt

T = 1000

- Define the greedy and UCB policy functions

def e greedy (Q, N, t , e ):

. . .

def UCB(Q, N, t , c ):

. . .

which both take in *Q*, an array representing the action-values, *N*, an array representing the number of times different actions have been taken, *t*, the total number of actions taken thus far, and finally a policy-specific parameter.

- Write a function that performs one run (1000 time steps), updates
*Q*incrementally and records the reward received at each time step:

def testrun ( policy , param ): truemeans = np. random . normal (0 , 1 , 10) reward = np. zeros (T + 1)

Q = np. zeros (10)

. . .

r = np. random . normal( true means [ a ] , 1)

. . .

return reward

At each time step, when action *a *is taken, the reward *r *is sampled from a normal distribution with mean *true means*[*a*] and standard deviation 1.

- Use the following function to average over 2000 runs and plot the results:

def main ():

aveg = np. zeros (T+1) aveeg = np. zeros (T+1) aveucb = np. zeros (T+1)

for i in range (2000):

g = test run ( e greedy , 0.0) eg = test run ( e greedy , # choose parameter ) ucb = test run (UCB, # choose parameter )

aveg += (g − ave g ) / ( i + 1) aveeg += (eg − ave eg ) / ( i + 1) aveucb += (ucb − ave ucb ) / ( i + 1)

t = np. arange (T + 1) plt . plot (t , ave g , ’b−’, t , ave eg , ’r −’, t , ave ucb , ’g−’) plt . show()

At approximately what value of * *does the -greedy method switch from being better than the greedy policy to being worse than it?

# Problem 3: Robot vs Snakes

In this problem, we will simulate a robot navigating a room trying to find the exit. The robot is bleeding and loses 1 hitpoint for every step it takes before it finds the exit. To make matters worse, there are 2 snakes in the centre of the room. If the robot steps onto a square with a snake, the snake will bite it and cause it to lose 15 hitpoints! As the robot is terrified of being bitten, it is very nervous and does not strictly respond to commands; if told to move in a certain direction, it will only do so half the time. One quarter of the time it will instead move to the left of the commanded direction, and another quarter of the time it will move to the right of the commanded direction. Can you help train the robot to find its way out?

Begin by installing OpenAI’s gym module (pip install gym), and include the following code in your script.

**import **numpy as np **import **sys **from **gym. envs . toy text **import **discrete UP = 0

RIGHT = 1

DOWN = 2

LEFT = 3

GOAL = 4 | # upper−right |
corner |

START = 20SNAKE1 = 7SNAKE2 = 17 | # lower−l e f t |
corner |

eps = 0.25

**class **Robot vs snakes world ( discrete . DiscreteEnv ):

**def **i n i t ( s e l f ): s e l f . shape = [5 , 5]

nS = np. prod( s e l f . shape ) *# total number of states*

nA = 4 *# total number of actions per state*

MAXY = s e l f . shape [0]

MAXX = s e l f . shape [1]

P = {}

grid = np. arange (nS ). reshape ( s e l f . shape ) it = np. nditer ( grid , flags =[ ’ multi index ’ ])

**while not **it . finished : s = it . iterindex

y , x = it . multi index P[ s ] = {a : [ ] **for **a **in range**(nA)}

is done = lambda s :if is done ( s ): reward = 0.0 |
s == GOAL |

elif s == SNAKE1 or reward = −15.0 |
s == SNAKE2: |

**else **:

reward = −1.0

**if **is done ( s ):

P[ s ] [UP] = [(1.0 , s , reward , True )]

P[ s ] [RIGHT] = [(1.0 , s , reward , True )]

P[ s ] [DOWN] = [(1.0 , s , reward , True )]

P[ s ] [LEFT] = [(1.0 , s , reward , True )] **else **:

nsup = s **if **y == 0 **else **s − MAXX nsright = s **if **x == (MAXX − 1) **else **s + 1

nsdown = s **if **y == (MAXY − 1) **else **s + MAXX

nsleft = s **if **x == 0 **else **s − 1

P[ s ] [UP] = [(1 − (2 ∗ eps ) , ns up , reward , is done ( ns up )) ,

(eps , nsright , reward , is done ( ns right )) ,

(eps , nsleft , reward , is done ( ns left ))]

P[ s ] [RIGHT] = [(1 − (2 ∗ eps ) , ns right , reward , is done ( ns right )) ,

(eps , nsup , reward , is done ( ns up )) ,

(eps , nsdown , reward , is done (ns down ))]

P[ s ] [DOWN] = [(1 − (2 ∗ eps ) , ns down , reward , is done (ns down )) ,

(eps , nsright , reward , is done ( ns right )) ,

(eps , nsleft , reward , is done ( ns left ))]

P[ s ] [LEFT] = [(1 − (2 ∗ eps ) , ns left , reward , is done ( ns left )) ,

(eps , nsup , reward , is done ( ns up )) ,

(eps , nsdown , reward , is done (ns down ))]

it . iternext ()

isd = np. zeros (nS) isd [START] = 1.0 s e l f .P = P **super**( Robot vs snakes world , s e l f ). i n i t (nS , nA, P, isd )

**def **render ( s e l f ):

grid = np. arange ( s e l f .nS ). reshape ( s e l f . shape ) it = np. nditer ( grid , flags =[ ’ multi index ’ ])

**while not **it . finished : s = it . iterindex

y , x = it . multi index

**if **s e l f . s == s :

output = ” R ”

**elif **s == GOAL:

output = ” G ”

**elif **s == SNAKE1 **or **s == SNAKE2:

output = ” S ” **else **:

output = ” o ”

**if **x == 0:

output = output . lstrip ()

**if **x == s e l f . shape [1] − 1:

output = output . rstrip () sys . stdout . write ( output )

**if **x == s e l f . shape [1] − 1: sys . stdout . write (”

”) it . iternext () sys . stdout . write (”

”)

You can write

env = Robot vs snakes world ()

to start the environment, which is a 5 x 5 grid. Locations on the grid are numbered starting from 0 on the upper-left, and increasing in a row-wise manner; i.e. the upper-right is numbered 4 (exit/goal state), the lower-left (starting location of robot) is numbered 20, the lower-right corner is numbered

24, and so on. You can access the robot’s location at any time using env . s ,

and issue a command to move using env . step (DIR) ,

where *DIR *is *UP*, *DOWN*, *LEFT *or *RIGHT*. Finally,

env .p[ state ] [ action ]

returns a list of tuples, each corresponding to a possible next state *j *and of the form (probability of transitioning to state *j*, *j*, reward, goal state?).

- Begin by writing a function which performs value-iteration (note that there is no discountingof the rewards, i.e. the discount factor is 1):

**def **value iteration (env ): policy = np. zeros ([ env .nS , env .nA])

V = np. zeros (env .nS)

. . .

. . .

**return **policy , V

This function is to return the value function, as well as the greedy policy function. Print out both the policy and the value function.

- Next, use the policy function obtained above to navigate the robot through the room. At eachtime step, use env . render ()

to print out the layout of the room with the robot’s current location. Terminate the program when the robot reaches the exit.

- Does the robot try to avoid the snakes? If so, refer to the policy function obtained in (a) toexplain in what way it does so.

## Reviews

There are no reviews yet.