[SOLVED] COMP 424 Artificial Intelligence Markov Decision Processes

$25

File Name: COMP_424__Artificial_Intelligence_Markov_Decision_Processes.zip
File Size: 555.78 KB

5/5 - (1 vote)

COMP 424 Artificial Intelligence Markov Decision Processes
Instructor: Jackie CK Cheung and Readings: R&N Ch 17

Markov decision processes

Copyright By Assignmentchef assignmentchef

Policies and value functions
Computing optimal value functions for MDPs
Policy iteration algorithm
Policy evaluation
Policy improvement
COMP-424: Artificial intelligence 2

Sequential decision-making
Utility theory provides a foundation for one-shot decisions.
If more than one decision has to be taken, reasoning about all of them in general is very expensive.
Agents need to be able to make decisions in a repeated interaction with the environment over time, where the effects of one decision affect the next one.
Markov Decision Processes (MDPs) provide a framework for modeling sequential decision-making.
COMP-424: Artificial intelligence 3

Applications of MDPs
AI / Computer Science:
Robotic control
Air campaign planning
Elevator control
Computation scheduling
Control and automation
Spoken dialogue management
Cellular channel allocation
Football play selection
COMP-424: Artificial intelligence 4

More applications of MDPs
Economics / Operations Research
Inventory management
Fleet maintenance
Road maintenance
Packet retransmission
Nuclear plant management
Agriculture
Herd management
Fish stock management
COMP-424: Artificial intelligence 5

Sequential decision-making
At each time step t, the agent is in some state st.
It chooses an action at, and as a result, it receives a
numerical reward Rt+1 and it can observe the new state st+1.
COMP-424: Artificial intelligence 6

Markov Decision Processes (MDPs)
Set of states S
Set of actions A
Transition model (dynamics): T: S x A x S [0, 1]
T(s,a,s) = P(st+1=s | st=s, at=a) is the probability of going from s to
s under action a. (Same as HMM model.)
Reward function R: S x A R
R(s,a) is the short-term utility of the action. Discount factor,
is between 0 and 1, usually close to 1.
COMP-424: Artificial intelligence 7

MDP Assumptions
Markov property:
Outcome of transition depends on current state and action
Reward function depends on current state and action
Distributions are stationary (transition and reward functions do not change over time)
COMP-424: Artificial intelligence 8

Discount factor,
Future rewards are worth less than current rewards. Why?
Two interpretations:
Inflation rate: receiving an amount of money in a year, is worth
less than today.
At each time step, there is a 1- chance that the agent dies, and does not receive rewards afterwards.
COMP-424: Artificial intelligence 9

Planning in MDPs
The goal of an agent in an MDP is to be rational. Maximize its expected utility (i.e., the MEU principle again).
Maximizing the immediate utility (reward) is not sufficient.
e.g., the agent might pick an action that gives instant gratification,
even if it later makes it die.
The goal is to maximize long-term utility, (also called return).
The return is an additive function of all rewards received by the agent.
COMP-424: Artificial intelligence 10

Long-term utilities
The utility Ut for a trajectory, starting from step t, is defined depending on the type of task.
Episodic / finite-horizon tasks:
(e.g., games, trips through a maze, etc.)
Ut = Rt + Rt+1 + Rt+2 + + RT
where T is the time when a terminal state is reached.
Continuing / infinite-horizon tasks:
(e.g., tasks which may go on forever)
Ut = Rt + Rt+1 + 2Rt+2 + 3Rt+3 = k=0: kRt+k Discount factor < 1 ensures that return is finite if rewards areCOMP-424: Artificial intelligence 11Example: Mountain-Car States: position and velocity Actions: accelerate forward, accelerate backward, coast Goal: get the car to the top of the hill as quickly as possible. Reward : -1 for every time step, until car reaches the top (then 0)(Alternately: reward = 1 at the top, 0 otherwise, <1)COMP-424: Artificial intelligence 12 Defines how agent should act in a state Two types of policies:1. Deterministic policy: in each state the agent chooses a unique action.: S A, (s) = a2. Stochastic policy: in the same state, the agent can roll a die andchoose different actions.: S x A [0, 1], (s,a) = P(at=a | st=s) COMP-424: Artificial intelligenceFinding a good policy Our agent needs to learn a good policy E.g., how should this robot behave in this map?0.1 0.7 0.1Intended direction One approach: Figure out how good each state is (i.e., its expected utility) Choose policy that takes actions with high expected utility, using the states the agent might end up in after an action COMP-424: Artificial intelligence 14State-value function The state-value function (or simply value function) of a policy is a function: V : S R The value of state s under policy is the expected return if the agent starts from state s and picks actions according to policy :V(s) = E [ Ut | st = s ] For a finite state space, we can represent this as an array, with oneentry for every state. COMP-424: Artificial intelligence 15Policy iteration COMP-424: Artificial intelligencePolicy evaluation Recall our definition of the return:Ut = Rt + Rt+1 + 2Rt+2 + 3Rt+3 + …= Rt + ( Rt+1 + Rt+2 + … )=Rt +Ut+1 Based on this observation, V(s) becomes:V(s) = E [ Ut | st = s ] = E [ Rt + Ut+1 | st=s ] By writing the expectation explicitly, we get: Deterministic policy: V(s) = ( R(s, (s)) + Stochastic policy: V(s) = (s,a) ( R(s,a) + T(s,a,s)V(s) )This is a system of linear equations (one per state) with uniquesolution V.COMP-424: Artificial intelligence 17T(s, (s), s)V(s) ) Consider the general, stochastic version:V(s) = aA (s,a) ( R(s,a) + sS T(s,a,s)V(s) ) This equation is recursive: It rewrites the value of a state in terms of the values of other We will use it to derive algorithms for policy evaluation and policy improvement. COMP-424: Artificial intelligence 18Policy evaluation in matrix form Bellmans equation in matrix form: V = R + T V What are V, R andT ? V is a vector containing the value of each state under policy . R is a vector containing the immediate reward at each state: R(s, (s)). T is a matrix containing the transition probability at each state: T(s, (s), s). In some cases, we can solve this exactly:V = ( I – T )-1 R Can we do this iteratively? (Necessary for large state spaces) COMP-424: Artificial intelligence 19 Iterative policy evaluationMain idea: turn Bellman equations into update rules.1. Start with some initial guess V0.2. During every iteration k, update the value function for all states:Vk+1(s) ( R(s, (s)) + sS T(s, (s), s)Vk(s) )3. Stop when the maximum changes between two iterations issmaller than a desired threshold (the values stop changing.)This is a bootstrapping idea: the value of one state is updated based on the current estimates of the values of successor states.This is a dynamic programming algorithm which is guaranteed to converge!COMP-424: Artificial intelligence 20Convergence of iterative policy evaluation Consider the absolute error in our estimate Vk+1(s): COMP-424: Artificial intelligence 21Convergence of iterative policy evaluation Let k be the worst error at iteration k-1: From previous calculation, we have: Because < 1, this means that We say that the error contracts by a contraction factor ofCOMP-424: Artificial intelligence 22Policy iterationCOMP-424: Artificial intelligenceSearching for a good policy We say that if V(s) V(s), sS. This gives a partial ordering of policies. If one policy is better at one state but worse at another state, the two policies are not comparable. Since we know how to compute values for policies, we can search through the space of policies. Local search seems like a good fit. COMP-424: Artificial intelligence 24Policy improvement Recall Bellmans eqn:V(s) aA (s,a) ( R(s,a) + sS T(s,a,s)V(s) ) Suppose that there is some action a*, such that: ( R(s,a*) + sS T(s,a*,s)V(s) ) > V(s)
Then if we set (s,a*)1, the value of state s will increase.
Because we replaced each element in the sum in V(s) with a bigger
The values of states that can transition to s increase as well.
The values of all other states stay the same.
So the new policy using a* is better than the initial policy . COMP-424: Artificial intelligence 25

Policy iteration
More generally, we can change the policy to a new policy which is greedy with respect to the computed values V
(s) = argmaxaA ( R(s,a) + sS T(s,a,s)V(s) )
This gives us a local search through the space of policies.
We stop when the values of two successive policies are identical.
Because we only look for deterministic policies, and there is a finite number of them, the search is guaranteed to terminate.
COMP-424: Artificial intelligence 26

Policy iteration algorithm
Start with an initial policy 0 (e.g. random)
Compute V, using policy evaluation.
Compute a new policy that is greedy with respect to V
Terminate when V = V
COMP-424: Artificial intelligence

A 43 gridworld example
Transitions are stochastic, as shown on left figure.
0.1 0.7 0.1
Intended direction
COMP-424: Artificial intelligence 28

Policy Iteration (1)
Initial policy After policy evaluation
COMP-424: Artificial intelligence

Policy Iteration (2)
After policy improvement Another policy evaluation
COMP-424: Artificial intelligence

Policy Iteration (3)
Another policy improvement State values converged!
COMP-424: Artificial intelligence

A 43 gridworld example
New version:
Transitions are still stochastic.
Change the reward of the pit from -10 to -500.
Agent actively tries to avoid the goal, for fear of falling into the pit!
COMP-424: Artificial intelligence 32

Generalized Policy Iteration
Any combination of policy evaluation and policy improvement steps, e.g., only update the value of one state, and immediately improve the policy at that state.
COMP-424: Artificial intelligence 33

Optimal policies and optimal value functions
The optimal value function, V*, is defined as the best value that can be achieved at any state:
V*(s) = max V(s)
In a finite MDP, there exists a unique optimal value
function (shown by Bellman, 1957).
Any policy that achieves the optimal value function is called an optimal policy (denoted *). The optimal policy is not necessarily unique.
COMP-424: Artificial intelligence 34

Optimal policies in the gridworld example
Optimal state values give information about the shortest path to the goal.
One of the deterministic optimal policies is shown below.
There can be an infinite number of optimal policies (think
stochastic policies).
COMP-424: Artificial intelligence 35

Complexity of policy iteration
Repeat 2 basic steps: Compute V + Compute a new policy 1. Compute V, using policy evaluation.
2. Compute a new policy that is greedy with respect to V .
Repeat for how many iterations?
COMP-424: Artificial intelligence 36

Complexity of policy iteration
Repeat 2 basic steps: Compute V + Compute a new policy 1. Compute V, using policy evaluation.
Per iteration: O(S3)
2. Compute a new policy that is greedy with respect to V
Per iteration: O(S2A)
Repeat for how many iterations? At most |A||S|
Can get very expensive when there are many states!
COMP-424: Artificial intelligence 37

What you should know
Definition of MDP framework.
Differences/similarities between MDPs and other AI approaches (e.g. general search, game playing, STRIPS planning).
Basic MDP algorithms and their properties:
Policy iteration
Policy evaluation
Policy improvement
COMP-424: Artificial intelligence 38

CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] COMP 424 Artificial Intelligence Markov Decision Processes
$25