[SOLVED] 代写代考 COMP 424 – Artificial Intelligence Markov Decision Processes

30 $

File Name: 代写代考_COMP_424_–_Artificial_Intelligence_Markov_Decision_Processes.zip
File Size: 706.5 KB

SKU: 2916367042 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


COMP 424 – Artificial Intelligence Markov Decision Processes
Instructor: Jackie CK Cheung and Readings: R&N Ch 17

• Markov decision processes

Copyright By PowCoder代写加微信 assignmentchef

• Policies and value functions
• Computing optimal value functions for MDPs
• Policy iteration algorithm
• Policy evaluation
• Policy improvement
COMP-424: Artificial intelligence 2

Sequential decision-making
• Utility theory provides a foundation for one-shot decisions.
• If more than one decision has to be taken, reasoning about all of them in general is very expensive.
• Agents need to be able to make decisions in a repeated interaction with the environment over time, where the effects of one decision affect the next one.
• Markov Decision Processes (MDPs) provide a framework for modeling sequential decision-making.
COMP-424: Artificial intelligence 3

Applications of MDPs
• AI / Computer Science:
• Robotic control
• Air campaign planning
• Elevator control
• Computation scheduling
• Control and automation
• Spoken dialogue management
• Cellular channel allocation
• Football play selection
COMP-424: Artificial intelligence 4

More applications of MDPs
• Economics / Operations Research
• Inventory management
• Fleet maintenance
• Road maintenance
• Packet retransmission
• Nuclear plant management
• Agriculture
• Herd management
• Fish stock management
COMP-424: Artificial intelligence 5

Sequential decision-making
• At each time step t, the agent is in some state st.
• It chooses an action at, and as a result, it receives a
numerical reward Rt+1 and it can observe the new state st+1.
COMP-424: Artificial intelligence 6

Markov Decision Processes (MDPs)
• Set of states S
• Set of actions A
• Transition model (dynamics): T: S x A x S → [0, 1]
• T(s,a,s’) = P(st+1=s’ | st=s, at=a) is the probability of going from s to
s’ under action a. (Same as HMM model.)
• Reward function R: S x A → R
• R(s,a) is the short-term utility of the action. • Discount factor, γ
• γ is between 0 and 1, usually close to 1.
COMP-424: Artificial intelligence 7

MDP Assumptions
• Markov property:
• Outcome of transition depends on current state and action
• Reward function depends on current state and action
• Distributions are stationary (transition and reward functions do not change over time)
COMP-424: Artificial intelligence 8

Discount factor, γ
• Future rewards are worth less than current rewards. Why?
• Two interpretations:
• Inflation rate: receiving an amount of money in a year, is worth
less than today.
• At each time step, there is a 1- γ chance that the agent dies, and does not receive rewards afterwards.
COMP-424: Artificial intelligence 9

Planning in MDPs
• The goal of an agent in an MDP is to be rational. Maximize its expected utility (i.e., the MEU principle again).
• Maximizing the immediate utility (reward) is not sufficient.
• e.g., the agent might pick an action that gives instant gratification,
even if it later makes it “die”.
• The goal is to maximize long-term utility, (also called return).
• The return is an additive function of all rewards received by the agent.
COMP-424: Artificial intelligence 10

Long-term utilities
The utility Ut for a trajectory, starting from step t, is defined depending on the type of task.
Episodic / finite-horizon tasks:
(e.g., games, trips through a maze, etc.)
Ut = Rt + Rt+1 + Rt+2 + … + RT
where T is the time when a terminal state is reached.
Continuing / infinite-horizon tasks:
(e.g., tasks which may go on forever)
Ut = Rt + γRt+1 + γ2Rt+2 + γ3Rt+3 … = ∑k=0: ∞ γkRt+k Discount factor γ < 1 ensures that return is finite if rewards areCOMP-424: Artificial intelligence 11Example: Mountain-Car• States: position and velocity• Actions: accelerate forward, accelerate backward, coast• Goal: get the car to the top of the hill as quickly as possible.• Reward : -1 for every time step, until car reaches the top (then 0)(Alternately: reward = 1 at the top, 0 otherwise, γ<1)COMP-424: Artificial intelligence 12• Defines how agent should act in a state• Two types of policies:1. Deterministic policy: in each state the agent chooses a unique action.π: S → A, π(s) = a2. Stochastic policy: in the same state, the agent can “roll a die” andchoose different actions.π: S x A → [0, 1], π(s,a) = P(at=a | st=s) COMP-424: Artificial intelligenceFinding a good policy• Our agent needs to learn a good policy• E.g., how should this robot behave in this map?0.1 0.7 0.1Intended direction• One approach:• Figure out how good each state is (i.e., its expected utility)• Choose policy that takes actions with high expected utility, using the states the agent might end up in after an action COMP-424: Artificial intelligence 14State-value function• The state-value function (or simply value function) of a policy π is a function: Vπ : S → R• The value of state s under policy π is the expected return if the agent starts from state s and picks actions according to policy π:Vπ(s) = Eπ [ Ut | st = s ]• For a finite state space, we can represent this as an array, with oneentry for every state. COMP-424: Artificial intelligence 15Policy iteration COMP-424: Artificial intelligencePolicy evaluation• Recall our definition of the return:Ut = Rt + γRt+1 + γ2Rt+2 + γ3Rt+3 + …= Rt + γ ( Rt+1 + γ Rt+2 + … )=Rt +γUt+1• Based on this observation, Vπ(s) becomes:Vπ(s) = Eπ [ Ut | st = s ] = Eπ [ Rt + γUt+1 | st=s ]• By writing the expectation explicitly, we get: • Deterministic policy: Vπ(s) = ( R(s, π(s)) + γ ∑• Stochastic policy: Vπ(s) = ∑ π(s,a) ( R(s,a) + γ ∑ T(s,a,s’)Vπ(s’) )This is a system of linear equations (one per state) with uniquesolution Vπ.COMP-424: Artificial intelligence 17T(s, π(s), s’)Vπ(s’) ) • Consider the general, stochastic version:Vπ(s) = ∑a∈A π(s,a) ( R(s,a) + γ ∑s’∈S T(s,a,s’)Vπ(s’) )• This equation is recursive:• It rewrites the value of a state in terms of the values of other• We will use it to derive algorithms for policy evaluation and policy improvement. COMP-424: Artificial intelligence 18Policy evaluation in matrix form• Bellman’s equation in matrix form: Vπ = Rπ + γ Tπ Vπ• What are Vπ, Rπ andTπ ?• Vπ is a vector containing the value of each state under policy π.• Rπ is a vector containing the immediate reward at each state: R(s, π (s)).• Tπ is a matrix containing the transition probability at each state: T(s, π (s), s’).• In some cases, we can solve this exactly:Vπ = ( I – γ Tπ )-1 Rπ• Can we do this iteratively? (Necessary for large state spaces) COMP-424: Artificial intelligence 19 Iterative policy evaluationMain idea: turn Bellman equations into update rules.1. Start with some initial guess V0.2. During every iteration k, update the value function for all states:Vk+1(s) ← ( R(s, π(s)) + γ ∑s’∈S T(s, π(s), s’)Vk(s’) )3. Stop when the maximum changes between two iterations issmaller than a desired threshold (the values stop changing.)This is a bootstrapping idea: the value of one state is updated based on the current estimates of the values of successor states.This is a dynamic programming algorithm which is guaranteed to converge!COMP-424: Artificial intelligence 20Convergence of iterative policy evaluation• Consider the absolute error in our estimate Vk+1(s): COMP-424: Artificial intelligence 21Convergence of iterative policy evaluation• Let εk be the worst error at iteration k-1:• From previous calculation, we have:• Because γ < 1, this means that• We say that the error contracts by a contraction factor ofCOMP-424: Artificial intelligence 22Policy iterationCOMP-424: Artificial intelligenceSearching for a good policy• We say that π ≥ π’ if Vπ(s) ≥Vπ’(s), ∀s∈S.• This gives a partial ordering of policies.• If one policy is better at one state but worse at another state, the two policies are not comparable.• Since we know how to compute values for policies, we can search through the space of policies.• Local search seems like a good fit. COMP-424: Artificial intelligence 24Policy improvement• Recall Bellman’s eqn:Vπ(s) ← ∑a∈A π(s,a) ( R(s,a) + γ ∑s∈S T(s,a,s’)Vπ(s’) )• Suppose that there is some action a*, such that: ( R(s,a*) + γ ∑s’∈S T(s,a*,s’)Vπ(s’) ) > Vπ(s)
• Then if we set π(s,a*)←1, the value of state s will increase.
• Because we replaced each element in the sum in Vπ(s) with a bigger
• The values of states that can transition to s increase as well.
• The values of all other states stay the same.
• So the new policy using a* is better than the initial policy π. COMP-424: Artificial intelligence 25

Policy iteration
• More generally, we can change the policy π to a new policy π’ which is greedy with respect to the computed values Vπ
π’(s) = argmaxa∈A ( R(s,a) + γ ∑s’∈S T(s,a,s’)Vπ(s’) )
• This gives us a local search through the space of policies.
• We stop when the values of two successive policies are identical.
• Because we only look for deterministic policies, and there is a finite number of them, the search is guaranteed to terminate.
COMP-424: Artificial intelligence 26

Policy iteration algorithm
• Start with an initial policy π0 (e.g. random)
• Compute Vπ, using policy evaluation.
• Compute a new policy π’ that is greedy with respect to Vπ
• Terminate when Vπ = Vπ’
COMP-424: Artificial intelligence

A 4×3 gridworld example
• Transitions are stochastic, as shown on left figure.
0.1 0.7 0.1
Intended direction
COMP-424: Artificial intelligence 28

Policy Iteration (1)
Initial policy After policy evaluation
COMP-424: Artificial intelligence

Policy Iteration (2)
After policy improvement Another policy evaluation
COMP-424: Artificial intelligence

Policy Iteration (3)
Another policy improvement State values – converged!
COMP-424: Artificial intelligence

A 4×3 gridworld example
• New version:
– Transitions are still stochastic.
– Change the reward of the pit from -10 to -500.
Agent actively tries to avoid the goal, for fear of falling into the pit!
COMP-424: Artificial intelligence 32

Generalized Policy Iteration
• Any combination of policy evaluation and policy improvement steps, e.g., only update the value of one state, and immediately improve the policy at that state.
COMP-424: Artificial intelligence 33

Optimal policies and optimal value functions
• The optimal value function, V*, is defined as the best value that can be achieved at any state:
V*(s) = maxπ Vπ(s)
• In a finite MDP, there exists a unique optimal value
function (shown by Bellman, 1957).
• Any policy that achieves the optimal value function is called an optimal policy (denoted π*). The optimal policy is not necessarily unique.
COMP-424: Artificial intelligence 34

Optimal policies in the gridworld example
• Optimal state values give information about the shortest path to the goal.
• One of the deterministic optimal policies is shown below.
• There can be an infinite number of optimal policies (think
stochastic policies).
COMP-424: Artificial intelligence 35

Complexity of policy iteration
Repeat 2 basic steps: Compute Vπ + Compute a new policy π’ 1. Compute Vπ, using policy evaluation.
2. Compute a new policy π’ that is greedy with respect to Vπ .
Repeat for how many iterations?
COMP-424: Artificial intelligence 36

Complexity of policy iteration
Repeat 2 basic steps: Compute Vπ + Compute a new policy π’ 1. Compute Vπ, using policy evaluation.
Per iteration: O(S3)
2. Compute a new policy π’ that is greedy with respect to Vπ
Per iteration: O(S2A)
Repeat for how many iterations? At most |A||S|
Can get very expensive when there are many states!
COMP-424: Artificial intelligence 37

What you should know
• Definition of MDP framework.
• Differences/similarities between MDPs and other AI approaches (e.g. general search, game playing, STRIPS planning).
• Basic MDP algorithms and their properties:
• Policy iteration
• Policy evaluation
• Policy improvement
COMP-424: Artificial intelligence 38

程序代写 CS代考加微信: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写代考 COMP 424 – Artificial Intelligence Markov Decision Processes
30 $