[SOLVED] chain algorithm 1. You MUST answer this question.

$25

File Name: chain_algorithm_1._You_MUST_answer_this_question..zip
File Size: 461.58 KB

5/5 - (1 vote)

1. You MUST answer this question.

(a) For what types of decision problems would the application of reinforcement
learning methods be preferable over other methods? [3 marks ]

(b) Explain the difference between value and reward signal. What properties are
desirable for the choice of a reward signal when setting-up a reinforcement
learning algorithm? [3 marks ]

(c) Why is the ergodicity of a Markov chain an important property in the con-
text of reinforcement learning? [4 marks ]

(d) Explain the concept of regret in the context of performance evaluation of
on-line algorithms. Does the -greedy method achieve zero regret? If so,
why? If not, why not? [3 marks ]

(e) How can one achieve sufficient exploration of states and actions in a rein-
forcement learning problem? [3 marks ]

(f) Using eligibility traces can speed-up learning. Give an example and briefly
explain your answer, also mentioning potential risks. [3 marks ]

(g) Explain how the stationary solution of a well-behaved Markovian decision
problem can be calculated when the transition probabilities and the starting
state are known. How can this solution be used to compare a number of
applicable control policies with respect to a given cost function? [3 marks ]

(h) Consider a one-dimensional track of discrete states which are numbered from
1 to N . Rewards are always zero, except for state 1 where a reward of r = 1
is given. The discount factor is . Assume the agent is currently in state k
(1 < k < N) and can move either to k + 1 or to k 1. How do the valuesof the two reachable states differ? [3 marks ]Page 1 of 32. (a) What is a state in a reinforcement learning problem and what is the reasonthe states are sometimes not fully observable? What is a belief state in thecontext of Partially Observable Markov Decision Processes (POMDP)? [3 marks ](b) Classical reinforcement algorithms aim at maximising the mean discountedreward over an unlimited future. In POMDPs, however, often a finite timehorizon is considered. Why is this so? [2 marks ](c) What do you understand by the term particle filter? Explain with equa-tions how this can be used to define an approximate solution method forPOMDPs. [3 marks ](d) In principle, partial observability considerably increases the complexity ofthe learning algorithms. Explain two ways (different from question (c)) toreduce the complexity of POMDPs. [4 marks ](e) How do on- and off-policy methods differ at the algorithmic level? What isthe key requirement for off-policy methods in terms of policy architecture? [2 marks ](f) How does a semi-Markov Decision Process (SMDP) differ from an MDP?Explain, using expressions for reward, state transition law and cost criterionfor the example of a robot navigating in a large building, why SMDPs canbe beneficial. If the robot receives information about its position, but doesnot have access to a map of the building, would it still be able to realisethese benefits? [4 marks ](g) Describe an application of inverse reinforcement learning in robotics. Pro-vide details of an algorithm for the problem by a graphical representationor by relevant equations. [4 marks ](h) Give an example where a stochastic policy is better than a deterministicpolicy even after perfect exploration? [3 marks ]Page 2 of 33. Connect Four (according to wikipedia) is a game in which two players first choosea colour and then take turns dropping a disc of their respective colour from thetop into a seven-column, six-row grid. The pieces fall straight down, occupyingthe next available space within the column. The objective of the game is toconnect four of ones own discs next to each other vertically, horizontally, ordiagonally before your opponent.Consider the problem of training an artificial player using reinforcement learning.(a) How would you design a reward signal for this problem? How would youapply the algorithm and organise the training? [2 marks ](b) While the number of actions is quite small, the state space is rather large:Each of the 42 grid positions can be empty or occupied by one of the colours,which leads to an upper bound of 342 states, but not all of these statesactually occur. Would it still make sense to use a look-up table for therepresentation of the value function? [2 marks ](c) Assuming that the answer for (b) is yes, what reinforcement learning algo-rithm would you choose? Describe the main steps of the algorithm. Howwould you initialise and control its parameters? [6 marks ](d) Assuming that the answer of (b) is no, explain two ways how the algo-rithms can be adapted to be applicable to the problem? Which way wouldyou choose? How would you initialise and control the parameters of thealgorithm? [7 marks ](e) If only a few thousands of games have been played for training, which im-plementation, (c) or (d), can be expected to perform better against a skilledhuman player? [2 marks ](f) How does the two-player problem differ from the standard reinforcementlearning problem? Extending upon this, how does a multi-agent reinforce-ment learning problem differ from its single-agent counterpart? Explainwhat aspects of the problem formulation change owing to the introductionof additional agents? Outline a learning procedure that works in the pres-ence of multiple agents. [6 marks ]Page 3 of 3

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] chain algorithm 1. You MUST answer this question.
$25