1. You MUST answer this question.
(a) Consider a bandit problem with two arms. It is known that one of the arms
leads to rewards that are homogeneously distributed in the interval [0, 1],
while for the other one rewards in [0, 2] are possible. How many exploratory
actions will you need take on average in order to identify which of the two
arms which has a higher reward average? In the above example, give a value
for the optimistic initialisation of the reward estimate which is sufficient to
identify the optimal solution in at least 80% of all cases. [6 marks ]
(b) The concept of value is important in reinforcement learning. Explain the
difference between value and reward signal. What properties are desirable
for the choice of a reward signal when setting-up a reinforcement learning
algorithm? [3 marks ]
(c) Explain the difference between on-policy learning and off-policy learning for
the example of an agent that moves in a grid world that contains one or
more pits (positions with very large negative reward). [3 marks ]
(d) Describe (using symbols and pseudocode) the SARSA and Q-learning algo-
rithms. What is the essential difference between them? Explain using an
example. In some applications, it is empirically observed, although not the-
oretically justified, that SARSA converges faster than Q-learning. Describe
possible reasons for this effect. [6 marks ]
(e) Exploration is essential in most reinforcement learning algorithms. Identify
three different approaches to exploration. [3 marks ]
(f) Consider a one-dimensional track of discrete states which are numbered from
1 to N . Rewards are always zero, except for state 1 where a reward of r = 1
is given. The discount factor is . Assume the agent is currently in state k
(1 < k < N) and can move either to k + 1 or to k 1. How do the valuesof the two reachable states differ? [4 marks ]Page 1 of 32. Markov Decision Processes(a) Why is the Markov property important for most reinforcement learning algo-rithms? Explain the concept of ergodicity in the context of discrete Markovchains. Why is it important for reinforcement learning? [4 marks ](b) What are the advantages and disadvantages of TD(0), Monte Carlo andTD() learning? Discuss your answer also for an inventory control problem. [3 marks ](c) Explain the role of the discount parameter in reinforcement learning. Un-der what condition is = 1 possible? Under what conditions would = 0make sense? [3 marks ](d) How does a semi-Markov Decision Process (SMDP) differ from an MDP?Explain, using expressions for reward, state transition law and cost criterionfor the example of a robot navigating in a large building, why SMDPs canbe beneficial. If the robot receives information about its position, but doesnot have access to a map of the building, would it still be able to realisethese benefits? [4 marks ](e) What is a belief state in the context of Partially Observable Markov DecisionProcesses (POMDP)? How does one modify the Bellman equation for theoptimal value function to accommodate this concept? [4 marks ](f) A mobile robot is travelling through a maze and receives the more reward thefaster it reaches at a certain goal location after departing from a given startstate. At low speed the robot obtains precise state information through itscamera, but at higher speeds the camera image shows movement artefactssuch that the state information cannot provide unambiguous state informa-tion. Describe the details of the task and its solution in terms of a POMDP,and describe what behaviours the robot might learn within this framework.[7 marks ]Page 2 of 33. Application of Reinforcement Learning to real-world problems(a) In the past century, many algorithms for reinforcement learning were de-signed and analysed for grid-world problems. In what respects were thesealgorithms developed in order to become competitive in real-world prob-lems? Explain your answer and choose an example to illustrate your argu-ment. [5 marks ](b) Suppose you are using function approximation in reinforcement learning.What would be the advantage of using a linear combination of basis functionsin order to represent the value function of a state rather than a non-linearmethod such as a neural network with hidden layers? How would you designa set of basis functions for the example that you have discussed in Question3(a)? [3 marks ](c) How can the learning speed of reinforcement learning algorithms be im-proved? Answer this question by discussing whether this is possiblei. without using detailed domain knowledge,ii. based on pre-training in a simulation, andiii. using trajectories obtained from human demonstration.Explain your answer in the context of specific reinforcement learning algo-rithms and discuss also potential drawbacks. [7 marks ](d) What generalisation issues may arise when applying reinforcement learningto a real-world problem? How can their effect be reduced? [3 marks ](e) Design a reinforcement learning algorithm for the control of an online adver-tising system that places ads on webpages based on (incomplete) informationabout the user. Discuss states, rewards, actions, type and structure of thealgorithm and reasonable values for its parameters. In order to convince aadvertising agency to adopt your algorithm you will also need to provide es-timates of training times and expected performance. Given the uncertaintyof the problem, these estimates will not be very reliable. What general ar-guments could you give in addition to the estimates in order to convince theagency of the reinforcement learning approach? [7 marks ]Page 3 of 3
Reviews
There are no reviews yet.