[SOLVED] 代写 R C algorithm Scheme game math scala parallel AI statistic software network Bayesian GPU Go react theory Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning

30 $

Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning
Max Jaderberg1, Wojciech M. Czarnecki1, Iain Dunning1, Luke Marris1
Guy Lever1, Antonio Garcia Castaneda1, Charles Beattie1, Neil C. Rabinowitz1
Ari S. Morcos1, Avraham Ruderman1, Nicolas Sonnerat1, Tim Green1, Louise Deason1 Joel Z. Leibo1, David Silver1, Demis Hassabis1, Koray Kavukcuoglu1, Thore Graepel1 Equal contribution.
1DeepMind, London, UK
Recent progress in artificial intelligence through reinforcement learning RL has shown great success on increasingly complex singleagent environments 30, 40, 45, 46, 56 and twoplayer turnbased games 47, 58, 66. However, the real world contains multiple agents, each learning and acting independently to co operate and compete with other agents, and environments reflecting this de gree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve humanlevel in a popular 3D multi player firstperson video game, Quake III Arena Capture the Flag 28, using only pixels and game points as input. These results were achieved by a novel twotier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents play ing in teams together and against each other on randomly generated environ ments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions us ing a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During gameplay, these agents display human like behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode highlevel game knowledge. In an extensive tournamentstyle evaluation the trained agents exceeded the win rate of strong human players both as teammates and opponents, and proved far stronger than existing stateoftheart agents. These results demonstrate a
1
arXiv:1807.01281v1 cs.LG 3 Jul 2018

significant jump in the capabilities of artificial agents, bringing us closer to the goal of humanlevel intelligence.
We demonstrate how intelligent behaviour can emerge from training sophisticated new learning agents within complex multiagent environments. Endtoend reinforcement learn ing methods 45, 46 have so far not succeeded in training agents in multiagent games that combine team and competitive play due to the high complexity of the learning problem 7, 43 that arises from the concurrent adaptation of other learning agents in the environment. We ap proach this challenge by studying teambased multiplayer 3D firstperson video games, a genre which is particularly immersive for humans 16 and has even been shown to improve a wide range of cognitive abilities 21. We focus specifically on a modified version 5 of Quake III Arena 28, the canonical multiplayer 3D firstperson video game, whose game mechanics served as the basis for many subsequent games, and which has a thriving professional scene 1. The task we consider is the game mode Capture the Flag CTF on per game randomly gener ated maps of both indoor and outdoor theme Figure 1 a,b. Two opposing teams consisting of multiple individual players compete to capture each others flags by strategically navigat ing, tagging, and evading opponents. The team with the greatest number of flag captures after five minutes wins. CTF is played in a visually rich simulated physical environment Supple mentary Video https:youtu.bedltN4MxV1RI, and agents interact with the envi ronment and with other agents through their actions and observations. In contrast to previous work 18, 41, 42, 47, 48, 53, 58, 63, 64, agents do not have access to models of the environment, other agents, or human policy priors, nor can they communicate with each other outside of the game environment. Each agent acts and learns independently, resulting in decentralised control within a team.
Since we wish to develop a learning agent capable of acquiring generalisable skills, we go beyond training fixed teams of agents on a fixed map, and instead devise an algorithm and training procedure that enables agents to acquire policies that are robust to the variability of maps, number of players, and choice of teammates, a paradigm closely related to adhoc team play 62. The proposed training algorithm stabilises the learning process in partially observable multiagent environments by concurrently training a diverse population of agents who learn by playing with each other, and in addition the agent population provides a mechanism for meta optimisation. We solve the prohibitively hard credit assignment problem of learning from the sparse and delayed episodic team winloss signal optimising thousands of actions based on a single final reward by enabling agents to evolve an internal reward signal that acts as a proxy for winning and provides denser rewards. Finally, we meet the memory and longterm temporal reasoning requirements of highlevel, strategic CTF play by introducing an agent architecture that features a multitimescale representation, reminiscent of what has been observed in pri mate cerebral cortex 11, and an external working memory module, broadly inspired by human episodic memory 22. These three innovations, integrated within a scalable, massively dis tributed, asynchronous computational framework, enables the training of highly skilled CTF agents through solely multiagent interaction and single bits of feedback about game outcomes.
2

a Outdoor procedural maps
Red flag
b Indoor procedural maps
Blue flag carrier
Example map
c Firstperson observations
that the agents see
d Thousands of parallel CTF games generate experience to train from
e Reinforcement Learning
updates each agents respective policy
Agent f Populationbasedtrainingprovidesdiversepoliciesfor Population training games and enables internal reward optimisation
Figure 1: CTF task and computational training framework. Shown are two example maps that have been sampled from the distribution of outdoor maps a and indoor maps b. Each agent in the game only sees its own firstperson pixel view of the environment c. Training data is generated by playing thousands of CTF games in parallel on a diverse distribution of procedurally generated maps d, and used to train the agents that played in each game with reinforcement learning e. We train a population of 30 different agents together, which provides a diverse set of teammates and opponents to play with, and is also used to evolve the internal rewards and hyperparameters of agents and learning process f. Gameplay footage and further exposition of the environment variability can be found in Supplementary Video https:youtu.bedltN4MxV1RI.
In our formulation, the agents policyuses the same interface available to human play ers. It receives raw RGB pixel input xt from the agents firstperson perspective at timestep t, produces control actions at simulating a gamepad, and receives game points t attainedthe points received by the player for various game events which is visible on the ingame scoreboard. The goal of reinforcement learning RL is to find a policy that maximises the ex pected cumulative discounted reward ETt0 trt over a CTF game with T time steps. The agents policyis parameterised by a multitimescale recurrent neural network with external memory 20 Figure 2 a, Figure S10. Actions in this model are generated conditional on a stochastic latent variable, whose distribution is modulated by a more slowly evolving prior process. The variational objective function encodes a tradeoff between maximising expected reward and consistency between the two timescales of inference more details are given in Sup plementary Materials Section 2.1. Whereas some previous hierarchical RL agents construct ex plicit hierarchical goals or skills 3, 65, 70, this agent architecture is conceptually more closely related to work on building hierarchical temporal representations 12, 14, 33, 55 and recurrent
3

latent variable models for sequential data 13, 19. The resulting model constructs a temporally hierarchical representation space in a novel way to promote the use of memory Figure S7 and temporally coherent action sequences.
For adhoc teams, we postulate that an agents policy 0 should maximise the probability of winning for its team, 0, 1, . . . , N 1, which is composed of 0 itself, and its teammates
2
policies 1, . . . , N 1, for a total of N players in the game: 2

P s team wins, N1E N1, ,…,N N ,…,. 1 0 n n0 ann0 0 1 2 1 2 N1
The winning operatorreturns 1 if the left team wins, 0 for losing, and randomly breaks ties.represents the specific map instance and random seeds, which are stochastic in learning and testing. Since game outcome as the only reward signal is too sparse for RL to be effective, we require rewards rt to direct the learning process towards winning yet are more frequently available than the game outcome. In our approach, we operationalise the idea that each agent hasadenseinternalrewardfunction60,61,74,byspecifyingrt wtbasedontheavailable game points signals t points are registered for events such as capturing a flag, and, crucially, allowing the agent to learn the transformation w such that policy optimisation on the internal rewards rt optimises the policy For The Win, giving us the FTW agent.
Training agents in multiagent systems requires instantiations of other agents in the envi ronment, like teammates and opponents, to generate learning experience. A solution could be selfplay RL, in which an agent is trained by playing against its own policy. While selfplay variants can prove effective in some multiagent games 4, 9, 24, 37, 47, 57, 58, these methods can be unstable and in their basic form do not support concurrent training which is crucial for scalability. Our solution is to train a population of P different agents pPp1 in parallel that play with each other, introducing diversity amongst players to stabilise training 54. Each agent within this population learns from experience generated by playing with teammates and opponents sampled from the population. We sample the agents indexed byfor a training game using a stochastic matchmaking scheme mp that biases coplayers to be of similar skill to player p. This scheme ensures thata priorithe outcome is sufficiently uncertain to provide a meaningful learning signal, and that a diverse set of teammates and opponents are seen dur ing training. Agents skill levels are estimated online by calculating Elo scores adapted from chess 15 based on outcomes of training games. We also use the population to metaoptimise the internal rewards and hyperparameters of the RL process itself, which results in the joint maximisation of:
J
w ,E p p
t0
p
J , w, . inner
J
outer
wE p p
E
twp p,t
p
inner
mp, mp,
a
P w,s team wins, w,
2
w,optimise
p p
4
T

This can be seen as a twotier reinforcement learning problem. The inner optimisation max imises Jinner, the agents expected future discounted internal rewards. The outer optimisation of Jouter can be viewed as a metagame, in which the metareward of winning the match is maximised with respect to internal reward schemes wp and hyperparameters p, with the inner optimisation providing the meta transition dynamics. We solve the inner optimisation with RL as previously described, and the outer optimisation with Population Based Training PBT 29. PBT is an online evolutionary process which adapts internal rewards and hyperparameters and performs model selection by replacing underperforming agents with mutated versions of better agents. This joint optimisation of the agent policy using RL together with the optimisation of the RL procedure itself towards a highlevel goal proves to be an effective and generally appli cable strategy, and utilises the potential of combining learning and evolution 2 in large scale learning systems.
To assess the generalisation performance of agents at different points during training, we performed a large tournament on procedurally generated maps with adhoc matches involving three types of agents as teammates and opponents: ablated versions of FTW including state oftheart baselines, Quake III Arena scripted bots of various levels 69, and human partici pants with firstperson video game experience. Figure 2 b and Figure S2 show the Elo scores and derived winning probabilities for different ablations of FTW, and how the combination of components provide superior performance. The FTW agents clearly exceeded the winrate of humans in maps which neither agent nor human had seen previously, i.e. zeroshot generalisa tion, with a team of two humans on average capturing 16 flags per game less than a team of two FTW agents Figure S2 Bottom, FF vs hh. Interestingly, only as part of a humanagent team did we observe a human winning over an agentagent team 5 win probability. This result suggests that trained agents are capable of cooperating with never seen before teammates, such as humans. In a separate study, we probed the exploitability of the FTW agent by allowing a team of two professional games testers with full communication to play continuously against a fixed pair of FTW agents. Even after twelve hours of practice the human game testers were only able to win 25 6.3 draw rate of games against the agent team.
Interpreting the difference in performance between agents and humans must take into ac count the subtle differences in observation resolution, frame rate, control fidelity, and intrinsic limitations in reaction time and sensorimotor skills Figure S11 a, Supplementary Materials Section 3.1. For example, humans have superior observation and control resolutionthis may be responsible for humans successfully tagging at long range where agents could not humans: 17 tags above 5 map units, agents: 0.5. In contrast, at short range, agents have superior tagging reaction times to humans: by one measure FTW agents respond to newly appeared opponents in 258ms, compared with 559ms for humans Figure S11 b. Another advantage exhibited by agents is their tagging accuracy, where FTW agents achieve 80 accuracy com pared to humans 48. By artificially reducing the FTW agents tagging accuracy to be similar to humans without retraining them, agents winrate was reduced, though still exceeded that of humans Figure S11 c. Thus, while agents learn to make use of their potential for better tagging accuracy, this is only one factor contributing to their overall performance.
5

a FTW Agent Architecture
b Progression During Training
Learning Rate
KL Weighting
Internal Timescale
Winning signal
Agent Elo
FTW
Strong Human
SelfplayRS
Average Human
Selfplay
Random agent
w
Slow RNN
Fast RNN
rt
Internal reward
? t
Qt
Game points ?t
Action at
Policy
Pt
Sampled latent variable
Qt1
Observation xt
Figure 2: Agent architecture and benchmarking. a Shown is how the agent processes a temporal sequence of observations xt from the environment. The model operates at two different time scales, faster at the bottom, and slower by a factor ofat the top. A stochastic vectorvalued latent variable is sampled at the fast time scale from distribution Qt based on observations xt. The action distribution t is sampled conditional on the latent variable at each time step t. The latent variable is regularised by the slow moving prior Pt which helps capture longrange temporal correlations and promotes memory. The network parameters are updated using reinforcement learning based on the agents own internal reward signal rt, which is obtained from a learnt transformation w of game points t. w is optimised for winning probability through population based training, another level of training performed at yet a slower time scale than RL. Detailed network architectures are described in Figure S10. b Top: Shown are the Elo skill ratings of the FTW agent population throughout training blue together with those of the best baseline agents using hand tuned reward shaping RS red and game winning reward signal only black, compared to human and random agent reference points violet, shaded region shows strength between 10th and 90th percentile. It can be seen that the FTW agent achieves a skill level considerably beyond strong human subjects, whereas the baseline agents skill plateaus below, and does not learn anything without reward shaping see Supplementary Materials for evaluation procedure. b Bottom: Shown is the evolution of three hyperparameters of the FTW agent population: learning rate, KL weighting, and internal time scale, plotted as mean and standard deviation across the population.
We hypothesise that trained agents of such high skill have learned a rich representation of the game. To investigate this, we extracted groundtruth state from the game engine at each point in time in terms of 200 binary features such as Do I have the flag?, Did I see my teammate recently?, and Will I be in the opponents base soon?. We say that the agent has knowledge of a given feature if logistic regression on the internal state of the agent accurately models the feature. In this sense, the internal representation of the agent was found to encode a wide variety of knowledge about the game situation Figure S4. Interestingly, the FTW agents representation was found to encode features related to the past particularly well: e.g. the FTW
6

d Single Neuron Selectivity
98
97
99
100
c Expected Neural Response Map
Agent flag taken
Opponent flag held by teammate
Opponent flag held by agent
Agent is respawning
b Basic CTF Situations
a Agent State tSNE Embedding
Points coloured by conjunction of basic CTF situations
h Base Camping Strategy
1 Waiting in opponents base
2 Teammate drops the flag
3 Opponent flag returned
4 Agent quickly picks up and runs back
Agent Flag Status
4
e True Returns
f Value Function
Agent flag at baseopponent flag at basenot respawningagent in home base
Agent flag taken
opponent flag held by teammatenot respawning
agent in the opponents base
Opponent Flag Status
3
1
2
Agent Respawning
Agent flag is strayagent has opponent
flag
Agent flag is strayteammate has opponent flag
g Agent Surprise KL10
Agent Location
Behaviour 12
Following teammate
FTW
FTW wo TH
SelfplayRS
Human
Behaviour 14
Opponent base camping
FTW
FTW wo TH
SelfplayRS
Human
FTW
FTW wo TH
SelfplayRS
Human
i Automatically Discovered Behaviours
Behaviour 32
Home base defence
Figure 3: Knowledge representation and behavioural analysis. a The 2D tSNE embedding of an FTW agents internal states during gameplay. Each point represents the internal state hp,hq at a particular point in the game, and is coloured according to the highlevel game state at this timethe conjunction of four basic CTF situations b. Colour clusters form, showing that nearby regions in the internal representation of the agent correspond to the same highlevel game state. c A visualisation of the expected internal state arranged in a similaritypreserving topological embedding Figure S5. d We show distributions of situation conditional activations for particular single neurons which are distinctly selective for these CTF situations, and show the predictive accuracy of this neuron. e The true return of the agents internal reward signal and f the agents prediction, its value function. g Regions where the agents internal twotimescale representation diverges, the agents surprise. h The fourstep temporal sequence of the highlevel strategy opponent base camping. i Three automatically discovered high level behaviours of agents and corresponding regions in the tSNE embedding. To the right, average occurrence per game of each behaviour for the FTW agent, the FTW agent without temporal hierarchy TH, selfplay with reward shaping agent, and human subjects more detail in Figure S9.
agent was able to classify the state both flags are stray flags dropped not at base with 91 AUCROC area under the receiver operating characteristic curve, compared to 70 with the selfplay baseline. Looking at the acquisition of knowledge as training progresses, the agent first learned about its own base, then about the opponents base, and picking up the flag. Immediately
7

useful flag knowledge was learned prior to knowledge related to tagging or ones teammates situation. Note that agents were never explicitly trained to model this knowledge, thus these results show the spontaneous emergence of these concepts purely through RLbased training.
A visualisation of how the agent represents knowledge was obtained by performing dimen sionality reduction of the agents activations using tSNE 67. As can be seen from Figure 3, internal agent state clustered in accordance with conjunctions of highlevel game state features: flag status, respawn state, and room type. We also found individual neurons whose activations coded directly for some of these features, e.g. a neuron that was active if and only if the agents teammate was holding the flag, reminiscent of concept cells 51. This knowledge was ac quired in a distributed manner early in training after 45K games, but then represented by a single, highly discriminative neuron later in training at around 200K games. This observed disentangling of game state is most pronounced in the FTW agent Figure S8.
One of the most salient aspects of the CTF task is that each game takes place on a randomly generated map, with walls, bases, and flags in new locations. We hypothesise that this requires agents to develop rich representations of these spatial environments to deal with task demands, and that the temporal hierarchy and explicit memory module of the FTW agent help towards this. An analysis of the memory recall patterns of the FTW agent playing in indoor environ ments shows precisely that: once the agent had discovered the entrances to the two bases, it primarily recalled memories formed at these base entrances Figure 4, Figure S7. We also found that the full FTW agent with temporal hierarchy learned a coordination strategy dur ing maze navigation that ablated versions of the agent did not, resulting in more efficient flag capturing Figure S3.
Analysis of temporally extended behaviours provided another view on the complexity of behavioural strategies learned by the agent 34. We developed an unsupervised method to automatically discover and quantitatively characterise temporally extended behaviour patterns, inspired by models of mouse behaviour 73, which groups short gameplay sequences into be havioural clusters Figure S9, Supplementary Video https:youtu.bedltN4MxV1RI. The discovered behaviours included well known tactics observed in human play, such as waiting in the opponents base for a flag to reappear opponent base camping which we only observed in FTW agents with a temporal hierarchy. Some behaviours, such as following a flagcarrying teammate, were discovered and discarded midway through training, while others such as per forming home base defence are most prominent later in training Figure 4.
In this work, we have demonstrated that an artificial agent using only pixels and game points as input can learn to play highly competitively in a rich multiagent environment: a popular multiplayer firstperson video game. This was achieved by combining a number of innovations in agent trainingpopulation based training of agents, internal reward optimisation, and tem porally hierarchical RLtogether with scalable computational architectures. The presented framework of training populations of agents, each with their own learnt rewards, makes mini mal assumptions about the game structure, and therefore should be applicable for scalable and stable learning in a wide variety of multiagent systems, and the temporally hierarchical agent represents a sophisticated new architecture for problems requiring memory and temporally ex
8

Single Neuron Response
100 75 50 25 0
Relative Internal Reward Magnitude
Agent Strength
Behaviour Probability
Games Played
Memory Usage
I am respawning
2K4K 7K
450K
Phase 1 Learning the basics of the game I have the flag
Phase 3 Perfecting strategy and memory
Phase 2 Increasing navigation, tagging, and coordination skills
My flag is taken
Teammate has the flag
45K 200K 300K
Agent Tagged Opponent
Agent Picked up Flag Opponent Captured Flag
Home Base Defence Opponent Base Camping
Beating Weak Bots
Beating Average Human
Beating
Strong Humans
2K
4K
7K 45K
10K 30K
Top Memory Read Locations
200K 300K
Top Memory Read Locations
450K Teammate Following
450K
Top Memory Read Locations
0K
350K 350K
Visitation Map
Visitation Map
Visitation Map
Figure 4: Progression of agent during training. Shown is the development of knowledge rep resentation and behaviours of the FTW agent over the training period of 450K games, segmented into three phases Supplementary Video https:youtu.bedltN4MxV1RI. Knowledge: Shown is the percentage of game knowledge that is linearly decodable from the agents representation, measured by average scaled AUCROC across 200 features of game state. Some knowledge is compressed to single neuron responses Figure 3 a, whose emergence in training is shown at the top. Relative Internal Re ward Magnitude: Shown is the relative magnitude of the agents internal reward weights of three of the thirteen events corresponding to game points . Early in training, the agent puts large reward weight on picking up the opponent flag, whereas later this weight is reduced, and reward for tagging an opponent and penalty when opponents capture a flag are increased by a factor of two. Behaviour Probability: Shown are the frequencies of occurrence for three of the 32 automatically discovered behaviour clus ters through training. Opponent base camping red is discovered early on, whereas teammate following blue becomes very prominent midway through training before mostly disappearing. The home base defence behaviour green resurges in occurrence towards the end of training, in line with the agents increased internal penalty for more opponent flag captures. Memory Usage: Shown are heat maps of visitation frequencies for locations in a particular map left, and locations of the agent at which the top ten most frequently read memories were written to memory, normalised by random reads from memory, indicating which locations the agent learned to recall. Recalled locations change considerably through out training, eventually showing the agent recalling the entrances to both bases, presumably in order to perform more efficient navigation in unseen maps, shown more generally in Figure S7.
9
Knowledge

tended inference. Limitations of the current framework, which should be addressed in future work, include the difficulty of maintaining diversity in agent populations, the greedy nature of the metaoptimisation performed by PBT, and the variance from temporal credit assignment in the proposed RL updates. Trained agents exceeded the winrate of humans in tournaments, and were shown to be robust to previously unseen teammates, opponents, maps, and numbers of players, and to exhibit complex and cooperative behaviours. We discovered a highly com pressed representation of important underlying game state in the trained agents, which enabled them to execute complex behavioural motifs. In summary, our work introduces novel techniques to train agents which can achieve humanlevel performance at previously insurmountable tasks. When trained in a sufficiently rich multiagent world, complex and surprising highlevel intel ligent artificial behaviour emerged.
References
1. QuakeCon, 2018.
2. David Ackley and Michael Littman. Interactions between learning and evolution. Artificial
life II, 10:487509, 1991.
3. PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In Pro
ceedings of AAAI Conference on Artificial Intelligence, pages 17261734, 2017.
4. Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emer gent complexity via multiagent competition. In Proceedings of International Conference on Learning Representations, 2018.
5. Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Ku ttler, Andrew Lefrancq, Simon Green, V ctor Valde s, Amir Sadik, et al. Deepmind Lab. arXiv preprint arXiv:1612.03801, 2016.
6. Franc ois Be rard, Guangyu Wang, and Jeremy R Cooperstock. On the limits of the human motor control precision: the search for a devices human resolution. In IFIP Conference on HumanComputer Interaction, pages 107122. Springer, 2011.
7. DanielS.Bernstein,ShlomoZilberstein,andNeilImmerman.Thecomplexityofdecentral ized control of Markov Decision Processes. In Proceedings of Conference on Uncertainty in Artificial Intelligence, pages 3237, 2000.
8. Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
10

9. G. W. Brown. Iterative solutions of games by fictitious play. In T.C. Koopmans, editor, Activity Analysis of Production and Allocation, pages 374376. John WileySons, Inc., 1951.
10. Alan D Castel, Jay Pratt, and Emily Drummond. The effects of action video game expe rience on the time course of inhibition of return and the efficiency of visual search. Acta psychologica, 1192:217230, 2005.
11. Janice Chen, Uri Hasson, and Christopher J Honey. Processing timescales as an organizing principle for primate cortex. Neuron, 882:244246, 2015.
12. Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neu ral networks. 2017.
13. Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Proceedings of Annual Conference on Neural Information Processing Systems, pages 29802988, 2015.
14. Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for longterm dependencies. In Proceedings of Annual Conference on Neural Information Processing Systems, pages 493499, 1996.
15. Arpad E Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
16. Laura Ermi and Frans Ma yra . Fundamental components of the gameplay experience: Analysing immersion. Worlds in play: International perspectives on digital games re search, 372:3753, 2005.
17. Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable dis tributed DeepRL with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561, 2018.
18. Jakob N Foerster, Richard Y Chen, Maruan AlShedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponentlearning awareness. arXiv preprint arXiv:1709.04326, 2017.
19. Marco Fraccaro, Sren Kaae Snderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Proceedings of Annual Conference on Neural Information Processing Systems, pages 21992207, 2016.
20. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwin ska, Sergio Go mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 5387626:471, 2016.
11

21. C Shawn Green and Daphne Bavelier. Action video game training for cognitive enhance ment. Current Opinion in Behavioral Sciences, 4:103108, 2015.
22. Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscienceinspired artificial intelligence. Neuron, 952:245258, 2017.
23. MatthewHausknechtandPeterStone.Deepreinforcementlearninginparameterizedaction space. In Proceedings of International Conference on Learning Representations, 2016.
24. Johannes Heinrich and David Silver. Deep reinforcement learning from selfplay in imperfectinformation games. In NIPS Deep Reinforcement Learning Workshop, 2016.
25. Ernst Hellinger. Neue begru ndung der theorie quadratischer formen von unendlichvielen vera nderlichen. Journal fu r die reine und angewandte Mathematik, 136:210271, 1909.
26. Geoffrey Hinton. Neural Networks for Machine Learning, Lecture 6e.
27. Sepp Hochreiter and Ju rgen Schmidhuber. Long shortterm memory. Neural Computation, 98:17351780, 1997.
28. id Software. Quake III Arena, 1999.
29. Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
30. Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of International Conference on Learning Representations, 2017.
31. Diederik P Kingma and Max Welling. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
32. Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, Eiichi Osawai, and Hitoshi Matsubara. Robocup: A challenge problem for ai and robotics. In Robot Soccer World Cup, pages 119. Springer, 1997.
33. Jan Koutnk, Klaus Greff, Faustino Gomez, and Jrgen Schmidhuber. A clockwork RNN. arXiv preprint arXiv:1402.3511, 2014.
34. John W Krakauer, Asif A Ghazanfar, Alex GomezMarin, Malcolm A MacIver, and David Poeppel. Neuroscience needs behavior: correcting a reductionist bias. Neuron, 933:480 490, 2017.
35. John Laird and Michael VanLent. Humanlevel ais killer application: Interactive computer games. AI magazine, 222:15, 2001.
12

36. Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforce ment learning. In Proceedings of AAAI Conference on Artificial Intelligence, pages 2140 2146, 2017.
37. Marc Lanctot, Vin cius Flores Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pe rolat, David Silver, and Thore Graepel. A unified gametheoretic approach to multiagent reinforcement learning. In Proceedings of Annual Conference on Neural Information Processing Systems, pages 41934206, 2017.
38. JoelZLeibo,ViniciusZambaldi,MarcLanctot,JanuszMarecki,andThoreGraepel.Multi agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
39. Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Proceedings of Annual Conference on Neural Information Processing Systems, pages 207215, 2013.
40. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yu val Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of International Conference on Learning Representations, 2016.
41. Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Proceedings of Annual Conference on Neural Information Processing Systems, pages 63826393, 2017.
42. Patrick MacAlpine and Peter Stone. UT Austin Villa: RoboCup 2017 3D simulation league competition and technical challenges champions. In Claude Sammut, Oliver Obst, Flavio Tonidandel, and Hidehisa Akyama, editors, RoboCup 2017: Robot Soccer World Cup XXI, Lecture Notes in Artificial Intelligence. Springer, 2018.
43. Lae titia Matignon, Guillaume J. Laurent, and Nadine Le FortPiat. Independent reinforce ment learners in cooperative Markov games: a survey regarding coordination problems. Knowledge Engineering Review, 271:131, 2012.
44. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 19281937, 2016.
45. VolodymyrMnih,AdriaPuigdomenechBadia,MehdiMirza,AlexGraves,TimothyP.Lill icrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 19281937, 2016.
13

46. VolodymyrMnih,KorayKavukcuoglu,DavidSilver,AndreiA.Rusu,JoelVeness,MarcG. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Ku maran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 5187540:529533, 2015.
47. Matej Moravcik, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert level artificial intelligence in headsup nolimit poker. Science, 3566337:508513, 2017.
48. IgorMordatchandPieterAbbeel.Emergenceofgroundedcompositionallanguageinmulti agent populations. In Proceedings of AAAI Conference on Artificial Intelligence, 2018.
49. Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward trans formations: Theory and application to reward shaping. In Proceedings of International Conference on Machine Learning, pages 278287, 1999.
50. JeffOrkin.Threestatesandaplan:theA.I.ofF.E.A.R.InProceedingsofGameDevelopers Conference, 2006.
51. Rodrigo Quian Quiroga. Concept cells: the building blocks of declarative memory func tions. Nature Reviews Neuroscience, 138:587, 2012.
52. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropaga tion and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
53. Martin Riedmiller and Thomas Gabel. On experiences in a complex and competitive gam ing domain: Reinforcement learning meets robocup. In Computational Intelligence and Games, 2007. CIG 2007. IEEE Symposium on, pages 1723. IEEE, 2007.
54. Christopher D Rosin and Richard K Belew. New methods for competitive coevolution. Evolutionary computation, 51:129, 1997.
55. Ju rgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 42:234242, 1992.
56. JohnSchulman,FilipWolski,PrafullaDhariwal,AlecRadford,andOlegKlimov.Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
57. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 5297587:484489, 2016.
14

58. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Master ing the game of Go without human knowledge. Nature, 5507676:354, 2017.
59. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
60. Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from? In Proceedings of Annual Meeting of the Cognitive Science Society, pages 26012606, 2009.
61. Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically mo tivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Au tonomous Mental Development, 22:7082, 2010.
62. Peter Stone, Gal A Kaminka, Sarit Kraus, Jeffrey S Rosenschein, et al. Ad hoc autonomous agent teams: Collaboration without precoordination. In Proceedings of AAAI Conference on Artificial Intelligence, 2010.
63. Peter Stone and Manuela Veloso. Layered learning. In European Conference on Machine Learning, pages 369381. Springer, 2000.
64. Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communica tion with backpropagation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Proceedings of Annual Conference on Neural Information Processing Systems, pages 22442252, 2016.
65. Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between MDPs and SemiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 11212:181211, 1999.
66. G. Tesauro. Temporal difference learning and TDGammon. Communications of the ACM, 383:5868, March 1995.
67. Laurens J P Van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9Nov:25792605, 2008.
68. Niels Van Hoorn, Julian Togelius, and Jurgen Schmidhuber. Hierarchical controller learn ing in a firstperson shooter. In IEEE Symposium on Computational Intelligence and Games, pages 294301. IEEE, 2009.
69. J. M. P. Van Waveren. The Quake III Arena Bot Masters Thesis, 2001.
15

70. AlexanderSashaVezhnevets,SimonOsindero,TomSchaul,NicolasHeess,MaxJaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 3540 3549, 2017.
71. Nikos Vlassis, Marc Toussaint, Georgios Kontes, and Savas Piperidis. Learning modelfree robot control by a Monte Carlo EM algorithm. Autonomous Robots, 272:123130, 2009.
72. Theophane Weber and Nicolas Heess. Reinforced variational inference. In NIPS Advances in Approximate Bayesian Inference Workshop, 2017.
73. Alexander B Wiltschko, Matthew J Johnson, Giuliano Iurilli, Ralph E Peterson, Jesse M Katon, Stan L Pashkovski, Victoria E Abraira, Ryan P Adams, and Sandeep Robert Datta. Mapping subsecond structure in mouse behavior. Neuron, 886:11211135, 2015.
74. David H Wolpert and Kagan Tumer. An introduction to collective intelligence. arXiv preprint cs9908014, 1999.
75. YuxinWuandYuandongTian.Trainingagentforfirstpersonshootergamewithactorcritic curriculum learning. In Proceedings of International Conference on Learning Representa tions, 2017.
Acknowledgments
We thank Matt Botvinick, Simon Osindero, Volodymyr Mnih, Alex Graves, Nando de Freitas, Nicolas Heess, and Karl Tuyls for helpful comments on the manuscript; Simon Green and Drew Purves for additional environment support and design; Kevin McKee and Tina Zhu for human experiment assistance; Amir Sadik and Sarah York for exploitation study participation; Adam Cain for help with figure design; Paul Lewis, Doug Fritz, and Jaume Sanchez Elias for 3D map visualisation work; Vicky Holgate, Adrian Bolton, Chloe Hillier, and Helen King for organisational support; and the rest of the DeepMind team for their invaluable support and ideas.
Supplementary Materials
1 Task
1.1 Rules of Capture the Flag
CTF is a team game with the objective of scoring more flag captures than the opposing team in five minutes of play time. To score a capture, a player must navigate to the opposing teams base, pick up the flag by touching the flag, carry it back to their own base, and capture it by
16

running into their own flag. A capture is only possible if the flag of the scoring players team is safe at their base. Players may tag opponents which teleports them back to their base after a delay respawn. If a flag carrier is tagged, the flag they are carrying drops on the ground and becomes stray. If a player on the team that owns the dropped flag touches the dropped flag, it is immediately returned back to their own base. If a player on the opposing team touches the dropped flag, that player will pick up the flag and can continue to attempt to capture the flag.
1.2 Environment
The environment we use is DeepMind Lab 5 which is a modified version of Quake III Arena 28. The modifications reduce visual connotations of violence, but retain all core game mechan ics. Video games form an important domain for research 35. Previous work on firstperson games considers either much simpler games 30, 36, 45, 75, simplified agent interfaces 68, or nonlearning systems 50, 69, and previously studied multiagent domains often consist of discretestate environments 18, 38, 64, have simplified 2D dynamics 23, 41, 53 or have fully observable or nonperceptual features 18, 23, 41, 48, 53, 64 rather than pixel observa tions. As an example, the RoboCup simulation league 32 is a multiagent environment that shares some of the same challenges of our environment, and successful work has included RL components 42, 53, 63, however these solutions use a combination of handengineering, humanspecified task decompositions, centralised control, and lowdimensional nonvisual in puts, compared to our approach of endtoend machine learning of independent reinforcement learners.
CTF games are played in an artificial environment referred to as a map. In this work we consider two themes of procedurally generated maps in which agents play, indoor maps, and outdoor maps, example schematics of which are shown in Figure S1. The procedural indoor maps are flat, mazelike maps, rotationally symmetric and contain rooms connected by corri dors. For each team there is a base room that contains their flag and player spawn points. Maps are contextually coloured: the red base is coloured red, the blue base blue. The procedural out door maps are open and hilly naturalistic maps containing randomly sized rocks, cacti, bushes, and rugged terrain that may be impassable. Each teams flag and starting positions are located in opposite quadrants of the map. Both the procedural indoor maps and the procedural outdoor maps are randomly generated each episode some random seeds are not used for training and reserved for performance evaluation, providing a very large set of environments. More details can be found in Section 5.3. Every player carries a disc gadget equivalent to the railgun in Quake III Arena which can be used for tagging, and can see their team, shield, and flag status on screen.
17

2 Agent
2.1 FTW Agent Architecture
The agents policyis represented by a neural network and optimised with reinforcement learn
ing RL. In a fully observed Markov Decision Process, one would aim at finding a policy that
maximisesexpecteddiscountedreturnEstRtingamestatest,whereRt Ttkrtk. k0
However, when an agent does not have information about the entire environment which is often the case in real world problems, including CTF, it becomes a PartiallyObserved Markov De cision Process, and hence we instead seek to maximise ExtRt, the expected return under a policy conditioned on the agents history of individual observations. Due to the ambiguity of the true state given the observations, P stxt, we represent the current value as a random vari able, VtExtRts PsxtEsRt. We follow the idea of RL as probabilistic inference 39, 71, 72 which leads to a KullbackLeibler divergence KL regularised objective in which the policy Q is regularised against a prior policy P. We choose both to contain a latent variable zt, the purpose of which is to model the dependence on past observations. Letting the policy and the prior differ only in the way this dependence on past observations is modelled leads to the following objective:
EQztCqtRtDKLQ ztCqtP ztCpt , 3
where P ztCptand Q ztCqtare the prior and variational posterior distributions on zt condi tioned on different sets of variables Cpt and Cqt respectively, and DKL is the KullbackLeibler divergence. The sets of conditioning variables Cpt and Cqt determine the structure of the prob abilistic model of the agent, and can be used to equip the model with representational priors. In addition to optimising the return as in Equation 3, we can also optimise extra modelling targets which are conditional on the latent variable zt, such as the value function to be used as a baseline 45, and pixel control 30, whose optimisation positively shapes the shared la tent representation. The conditioning variables Cqt and Cpt and the associated neural network structure are chosen so as to promote forward planning and the use of memory. We use a hierarchical RNN consisting of two recurrent networks LSTMs 27 operating at different timescales. The hierarchical RNNs fast timescale core generates a hidden state hqt at every environment step t, whereas its slow timescale core produces an updated hidden state every
steps hpthp . We use the output of the fast ticking LSTM as the variational posterior,t

QztPzt,zt,xt,at,rtNqt,qt, where the mean qt and covariance qttqI2
of the normal distribution are parameterised by the linear transformation qt , log tq fq hqt ,
and at each timestep take a sample of ztN qt , qt . The slow timescale LSTM output is
pppp2 usedforthepriorofP ztzt,xt,at,rt Nt,twheret tI ,

pt , log tpfphptand fp is a linear transformation. The fast timescale core takes as input the observation that has been encoded by a convolutional neural network CNN, utCNNxt, the previous action at1, previous reward rt1, as well as the prior distribution parameters pt
18

and pt , and the previous sample of the variational posterior zt1N qt1, qt1. The slow core takes in the fast cores hidden state as input, giving the recurrent network dynamics of
hqt gqut,at1,rt1,hpt,hqt1,pt,pt,zt1 gphqt1, hpt1 if t mod 0
hpthp otherwise 4t

where gq and gp are the fast and slow timescale LSTM cores respectively. Stochastic policy, value function, and pixel control signals are obtained from the samples zt using further non linear transformations. The resulting update direction is therefore:
EztQ Lzt,xtDKLQztPzt,zt,xt,at,rtPztz t ,x t ,a t ,r t .

Cqt Cpt
5 where L,represents the objective function composed of terms for multistep policy gradient and value function optimisation 45, as well as pixel control and reward prediction auxiliary tasks 30, see Section 5.4. Intuitively, this objective function captures the idea that the slow LSTM generates a prior on z which predicts the evolution of z for the subsequentsteps, while the fast LSTM generates a variational posterior on z that incorporates new observations, but adheres to the predictions made by the prior. All the while, z must be a useful representation for maximising reward and auxiliary task performance. This architecture can be easily extended to more than two hierarchical layers, but we found in practice that more layers made little differ ence on this task. We also augmented this dualLSTM agent with shared DNC memory 20 to further increase its ability to store and recall past experience this merely modifies the functional form of gp and gq . Finally, unlike previous work on DeepMind Lab 17, 30, the FTW agent uses a rich action space of 540 individual actions which are obtained by combining elements from six independent action dimensions. Exact agent architectures are described in Figure S10.
2.2 Internal Reward and Population Based Training
We wish to optimise the FTW agent with RL as stated in Equation 5, using a reward signal that maximises the agent teams win probability. Reward purely based on game outcome, such as windrawloss signal as a reward of rT1, rT0, and rT1 respectively, is very sparse and delayed, resulting in no learning Figure 2 b Selfplay. Hence, we obtain more frequent rewards by considering the game points stream t. These can be used simply for re ward shaping 49 Figure 2 b SelfplayRS or transformed into a reward signal rtwt using a learnt transformation w Figure 2 b FTW. This transformation is adapted such that performing RL to optimise the resulting cumulative sum of expected future discounted rewards effectively maximises the winning probability of the agents team, removing the need for man ual reward shaping 49. The transformation w is implemented as a table lookup for each of the 13 unique values of t, corresponding to the events listed in Section 5.5. In addition to
19

optimising the internal rewards of the RL optimisation, we also optimise hyperparametersof the agent and RL training process automatically. These include learning rate, slow LSTM time scale, the weight of the DKL term in Equation 5, and the entropy cost full list in Section 5.4. This optimisation of internal rewards and hyperparameters is performed using a process of pop ulation based training PBT 29. In our case, a population of P30 agents was trained in parallel. For each agent we periodically sampled another agent, and estimated the win probabil ity of a team composed of only the first agent versus a team composed of only the second from training matches using Elo scores. If the estimated win probability of an agent was found to be less than 70 then the losing agent copied the policy, the internal reward transformation, and hyperparameters of the better agent, and explored new internal rewards and hyperparameters. This exploration was performed by perturbing the inherited value by 20 with a probability of 5, with the exception of the slow LSTM time scale, which was uniformly sampled from the integer range 5, 20. A burnin time of 1K games was used after each exploration step which prevents further exploration and allows learning to occur.
2.3 Training Architecture
We used a distributed, populationbased training framework for deep reinforcement learning agents designed for the fast optimisation of RL agents interacting with each other in an environ ment with high computational simulation costs. Our architecture is based on an actorlearner structure 17: a large collection of 1920 arena processes continually play CTF games with players sampled at the beginning of each episode from the live training population to fill the N player positions of the game Section 5.4.1 for details. We train with N4 2 vs 2 games but find the agents generalise to different team sizes Figure S2. After every 100 agent steps, the trajectory of experience from each players point of view observations, actions, rewards is sent to the learner responsible for the policy carried out by that player. The learner corresponding to an agent composes batches of the 32 trajectories most recently received from arenas, and computes a weight update to the agents neural network parameters based on Equation 5 using VTrace offpolicy correction 17 to account for offpolicy drift.
3 Performance Evaluation
An important dimension of assessing the success of training agents to play CTF is to evalu ate their skill in terms of the agent teams win probability. As opposed to singleagent tasks, assessing skill in multiagent systems depends on the teammates and opponents used during evaluation. We quantified agent skill by playing evaluation games with players from the set of all agents to be assessed. Evaluation games were composed using adhoc matchmaking in the sense that all N players of the game, from both teams, were drawn at random from the set of agents being evaluated. This allowed us to measure skill against any set of opponent agents and robustness to any set of teammate agents. We estimate skill using the Elo rating system 15
20

extended to teams see Section 5.1 for exact details of Elo calculation.
We performed evaluation matches with snapshots of the FTW agent and ablation study
agents through training time, and also included builtin bots and human participants as reference agents for evaluation purposes only. Differences between these types of players is summarised in Figure S11.
The various ablated agents in experiments are i UNREAL 30 trained with selfplay using game winning rewardthis represents the stateoftheart naive baselineii Selfplay with reward shaping RS which instead uses the Quake default points scheme as reward, iii PBT with RS, which replaces selfplay with population based training, and iv FTW without tem poral hierarchy which is the full FTW agent but omitting the temporal hierarchy Section 5.6 for full details.
The builtin bots were scripted AI bots developed for Quake III Arena. Their policy has access to the entire game engine, game state, and map layout, but have no learning compo nent 69. These bots were configured for various skill levels, from Bot 1 very low skill level to Bot 5 very high skill level, increased shields, as described fully in Section 5.9.
The human participants consisted of 40 people with firstperson video game playing expe rience. We collected results of evaluation games involving humans by playing five tournaments of eight human players. Players were given instructions on the game environment and rules, and performed two games against Bot 3 builtin bots. Human players then played seven games in adhoc teams, being randomly matched with other humans, FTW agents, and FTW without a temporal hierarchy agents as teammates and opponents. Players were not told with which agent types they were playing and were not allowed to communicate with each other. Agents were executed in realtime on the CPUs of the same workstations used by human players desktops with a commodity GPU without adversely affecting the framerate of the game.
Figure S2 shows the outcome of the tournaments involving humans. To obtain statistically valid Elo estimates from the small number of games played among individuals with high skill variance, we pooled the humans into two groups, top 20 strong and remaining 80 aver age, according to their individual performances.
We also performed another study with human players to find out if human ingenuity, adap tivity and teamwork would help humans find exploitative strategies against trained agents. We asked two professional games testers to play as a team against a team of two FTW agents on a fixed, particularly complex map, which had been held out of training. After six hours of practice and experimentation, the human games testers were able to consistently win against the FTW team on this single map by employing a highlevel strategy. This winning strategy involved careful study of the preferred routes of the agents on this map in exploratory games, drawing explicit maps, and then precise communication between humans to coordinate successful flag captures by avoiding the agents preferred routes. In a second test, the maps were changed to be procedurally generated for each episode as during training. Under these conditions, the human game testers were not able to find a consistently winning strategy, resulting in a human winrate of only 25 draw rate of 6.3.
21

3.1 HumanAgent Differences
It is important to recognise the intrinsic differences between agents and humans when evaluat ing results. It is very difficult to obtain an even playing ground between humans and agents, and it is likely that this will continue to be the case for all humanmachine comparisons in the domain of action video games. While we attempted to ensure that the interaction of agents and humans within their shared environment was as fair as possible, engineering limitations mean that differences still exist. Figure S11 a outlines these, which include the fact that the environ ment serves humans a richer interface than agents: observations with higher visual resolution and lower temporal latency, and a control space of higher fidelity and temporal resolution.
However, in spite of these environmental constraints, agents have a set of advantages over humans in terms of their ultimate sensorimotor precision and perception. Humans cannot take full advantage of what the environment offers: they have a visualresponse feedback loop far slower than the 60Hz observation rate 10; and although a high fidelity action space is available, humans cognitive and motor skills limit their effective control in video games 6.
One way that this manifests in CTF games is through reaction times to salient events. While we cannot measure reaction time directly within a full CTF game, we measure possible proxies for reaction time by considering how long it takes for an agent to respond to a newlyappeared opponent Figure S11 b. After an opponent first appears within a players 90 degree field ofview, it must be become taggable, i.e. positioned within a 10 degree cone of the players centre of vision. This occurs very quickly within both human and agent play, less than 200ms on average though this does not necessarily reflect intentional reactions, and may also result from some combination of players movement statistics and prior orientation towards opponent appearance points. However, the time between first seeing an opponent and attempting a tag the opponent is taggable and the tag action is emitted is much lower for FTW agents 258ms on average compared to humans 559ms, and when a successful tag is considered this gap widens 233ms FTW, 627ms humans. Stronger agents also had lower response times in general than weaker agents, but there was no statistically significant difference in strong humans response times compared to average humans.
The tagging accuracy of agents is also significantly higher than that of humans: 80 for FTW agents compared to 48 for humans. We measured the effect of tagging accuracy on the performance of FTW agents playing against a Bot 3 team by artificially impairing agents ability to fire, without retraining the agents Figure S11 c. Win probability decreased as the accuracy of the agent decreased, however at accuracies comparable to humans the FTW agents still had a greater win probability than humans albeit with comparable mean flag capture differences. We also used this mechanism to attempt to measure the effect of successful tag time on win probability Figure S11 d, and found that an average response time of up to 375ms did not effect the win probability of the FTW agentonly at 448ms did the win rate drop to 85.
22

4 Analysis
4.1 Knowledge Representation
We carried out an analysis of the FTW agents internal representation to help us understand how it represents its environment, what aspects of game state are well represented, and how it uses its memory module and parses visual observations.
We say that the agent had gamestate related knowledge of a given piece of information if that information could be decoded with sufficient accuracy from the agents recurrent hidden state hpt , hqtusing a linear probe classifier. We defined a set of 40 binary features that took the form of questions found in Figure S4 about the state of the game in the distant and recent past, present, and future, resulting in a total of 200 features. Probe classifiers were trained for each of the 200 features using balanced logistic regression on 4.5 million game situations, with results reported in terms of AUCROC evaluated with 3fold episodewise cross validation. This analysis was performed on the agent at multiple points in training to show what knowledge emerges at which point in training, with the results shown in Figure S4.
Further insights about the geometry of the representation space were gleaned by performing a tSNE dimensionality reduction 67 on the recurrent hidden state of the FTW agent. We found strong evidence of cluster structure in the agents representation reflecting conjunction of known CTF game state elements: flag possession, location of the agent, and the agents respawn state. Furthermore, we introduce neural response maps which clearly highlight the differences in coactivation of individual neurons of the agent in these different game states Figure S5. In fact, certain aspects of the game, such as whether an agents flag is held by an opponent or not, or whether the agents teammate holds the opponents flag or not, are represented by the response of single neurons.
Finally, we can decode the sensitivity of the agents value function, policy, and internal single neuron responses to its visual observations of the environment through gradientbased saliency analysis 59 Figure S6. Sensitivity analysis combined with knowledge classifiers seems to indicate that the agent performed a kind of taskbased scene understanding, with the effect that its value function estimate was sensitive to seeing the flag, other agents, and elements of the onscreen information. The exact scene objects which an agents value function was sensitive to were often found to be context dependent Figure S6 bottom.
4.2 Agent Behaviour
The CTF games our agents played were five minutes long and consisted of 4500 elemental ac tions by each player. To better understand and interpret the behaviour of agents we considered modelling temporal chunks of highlevel game features. We segmented games into twosecond periods represented by a sequence of game features e.g. distance from bases, agents room, vis ibility of teammates and opponents, flag status, see Section 5.8 and used a variational autoen coder VAE consisting of an RNN encoder and decoder 8 to find a compressed vector repre
23

sentation of these two seconds of highlevel agentcentric CTF gameplay. We used a Gaussian mixture model GMM with 32 components to find clusters of behaviour in the VAEinduced vector representation of gameplay segments Section 5.8 for more details. These discrete clus ter assignments allowed us to represent highlevel agent play as a sequence of clusters indices Figure S9 b. These two second behaviour prototypes were interpretable and represented a wide range of meaningful behaviours such as home base camping, opponents base camping, de fensive behaviour, teammate following, respawning, and empty room navigation. Based on this representation, highlevel agent behaviour could be represented by histograms of frequencies of behaviour prototypes over thousands of episodes. These behavioural fingerprints were shown to vary throughout training, differed strongly between hierarchical and nonhierarchical agent architectures, and were computed for human players as well Figure S9 a. Comparing these behaviour fingerprints using the Hellinger distance 25 we found that the human behaviour was most similar to the FTW agent after 200K games of training.
5 Experiment Details 5.1 Elo Calculation
We describe the performance of both agents human or otherwise in terms of Elo ratings 15, as commonly used in both traditional games like chess and in competitive video game ranking and matchmaking services. While Elo ratings as described for chess address the oneversusone case, we extend this for CTF to the nversusn case by making the assumption that the rating of a team can be decomposed as the sum of skills of its team members.
Given a population of M agents, let iR be the rating for agent i. We describe a given match between two teams, blue and red, with a vector mZM , where mi is the number of times agent i appears in the blue team less the number of times the agent appears in the red team. Using our additive assumption we can then express the standard Elo formula as:
Pblue wins against redm, 1 . 6 110T m400
To calculate ratings given a set of matches with team assignments mi and outcomes yi yi1 for blue beats red and yi1 for draw, we optimiseto find ratingsthat
2
maximise the likelihood of the data. Since win probabilities are determined only by absolute differences in ratings we typically anchor a particular agent Bot 4 to a rating of 1000 for ease of interpretation.
For the purposes of PBT, we calculate the winning probability of i versus j using mi2 and mj2 and mk0 for ki, j, i.e. we assume that both players on the blue team are i and similarly for the red team.
24

5.2 Environment Observation and Action Space
DeepMind Lab 5 is capable of rendering colour observations at a wide range of resolutions.
We elected to use a resolution of 8484 pixels as in previous related work in this environment
30, 44. Each pixel is represented by a triple of three bytes, which we scale by 1 to produce
255
an observation xt0, 184843.
The environment accepts actions as a composite of six types of partial actions: change in
yaw continuous, change in pitch continuous, strafing left or right ternary, moving forward or backwards ternary, tagging and jumping both binary. To further simplify this space, we expose only two possible values for yaw rotations 10 and 60 and just one for pitch 5. Consequently, the number of possible composite actions that the agent can produce is of size 533322540.
5.3 Procedural Environments
Indoor Procedural Maps The procedural indoor maps are flat, pointsymmetric mazes con sisiting of rooms connected by corridors. Each map has two base rooms which contain the teams flag spawn point and several possible player spawn points. Maps are contextually coloured: the red base is red, the blue base is blue, empty rooms are grey and narrow corri dors are yellow. Artwork is randomly scattered around the maps walls.
The procedure for generating an indoor map is as follows:
1. Generate random sized rectangular rooms within a fixed size square area e.g. 1313 or 1717 cells. Room edges were only placed on even cells meaning rooms always have odd sized walls. This restriction was used to work with the maze backtracking algorithm.
2. Fillthespacebetweenroomsusingthebacktrackingmazealgorithmtoproducecorridors. Backtracking only occurs on even cells to allow whole cell gaps as walls.
3. Remove dead ends and horseshoes in the maze.
4. Searching from the top left cell, the first room encountered is declared the base room. This ensures that base rooms are typically at opposite ends of the arena.
5. The map is then made to be pointsymmetric by taking the first half of the map and concatenating it with its reversed self.
6. Flag bases and spawn points are added pointsymmetrically to the base rooms.
7. The map is then checked for being solveable and for meeting certain constraints base room is at least 9 units in area, the flags are a minimum distance apart.
8. Finally, the map is randomly rotated to prevent agents from exploiting the skybox for navigation.
25

Outdoor Procedural Maps The procedural outdoor maps are open and hilly naturalistic maps containing obstacles and rugged terrain. Each teams flag and spawn locations are on opposite corners of the map. Cacti and boulders of random shapes and sizes are scattered over the landscape. To produce the levels, first the height map was generated using the diamond square fractal algorithm. This algorithm was run twice, first with a low variance and then with a high variance and compiled using the elementwise max operator. Cacti and shrubs were placed in the environment using rejection sampling. Each plant species has a preference for a distri bution over the height above the water table. After initial placement, a lifecycle of the plants was simulated with seeds being dispersed near plants and competition limiting growth in high vegetation areas. Rocks were placed randomly and simulated sliding down terrain to their final resting places. After all entities had been placed on the map, we performed pruning to ensure props were not overlapping too much. Flags and spawn points were placed in opposite quad rants of the map. The parameters of each map such as water table height, cacti, shrub and rock density were also randomly sampled over each individual map. 1000 maps were generated and 10 were reserved for evaluation.
5.4 Training Details
Agents received observations from the environment 15 times steps per second. For each ob servation, the agent returns an action to the environment, which is repeated four times within the environment 30, 44. Every training game lasts for five minutes, or, equivalently, for 4500 agent steps. Agents were trained for two billion steps, corresponding to approximately 450K games.
Agents parameters were updated every time a batch of 32 trajectories of length 100 had been accumulated from the arenas in which the respective agents were playing. We used RM SProp 26 as the optimiser, with epsilon 105, momentum 0, and decay rate 0.99. The initial learning rate was sampled per agent from LogUniform105, 5103 and further tuned during training by PBT, with a population size of 30. Both VTrace clipping thresholds, cwere set to 1. RL discounting factorwas set to 0.99.
All agents were trained with at least the components of the UNREAL loss 30: the losses used by A3C 44, plus pixel control and reward prediction auxiliary task losses. The baseline cost weight was fixed at 0.5, the initial entropy cost was sampled per agent from LogUniform5 104, 102, the initial reward prediction loss weight was sampled from LogUniform0.1, 1, and the initial pixel control loss weight was sampled from LogUniform0.01, 0.1. All weights except the baseline cost weight were tuned during training by PBT.
Due to the composite nature of action space, instead of training pixel control policies directly on 540 actions, we trained independent pixel control policies for each of the six action groups.
The reward prediction loss was trained using a small replay buffer, as in UNREAL 30. In particular, the replay buffer had capacity for 800 nonzeroreward and 800 zeroreward se quences. Sequences consisted of three observations. The batch size for the reward prediction loss was 32, the same as the batch size for all the other losses. The batch consisted of 16
26

nonzeroreward sequences and 16 zeroreward sequences.
For the FTW agent with temporal hierarchy, the loss includes the KL divergence between
the prior distribution from the slowticking core and the posterior distribution from the fast
ticking core, as well as KL divergence between the prior distribution and a multivariate Gaus
sian with mean 0, standard deviation 0.1. The weight on the first divergence was sampled from LogUniform103, 1, and the weight on the second divergence was sampled from LogUniform104, 101. A scaling factor on the gradients flowing from fast to slow ticking cores was sampled from LogUniform0.1, 1. Finally, the initial slowerticking core time periodwas sampled from Categorical5, 6, . . . , 20. These four quantities were further optimised during training by
PBT.
5.4.1 Training Games
Each training CTF game was started by randomly sampling the level to play on. For indoor procedural maps, first with 50 probability the size of map 13 or 17 and its geometry were generated according to the procedure described in Section 5.3. For outdoor procedural maps one of the 1000 pregenerated maps was sampled uniformly. Next, a single agent p was randomly sampled from the population. Based on its Elo score three more agents were sampled without replacement from the population according to the distribution
pPp
1 2

2 exp
Pp beats 0.521
which is a normal distribution over Elobased probabilities of winning, centred on agents of the same skill. For selfplay ablation studies agents were paired with their own policy instead. The agents in the game pool were randomly assigned to the red and blue teams. After each 5 minute episode this process was repeated.
5.5 Game Events
There are 13 binary game events with unique game point values t. These events are listed below, along with the default values wquake from the default Quake III Arena points system used
27
22
where 6

for manual reward shaping baselines SelfplayRS, PBTRS:
1I am tagged with the flag t
2I am tagged without the flag t
3I captured the flag t
4I picked up the flag t
5I returned the flag t
6Teammate captured the flag t
7Teammate picked up the flag t
8Teammate returned the flag t
9I tagged opponent with the flag t
10I tagged opponent without the flag t
11Opponents captured the flag t
12Opponents picked up the flag t
13Opponents returned the flag t
p11
p21
p31
p41
p51
p61
p71
p81
p91 p101
p111 p121 p131
w1 0 quake
w2 0 quake
w3 6 quake
w4 1 quake
w5 1 quake
w6 5 quake
w7 0 quake
w8 0 quake
w9 2 quake
w10 1 quake
w11 0 quake
w12 0 quake
w13 0 quake
Agents did not have direct access to these events. FTW agents initial internal reward mapping was sampled independently for each agent in the population according to
wipi LogUniform0.1, 10.0. initial
after which it was adapted through training with reward evolution.
5.6 Ablation
We performed two separate series of ablation studies, one on procedural indoor maps and one on procedural outdoor maps. For each environment type we ran the following experiments:
Selfplay: An agent with an LSTM recurrent processing core Figure S10 e trained with the UNREAL loss functions described in Section 5.4. Four identical agent policies played in each game, two versus two. Since there was only one agent policy trained, no Elo scores could be calculated, and populationbased training was disabled. A single reward was provided to the agent at the end of each episode, 1 for winning, 1 for losing and 0 for draw.
SelfplayReward Shaping: Same setup as Selfplay above, but with manual reward shaping given by wquake.
28

PBTReward Shaping: Same agent and losses as SelfplayReward Shaping above, but for each game in each arena the four participating agents were sampled without re placement from the population using the process described in Section 5.4. Based on the match outcomes Elo scores were calculated for the agents in the population as described in Section 5.1, and were used for PBT.
FTW wo Temporal Hierarchy: Same setup as PBTReward Shaping above, but with Reward Shaping replaced by an internal reward signals evolved by PBT.
FTW: The FTW agent, using the recurrent processing core with temporal hierarchy Fig ure S10 f, with the training setup described in Methods: matchmaking, PBT, and inter nal reward signal.
5.7 Distinctly Selective Neurons
For identifying the neuron in a given agent that is most selective for a game state feature y we recorded 100 episodes of the agent playing against Bot 3. Given this dataset of activations hi and corresponding labels yi we fit a Decision Tree of depth 1 using Gini impurity criterion. The decision tree learner selects the most discriminative dimension of h and hence the neuron most selective for y. If the accuracy of the resulting stump exceeds 97 over 1004500 steps we consider it to be a distinctly selective neuron.
5.8 Behavioural Analysis
For the behavioural analysis, we model chunks of two seconds 30 agent steps of gameplay. Each step is represented by 56 agentcentric binary features derived from groundtruth game state:
3 features Thresholded shortest path distance from other three agents.
4 features Thresholded shortest path distance from each teams base and flags.
4 features Whether an opponent captured, dropped, picked up, or returned a flag.
4 features Whether the agent captured, dropped, picked up, or returned a flag.
4 features Whether the agents teammate captured, dropped, picked up, or returned a flag.
4 features Whether the agent was tagged without respawning, was tagged and must respawn, tagged an opponent without them respawning, or tagged an opponent and they must respawn.
4 features What room an agent is in: home base, opponent base, corridor, empty room. 29

5 features Visibility of teammate visible and not visible, no opponents visible, one opponent visible, two opponents visible.
5 features Which other agents are in the same room: teammate in room, teammate not in room, no opponents in room, one opponent in room, two opponents in room.
4 features Each teams base visibility.
13 features Each teams flag status and visibility. Flags status can be either at base, held
by teammate, held by opponent, held by the agent, or stray.
2 features Whether agent is respawning and cannot move or not.
For each of the agents analysed, 1000 episodes of pairs of the agent playing against pairs of Bot 3 were recorded and combined into a single dataset. A variational autoencoder VAE 31, 52 was trained on batches of this mixed agent dataset each data point has dimensions 3056 using an LSTM encoder 256 units over the 30 time steps, whose final output vector is linearly projected to a 128 dimensional latent variable diagonal Gaussian. The decoder was an LSTM 256 units which took in the sampled latent variable at every step.
After training the VAE, a dataset of 400K data points was sampled, the latent variable means computed, and a Gaussian mixture model GMM was fit to this dataset of 400K128, with diagonal covariance and 32 mixture components. The resulting components were treated as behavioural clusters, letting us characterise a two second clip of CTF gameplay as one belonging to one of 32 behavioural clusters.
5.9 Bot Details
The bots we use for evaluation are a pair of Tauri and Centuri bots from Quake III Arena as defined below.
30

Bot Personality Tauri Centauri BotLevel 12345
12345
0.0 0.25 0.5 1.0 1.0 0.0 0.25 0.5 1.0 1.0 0.0 0.25 0.5 1.0 1.0 0.1 0.35 0.6 0.9 1.0
5 90 120 240 360 5.0 4.0 3.0 1.75 0.0
0.4 0.25 0.1 0.4 0.45 0.5 0.1 0.05 0.0 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5
0.0 0.25 0.5
0.1 0.3 0.5
0.1 0.3 0.5
0.1 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.0 1.0 1.0 1.0 1.0
ATTACK SKILL 0.0 0.25 0.5
1.0 1.0 1.0 1.0 1.0 1.0 0.9 1.0 240 360
1.75 0.0 0.1 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.0 1.0 1.0 1.0 1.0
AIM SKILL
AIM ACCURACY 0.0 0.25 0.5
0.0 0.25 0.5
VIEW FACTOR
VIEW MAXCHANGE 5 90 120 REACTIONTIME 5.0 4.0 3.0 CROUCHER 0.4 0.25 0.1 JUMPER 0.4 0.45 0.5 WALKER 0.1 0.05 0.0 WEAPONJUMPING 0.1 0.3 0.5
0.1 0.35 0.6
GRAPPLE USER
AGGRESSION 0.1 0.3 0.5 SELFPRESERVATION 0.1 0.3 0.5 VENGEFULNESS 0.1 0.3 0.5 CAMPER 0.0 0.25 0.5
0.1 0.3 0.5
EASY FRAGGER
ALERTNESS 0.1 0.3 0.5
0.1 0.3 0.5
AIMACCURACY 0.0 0.22 0.45 0.75 1.0 0.0 0.22 FIRETHROTTLE 0.01 0.13 0.25 1.0 1.0 0.01 0.13
0.45 0.95 1.0 0.25 0.1 0.01
31

Figure S1: Shown are schematics of samples of procedurally generated maps on which agents were trained. In order to demonstrate the robustness of our approach we trained agents on two distinct styles of maps, procedural outdoor maps top and procedural indoor maps bottom.
32

Figure S2: Top: Shown are win probabilities of different agents, including bots and humans, in eval uation tournaments, when playing on procedurally generated maps of various sizes 1321, team sizes 14 and styles indooroutdoor. On indoor maps, agents were trained with team size two on a mixture of 1313 and 1717 maps, so performance in scenarios with different map and team sizes measures their ability to successfully generalise. Teams were composed by sampling agents from the set in the figurewithreplacement.Bottom: Shownarewinprobabilities,differencesinnumberofflagscaptured, and number of games played for the human evaluation tournament, in which human subjects played with agents as teammates andor opponents on indoor procedurally generated 1717 maps.
33

Twoplayer Fetch
Agent Flags Bot 14 SelfplayRS 9 PBTRS 14 FTW wo TH 23 FTW 37 Fetchtrained FTW wo TH 30 Fetchtrained FTW 44
Figure S3: Left: Average number of flags scored per match for different CTFtrained agents playing twoplayer fetch CTF without opponents on indoor procedurally generated maps of size 17. This test provides a measure of agents ability to cooperate while navigating in previously unseen maps. Ten thousand matches were played, with teams consisting of two copies of the same agent, which had not been trained on this variant of the CTF task. All bot levels performed very similarly on this task, so we report a single number for all bot levels. In addition we show results when agents are trained solely on the fetch task 1 reward for picking up and capturing a flag only. Right: Heatmaps of the visitation of the FTW agent during the second half of several episodes while playing fetch..
34

Figure S4: Shown is prediction accuracy in terms of percent AUCROC of linear probe classifiers on 40 different highlevel game state features columns for different agents rows, followed by their averages across features, for five different temporal offsets ranging from 20 to 20 frames top to bottom. Results are shown for the baseline selfplay agent with reward shaping as well as the FTW agent after different numbers of training games, and an untrained randomly initialised FTW agent.
35

Vector to visualise x?RH
Neural Response Map of x
Values for vertices colors
Temporally consistent embedding of neurons
e.g. tSNE applied to X or
minimisationof
TT ?ij ?ViVj, X iX j
Topology
Faces are colored according to mean of vertices colors
Triangulation
e.g. Delau
nay triangulation
T
V? RHx2
Dataset X ? R
TxH
Training phase
Figure S5: Top: Shown are neural response maps for the FTW agent for game state features used in the knowledge study of Extended Data Figure S4. For each binary feature y we plot the response vector Ehp,hqy1Ehp,hqy0. Bottom: Process for generating similarity based topological embedding of the elements of vector xRH given a dataset of other XRTH. Here we use two independent tSNE embeddings, one for each of the agents LSTM hidden state vectors at the two timescales.
36

Agents View Value Function Policy
I am respawning I have the flag Teammate has the flag Opponent has the flag
Figure S6: Top two rows: Selected saliency analysis of FTW agent. Contours show sensitivity f t
xt,ij 1, where ft is instantiated as the agents value function at time t, its policy, or one of four
highly selective neurons in the agents hidden state, and xt,ij represents the pixel at position ij at time t. Brighter colour means higher gradient norm and thus higher sensitivity to given pixels. Bottom: Saliency analysis of a single neuron that encodes whether an opponent is holding a flag. Shown is a single situation from the perspective of the FTW agent, in which attention is on an opponent flag carrier at time t, on both opponents at time t2, and switches to the onscreen information at time t4 once the flag carrier has been tagged and the flag returned.
37

7K 200K 300K 450K
Distance to my base Distance to my base Distance to my base Distance to my base
Figure S7: Top: Shown are Hinton diagrams representing how often the FTW agent reads memory slots written to at different locations, which are represented in terms of distance to home and opponent base, on 1000 procedurally generated maps, at different points during training. The size of each square represents the difference between the probability of reading from the given location compared to randomly reading from one of the locations visited earlier in the episode. Red indicates that the agent reads from this position more often than random, and blue less. At 450K the agent appears to have learned to read from near its own base and just outside the opponent base. Bottom: Shown are memory recall patterns for an example episode. The heatmap plot on the left shows memory recall frequency averaged across the episode. Shown on the right are the recall patterns during the agents first exploration of a newly encountered map. Early in the episode, the agent simply recalls its own path. In almost the same situation later in the episode, the agent recalls entering the opponent base instead.
38
Distance to opponents base
Distance to opponents base
Distance to opponents base
Distance to opponents base

Figure S8: Shown is a sidebyside comparison of the internal representations learned from playing CTF for the FTW and SelfplayRS agent, visualised using tSNE and single neuron activations Figure 3 for more information. The selfplay agents representation is seen to be significantly less coherently clustered by game state, especially with respect to flag possessions. Furthermore, it appears to have developed only two highly selective neurons compared to four for the FTW agent.
39

2
a Behaviour occurrence per episode
Behaviour Cluster
b Behaviours for randomly sampled episode played by the FTW 450k game agent
FTW 7K games: 0.50 distance from human FTW 45K games: 0.52 distance from human FTW 200K games: 0.34 distance from human FTW 300K games: 0.67 distance from human FTW 450K games: 0.40 distance from human FTW wo TH: 0.48 distance from human SelfplayRS: 0.47 distance from human Human
Episode Time Following teammate Opponent base camping Home base defence
cnsiodhomejaderbergexperimentdatarotden17hvsdnc2agent01820e8bl3v2 shard0013episode0001syllabusrotationaldense17botlevel3 0
Figure S9: a Shown is a collection of bar plots, one for each of 32 automatically discovered behaviour clusters, representing the number of frames per episode during which the behaviour has been observed for the FTW agent at different points in training, the FTW agent without the temporal hierarchy TH, the SelfplayRS agent, and human players, averaged over maps and episodes. The behavioural fingerprint changes significantly throughout training, and differs considerably between models with and without temporal hierarchy. b Shown is the multivariate time series of active behaviour clusters during an example episode played by the trained FTW agent. Shown are three particular situations represented by the behaviour clusters: following your teammate, enemy base camping, and home base defence.
40

Behaviour Cluster Steps per Episode

Figure S10: Shown are network architectures of agents used in this study. All agents have the same highlevel architecture a, using a decomposed policy b see Section 5.2, value function c, and convolutional neural network CNN visual feature extractor d. The baseline agents and ablated FTW without temporal hierarchy agents use an LSTM for recurrent processing e. The FTW agent uses a temporal hierarchy for recurrent processing f which is composed of two variational units g. All agents use reward prediction h and independent pixel control i auxiliary task networks.
41

a Environment Interface
b Time delay between opponent appearing and:
Human
2049 possible rotations, discrete movement
60 Hz0 ms
60 Hz
None
RGB 800×600 pixels
Large look left
Small look left
Agent Observation and rotation actions
Figure S11:
bots. b Humans and agents are in addition bound by other sensorimotor limitations. To illustrate we measure humans and agents response times, when playing against a Bot 3 team on indoor procedural maps. Time delays all measured from first appearance of an opponent in an observation. Left: delay until the opponent becoming taggable i.e. lies within a 10 degree visual cone. Middle: delay until an attempted tag i.e. opponent lies within a 10 degree visual cone and tag action emitted. Right: delay until a successful tag. We ignore situations where opponents are further than 1.5 map units away. The shaded region represents values which are impossible to obtain due to environment constraints. c Ef fect of tagging accuracy on win probability against a Bot 3 team on indoor procedural maps. Accuracy is the number of successful tags divided by valid tag attempts. Agents have a trained accuracy of 80, much higher than the 48 of humans. In order to measure the effect of decreased accuracy on the FTW agent, additional evaluation matches were performed where a proportion of tag events were artificially discarded. As the agents accuracy increases from below human 40 to 80 the win probability in creases from 90 to 100 which represents a significant change in performance. d Effect of successful tag time on win probability against a Bot 3 team on indoor procedural maps. In contrast to c, the tag actions were artificially discarded p of the timedifferent values of p result in the spectrum of re sponse times reported. Values of p greater than 0.9 did not reduce response time, showing the limitations of p as a proxy. Note that in both c and d, the agents were not retrained with these p values and so obtained values are only a lowerbound of the potential performance of agentsthis relies on the agents generalising outside of the physical environment they were trained in.
Action Space
Observation RateDelay
Action Resolution
Auxiliary Information
Observation
Human Observation
Agent
6 possible rotations, discrete movement
15 Hz66.7 ms 15 Hz
None
RGB 84×84 pixels
Look up
Bot
2049 possible rotations, discrete movement
60 Hz0 ms
60 Hz
Map layout, all player states, all object states
Groundtruth game state
Small look right
Look down
Large look right
c Posthoc effect of tagging accuracy
d Posthoc effect of successful tag time
a The differences between the environment interface offered to humans, agents, and
42

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 R C algorithm Scheme game math scala parallel AI statistic software network Bayesian GPU Go react theory Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning
30 $