Implement (in C++) a Q-learning [Watkins and Dayan, 1992] algorithm to solve a simulated mine-sweeping task.
Q-learning is to control autonomous mine sweeper agents in a two-dimensional world of mines and super-mines (figure 1). Exact simulation parameters (mines, sweepers, iteration length, frame rate, etc.) are set in the CParams.ini file.
If a sweeper collides with a super-mine both the sweeper and super-mine are destroyed. Destroyed super-mines are re-spawned in the next iteration. If a sweeper collides with a mine, that mine is removed from the world and the sweepers statistics are updated. The framework to set up the world, rendering and update loop is supplied. It should not be necessary to modify the frameworks core functionality (draw and update loops), aside from the Q-learning controller and supporting methods:
- cpp
Figure 1: Simulation world consisting of mine-sweepers, mines and super-mines. Mines and sweepers are spawned at random positions at the start of a simulation. The world wraps around on both the x and y axis (i.e. if a sweeper moves off the world on one side they reappear on the opposite side). Sweepers should learn to remove mines (green squares) and avoid super-mines (red squares).
Each of the controllers implements a carefully defined interface.
CQLearningController.cpp is a discrete (grid) environment and inherits from CDiscCollisionObject.cpp.
The controller CQLearningController.cpp overrides the following methods:
- virtual void Update(void);of the parent controller. This method must call the Update method
- virtual void InitializeLearningAlgorithm(void);
You will need to fill in the details of the methods for the controller and supporting classes, paying special attention to the comments in each file.
Mine-sweeper statistics are drawn by the P lotStats() function when the Fkey is pressed. Graphs of the maximum and average number of mines swept are drawn when your mine-sweepers learn to sweep mines. The framework uses the Windows WIN32 windowing API and is bundled as a Visual Studio 2019 project. Both the Senior and the Games lab machines have Visual Studio installed.
Implementing Q-learning
For a detailed description of the Q-learning algorithm refer to the seminal Qlearning paper [Watkins and Dayan, 1992] and Chapter 13 of Machine Learning [Mitchell, 1997] (both are available on Vula)[1].
Note that the Q-learning algorithm keeps a table containing all possible states and actions and progressively updates the table according to the reward function. Since all possible state-action pairs have to be tracked the world must consist of a finite number of grid positions. You can assume that each minesweeper can only move up, down, left and right at every step of the update cycle.
When implementing Q-learning you must decide on a suitable reward function and algorithm parameters including the learning rate and discount factor. For example, sweepers could be rewarded for completing the task (clearing the world of mines) and penalized for finding nothing. However, Q-learning must always run for 2000 iterations (simulation ticks) and each simulation must have mine-sweepers and mines initialised to random positions in the world.
To verify that your agents have learned a suitable mine-sweeping behaviour, your Q-learned agent controllers must be evaluated on the following tests:
Test 1: Initialise the world with 1 mine-sweeper and 30 mines.
Test 2: Initialise the world with 1 mine-sweeper, 25 mines and 5 super-mines.
Test 3: Initialise the world with 1 mine-sweeper, 5 mines and 25 super-mines.
The success of each of these tests is mine-sweeper task performance, i.e. average and maximum number of mines swept, where this average is calculated over 50 simulation runs (note that for each test simulation, mine-sweepers, mines and super-mines must be initialised to random locations in the world).
Reviews
There are no reviews yet.