In this project, you need to implement one reinforcement learning algorithm (e.g., value iteration, policy iteration, Q-learning) for one grid-world-based environment: Treasure Hunting.
(a) 3D grid world. Smile faces represent terminal states which (b) The illustration of transition, e.g., the ingive reward 1. tended action is RIGHT
Figure 1: Illustration of treasure hunting in a cube
2 Treasure Hunting in a Cube
The environment is a 3D grid world. The MDP formulation is described as follows:
- State: a 3D coordinate, which indicates the current position where the agent is. The initial state is (0, 0, 0) and there is only one terminal state: (3,3,3).
- Action: The action space is (forward, backward, left, right, up, down). The agent needs to select one of them to navigate in the environment.
- Reward: The agent will receive 1 reward when it arrives at the terminal states, or otherwise receive -0.1 reward.
- Transition: The intended movement happens with probability 0.6. With probability 0.1, the agent ends up in one of the states perpendicular to the intended direction. If a collision with a wall happens, the agent stays in the same state.
3 Code Example
We provide the environment code environment.py and examples code test.py. In environment.py, we provide the code: TreasureCube.
In test.py, we provide a random agent. You can modify it to implement your agent. You should install a numpy package additionally to run the code.
from collections import defaultdict import argparse import random import numpy as np from environment import TreasureCube# you need to implement your agent based on one RL algorithm class RandomAgent(object):def __init__(self):self.action_space = [left,right,forward,backward,up,down] # inTreasureCube self.Q = defaultdict(lambda: np.zeros(len(self.action_space)))def take_action(self, state):action = random.choice(self.action_space) return action# implement your train/update function to update self.V or self.Q# you should pass arguments to the train function def train(self, state, action, next_state, reward):pass |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Besides, in test.py, we implement a test function. You should replace the random agent with your agent in line 3.
def | test_corridor(max_episode, max_step):env = TreasureCorridor(max_step=max_step) agent = RandomAgent()for epsisode_num in range(0, max_episode):state = env.reset() terminate = Falset = 0episode_reward = 0 while not terminate:action = agent.take_action(state)reward, terminate, next_state = env.step(action) episode_reward += reward# env.render()# print(fstep: {t}, action: {action}, reward: {reward}) t += 1agent.train(state, action, next_state, reward) state = next_stateprint(fepsisode: {epsisode_num}, total_steps: {t} episode reward: {episode_reward}) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
If you use Q-learning, you can use the parameters: discount factor = 0.99, learning rate = 0.5, exploration rate
You can run the following code to generate output and test your agent.
python test.py max_episode 500 max_step 500 |
1
Reviews
There are no reviews yet.