5/5 - (1 vote)

In this lab, we are required to work with Reinforcement Learning, a newer Machine Learning technique, that can train an agent in an environment. Thea gent will navigate the classic 44 grid-world environment to a specific goal. The agent will learn an optimal 12policy through Q-Learning which will allow it to take actions to reach the goal while 13avoiding the boundaries. We use a platform here called AI gym to facilitate the whole 14process of the construction of the agent and the environment.

Introduction

Name: [Solved] ML Homework 1-Introduction to RL using Open AI gym
Brand: Assignment Chef
SKU: [Solved] ML Homework 1-Introduction to RL using Open AI gym
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

191.1 What is Reinforcment Learning?

Reinforcement Learning(RL) is one of the hottest research topics in the field of modern
Artificial Intelligence and its popularity is only growing. Reinforcement Learning(RL) is 22 a type of machine learning technique that enables an agent to learn in an interactive 23 environment by trial and error using feedback from its own actions and experiences.
Though both supervised and reinforcement learning use mapping between input and
output, unlike supervised learning where the feedback provided to the agent is correct set 26 of actions for performing a task, reinforcement learning uses rewards and punishments as 27 signals for positive and negative behavior.
As compared to unsupervised learning, reinforcement learning is different in terms of
While the goal in unsupervised learning is to find similarities and differences
between data points, in the case of reinforcement learning the goal is to find a suitable 31 action model that would maximize the total cumulative reward of the agent. The figure 32 below illustrates the action-reward feedback loop of a generic RL model.

2. Open AI Gym

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It
makes no assumptions about the structure of your agent, and is compatible with any
numerical computation library, such as TensorFlow or Theano.The gym library is a
collection of test problems environments that you can use to work out your 54 reinforcement learning algorithms. These environments have a shared interface, 55 allowing you to write general algorithms.

56 3. The Envrionment

The Environment is a 4 by 4 grid environment described by two
things the grid environment and the agent. An observation space
which is defined as a vector of elements. This can be particularly
useful for environments which return measurements, such as in 61 robotic environments.

env

62 The core gym interface is , which is the unified environment interface. The 63 following are the methods that would be quite helpful to us:

env.reset

64: Resets the environment and returns a random initial state.

env.step(action)

65: Step the environment by one timestep.

Returns observation: Observations of the environment
reward: If your action was beneficial or not
done: Indicates if we have successfully picked up and dropped off a passenger, also
called one episode
info: Additional info such as performance and latency for debugging purposes

env.render

71: Renders one frame of the environment (helpful in visualizing the

environment)
We have an Action Space of size 4
0 = down
1 = up
2 = right
3 = left

78 5. The MULti-Bandit Problem (E-Greedy Algorithm)

The multi-armed bandit problem is a classic reinforcement learning example where
we are given a slot machine with n arms (bandits) with each arm having its own
rigged probability distribution of success. Pulling any one of the arms gives you a
stochastic reward of either R=+1 for success, or R=0 for failure. Our objective is to
pull the arms one-by-one in sequence such that we maximize our total reward 84 collected in the long run.

The non-triviality of the multi-armed bandit problem lies in
the fact that we (the agent) cannot access the true bandit
probability distributions all learning is carried out via the 92 means of trial-and-error and value estimation. So the question
is:
This is our goal for the multi-armed bandit problem, and having such a strategy 95 would prove very useful in many real-world situations where one would like to select 96 the best bandit out of a group of bandits.
In this project, we approach the multi-armed bandit problem with a classical
reinforcement learning technique of an epsilon-greedy agent with a learning framework 99 of reward-average sampling to compute the action-value Q(a) to help the agent improve 100 its future action decisions for long-term reward maximization.
In a nutshell, the epsilon-greedy agent is a hybrid of a (1) completely-exploratory agent
and a (2) completely-greedy agent. In the multi-armed bandit problem, a completely103 exploratory agent will sample all the bandits at a uniform rate and acquire knowledge
about every bandit over time; the caveat of such an agent is that this knowledge is never
utilized to help itself to make better future decisions! On the other extreme, a completely-
greedy agent will choose a bandit and stick with its choice for the rest of eternity; it will 107 not make an effort to try out other bandits in the system to see whether they have better 108 success rates to help it maximize its long-term rewards, thus it is very narrow-minded!
How do we perform this in our code?
We perform this by assigning a variable called epsilon. This epsilon switches between the
exploratory and the greedy agent. We choose a random number between 0 and 1, if this 112 number is less than epsilon, we tell the agent to explore if it is greater then we tell the 113 agent to be greedy. This tactic is used in our policy part of our code.

Q-Learning Algorithm

Essentially, Q-learning lets the agent use the environments rewards 117 to learn, over time, the best action to take in a given state.

In our environment, we have the reward table,that the agent will learn from. It does
thing by looking receiving a reward for taking an action in the current state, then 120 updating a Q-value to remember if that action was beneficial.
The values store in the Q-table are called a Q-values, and they map to a (state,action)
A Q-value for a particular state-action combination is representative of the quality of 124 an action taken from that state. Better Q-values imply better chances of getting greater

125 rewards.

Q-values are initialized to an arbitrary value, and as the agent exposes itself to the
environment and receives different rewards by executing different actions, the Q129 values are updated using the equation:

130 Q(state,action)(1)Q(state,action)+(reward+maxaQ(next)Q(state,action)+(reward+maxaQ(next)Q(state,action)+(reward+maxaQ(next)Q(state,action)+(reward+maxaQ(next(reward+(reward+maxaQ(nextmaxaQ(nextmaxaQ(next state,all actions))

131

Where:
(alpha) is the learning rate (0<11) Just like in supervised learning settings, is 134 the extent to which our Q-values are being updated in every iteration.
(gamma) is the discount factor (01) determines how much importance we
want to give to future rewards. A high value for the discount factor (close to 1) 137 captures the long-term effective award, whereas, a discount factor of 0 makes our 138 agent consider only immediate reward, hence making it greedy.
What is this saying?
We are assigning (), or updating, the Q-value of the agents current state and action
by first taking a weight (1) of the old Q-value, then adding the learned value. The
learned value is a combination of the reward for taking the current action in the current
state, and the discounted maximum reward from the next state we will be in once we 144 take the current action.
Basically, we are learning the proper action to take in the current state by looking at
the reward for the current state/action combo, and the max rewards for the next state.
This will eventually cause our taxi to consider the route with the best rewards strung
The Q-value of a state-action pair is the sum of the instant reward and the discounted
future reward (of the resulting state). The way we store the Q-values for each state and
action is through a Q-table
Q-Table
The Q-table is a matrix where we have a row for every state and a column for every 154 Its first initialized to 0, and then values are updated after training.

Epsilon
We want the odds of the agent exploring to decrease as time goes
One way to do this is by updated the epsilon every moment of
the agent during the training phase. Choosing a decay rate was
the difficult part. We want episilon not to decay too soon, so the
agent can have time to explore. The way I went about this is I
want the agent the ability to explore the region for at least the
size of the grid movements. So in 25 movements the agent should 164 still have a good chance of being in exploratory phase. I chose
decay = 1/1.01 because (1/1.01)^25 = 75% which is a good chance
and still being in the exploratory phase within 0-25 moves. After 167 50 moves episilon goes to 0.6 which is good odds of agent doing 168 both exploring and action orientated.

169

Results and Charts
Here we plot epsilon vs episode.
We see that we have a exponential decay going on here.

Then we plot rewards vs episode

Here we see that the total rewards slowly goes up then plateus after hitting 8. This is because 8 is the optimal path for our algrotihm.

CONCLUSION

Our Reinforcement Algorithm had done fairly well. Our agent has been able to learn the material in an effective manner. It has been able to reach its goal in 8 steps.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] ML Homework 1-Introduction to RL using Open AI gym

Introduction

2. Open AI Gym

56 3. The Envrionment

78 5. The MULti-Bandit Problem (E-Greedy Algorithm)

Q-Learning Algorithm

Reviews

Related products

[Solved] ML Homework1-Nearest Neighbor

[Solved] ML Homework1- Pandas, Matplotlib

[Solved] ML Exercise1-Linear Regression

[Solved] ML Assignment1-Distance Weighted Nearest Neighbour Algorithm for Regression

[Solved] ML Exercise5-Regularized Linear Regression and Bias v.s. Variance

[Solved] ML Assignment2-Scikit Learn