Name: [SOLVED] Cs 4/5789 – programming assignment 2
Brand: Assignment Chef
SKU: 54185
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

5/5 - (1 vote)

For this assignment, we will ask you to implement LQR (and DDP) in order to solve a cartpole environment.
While we are using a custom cartpole environment, it would be useful to go through the following OpenAI
Gym introduction.

It may also be useful to look up how gym environments work in general. Wen-Ding, a
graduate TA of CS 4789 from SP21, created a tutorial that may be useful.
OpenAI Gym contains many environments that can be used for testing RL algorithms. Additionally, it
allows users to create their own custom environments. All these environments share a standard API, making it easy to test different algorithms on different environments. As mentioned earlier, we are using a
custom version1 of the cartpole environment. The goal here is to keep the pole on the cartpole upright by
applying a variable force either left or right on the cartpole as shown below.

There are several files in this project.
• lqr.py: This contains the code to find the optimal control policy for LQR given the dynamics and
cost matrices.

• finite difference method.py: This file is used to finite difference approximations. These will
be used to help compute the reward and transition matrices from the environment which we then
pass to the LQR.
• test.py: Test case files. Run this with python test.py. These will contains tests for different
parts of the code.

• ddp.py: This contains code which implements DDP.
• local controller.py: This contains cartpole environment specific elements and
compute local expansion.
• cartpole.py This contains the standard loop for interacting with the environment. One can use
–env DDP to run DDP or –env LQR to run local LQR. You can use the -h flag for more information
about usage.

In the standard cartpole environments, the only possible actions are to apply a force left or right. The amount of force applied is
constant. In our environment, we allow variable force to be applied.

There are 3 parts to this programming assignment.
• Section 3 goes through the derivation of the generalized LQR formulation. We break the proof down
into intermediate steps and show the final results. You should try the proofs yourself and verify the
results yourself. You are responsible for understanding the proof for exams. There is no coding in
this section.

• Section 4 describes a local linearization approach to stabilizing the cartpole. Section 4.3’s task is to
implement local linearization control for the cartpole to stay vertically upwards. You will need to
complete the functions in
– finite difference method.py
– local controller.py
– LQR.py

• Section 5 goes through the derivation of DDP. After understanding the procedure, you are asked to
complete functions in DDP.py.
Once the programming tasks are done, you will need to test and visualize the cartpole. You will turn
in a report containing the test output and images. Complete instructions can be found at the end of
Section 6.

In the class, we introduced the most basic formulation of LQR. In this section, we are going to slightly make
the model more general. We are interested in solving the following problem
min
π0,··· ,πT−1
E
”
T
X−1
t=0
s
⊤
t Qst + a
⊤
t Rat + s
⊤
t M at + q
⊤st + r
⊤at + b
#
(1)
subject to st+1 = Ast + Bat + m, at = πh(st) s0 ∼ µ0 (2)
Here we have s ∈ R
ns
, a ∈ R
na , Q ∈ R
ns×ns
(Q is positive definite), M ∈ R
ns×na , q ∈ R
ns
, R ∈ R
na×na
(R is positive definite), r ∈ R
na , b ∈ R, A ∈ R
ns×ns
, B ∈ R
ns×na , and m ∈ R
ns
. We also always have the
following matrix being positive definite:

Q M/2
M⊤/2 R

(3)

The difference between the above formulation and the formulation we had in class is that the cost function
contains an additional second order term s
⊤
t M at, first-order terms q
⊤st, r⊤at and zeroth order terms b, and
the dynamics contain zeroth order term m. The transitions here are deterministic.
In this problem, we will derive the expressions for Q⋆
t
, π⋆
t
, and V
⋆
t
for the base and inductive cases like
in class. The difference this time is V
⋆
t and π
⋆
t
can be thought of as complete quadratic and complete linear
functions as follows
V
⋆
t
(s) = s
⊤Pts + y
⊤
t
s + pt
π
⋆
t
(x) = K⋆
t
s + k
⋆
t
where Pt ∈ R
ns×ns
(Pt is PSD), yt ∈ R
ns
, pt ∈ R and K⋆
t ∈ R
na×ns
, k⋆
t ∈ R
na .

The following sections outline
the derivation but do not go through all the steps. These are left as an exercise to students.
3.1 Base Case
3.1.1 Q⋆
T −1
Q
⋆
T −1
(s, a) = s
⊤Qs + a
⊤Ra + s
⊤M a + q
⊤s + r
⊤a + b
3.1.2 Extracting the policy
We can derive the expression from π
⋆
T −1
as done in class. Then, we can write out the expression for
K⋆
T −1
, k⋆
T −1
. If you’ve done things correctly, you should get the following
π
⋆
T −1
(s) = −
1
2
R
−1M⊤s −
1
2
R
−1
r
∴ K⋆
T −1 = −
1
2
R
−1M⊤
∴ k
⋆
T −1 = −
1
2
R
−1
r
3.1.3 V
⋆
T −1
Recall from class that V
⋆
t
(s) = Q⋆
t
(s, π⋆
t
(s)).
Using Q⋆
T −1
and π
⋆
T −1
, we can derive the expression for V
⋆
T −1
(s). We’ll keep K⋆
T −1
, k⋆
T −1
in the expression and then PT −1, yT −1, pT −1. If you’ve done everything correctly, you should get the following
PT −1 = Q + K⋆
T −1
⊤RK⋆
T −1 + MK⋆
T −1
y
⊤
T −1 = q
⊤ + 2(k
⋆
T −1
)
⊤RK⋆
T −1 + (k
⋆
T −1
)
⊤M⊤ + r
⊤K⋆
T −1
pT −1 = (k
⋆
T −1
)
⊤Rk⋆
T −1 + r
⊤k
⋆
T −1 + b
V
⋆
T −1
(s) = s
⊤PT −1s + y
⊤
T −1
s + pT −1

These expressions can be simplified as shown below but the above expressions will be useful when deriving
the inductive case.
PT −1 = Q −
1
4
MR−1M⊤
y
⊤
T −1 = q
⊤ −
1
2
r
⊤R
−1M⊤
pT −1 = b −
1
4
r
⊤R
−1
r
Note above, we technically need to show that PT −1 is PSD before moving on. You can try to show this.

3.2 Inductive step
For the inductive step, assume V
⋆
t+1(s) = s
⊤Pt+1s + y
⊤
h+1s + ph+1 where Pt+1 is PSD.
3.2.1 Q⋆
t
We can derive the expression for Q⋆
t
(s, a) following the steps done in class. If done correctly, you should
get the following
Q
⋆
t
(s, a) = s
⊤Cs + a
⊤Da + s
⊤Ea + f
⊤s + g
⊤a + h
where
C = Q + A
⊤Pt+1A
D = R + B
⊤Pt+1B
E = M + 2A
⊤Pt+1B
f
⊤ = q
⊤ + 2m⊤Pt+1A + y
⊤
t+1A
g
⊤ = r
⊤ + 2m⊤Pt+1B + y
⊤
t+1B
h = b + m⊤Pt+1m + y
⊤
t+1m + pt+1
You should notice that Q⋆
t
(s, a) is similar to the expression for Q⋆
T −1
(s, a). Although we are not asking you
to prove this, you should verify for yourself that C and D are positive definite matrices. Thus, we can use
the exact same steps as in 3.1.2 and 3.1.3 to find π
⋆
t and V
⋆
t
.
3.2.2 Extracting the policy
Following the same steps in 3.1.2, we get that
π
⋆
t
(s) = −
1
2
D−1E
⊤s −
1
2
D−1
g
K⋆
t = −
1
2
D−1E
⊤
k
⋆
t = −
1
2
D−1
g
Please verify these for yourself.
3.2.3 V
⋆
t
Using the same steps as in 3.1.3, we can derive V
⋆
t
(s). Since we know
Q
⋆
t
(s, a) = s
⊤Cs + a
⊤Da + s
⊤Ea + f
⊤s + g
⊤a + h
then we have
Pt = C + K⋆
t
⊤DK⋆
t + EK⋆
t
y
⊤
t = f
⊤ + 2(k
⋆
t
)
⊤DK⋆
t + (k
⋆
t
)
⊤E
⊤ + g
⊤K⋆
t
pt = (k
⋆
t
)
⊤Dk⋆
t + g
⊤k
⋆
t + h
V
⋆
t = x
⊤Ptx + y
⊤
t x + pt
4
Section 4: Programming: Local Linearization Approach for Controlling
CartPole

4.1 Setup of Simulated CartPole
The simulated CartPole has the following nonlinear deterministic dynamics st+1 = f(st, at), and potentially non-quadratic cost function c(s, a) that penalizes the deviation of the state from the balance point
(s
⋆
, a⋆
) where a
⋆ = 0, and s
⋆
represents the state of CartPole where the pole is straight and the cart is in a
pre-specified position. A state st is a 4-dimension vector defined as
st =




xt
vt
θt
ωt




It consists of the position of the cart, the speed of the cart, the angle of the pole in radians, and the angular
velocity of the pole. The action at is a 1-dimensional vector correspond to the force applied on the cart.
Through this section, we assume that we have black-box access to c and f, i.e., we can feed any (s, a)
to f and c, we will get two scalar returns which are f(s, a) and c(s, a) respectively. Namely, we do not
know the analytical math formulation of f and c (e.g., imagine that we are trying to control some complicated simulated humanoid robot. The simulator is the black-box f).

In this assignment we will use our customized OpenAI gym CartPole environment provided in the following repository. The environment is under env directory. The goal is to finish the implementation of
lqr.py which contains a class to compute the locally linearized optimal policy of our customized CartPole
environment. We also provide other files to help you get started.

4.1.1 TODO:
Please review the files for the programming assignment.
4.2 Finite Difference for Taylor Expansion
Since we do not know the analytical form of f and c, we cannot directly compute the analytical formulations for Jacobians, Hessians, and gradients. However, given the black-box access to f and c, we can use
finite difference to approximately compute these quantities.

Below we first explain the finite differencing approach for approximately computing derivatives. Your
will later use finite differencing methods to compute A, B, Q, R, M, q, r.
To illustrate finite differencing, assume that we are given a function g : R 7→ R. Given any α0 ∈ R, to
compute g
′
(α0), we can perform the following process:
Finite Difference for derivative: gb′(α0) := g(α0 + δ) − g(α0 − δ)
2δ
,
for some δ ∈ R
+.

Note that by the definition of derivative, we know that when δ → 0
+, the approximation
approaches to g
′
(α0). In practice, δ is a tuning parameter: we do not want to set it to 0
+ due to potential
numerical issue. We also do not want to set it too large, as it will give a bad approximation of g
′
(α0).
With gb′(α) as a black-box function, we can compute the second-derivate using Finite differencing on top of
it:

Finite Difference for Second Deriviative: gc′′(α0) := gb′(α0 + δ) − gb′(α0 − δ)
2δ
Note that to implement the above second derivative approximator gc′′(α), we need to first implement the
function gb′(α) and treat it as a black-box function inside the implementation of gc′′(α). You can see that we
need to query black-box f twice for computing g
′
(α0) and we need to query black-box f four times for
computing g
′′(α0).

Similar ideas can be used to approximate gradients, Jacobians, and Hessians. At this end, we can use
the provided Cartpole simulator which has black-box access to f and c and the goal balance point s
⋆
, a⋆
, to
compute the taylor expantion around the balancing point.

4.2.1 TODO:
We provide minimum barebone functions and some tests for finite difference method in the file
finite difference method.py. Complete the implementation of gradient, Jacobian and Hessian functions so that we can use them in the next section.

4.3 Local Linearization Approach for Nonlinear Control
Formally, consider our objective:
min
π0,…,πT−1
T
X−1
t=0
c(st, at) (4)
subject to: st+1 = f(st, at), at = πt(st), s0 ∼ µ0; (5)
where f : R
ns × R
na 7→ R
ns
.
In general the cost c(s, a) could be anything. Here we focus on a special instance where we try to keep
the system stable around some stable point (x
⋆
, a⋆
), i.e., our cost function c(s, a) penalizes the deviation to
(s
⋆
, a⋆
), i.e., c(s, a) = ρ (st − s
⋆
)+ρ(at−a
⋆
), where ρ could be some distance metric such as ℓ1 or ℓ2 distance.

To deal with nonlinear f and non-quadratic c, we use the linearization approach here. Since the goal is
to control the robot to stay at the pre-specified stable point (s
⋆
, a⋆
), it is reasonable to assume that the
system is approximately linear around (s
⋆
, a⋆
), and the cost is approximately quadratic around (s
⋆
, a⋆
).
Namely, we perform first-order taylor expansion of f around (s
⋆
, a⋆
), and we perform second-order taylor
expansion of c around (s
⋆
, a⋆
):
f(s, a) ≈ A(s − s
⋆
) + B(a − a
⋆
) + f(s
⋆
, a⋆
),
c(s, a) ≈
1
2

s − s
⋆
a − a
⋆
⊤
Q M
M⊤ R
s − s
⋆
a − a
⋆

+

q
r
⊤
s − s
⋆
a − a
⋆

+ c(s
⋆
, a⋆
);
Here A and B are Jacobians, i.e.,
A ∈ R
ns×ns
: A[i, j] = ∂f[i]
∂s[j]
[s
⋆
, a⋆
] : B ∈ R
ns×na
, B[i, j] = ∂f[i]
∂a[j]
[s
⋆
, a⋆
],
where f[i](s, a) stands for the i-th entry of f(s, a), and s[i] stands for the i-th entry of the vector s. Similarly,
for cost function, we will have Hessians and gradients as follows:
Q ∈ R
ns×ns
: Q[i, j] = ∂
2
c
∂s[i]∂s[j]
[s
⋆
, a⋆
], R ∈ R
na×na
: R[i, j] = ∂
2
c
∂a[i]∂a[j]
[s
⋆
, a⋆
]
M ∈ R
ns×na
: M[i, j] = ∂
2
c
∂s[i]∂a[j]
[s
⋆
, a⋆
], q ∈ R
ns
: q[i] = ∂c
∂s[i]
[s
⋆
, a⋆
]
r ∈ R
na
: r[i] = ∂c
∂a[i]
[s
⋆
, a⋆
].
We are almost ready compute a control policy using A, B, Q, R, M, q, r together with the optimal control we
derived for the system in Eq. 1. One potential issue here is that the original cost function c(s, a) may not
be even convex. Thus the Hessian matrix H :=
Q M
M⊤ R

may not be a positive definite matrix. We will
apply further approximation to make it a positive definite matrix. Denote the eigen-decomposition of H
as H =
Pns+na
i=1 σiviv
⊤
i where σi are eigenvalues and vi are corresponding eigenvectors. We force H to be
convex as follows:
H ←
nXs+na
i=1
1{σi > 0}σiviv
⊤
i + λI, (6)
6
where λ ∈ R
+ is some small positive real number for regularization which ensures that after approximation, we get an H that is positive definite with minimum eigenvalue lower bounded by λ.

Note this Hessian matrix is slightly different from 3. You should not view these matrices as the same
in Eq. 1 yet. We still need to reformulate this problem in that form as shown in section 4.4
4.3.1 TODOs:
Review the concepts of gradient, Jacobian, Hessian, and positive definite matrices. With the previous implementation of finite difference methods, you can complete the function compute local expansion in
local controller.py to calculate the A, B, Q, R, M, q, r as we defined above.

4.4 Computing Locally Optimal Control
With A, B, Q, R, M, q, r computed, let us first check if H =

Q M
M⊤ R

a positive definite matrix (if it’s not
PD, the LQR formulation may run into the case where matrix inverse does not exist and and numerically
you will observe NAN as well.). If not, you must use the trick in Eq. 6 to modify H. We now are ready to
solve the following linear quadratic system:
min
π0,…,πT−1
T
X−1
t=0
1
2

st − s
⋆
at − a
⋆
⊤
H

st − s
⋆
at − a
⋆

+

q
r
⊤
st − s
⋆
at − a
⋆

+ c(s
⋆
, a⋆
), (7)
subject to st+1 = Ast + Bat + m, at = πt(st), s0 ∼ µ0. (8)

With some rearranging terms, we can re-write the above program in the format of Eq. 1. This is shown
below.
We can expand the cost function as
1
2

st − s
⋆
at − a
⋆
⊤
H

st − s
⋆
at − a
⋆

+

q
r
⊤
st − s
⋆
at − a
⋆

+ c(s
⋆
, a⋆
)
=
1
2

st − s
⋆
at − a
⋆
⊤
Q M
M⊤ R
st − s
⋆
at − a
⋆

+

q
r
⊤
st − s
⋆
at − a
⋆

+ c(s
⋆
, a⋆
)
=
1
2
(st − s
⋆
)
⊤Q(st − s
⋆
) +
1
2
2(st − s
⋆
)
⊤M(at − a
⋆
) +
1
2
(at − a
⋆
)
⊤R(at − a
⋆
) +
q
r
⊤
st − s
⋆
at − a
⋆

+ c(s
⋆
, a⋆
)
=
1
2
(s
⊤
t Qst − 2s
⋆⊤Qst + s
⋆⊤Qs⋆
) + (s
⊤
t M at − a
⋆⊤M⊤st − s
⋆⊤M at + s
⋆⊤M a⋆
)
+
1
2
(a
⊤
t Rat − 2a
⋆⊤Rat + a
⋆⊤Ra⋆
) +
q
r
⊤
st − s
⋆
at − a
⋆

+ c(s
⋆
, a⋆
)
=
1
2
s
⊤
t Qst − s
⋆⊤Qst +
1
2
s
⋆⊤Qs⋆ + s
⊤
t M at − a
⋆⊤M⊤st − s
⋆⊤M at + s
⋆⊤M a⋆ +
1
2
a
⊤
t Rat − a
⋆⊤Rat +
1
2
a
⋆⊤Ra⋆
+ q
⊤st − q
⊤s
⋆ + r
⊤at − r
⊤a
⋆ + c(s
⋆
, a⋆
)
= s
⊤
t
(
Q
2
)st + a
⊤
t
(
R
2
)at + s
⊤
t M at + (q
⊤ − s
⋆⊤Q − a
⋆⊤M⊤)st + (r
⊤ − a
⋆⊤R − s
⋆⊤M)at+
(c(s
⋆
, a⋆
) +
1
2
s
⋆⊤Qs⋆ +
1
2
a
⋆⊤Ra⋆ + s
⋆⊤M a⋆ − q
⊤s
⋆ − r
⊤a
⋆
)
Next, let us expand out the transition function, f. As in section 3.1, we perform first-order taylor expansion
of f around (s
⋆
, a⋆
) so our transition is defined by
st+1 = A(s − s
⋆
) + B(a − a
⋆
) + f(s
⋆
, a⋆
)
= As − As⋆ + Ba − Ba⋆ + f(s
⋆
, a⋆
)
= As + Ba + (f(s
⋆
, a⋆
) − As⋆ − Ba⋆
)
7
Thus, let us define the following variables
Q2 =
Q
2
R2 =
R
2
q
⊤
2 = q
⊤ − s
⋆⊤Q − a
⋆⊤M⊤
r
⊤
2 = r
⊤ − a
⋆⊤R − s
⋆⊤M
b = c(s
⋆
, a⋆
) +
1
2
s
⋆⊤ Q
2
s
⋆ +
1
2
a
⋆⊤Ra⋆ + s
⋆⊤M a⋆ − q
⊤s
⋆ − r
⊤a
⋆
m = f(s
⋆
, a⋆
) − As⋆ − Ba⋆
Thus, we can rewrite our formulation as
min
π0,··· ,πT−1
T
X−1
t=0
s
⊤
t Q2st + a
⊤
t R2at + s
⊤
t M at + q
⊤
2
st + r
⊤
2 at + b
subject to st+1 = Ast + Bat + m, at = πh(st), s0 ∼ µ0
This is exactly the same formulation as in section 3. Thus, we use the formulation there to derive the optimal
policies.

4.4.1 TODO:
Please complete the function lqr in lqr.py using the formulation derived above. You will need to compute the optimal policies for time steps T − 1, …, 0. At timestep t, we find find π
⋆
t using Q⋆
t
.

In the LQR setting of equation 1, where the cost is written with respect to some fixed matrices and vectors
(i.e., Q, R, M, q, r, b), we are able to derive the closed-loop solutions of the policy. So far, we used this insight
to approximate nonlinear control as LQR. However, recall that we originally derived the LQR policy using
dynamic programming. What if we apply the idea of approximation to the dynamic programming steps
directly? For our policy at time step i, we care about the cost-to-go as the accumulated sum of the costs in
the future steps. The optimal Value at time i is
V
⋆
i
(s) ≡ min
πi,…,πT +1
T
X−1
t=i
c(st, at) s.t. si = s, st+1 = f(st, at), at = πt(st).

Using the finite horizon Bellman Optimality Equation (i.e. the principle behind Dynamic Programming),
we can reduce this minimization over the entire sequence of controls to one action:
V
⋆
i
(s) = min
a
[c(s, a) + V
⋆
i+1(f(s, a))] = min
a
Q
⋆
i
(s, a).
Let’s try to do a second order approximation to Q⋆
i
(s, a) directly. While you are responsible for understanding the motivation and basic ideas (quadratic approximation, dynamic programming) behind DDP,
you are not required understand the technical details of the following derivation precisely. We will approximate around a point s0, a0. Let s = s0 + δs and a = a0 + δa be small perturbations. Then,
Q
⋆
i
(s, a) − Q
⋆
i
(s0, a0) = c(s0 + δs, a0 + δa) − c(s0, a0) + V
⋆
i+1(f(s0 + δs, a0 + δa)) − V
⋆
i+1(f(s0, a0)) .

Approximating around δs = 0, δa = 0 to second order results in
Q
⋆
i
(s, a) − Q
⋆
i
(s0, a0) ≈
1
2


1
δs
δa


⊤ 

0 Q⊤
s Q⊤
a
Qs Qss Qsa
Qa Qas Qaa




1
δs
δa

 . (9)
The expansion coefficients are
Qs = cs + f
⊤
s V
i+1
s
, Qs ∈ R
ns
Qa = ca + f
⊤
a V
i+1
s
, Qa ∈ R
na
Qss = css + f
⊤
s V
i+1
ss fs + V
i+1
s
· fss, Qss ∈ R
ns×ns
Qaa = caa + f
⊤
a V
i+1
ss fa + V
i+1
s
· faa, Qaa ∈ R
na×na
Qas = cas + f
⊤
a V
i+1
ss fs + V
i+1
s
· fas, Qas ∈ R
na×ns
where we denote: cs and ca the gradients of the cost function; css, caa, cas the Hessians of the cost function;
fs and fa the Jacobian of the transition function; fss, faa and fas the Jacobian of the Jacobian of the transition function (this is a 3D tensor!); V
i+1
s and V
i+1
ss are the coefficients of the expanded value function at
time step i + 1, which will be defined later.

Our policy minimizes δa with respect to Q using the quadratic approximation in Eq. 9. With simple computation, we can get
δa⋆ = −Q
−1
aa (Qa + Qasδs).
We define ki = −Q−1
aa Qa and Ki = −Q−1
aa Qas. Plug the policy into 9, we can derive (iteratively as using
dynamic programming) the quadratic approximation of the value functions at timestep i:
V
⋆
i
(s) − V
⋆
i
(s0) ≈ −
1
2
Q
⊤
a Q
−1
aa Qa
V
i
s = Qs − Q
⊤
saQ
−1
aa Qa
V
i
ss = Qss − Q
⊤
saQ
−1
aa Qas
The above procedure computes the control actions and values from T − 1 to 0 in a “backwards pass.” And
once the backward pass is complete, we can use the policy to roll-out a new trajectory, which gives us new
states and actions (s0, a0) to linearize around. Formally, the forward pass calculates:
sˆ0 = s0
aˆi = ai + ki + Ki(ˆsi − si)
sˆi+1 = f(ˆsi
, aˆi)
The algorithm iterates until the forward pass and the backward pass converge.

5.1 DDP Implementation
We will implement DDP to complete a more challenging task—making the cartpole swing up.
5.1.1 TODO:
There are two main sections that need to be completed in DDP.py: forward(), which is the forward
dynamics and the main driver code in train(). We have provided a small starter implementation for the
number of timeststeps that it took for our solution to swing up the cartpole. But if yours takes less, feel free
to change it.

6.1 LQR Stress Tests
Given policies π0, . . . , πT −1 that computed from previous sections, we will evaluate its performance by executing it on the real system f and real cost c. To generate a T-step trajectory, we first sample s0 ∼ µ0, and
then take at = πt(st), and then call the black-box f to compute st+1 = f(st, at) till t = T − 1. This gives us
a trajectory τ = {s0, a0, s1, a1, . . . , sT −1, aT −1}. The total cost of the trajectory is C(τ ) := PT −1
t=0 c(st, at).

In the main file cartpole.py, we provide several difference initialization states. These initial states are
ordered based on the distance between the means to the goal (s
⋆
, a⋆
). Intuitively, we should expect that
our control perform worse when the initial states are far away from the taylor-expansion point (s
⋆
, a⋆
), as
our linear and quadratic approximation become less accurate. Testing your computed policies on these
difference initializations and report the performances for all initializations.

6.1.1 TODO
Run cartpole.py –env LQR and take a screenshot of the output costs printed from running the LQR
version. The costs can vary depending on the version of packages used (even with the environment being
seeded). However, your costs should fall in the following ranges
• Case 0 < 10
• Case 1 < 200
• Case 2 < 1000
• Case 3 < 2000
• Case 4 < 5000
• Case 5 = ∞
• Case 6 = ∞
• Case 7 = ∞
Please know that these are loose upper bounds and the actual costs for a correct implementation may be
much lower than the upper bound listed (apart from Case 5-7).
Please briefly explain why you think the costs are different for each case.

6.2 DDP Tests
Likewise, we also provides different initial states for the swing-up task that are ordered such that their
distance are increasing from the goal state. This time, since DDP no longer depends on the assumption of
local linearization, we should expect our control to be able to complete the swing-up task even far away
from the goal state.

6.2.1 TODO
Run cartpole.py –env DDP –init s far and describe the behavior of the trained cartpole. Since
the training process for DDP takes longer, we ask you to manually test the initial states one by one. You can
simply exchange the –init s to be {far,close,exact}.

Observe the generated videos, for each test case you run: how does DDP perform with the different
initial states?
Compare the LQR policy and the DDP policy: how do they perform on different initial states? What’s
the reason behind their differences?

6.3 Submission
Please submit a zip file containing your implementation as follows:
YOUR NET ID/
Answers.pdf
README.md
init .py
cartpole.py
local controller.py
finite difference method.py
lqr.py
ddp.py
test.py
env/
init .py
cartpole lqr env.py
cartpole ilqr env.py
where Answers.pdf contains the screenshot of the cost of the cartpole simulations, the description of
what happen in the three initial states in the cartpole swing-up task, and your explanations to the above
highlighted three research questions.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] Cs 4/5789 – programming assignment 2

Reviews

Related products

[SOLVED] COP 4600 Threads Lab 4

[SOLVED] Csce625: programming assignment 1 : search, bfs, and lazy evaluation

[SOLVED] Programming Project for TCP Socket Programming

[SOLVED] LINEAR DATA STRUCTURES AND ALGORITHMS ASSIGNMENT 2

[SOLVED] HR Analytics: Job Change of Data Scientists

[SOLVED] Individual Programming Assignment