2 Please, finish reading everything on this page before you start.
3 • Open book, web, and everything. No chatgpt. Do all the work completely by yourself without
4 any discussion with others. Any form of communication related to the course before the submission
5 deadline will be considered academic integrity violation.
6 • For clarification questions, ask me via direct messages on Slack (@Sean) and do not post anything
7 in public channels. If any general correction/clarification is needed, I’ll make announcements in the
8 #general channel on Slack.
9 • Be assured that I will read every message, and if you do not see my reply within an hour or so (when
10 my Slack status is green), it means my reply is “no comment” because the answer will be clear after
11 re-reading the questions or going through class materials.
12 • Submit the following two items by the deadline:
13 1. Submit photos or pdfs of your written/typed answers on Gradescope (Assignment: “Final”).
14 2. Submit a video recording of yourself briefly explaining all your answers, at the following link:
15 https://www.dropbox.com/request/MlYRr5sFww1kj9BQcbTu.
16 • Make sure to leave enough time for uploading especially if your network is spotty. You need
17 to show your face and a photo ID in the beginning of the video (feel free to turn off camera and only
18 show the answer sheets after that). If you don’t have cameras, message me.
19 • There is no strict rule about how long the video should be, but do not not make it too long (20 min
20 should be more than enough). Just enough to show that you get the right ideas for each question you
21 answered. Think of it as an interview, and you are trying to convince the examiner about what you
22 understand, while not overdoing it.
23 • The grading will be entirely based on the written part. The video is used to ensure academic integrity
24 and also help getting you partial credits. In particular, if a question explicitly asks you to explain
25 something, you need to explain in the written part (not just in the video), otherwise that question will
26 be considered unanswered.
27 • Make sure the file name of your video has your first name, last name, and PID. I don’t know if Dropbox
28 sends you confirmation, but for your record you can always take a screenshot of the page indicating
29 that you have successfully uploaded your files. You can upload multiple times, but we do not guarantee
30 that we will only check the last one. So try your best to send only a final version of the video.
31 • Zoom is highly recommended for doing your recording. If you recorded using your phone or in some
32 other ways, try compressing the video into mp4 format (can use relatively low resolution) and avoid
33 making the file too large. If you find yourself uploading gigabytes of video, you should change the file.
34 • Your overall letter grade for the course will only depend on the sum of all raw points you obtained in
35 the class, and there is no notion of “percentage” in this class. You only need to get enough points for
36 the threshold of the grade you want.
38 Question 1 (7 Points). The undirected graph in Figure 1-(Left) has four nodes (A,B,C,D). The number
39 on each edge indicates the cost of going through the edge – e.g., going from A to B has a cost of 5, same
40 as going from B to A. Set node A as the start, and D the goal node. Answer the following questions and
41 explain your reasoning.
42 1. In the correct algorithm for Dijkstra, if we visit a node that is already in the frontier, we need to
43 compare the cost of this node from the current path with its cost in the frontier (the last two lines
44 in the pseudocode of Slide 36 of this chapter 1 Classical Search.pdf). Now, on this given graph, if we
45 do not do that comparison and replacement (i.e., remove the last line in the pseudocode), then what
46 is the path from A to D that this wrong modification of Dijkstra algorithm will return? Explain the
47 steps of how you obtained it.
48 2. What is the path that will be returned by the correct Dijkstra algorithm instead?
49 3. In your PA1, we agreed that it was ok to not do this frontier check and replacement, and still always
50 get the optimal path from the wrong modification of the Dijkstra algorithm. Explain why it was ok in
51 the graphs in PA1 and its difference with this graph in Figure 1. In your explanation, you can use the
52 fact that the Dijkstra algorithm always pops nodes in the order of non-decreasing path cost.
53 4. Define the following heuristic function with heuristic values h(A) = 3, h(B) = 1, h(C) = 7, h(D) = 0.
54 Is h a consistent heuristic function and why?
55 5. What is the path that the A* algorithm will return from A to D under this heuristic h?
Figure 1: (Left) Graph for the classical search question. (Right) Game for adversarial search question.
56 2 Adversarial Search
57 Question 2 (3 Points). Consider the game in Figure 1-(Right). There are two players: the column player
58 and the row player. The column player goes first, and chooses either the first column (C1) or the second
59 (C2), and after that, the row player chooses either the first row (R1) or the second (R2). After these two
60 steps, the game ends and the pair of two numbers located by these two choices is the outcome of the game.
61 In each pair of numbers, the first number is the reward that the column player will receive, and the second
62 number is what the row player will receive. For instance, if the column player chooses C1, and then the row
63 player chooses R1, then the outcome is (10,2), which means the column player will receive a reward of 10
64 and the row player a reward of 2.
65 Assume that both players are rational players and want to maximize their own reward (no minimizers),
66 what should each of their action be, and what is the final outcome of the game under those actions?
67 In your answer, draw the game tree that is similar to the minimax tree, just that now the two players
68 are no longer the max and min players (just annotate their nodes as column and row players). Explain your
69 reasoning on the tree. I hope you realize this is how the algorithms we talked about in minimax can be
70 applied to games where players are not directly adversarial to each other (not zero-sum).
71 3 Markov Decision Processes and Reinforcement Learning
72 Consider the following MDP:
73 • State space: S = {s1, s2}
74 • Actions: A(s1) = {a1, a2} and A(s2) = {a1, a2}.
75 • Transition model:
76 – P (s1|s1, a1) = 0.2, P (s2|s1, a1) = 0.8, P (s1|s1, a2) = 0.8, P (s2|s1, a2) = 0.2
77 – P (s1|s2, a1) = 0.7, P (s2|s2, a1) = 0.3, P (s1|s2, a2) = 0.3, P (s2|s2, a2) = 0.7
78 • Rewards: R(s1) = 1 and R(s2) = −2.
79 • Discount factor: γ = 0.8.
80 Question 3 (2 Points). Draw a graph that fully represents the MDP. In the graph, show the states, actions,
81 Q-states (i.e., state-action pairs), and rewards. Label transitions with probabilities on the appropriate edges
82 in the graph.
83 Question 4 (3 Points). Perform two steps of value iteration from initial guess V0(s1) = V0(s2) = 0. That
84 is, perform two steps of Bellman update Vi+1 ← B(Vi) (check the notations in the slides). Note that each V
85 is a vector of the values on both states, i.e., V0 = (V0(s1), V0(s2)). So you should show how to compute V1
86 and V2, where each Vi is a vector of two numbers.
87 Question 5 (2 Points). Roughly how many iterations will it take for the value iteration to converge to be
88 within 0.1 error from the optimal values? That is, suppose the optimal values are V ∗
= (V
∗(s1), V
∗(s2)),
89 and we want |Vk(s1) − V
∗(s1)| < 0.1 and |Vk(s2) − V
∗(s2)| < 0.1, then how large does k need to be to
90 converge to under this threshold? Explain your reasoning mathematically, no implementation involved.
∗ ∗
91 Question 6 (2 Points). Suppose we refer to the MDP defined above as M1, and write VM1 (s1), VM1 (s2) to
92 represent the optimal values the M1. Now define another MDP M2 that is exactly the same as M1 except
93 that the rewards are now RM2 (s1) = 2 and RM2 (s2) = −4. Namely, M2 differs from M1 only in that the
∗ ∗
94 reward on each state is twice as large. We write the optimal values of M2 as VM2 (s1) and VM2 (s2).
∗ ∗
95 Now, suppose you know what VM1 (s1) and VM1 (s2) are (which you don’t, and you do not need to
∗ ∗
96 implement it to find out), then what should VM2 (s1) and VM2 (s2) be? That is, write down the values of
∗ ∗ ∗ ∗
97 VM2 (s1) and VM2 (s2) using the values of VM1 (s1) and VM1 (s2). Explain your reasoning.
98 Question 7 (4 Points). We now perform Q-learning in the MDP defined in the beginning, i.e., M1, without
99 accessing the transition probabilities. We initialize all Q-values to be zero Q(si, ai) = 0. For simplicity we
100 will use a fixed learning rate parameter α = 0.1 in all temporal-difference updates (so it does not change
101 with the number of visits). We start from the initial state s2, and suppose we perform/observe the following
102 three steps in the learning process in the MDP:
103 • (Step 1) At the initial state s2, we take the action a1, and then observe that we are transitioned back
104 to s2 by the MDP.
106 the MDP.
107 • (Step 3) Now at the state s1, we take the action of a1 yet again, and then observe that we are
108 transitioned to state s2 by the MDP.
109 After each step of these three steps, we perform temporal-difference updates for the Q-values on the relevant
110 state-action pairs. Now answer the following questions.
111 1. What values do we have for Q(s1, a1) and Q(s2, a1) now, after these three steps of updates? Write
112 down how you obtained them.
113 2. Suppose from here we will use the ε-greedy strategy with ε = 0.3, which means that with ε probability
114 we will use an arbitrary action (each of the two actions will be chosen equally likely in this case), and
115 with 1 − ε probability we will choose the best action according to the current Q-values. Now that we
116 are in s2 after Step 3, what is the probability of seeing the transition (s2, a1, s1) in the next step? That
117 is, calculate the probability of the event “according to the ε-greedy policy, we obtained the action a1
118 in the current state s2, and after applying this action, the MDP puts us in s1 as the next state.”
119 3. If instead of ε-greedy policy, we take the greedy policy that always takes the action that maximizes
120 Q-values in each step, then what is the probability of seeing (s2, a1, s1) in the next step?
121 4 Monte Carlo Tree Search
122 Question 8 (2 Points). Consider the simple game of betting over the blue and red coins as shown in the
123 slides for this chapter. Explain why, from a regret perspective, choosing two coins uniformly randomly at
124 each round is the same as randomly sticking with one coin from the very beginning. You should use the
125 definition of regret over N rounds of play of the game in your explanation.
126 Question 9 (Extra 2 Points). This question is an advanced topic for extra credits, and it can be confusing
127 if you are not familiar with the notion of i.i.d. samples. In that case no need to attempt it.
128 In MCTS we use the Upper Confidence Bound (UCB) strategy to make decisions that balance exploration
129 and exploitation. We explained that for the multi-armed bandit problem such as the two-coin betting game,
130 the UCB strategy achieves logrithmic regret. That result needs to assume that we can draw i.i.d. samples
131 for each coin, because the reward distribution on each coin does not change over time.
132 Give a concrete example to explain why this may not be the case in the MCTS setting. That is, design
133 a minimax tree, such that the win rate distributions for the choices at the root node change over time, as
134 the MCTS algorithm expands the tree. Recall that the win rate distribution at a node is what we are using
135 roll-outs from that node perform Monte Carlo estimation of.
136 The mathematical detail does not need to be very precise, just focus on explaining what the MCTS
137 algorithm does on the tree that you design, so that it is clear that the win rate distribution for the options
138 at the root node may shift over time.
139 5 Constraint Solving and Propositional Reasoning
140 Question 10 (3 Points). Consider the following constraint system: The variables are X = {x1, x2} with
141 x1 ∈ D1 = [0, 1] and x2 ∈ D2 = [0, 1], both are continuous intervals of real numbers. The constraints are
C1 : x1 = x2 + 0.2
2
C2 : x1 = x2
142 Perform propagation on the domain D = D1 ×D2 using the constraints {C1, C2} by enforcing arc-consistency.
143 Show how the domains on each variable will be updated in the first several propagation steps, and then show
144 what end results you will obtain by iterating such propagation, and explain why.
φ : ¬ (p2 → (p3 → ¬p1)) → p1
146 You only need to expand the definition of the logical connective “→” and then using De Morgan laws should
147 be enough (i.e., no need to go through the general procedures of introducing new variables, etc.).
Reviews
There are no reviews yet.