, , , , , , , , , , , ,

[SOLVED] 38question 1(7 points).the undirected graph in figure 1-(left) has four nodes (a,b,c,d). the number

$25

File Name: 38question_1(7_points).the_undirected_graph_in_figure_1-(left)_has_four_nodes_(a,b,c,d)._the_number.zip
File Size: 932.58 KB

5/5 - (1 vote)

2 Please, finish reading everything on this page before you start.

3 Open book, web, and everything. No chatgpt. Do all the work completely by yourself without

4 any discussion with others. Any form of communication related to the course before the submission

5 deadline will be considered academic integrity violation.

6 For clarification questions, ask me via direct messages on Slack (@Sean) and do not post anything

7 in public channels. If any general correction/clarification is needed, I’ll make announcements in the

8 #general channel on Slack.

9 Be assured that I will read every message, and if you do not see my reply within an hour or so (when

10 my Slack status is green), it means my reply is “no comment” because the answer will be clear after

11 re-reading the questions or going through class materials.

12 Submit the following two items by the deadline:

13 1. Submit photos or pdfs of your written/typed answers on Gradescope (Assignment: “Final”).

14 2. Submit a video recording of yourself briefly explaining all your answers, at the following link:

15 https://www.dropbox.com/request/MlYRr5sFww1kj9BQcbTu.

16 Make sure to leave enough time for uploading especially if your network is spotty. You need

17 to show your face and a photo ID in the beginning of the video (feel free to turn off camera and only

18 show the answer sheets after that). If you don’t have cameras, message me.

19 There is no strict rule about how long the video should be, but do not not make it too long (20 min

20 should be more than enough). Just enough to show that you get the right ideas for each question you

21 answered. Think of it as an interview, and you are trying to convince the examiner about what you

22 understand, while not overdoing it.

23 The grading will be entirely based on the written part. The video is used to ensure academic integrity

24 and also help getting you partial credits. In particular, if a question explicitly asks you to explain

25 something, you need to explain in the written part (not just in the video), otherwise that question will

26 be considered unanswered.

27 Make sure the file name of your video has your first name, last name, and PID. I don’t know if Dropbox

28 sends you confirmation, but for your record you can always take a screenshot of the page indicating

29 that you have successfully uploaded your files. You can upload multiple times, but we do not guarantee

30 that we will only check the last one. So try your best to send only a final version of the video.

31 Zoom is highly recommended for doing your recording. If you recorded using your phone or in some

32 other ways, try compressing the video into mp4 format (can use relatively low resolution) and avoid

33 making the file too large. If you find yourself uploading gigabytes of video, you should change the file.

34 Your overall letter grade for the course will only depend on the sum of all raw points you obtained in

35 the class, and there is no notion of “percentage” in this class. You only need to get enough points for

36 the threshold of the grade you want.

38 Question 1 (7 Points). The undirected graph in Figure 1-(Left) has four nodes (A,B,C,D). The number

39 on each edge indicates the cost of going through the edge – e.g., going from A to B has a cost of 5, same

40 as going from B to A. Set node A as the start, and D the goal node. Answer the following questions and

41 explain your reasoning.

42 1. In the correct algorithm for Dijkstra, if we visit a node that is already in the frontier, we need to

43 compare the cost of this node from the current path with its cost in the frontier (the last two lines

44 in the pseudocode of Slide 36 of this chapter 1 Classical Search.pdf). Now, on this given graph, if we

45 do not do that comparison and replacement (i.e., remove the last line in the pseudocode), then what

46 is the path from A to D that this wrong modification of Dijkstra algorithm will return? Explain the

47 steps of how you obtained it.

48 2. What is the path that will be returned by the correct Dijkstra algorithm instead?

49 3. In your PA1, we agreed that it was ok to not do this frontier check and replacement, and still always

50 get the optimal path from the wrong modification of the Dijkstra algorithm. Explain why it was ok in

51 the graphs in PA1 and its difference with this graph in Figure 1. In your explanation, you can use the

52 fact that the Dijkstra algorithm always pops nodes in the order of non-decreasing path cost.

53 4. Define the following heuristic function with heuristic values h(A) = 3, h(B) = 1, h(C) = 7, h(D) = 0.

54 Is h a consistent heuristic function and why?

55 5. What is the path that the A* algorithm will return from A to D under this heuristic h?

Figure 1: (Left) Graph for the classical search question. (Right) Game for adversarial search question.

56 2 Adversarial Search

57 Question 2 (3 Points). Consider the game in Figure 1-(Right). There are two players: the column player

58 and the row player. The column player goes first, and chooses either the first column (C1) or the second

59 (C2), and after that, the row player chooses either the first row (R1) or the second (R2). After these two

60 steps, the game ends and the pair of two numbers located by these two choices is the outcome of the game.

61 In each pair of numbers, the first number is the reward that the column player will receive, and the second

62 number is what the row player will receive. For instance, if the column player chooses C1, and then the row

63 player chooses R1, then the outcome is (10,2), which means the column player will receive a reward of 10

64 and the row player a reward of 2.

65 Assume that both players are rational players and want to maximize their own reward (no minimizers),

66 what should each of their action be, and what is the final outcome of the game under those actions?

67 In your answer, draw the game tree that is similar to the minimax tree, just that now the two players

68 are no longer the max and min players (just annotate their nodes as column and row players). Explain your

69 reasoning on the tree. I hope you realize this is how the algorithms we talked about in minimax can be

70 applied to games where players are not directly adversarial to each other (not zero-sum).

71 3 Markov Decision Processes and Reinforcement Learning

72 Consider the following MDP:

73 State space: S = {s1, s2}

74 Actions: A(s1) = {a1, a2} and A(s2) = {a1, a2}.

75 Transition model:

76 P (s1|s1, a1) = 0.2, P (s2|s1, a1) = 0.8, P (s1|s1, a2) = 0.8, P (s2|s1, a2) = 0.2

77 P (s1|s2, a1) = 0.7, P (s2|s2, a1) = 0.3, P (s1|s2, a2) = 0.3, P (s2|s2, a2) = 0.7

78 Rewards: R(s1) = 1 and R(s2) = 2.

79 Discount factor: γ = 0.8.

80 Question 3 (2 Points). Draw a graph that fully represents the MDP. In the graph, show the states, actions,

81 Q-states (i.e., state-action pairs), and rewards. Label transitions with probabilities on the appropriate edges

82 in the graph.

83 Question 4 (3 Points). Perform two steps of value iteration from initial guess V0(s1) = V0(s2) = 0. That

84 is, perform two steps of Bellman update Vi+1 B(Vi) (check the notations in the slides). Note that each V

85 is a vector of the values on both states, i.e., V0 = (V0(s1), V0(s2)). So you should show how to compute V1

86 and V2, where each Vi is a vector of two numbers.

87 Question 5 (2 Points). Roughly how many iterations will it take for the value iteration to converge to be

88 within 0.1 error from the optimal values? That is, suppose the optimal values are V

= (V

(s1), V

(s2)),

89 and we want |Vk(s1) V

(s1)| < 0.1 and |Vk(s2) V

(s2)| < 0.1, then how large does k need to be to

90 converge to under this threshold? Explain your reasoning mathematically, no implementation involved.

∗ ∗

91 Question 6 (2 Points). Suppose we refer to the MDP defined above as M1, and write VM1 (s1), VM1 (s2) to

92 represent the optimal values the M1. Now define another MDP M2 that is exactly the same as M1 except

93 that the rewards are now RM2 (s1) = 2 and RM2 (s2) = 4. Namely, M2 differs from M1 only in that the

∗ ∗

94 reward on each state is twice as large. We write the optimal values of M2 as VM2 (s1) and VM2 (s2).

∗ ∗

95 Now, suppose you know what VM1 (s1) and VM1 (s2) are (which you don’t, and you do not need to

∗ ∗

96 implement it to find out), then what should VM2 (s1) and VM2 (s2) be? That is, write down the values of

∗ ∗ ∗ ∗

97 VM2 (s1) and VM2 (s2) using the values of VM1 (s1) and VM1 (s2). Explain your reasoning.

98 Question 7 (4 Points). We now perform Q-learning in the MDP defined in the beginning, i.e., M1, without

99 accessing the transition probabilities. We initialize all Q-values to be zero Q(si, ai) = 0. For simplicity we

100 will use a fixed learning rate parameter α = 0.1 in all temporal-difference updates (so it does not change

101 with the number of visits). We start from the initial state s2, and suppose we perform/observe the following

102 three steps in the learning process in the MDP:

103 (Step 1) At the initial state s2, we take the action a1, and then observe that we are transitioned back

104 to s2 by the MDP.

106 the MDP.

107 (Step 3) Now at the state s1, we take the action of a1 yet again, and then observe that we are

108 transitioned to state s2 by the MDP.

109 After each step of these three steps, we perform temporal-difference updates for the Q-values on the relevant

110 state-action pairs. Now answer the following questions.

111 1. What values do we have for Q(s1, a1) and Q(s2, a1) now, after these three steps of updates? Write

112 down how you obtained them.

113 2. Suppose from here we will use the ε-greedy strategy with ε = 0.3, which means that with ε probability

114 we will use an arbitrary action (each of the two actions will be chosen equally likely in this case), and

115 with 1 ε probability we will choose the best action according to the current Q-values. Now that we

116 are in s2 after Step 3, what is the probability of seeing the transition (s2, a1, s1) in the next step? That

117 is, calculate the probability of the event “according to the ε-greedy policy, we obtained the action a1

118 in the current state s2, and after applying this action, the MDP puts us in s1 as the next state.”

119 3. If instead of ε-greedy policy, we take the greedy policy that always takes the action that maximizes

120 Q-values in each step, then what is the probability of seeing (s2, a1, s1) in the next step?

121 4 Monte Carlo Tree Search

122 Question 8 (2 Points). Consider the simple game of betting over the blue and red coins as shown in the

123 slides for this chapter. Explain why, from a regret perspective, choosing two coins uniformly randomly at

124 each round is the same as randomly sticking with one coin from the very beginning. You should use the

125 definition of regret over N rounds of play of the game in your explanation.

126 Question 9 (Extra 2 Points). This question is an advanced topic for extra credits, and it can be confusing

127 if you are not familiar with the notion of i.i.d. samples. In that case no need to attempt it.

128 In MCTS we use the Upper Confidence Bound (UCB) strategy to make decisions that balance exploration

129 and exploitation. We explained that for the multi-armed bandit problem such as the two-coin betting game,

130 the UCB strategy achieves logrithmic regret. That result needs to assume that we can draw i.i.d. samples

131 for each coin, because the reward distribution on each coin does not change over time.

132 Give a concrete example to explain why this may not be the case in the MCTS setting. That is, design

133 a minimax tree, such that the win rate distributions for the choices at the root node change over time, as

134 the MCTS algorithm expands the tree. Recall that the win rate distribution at a node is what we are using

135 roll-outs from that node perform Monte Carlo estimation of.

136 The mathematical detail does not need to be very precise, just focus on explaining what the MCTS

137 algorithm does on the tree that you design, so that it is clear that the win rate distribution for the options

138 at the root node may shift over time.

139 5 Constraint Solving and Propositional Reasoning

140 Question 10 (3 Points). Consider the following constraint system: The variables are X = {x1, x2} with

141 x1 D1 = [0, 1] and x2 D2 = [0, 1], both are continuous intervals of real numbers. The constraints are

C1 : x1 = x2 + 0.2

2

C2 : x1 = x2

142 Perform propagation on the domain D = D1 ×D2 using the constraints {C1, C2} by enforcing arc-consistency.

143 Show how the domains on each variable will be updated in the first several propagation steps, and then show

144 what end results you will obtain by iterating such propagation, and explain why.

φ : ¬ (p2 (p3 → ¬p1)) p1

146 You only need to expand the definition of the logical connective “” and then using De Morgan laws should

147 be enough (i.e., no need to go through the general procedures of introducing new variables, etc.).

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 38question 1(7 points).the undirected graph in figure 1-(left) has four nodes (a,b,c,d). the number
$25