Assignment 3
Q1
10 Points
Q1.1 BERT
5 Points
What is the optimization objective of BERT?
Predicting a masked word.
Predicting whether two sentences follow each other. Both a and b.
Explanation:
Save Answer
Q1.2 Positional Encoding 5 Points
Why do we use positional encoding in transformers?
Because it helps locate the most important word in the sentence.
Because attention encoder outputs depend on the order of the inputs.
Because we replaced recurrent connections with attention modules.
Because it decreases overfitting in RNN and transformers.
Explanation:
Save Answer
Q2 Time Complexity of Transformers
10 Points
Q2.1
5 Points
What is the time complexity of Transformers as a function of the number of heads h?
O(log h) O(h)
O(h^2)
None of the above.
Explanation:
Save Answer
Q2.2
5 Points
What is the time complexity of Transformers as a function of sequence length n?
O(n)
O(n^2)
O(n^3)
None of the above.
Explanation:
Save Answer
Q3 Transformers
10 Points
Q3.1
5 Points
What is the main difference between the Transformer and the encoder-decoder architecture?
Unlike encoder-decoder, transformer uses attention only in encoders.
Unlike encoder-decoder, transformer uses attention only in decoders.
Unlike encoder-decoder, transformers have multiple encoder-decoder structures layered up together.
Unlike encoder-decoder, transformer uses attention in both encoders and decoders.
Explanation:
Save Answer
Q3.2
5 Points
Which of the following architectures cannot be parallelized?
CNNs RNNs
Transformers
Explanation:
Save Answer
Q4 Transformers
10 Points
Q4.1
5 Points
For the vectors xi, consider the weighted average yi =
i
∑j αi,j ⋅ xj where wi,j = xT xj and αi,j = softmax(wi,j ).
What is ∑j αi,j for any i?
Answer and Explanation:
Save Answer
Q4.2
5 Points
In Seq2Seq models with n words, let h1, h2, …, hn ∈ Rh be the encoder hidden states (h is the embedding dimensionality), st ∈ Rh be the decoder hidden state at step t, then the attention scores for step t are: et =
[(st)T h1, …, (st)T hn] ∈ Rn.
Taking the softmax to get the attention distribution αt for this step: αt = softmax(et) ∈ Rn. What is the formula for the attention output at?
Please type any math input using latex format enclosed by
$$…$$.
Answer and Explanation:
Save Answer
Q5.1
5 Points
What are the two key operations used for updating a node representation in a GNN?
Aggregate and Combine. Aggregate and Message. Combine and Update.
Aggregate and Max Pooling.
Explanation:
Save Answer
Q5.2
5 Points
What is the difference between a node embedding and a node representation?
A node representation is a special case of node embedding. A node embedding is a special case of node representation. There is no difference between the two.
Explanation:
Save Answer
Q6 GNNs
10 Points
Q6.1
5 Points
Citation networks can be treated as graphs where researchers cite fellow researchers’ papers. Given a paper “A” we would like to predict if the paper cites another paper “B” or not. Which type of prediction task can you model this to be?
Node prediction. Link prediction. Graph prediction.
Sub graph prediction.
Explanation:
Save Answer
Q6.2
5 Points
Select the correct statement among the following:
Combine operation gathers information from all nodes and aggregate operation updates the collected information with its self information.
Aggregate operation gathers information from all nodes and combine operation updates the collected information with its self information.
Combine operation gathers information from it’s neighboring nodes and aggregate operation updates the collected information with its self information.
Aggregate operation gathers information from it’s neighboring nodes and combine operation updates the collected information with its self information.
Explanation:
Save Answer
Q7 GNNs
10 Points
What is the Laplacian matrix for a graph with nodes {1,2,3,4,5} and edges {(1.5),(1,3),(2,3),(2,5),(3,4)}?
Answer and Explanation:
You can also upload a picture of your work:
Select file(s)
Please select file(s)
Save Answer
Q8 RNNs
10 Points
Given an RNN defined by st = W ⋅ st−1 + U ⋅ xt with W =
0, −1
1
0
(−1, 0), U = (1), and s0 = (0), what is s2 for x =
(x1, x2) = (1, 0)?
Answer:
Explanation:
You can also upload a picture of your work:
Select file(s)
Please select file(s)
Save Answer
Q9 Convolutional view of a linear RNN
10 Points
Q9.1
8 Points
Given an RNN defined by
st = W ⋅ st−1 + U ⋅ xt yt = C⋅ st
s0 = U ⋅ x0
where st ∈ R2, xt ∈ R and yt ∈ R denote the hidden state, input, and output of the RNN at timestep t, respectively.
0, −1
−1
W = (−1, 0), U = ( 1 ), C = (−1, 1)
This linear RNN can be written as a convolution:
yt = Kt ∗ x
where x = (x0, x1, …, xt) denotes the inputs to the RNN up to time t, ∗ denotes convolution, and Kt denotes a convolutional kernel matrix of shape 1 × (t + 1).
What are the entries of K2 such that the output y2 obtained by convolution is the same as the output obtained from the above recursive definition of the RNN?
Hint: This blog post might help:
https://huggingface.co/blog/lbourdois/get-on-the-ssm-train
Answer: K2 =
Explanation:
You can also upload a picture of your work:
Select file(s)
Please select file(s)
Save Answer
Q9.2
2 Points
Give one reason why we might want to implement a linear RNN using a convolution instead of recursion:
Save Answer
Q10 Parameter Efficient Tuning of Attention layers
10 Points
Q10.1
2.5 Points
Consider a multihead self attention block with h heads.
MultiHead(X) = Concat(head1, …, headh)WO
where headi = Attention(XW Q, XW K, XW V ).
i i i
Assume bias=False, X ∈ RN×D, W Q, W K, W V ∈ RD×d.
i i i
What should the shape of WO be, so that the output of the multihead self attention has the same shape as the input, in terms of h, d, D, and N?
Answer:
Explanation:
Save Answer
2.5 Points
How many parameters are there in total, in terms of h, d, D, and N?
Answer:
Explanation:
Save Answer
2.5 Points
I have some data I want to finetune the multihead attention block on, but I don’t want to finetune all the parameters.
Specifically, I want to use LoRA to finetune the model. (LoRA is described here: https://arxiv.org/pdf/2106.09685)
Specifically, for every trainable matrix W , I decompose it as:
W = W0 + ΔW = W0 + BA
W0 is the pretrained weights that are not trained. ΔW = BA
is a residual matrix with the same shape as W that I do train.
B ∈ RD×r and A ∈ Rr×d. I decompose every query, key and value projection matrix in the multihead attention block in this way, and train only the residual matrices. I leave the WO frozen.
How many parameters are being trained in the above procedure, in terms of h, d, D, N, and r?
Answer:
Explanation:
Save Answer
2.5 Points
State two separate advantages of finetuning a model with LoRA:
Save Answer
Submit & View Submission
Save All Answers
Reviews
There are no reviews yet.