[SOLVED] Assignment 3

$25

File Name: Assignment_3.zip
File Size: 113.04 KB

Category:
5/5 - (1 vote)

Assignment 3

Q1

10 Points

Q1.1 BERT

5 Points

What is the optimization objective of BERT?

Predicting a masked word.

Predicting whether two sentences follow each other. Both a and b.

Explanation:

Save Answer

Q1.2 Positional Encoding 5 Points

Why do we use positional encoding in transformers?

Because it helps locate the most important word in the sentence.

Because attention encoder outputs depend on the order of the inputs.

Because we replaced recurrent connections with attention modules.

Because it decreases overfitting in RNN and transformers.

Explanation:

Save Answer

Q2 Time Complexity of Transformers

10 Points

Q2.1

5 Points

What is the time complexity of Transformers as a function of the number of heads h?

O(log h) O(h)

O(h^2)

None of the above.

Explanation:

Save Answer

Q2.2

5 Points

What is the time complexity of Transformers as a function of sequence length n?

O(n)

O(n^2)

O(n^3)

None of the above.

Explanation:

Save Answer

Q3 Transformers

10 Points

Q3.1

5 Points

What is the main difference between the Transformer and the encoder-decoder architecture?

Unlike encoder-decoder, transformer uses attention only in encoders.

Unlike encoder-decoder, transformer uses attention only in decoders.

Unlike encoder-decoder, transformers have multiple encoder-decoder structures layered up together.

Unlike encoder-decoder, transformer uses attention in both encoders and decoders.

Explanation:

Save Answer

Q3.2

5 Points

Which of the following architectures cannot be parallelized?

CNNs RNNs

Transformers

Explanation:

Save Answer

Q4 Transformers

10 Points

Q4.1

5 Points

For the vectors xi, consider the weighted average yi =

i

j αi,j xj where wi,j = xT xj and αi,j = softmax(wi,j ).

What is j αi,j for any i?

Answer and Explanation:

Save Answer

Q4.2

5 Points

In Seq2Seq models with n words, let h1, h2, …, hn Rh be the encoder hidden states (h is the embedding dimensionality), st Rh be the decoder hidden state at step t, then the attention scores for step t are: et =

[(st)T h1, …, (st)T hn] ∈ Rn.

Taking the softmax to get the attention distribution αt for this step: αt = softmax(et) ∈ Rn. What is the formula for the attention output at?

Please type any math input using latex format enclosed by

$$…$$.

Answer and Explanation:

Save Answer

Q5.1

5 Points

What are the two key operations used for updating a node representation in a GNN?

Aggregate and Combine. Aggregate and Message. Combine and Update.

Aggregate and Max Pooling.

Explanation:

Save Answer

Q5.2

5 Points

What is the difference between a node embedding and a node representation?

A node representation is a special case of node embedding. A node embedding is a special case of node representation. There is no difference between the two.

Explanation:

Save Answer

Q6 GNNs

10 Points

Q6.1

5 Points

Citation networks can be treated as graphs where researchers cite fellow researchers’ papers. Given a paper “A” we would like to predict if the paper cites another paper “B” or not. Which type of prediction task can you model this to be?

Node prediction. Link prediction. Graph prediction.

Sub graph prediction.

Explanation:

Save Answer

Q6.2

5 Points

Select the correct statement among the following:

Combine operation gathers information from all nodes and aggregate operation updates the collected information with its self information.

Aggregate operation gathers information from all nodes and combine operation updates the collected information with its self information.

Combine operation gathers information from it’s neighboring nodes and aggregate operation updates the collected information with its self information.

Aggregate operation gathers information from it’s neighboring nodes and combine operation updates the collected information with its self information.

Explanation:

Save Answer

Q7 GNNs

10 Points

What is the Laplacian matrix for a graph with nodes {1,2,3,4,5} and edges {(1.5),(1,3),(2,3),(2,5),(3,4)}?

Answer and Explanation:

You can also upload a picture of your work:

Select file(s)

image Please select file(s)

Save Answer

Q8 RNNs

10 Points

Given an RNN defined by st = W st−1 + U xt with W =

0, −1

1

0

(−1, 0), U = (1), and s0 = (0), what is s2 for x =

(x1, x2) = (1, 0)?

Answer:

Explanation:

You can also upload a picture of your work:

Select file(s)

image Please select file(s)

Save Answer

Q9 Convolutional view of a linear RNN

10 Points

Q9.1

8 Points

Given an RNN defined by

st = W st−1 + U xt yt = Cst

s0 = U x0

where st R2, xt R and yt R denote the hidden state, input, and output of the RNN at timestep t, respectively.

0, −1

−1

W = (−1, 0), U = ( 1 ), C = (−1, 1)

This linear RNN can be written as a convolution:

yt = Kt x

where x = (x0, x1, …, xt) denotes the inputs to the RNN up to time t, denotes convolution, and Kt denotes a convolutional kernel matrix of shape 1 × (t + 1).

What are the entries of K2 such that the output y2 obtained by convolution is the same as the output obtained from the above recursive definition of the RNN?

Hint: This blog post might help:

https://huggingface.co/blog/lbourdois/get-on-the-ssm-train

Answer: K2 =

Explanation:

You can also upload a picture of your work:

Select file(s)

image Please select file(s)

Save Answer

Q9.2

2 Points

Give one reason why we might want to implement a linear RNN using a convolution instead of recursion:

Save Answer

Q10 Parameter Efficient Tuning of Attention layers

10 Points

Q10.1

2.5 Points

Consider a multihead self attention block with h heads.

MultiHead(X) = Concat(head1, …, headh)WO

where headi = Attention(XW Q, XW K, XW V ).

i i i

Assume bias=False, X RN×D, W Q, W K, W V RD×d.

i i i

What should the shape of WO be, so that the output of the multihead self attention has the same shape as the input, in terms of h, d, D, and N?

Answer:

Explanation:

Save Answer

2.5 Points

How many parameters are there in total, in terms of h, d, D, and N?

Answer:

Explanation:

Save Answer

2.5 Points

I have some data I want to finetune the multihead attention block on, but I don’t want to finetune all the parameters.

Specifically, I want to use LoRA to finetune the model. (LoRA is described here: https://arxiv.org/pdf/2106.09685)

Specifically, for every trainable matrix W , I decompose it as:

W = W0 + ΔW = W0 + BA

W0 is the pretrained weights that are not trained. ΔW = BA

is a residual matrix with the same shape as W that I do train.

B RD×r and A Rr×d. I decompose every query, key and value projection matrix in the multihead attention block in this way, and train only the residual matrices. I leave the WO frozen.

How many parameters are being trained in the above procedure, in terms of h, d, D, N, and r?

Answer:

Explanation:

Save Answer

2.5 Points

State two separate advantages of finetuning a model with LoRA:

Save Answer

Submit & View Submission

Save All Answers

image

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Assignment 3
$25