Note. Here are some things to keep in mind as you plan your time for this assignment. There are math questions again! The total amount of PyTorch code to write, and code complexity, of this assignment is lower than Assignment 4. However, youre also given less guidance or scaffolding in how to write the code. This assignment involves a pretraining step that takes approximately 2 hours to perform on Azure, and youll have to do it twice. |
This assignment is an investigation into Transformer self-attention building blocks, and the effects of pretraining. It covers mathematical properties of Transformers and self-attention through written questions. Further, youll get experience with practical system-building through repurposing an existing codebase. The assignment is split into a written (mathematical) part and a coding part, with its own written questions. Heres a quick summary:
- Mathematical exploration: What kinds of operations can self-attention easily implement? Why should we use fancier things like multi-headed self-attention? This section will use some mathematical investigations to illuminate a few of the motivations of self-attention and Transformer networks. Note: for all questions, you should justify your answer with mathematical reasoning when required.
- Extending a research codebase: In this portion of the assignment, youll get some experience and intuition for a cutting-edge research topic in NLP: teaching NLP models facts about the world through pretraining, and accessing that knowledge through finetuning. Youll train a Transformer model to attempt to answer simple questions of the form Where was person [x] born? without providing any input text from which to draw the answer. Youll find that models are able to learn some facts about where people were born through pretraining, and access that information during fine-tuning to answer the questions.
Then, youll take a harder look at the system you built, and reason about the implications and concerns about relying on such implicit pretrained knowledge.
This assignment was originally created by John Hewitt, CS 224N Head TA in Winter 2021.
1
1. Attention exploration (22 points)
Multi-headed self-attention is the core modeling component of Transformers. In this question, well get some practice working with the self-attention equations, and motivate why multi-headed self-attention can be preferable to single-headed self-attention.
Recall that attention can be viewed as an operation on a query q Rd, a set of value vectors {v1,,vn},vi Rd, and a set of key vectors {k1,,kn},ki Rd, specified as follows:
n
c = vii (1)
i=1
(2)
with i termed the attention weights. Observe that the output c Rd is an average over the value vectors weighted with respect to i.
- (4 points) Copying in attention. One advantage of attention is that its particularly easy to copy a value vector to the output c. In this problem, well motivate why this is the case.
- (1 point) Explain why can be interpreted as a categorical probability distribution.
- (2 points) The distribution is typically relatively diffuse; the probability mass is spread out between many different i. However, this is not always the case. Describe (in one sentence) under what conditions the categorical distribution puts almost all of its weight on some j, where j {1,,n} (i.e. j i=j i). What must be true about the query q and/or the keys {k1,,kn}?
- (1 point) Under the conditions you gave in (ii), describe what properties the output c might have. iv. (1 point) Explain (in two sentences or fewer) what your answer to (ii) and (iii) means intuitively.
- (7 points) An average of two. Instead of focusing on just one vector vj, a Transformer model might want to incorporate information from multiple source vectors. Consider the case where we instead want to incorporate information from two vectors va and vb, with corresponding key vectors ka and kb.
- (3 points) How should we combine two d-dimensional vectors va,vb into one output vector c in a way that preserves information from both vectors? In machine learning, one common way to do so is to take the average: . It might seem hard to extract information about the original vectors va and vb from the resulting c, but under certain conditions one can do so. In this problem, well see why this is the case.
Suppose that although we dont know va or vb, we do know that va lies in a subspace A formed by the m basis vectors {a1,a2,,am}, while vb lies in a subspace B formed by the p basis vectors {b1,b2,,bp}. (This means that any va can be expressed as a linear combination of its basis vectors, as can vb. All basis vectors have norm 1 and orthogonal to each other.) Additionally, suppose that the two subspaces are orthogonal; i.e. aj bk = 0 for all j,k.
Using the basis vectors {a1,a2,,am}, construct a matrix M such that for arbitrary vectors va A and vb B, we can use M to extract va from the sum vector s = va + vb. In other words, we want to construct M such that for any va,vb, Ms = va).
Note: both M and va,vb should be expressed as a vector in Rd, not in terms of vectors from A and B.
Hint: Given that the vectors {a1,a2,,am} are both orthogonal and form a basis for va, we know that there exist some c1,c2,,cm such that va = c1a1 + c2a2 + + cmam. Can you create a vector of these weights c? ii. (4 points) As before, let va and vb be two value vectors corresponding to key vectors ka and kb, respectively. Assume that (1) all key vectors are orthogonal, so kikj for all i = j; and (2) all key vectors have norm 1.[1] Find an expression for a query vector q such that.
2
- (5 points) Drawbacks of single-headed attention: In the previous part, we saw how it was possible for a single-headed attention to focus equally on two values. The same concept could easily be extended to any subset of values. In this question well see why its not a practical Consider a set of key vectors {k1,,kn} that are now randomly sampled, ki N(i,i), where the means i Rd are known to you, but the covariances i are unknown. Further, assume that the means i are all perpendicular; i j = 0 if i = j, and unit norm, i = 1.
- (2 points) Assume that the covariance matrices are i = Ii {1,[2],,n}, for vanishingly small . Design a query q in terms of the i such that as before,, and provide a brief argument as to why it works.
- (3 points) Though single-headed attention is resistant to small perturbations in the keys, some types of larger perturbations may pose a bigger issue. Specifically, in some cases, one key vector ka may be larger or smaller in norm than the others, while still pointing in the same direction as a. As an example, let us consider a covariance for item for vanishingly small (as shown in figure 1). This causes ka to point in roughly the same direction as a, but with large variances in magnitude. Further, let i = I for all i = a.
Figure 1: The vector a (shown here in 2D as an example), with the range of possible values of ka shown in red. As mentioned previously, ka points in roughly the same direction as a, but may have larger or smaller magnitude.
When you sample {k1,,kn} multiple times, and use the q vector that you defined in part i., what qualitatively do you expect the vector c will look like for different samples?
- (3 points) Benefits of multi-headed attention: Now well see some of the power of multi-headed attention. Well consider a simple version of multi-headed attention which is identical to singleheaded self-attention as weve presented it in this homework, except two query vectors (q1 and q2) are defined, which leads to a pair of vectors (c1 and c2), each the output of single-headed attention given its respective query vector. The final output of the multi-headed attention is their average,
. As in question 1(c), consider a set of key vectors {k1,,kn} that are randomly sampled, ki N(i,i), where the means i are known to you, but the covariances i are unknown. Also as before, assume that the means i are mutually orthogonal; i j = 0 if i = j, and unit norm, i = 1.
- (1 point) Assume that the covariance matrices are i = I, for vanishingly small . Design q1 and q2 such that c is approximately equal to .
- (2 points) Assume that the covariance matrices are for vanishingly small , and i = I for all i = a. Take the query vectors q1 and q2 that you designed in part i. What, qualitatively, do you expect the output c to look like across different samples of the key vectors? Please briefly explain why. You can ignore cases in which kaqi < 0.
2. Pretrained Transformer models and knowledge access (35 points)
Youll train a Transformer to perform a task that involves accessing knowledge about the world knowledge which isnt provided via the tasks training data (at least if you want to generalize outside the training set). Youll find that it more or less fails entirely at the task. Youll then learn how to pretrain that Transformer on Wikipedia text that contains world knowledge, and find that finetuning that Transformer on the same knowledge-intensive task enables the model to access some of the knowledge learned at pretraining time. Youll find that this enables models to perform considerably above chance on a held out development set.
The code youre provided with is a fork of Andrej Karpathys minGPT. Its nicer than most research code in that its relatively simple and transparent. The GPT in minGPT refers to the Transformer language model of OpenAI, originally described in this paper [1].
As in previous assignments, you will want to develop on your machine locally, then run training on Azure. You can use the same conda environment from previous assignments for local development, and the same process for training on Azure (see the CS224n Azure Guide for a refresher). Specifically, youll still be running conda activate py37_pytorch on the Azure machine. Youll need around 5 hours for training, so budget your time accordingly!
Your work with this codebase is as follows:
- (0 points) Check out the demo.
In the mingpt-demo/ folder is a Jupyter notebook that trains and samples from a Transformer language model. Take a look at it (locally on your computer) to get somewhat familiar with how it defines and trains models. Some of the code youre writing below will be inspired by what you see in this notebook.
Note that you do not have to write any code or submit written answers for this part.
- (0 points) Read through NameDataset, our dataset for reading name-birthplace pairs.
The task well be working on with our pretrained models is attempting to access the birth place of a notable person, as written in their Wikipedia page. Well think of this as a particularly simple form of question answering:
Q: Where was [person] born?
A: [place]
From now on, youll be working with the src/ folder. The code in mingpt-demo/ wont be changed or evaluated for this assignment. In dataset.py, youll find the the class NameDataset, which reads a TSV (tab-separated values) file of name/place pairs and produces examples of the above form that we can feed to our Transformer model.
To get a sense of the examples well be working with, if you run the following code, itll load your NameDataset on the training set birth_places_train.tsv and print out a few examples.
python src/dataset.py namedata
Note that you do not have to write any code or submit written answers for this part.
(c) (0 points) Implement finetuning (without pretraining).
Take a look at run.py. It has some skeleton code specifying flags youll eventually need to handle as command line arguments. In particular, you might want to pretrain, finetune, or evaluate a model with this code. For now, well focus on the finetuning function, in the case without pretraining.
Taking inspiration from the training code in the play_char.ipynb file, write code to finetune a Transformer model on the name/birthplace dataset, via examples from the NameDataset class. For now, implement the case without pretraining (i.e. create a model from scratch and train it on the birthplace prediction task from part (b)). Youll have to modify two sections, marked [part c] in the code: one to initialize the model, and one to finetune it. Note that you only need to initialize the model in the case labeled vanilla for now (later in section (g), we will explore a model variant). Use the hyperparameters for the Trainer specified in the run.py code.
Also take a look at the evaluation code which has been implemented for you. It samples predictions from the trained model and calls evaluate_places() to get the total percentage of correct place predictions. You will run this code in part (d) to evaluate your trained models.
This is an intermediate step for later portions, including Part d, which contains commands you can run to check your implementation. No written answer is required for this part.
- (5 points) Make predictions (without pretraining).
Train your model on wiki.txt, and evaluate on birth_dev.tsv. Specifically, you should now be able to run the following three commands:
# Train on the names datasetpython src/run.py finetune vanilla wiki.txt writing_params_path vanilla.model.params finetune_corpus_path birth_places_train.tsv# Evaluate on the dev set, writing out predictionspython src/run.py evaluate vanilla wiki.txt reading_params_path vanilla.model.params eval_corpus_path birth_dev.tsv outputs_path vanilla.nopretrain.dev.predictions# Evaluate on the test set, writing out predictionspython src/run.py evaluate vanilla wiki.txt reading_params_path vanilla.model.params eval_corpus_path birth_test_inputs.tsv outputs_path vanilla.nopretrain.test.predictions |
Training will take less than 10 minutes (on Azure). Report your models accuracy on the dev set (as printed by the second command above). Dont be surprised if it is well below 10%; we will be digging into why in Part 3. As a reference point, we want to also calculate the accuracy the model would have achieved if it had just predicted London as the birth place for everyone in the dev set. Fill in london_baseline.py to calculate the accuracy of that approach and report your result in your write-up. You should be able to leverage existing code such that the file is only a few lines long.
- (10 points) Define a span corruption function for pretraining.
In the file src/dataset.py, implement the __getitem__() function for the dataset class CharCorruptionDataset. Follow the instructions provided in the comments in dataset.py. Span corruption is explored in the T5 paper [2]. It randomly selects spans of text in a document and replaces them with unique tokens (noising). Models take this noised text, and are required to output a pattern of each unique sentinel followed by the tokens that were replaced by that sentinel in the input. In this question, youll implement a simplification that only masks out a single sequence of characters.
This question will be graded via autograder based on whether your span corruption function implements some basic properties of our spec. Well instantiate the CharCorruptionDataset with our own data, and draw examples from it.
To help you debug, if you run the following code, itll sample a few examples from your CharCorruptionDataset on the pretraining dataset wiki.txt and print them out for you.
python src/dataset.py charcorruption
No written answer is required for this part.
- (10 points) Pretrain, finetune, and make predictions. Budget 2 hours for training.
Now fill in the pretrain portion of run.py, which will pretrain a model on the span corruption task. Additionally, modify your finetune portion to handle finetuning in the case with pretraining. In particular, if a path to a pretrained model is provided in the bash command, load this model before finetuning it on the birthplace prediction task. Pretrain your model on wiki.txt (which should take approximately two hours), finetune it on NameDataset and evaluate it. Specifically, you should be able to run the following four commands: (Dont be concerned if the loss appears to plateau in the middle of pretraining; it will eventually go back down.)
# Pretrain the modelpython src/run.py pretrain vanilla wiki.txt writing_params_path vanilla.pretrain.params# Finetune the modelpython src/run.py finetune vanilla wiki.txt reading_params_path vanilla.pretrain.params writing_params_path vanilla.finetune.params finetune_corpus_path birth_places_train.tsv# Evaluate on the dev set; write to diskpython src/run.py evaluate vanilla wiki.txt reading_params_path vanilla.finetune.params eval_corpus_path birth_dev.tsv outputs_path vanilla.pretrain.dev.predictions# Evaluate on the test set; write to diskpython src/run.py evaluate vanilla wiki.txt reading_params_path vanilla.finetune.params eval_corpus_path birth_test_inputs.tsv outputs_path vanilla.pretrain.test.predictions |
Report the accuracy on the dev set (printed by the third command above). We expect the dev accuracy will be at least 10%, and will expect a similar accuracy on the held out test set.
- (10 points) Research! Write and try out the synthesizer variant (Budget 2 hours for pretraining!)
Well now go to changing the Transformer architecture itself specifically, the self-attention module. While weve been using a self-attention scoring function based on dot products, this involves a rather intensive computation thats quadratic in the sequence length. This is because the dot product between 2 pairs of word vectors is computed in each computation. Synthesized attention [3] is a very recent alternative that has potential benefits by removing this dot product (and quadratic computation) entirely. Its a promising idea, and one way for us to ask, Whats important/right about the Transformer architecture, and where can we improve/prune aspects of it? In attention.py, implement the forward() method of SynthesizerAttention, which implements a variant of the Synthesizer proposed in the cited paper.
The provided CausalSelfAttention implements the following attention for each head of the multiheaded attention: Let X Rd (where is the block size and d is the total dimensionality, d/h is the dimensionality per head.).[3] Let Q,K,V Rdd/h. Then the output of the self-attention head is
Yi = softmax (3)
where Yi Rd/h. Then the output of the self-attention is a linear transformation of the concatenation of the heads:
Y = [Y1;;Yh]A (4)
where A Rdd and [Y1;;Yh] Rd. The code also includes dropout layers which we havent written here. We suggest looking at the provided code and noting how this equation is implemented in PyTorch.
Your job is to implement the following variant of attention. Instead of Equation 3, implement the following in SynthesizerAttention:
Yi = softmax(ReLU(XAi + b1)Bi + b2)(XVi), (5)
where Ai Rdd/h, Bi Rd/h, and Vi Rdd/h.[4] One way to interpret this is as follows: The term (XQi)(XKi) is an matrix of attention scores, computed as all pairs of dot products between word embeddings. The synthesizer variant eschews the all-pairs dot product and directly computes the matrix of attention scores by mapping each d-dimensional vector of each head for X to an -dimesional vector of unnormalized attention weights.
In the rest of the code in the src/ folder, modify your model to support using either CausalSelfAttention or SynthesizerAttention. Add the ability to switch between these attention variants depending on whether vanilla (for causal self-attention) or synthesizer (for the synthesizer variant) is selected in the command line arguments (see the section marked [part g] in src/run.py). You are free to implement this functionality in any way you choose, so long as it supports these command line arguments.
Below are bash commands that your code should support in order to pretrain the model, finetune it, and make predictions on the dev and test sets. Note that the pretraining process will take approximately 2 hours.
# Pretrain the modelpython src/run.py pretrain synthesizer wiki.txt writing_params_path synthesizer.pretrain.params |
# Finetune the modelpython src/run.py finetune synthesizer wiki.txt reading_params_path synthesizer.pretrain.params writing_params_path synthesizer.finetune.params finetune_corpus_path birth_places_train.tsv# Evaluate on the dev set; write to diskpython src/run.py evaluate synthesizer wiki.txt reading_params_path synthesizer.finetune.params eval_corpus_path birth_dev.tsv outputs_path synthesizer.pretrain.dev.predictions# Evaluate on the test set; write to diskpython src/run.py evaluate synthesizer wiki.txt reading_params_path synthesizer.finetune.params eval_corpus_path birth_test_inputs.tsv outputs_path synthesizer.pretrain.test.predictions |
Report the accuracy of your synthesizer attention model on birthplace prediction on birth_dev.tsv after pretraining and fine-tuning.
- (8 points) Well score your model as to whether it gets at least 5% accuracy on the test set, which has answers held out.
- (2 points) Why might the synthesizer self-attention not be able to do, in a single layer, what the key-query-value self-attention can do?
3. Considerations in pretrained knowledge (5 points)
Please type the answers to these written questions (to make TA lives easier).
- (1 point) Succinctly explain why the pretrained (vanilla) model was able to achieve an accuracy of above 10%, whereas the non-pretrained model was not.
- (2 points) Take a look at some of the correct predictions of the pretrain+finetuned vanilla model, as well as some of the errors. We think youll find that its impossible to tell, just looking at the output, whether the model retrieved the correct birth place, or made up an incorrect birth place. Consider the implications of this for user-facing systems that involve pretrained NLP components. Come up with two distinct reasons why this model behavior (i.e. unable to tell whether its retrieved or made up) may cause concern for such applications, and an example for each reason.
- (2 points) If your model didnt see a persons name at pretraining time, and that person was not seen at fine-tuning time either, it is not possible for it to have learned where they lived. Yet, your model will produce something as a predicted birth place for that persons name if asked. Concisely describe a strategy your model might take for predicting a birth place for that persons name, and one reason why this should cause concern for the use of such applications. (You do not need to submit the same answer for 3c as for 3b.)
Submission Instructions
You will submit this assignment on GradeScope as two submissions one for Assignment 5 [coding] and another for Assignment 5 [written]:
- Verify that the following files exist at these specified paths within your assignment directory:
- The no-pretraining model and predictions: vanilla.model.params, vanilla.nopretrain.dev.predictions, vanilla.nopretrain.test.predictions
- The pretrain-finetune model and predictions: vanilla.finetune.params, vanilla.pretrain.dev.predictions, vanilla.pretrain.test.predictions
- The synthesizer model and predictions: synthesizer.finetune.params, synthesizer.pretrain.dev.predictions, synthesizer.pretrain.test.predictions
- Run the collect_submission.sh script to produce your assignment5.zip file.
- Upload your assignment5.zip file to GradeScope to Assignment 5 [coding].
- Check that the public autograder tests passed correctly.
- Upload your written solutions, for questions 1, parts of 2, and 3, to GradeScope to Assignment 5 [written]. Tag it properly!
References
- Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding with unsupervised learning. Technical report, OpenAI (2018).
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 167.
- Tay, Y., Bahri, D., Metzler, D., Juan, D.-C., Zhao, Z., and Zheng, C. Synthesizer: Rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743 (2020).
[1] Recall that a vector x has norm 1 iff xx=1.
[2] Hint: while the softmax function will never exactly average the two vectors, you can get close by using a large scalar multiple in the expression.
[3] Note that these dimensionalities do not include the minibatch dimension.
[4] Hint: copy over the CausalSelfAttention class, and modify it minimally for this.
Reviews
There are no reviews yet.