CS 505 Homework 04: Classification
Due Friday 10/27 at midnight (1 minute after 11 59 pm) in Gradescope (with a grace period of 6 hours)
You may submit the homework up to 24 hours late (with the same grace period) for a penalty of 10%.
All homeworks will be scored with a maximum of 100 points; point values are given for individual problems, and if parts of problems do not have point values given, they will be counted equally toward the total for that problem.
Note: I strongly recommend you work in Google Colab (the free version) to complete homeworks in this class; in addition to (probably) being faster than your laptop, all the necessary libraries will already be available to you, and you don‘t have to hassle with
conda , pip , etc. and resolving problems when the install doesn‘t work. But it is up to you! You should go through the necessary tutorials listed on the web site concerning Colab and storing files on a Google Drive. And of course, Dr. Google is always ready to help you resolve your problems.
I will post a “walk–through“ video ASAP on my Youtube Channel.
Submission Instructions
You must complete the homework by editing this notebook and submitting the following two files in Gradescope by the due date and time:
A file HW04.ipynb (be sure to select Kernel -> Restart and Run All
before you submit, to make sure everything works); and A file HW04.pdf created from the previous.
For best results obtaining a clean PDF file on the Mac, select File -> Print
Review from the Jupyter window, then choose File-> Print in your browser and then Save as PDF . Something similar should be possible on a Windows machine — just make sure it is readable and no cell contents have been cut off. Make it easy to grade!
The date and time of your submission is the last file you submitted, so if your IPYNB file is submitted on time, but your PDF is late, then your submission is late.
Collaborators (5 pts)
Describe briefly but precisely
. Any persons you discussed this homework with and the nature of the discussion;
. Any online resources you consulted and what information you got from those resources; and
. Any AI agents (such as chatGPT or CoPilot) or other applications you used to complete the homework, and the nature of the help you received.
A few brief sentences is all that I am looking for here.
I learned about the process of word segmentation and model training from the documents of pytorch and spacy, and the usage methods of relevant machine learning models from the documents of sklearn.
import math
import numpy as np
from numpy.random import shuffle, seed, choice
from tqdm import tqdm
from collections import defaultdict, Counter
import pandas as pd
import re
import matplotlib.pyplot as plt
import torch
from torch.utils.data import Dataset,DataLoader
import torch.nn.functional as F
from torch.utils.data import random_split,Dataset,DataLoader
from torchvision import datasets, transforms
from torch import nn, optim
from torchvision.datasets import MNIST
import torchvision.transforms as T
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
In [15]:
Problem One: Exploring Shakespeare‘s Plays with PCA (45 pts)
In this problem, we will use Principal Components Analysis to look at Shakespeare‘s plays, as we discussed with a very different play/movie in lecture. Along the way, we
shall use the tokenizer and the TF–IDF vectorizer from sklearn, a common machine learning library.
Note: There is a library for text analysis in Pytorch called Torchtext , however, in my view this will less well–developed and less well–supported than the rest of Pytorch, so we shall use sklearn for this problem.
Part A: Reading and exploring the data (5 pts)
The cells below read in three files and convert them to numpy arrays (I prefer to work with the arrays rather than with pandas functions, but it is your choice).
. The file shakespeare_plays.csv contains lines from William Shakespeareʼs plays. The second column of the file contains the name of
the play, the third the name of the player (or the indication <Stage Direction> , and the fourth the line spoken.
. The file play_attributes.csv stores the genres and chronology of Shakepeare‘s plays; the first column is the name of the play, the second the genre, and the third its order in a chronological listing of when it was first performed. The plays are in the same (arbitrary) order as in the first file.
. The file player_genders.csv stores the name of a major character (defined somewhat arbitrarily as one whose total lines contain more than 1400 characters) in the first column and their gender in the second.
For each of the arrays, print out the the shape and the first line.
In [16]:
plays_array = pd.read_csv(‘https://www.cs.bu.edu/fac/snyder/cs505/shakespear print(plays_array.shape, plays_array[0])
player_genders_array = pd.read_csv(‘https://www.cs.bu.edu/fac/snyder/cs505/p print(player_genders_array.shape, player_genders_array[0])
play_attributes_array = pd.read_csv(‘https://www.cs.bu.edu/fac/snyder/cs505/ print(play_attributes_array.shape, play_attributes_array[0])
(111582, 4) [1 ‘Henry IV Part 1’ ‘<Stage Direction>’ ‘ACT I’]
(398, 2) [‘AARON’ ‘male’]
(36, 3) [‘Henry IV Part 1’ ‘History’ 15]
Part B: Visualizing the Plays (8 pts)
. Create an array containing 36 strings, each being the concatenation of all lines spoken. Be sure to NOT include stage directions! You may wish to create an appropriate dictionary as an intermediate step.
. Create a document–term matrix where each row represents a play and each column represents a term used in that play. Each entry in this matrix represents the number of times a particular word (defined by the column) occurs in a particular play (defined by the row). Use CountVectorizer in sklearn to create the matrix.
Keep the rows in the same order as in the original files in order to associate play names with terms correctly.
. From this matrix, use TruncatedSVD in sklearn to create a 2-dimensional representation of each play. Try to make it as similar as possible to the illustration below, including (i) appropriate title, (ii) names of each play, followed by its chronological order, and (iii) different colors for each genre. Use a figsize of
(8,8) and a fontsize of 6 to provide the best visibility. You can follow the tutorial here to create the visualization (look at the “PCA“ part).
. Now do the same thing all over again, but with TF–IDF counts (using
TFIDFVectorizer in sklearn ).
. Answer the following in a few sentences: What plays are similar to each other? Do they match the grouping of Shakespeareʼs plays into comedies, histories, and tragedies here? Which plays are outliers (separated from the others in the same genre)? Did one of TF or TF–IDF provided the best insights?
In [94]:
genres_to_colors = {
“History”: “blue”,
“Comedy”: “green”,
“Tragedy”: “red”,
}
def visualize_pca_plays(reduced, title): plt.figure(figsize=(8, 8))
plt.title(title)
for i, (play, genre, _) in enumerate(play_attributes_array):
plt.plot(reduced[i][0], reduced[i][1], ‘o’, c=genres_to_colors[genr plt.text(reduced[i][0], reduced[i][1], play, fontdict={“fontsize”:
plt.show()
plays_to_lines = defaultdict(list)
for _, play, player, line in plays_array:
if player == “<Stage Direction>”:
continue
plays_to_lines[play].append(line)
strings = []
for play, _, _ in play_attributes_array:
strings.append(” “.join(plays_to_lines[play])) svd = TruncatedSVD()
cv = CountVectorizer()
doc_term_mat = cv.fit_transform(strings) reduced = svd.fit_transform(doc_term_mat)
visualize_pca_plays(reduced, “Shakespeare Plays Visualized with PCA (TF)”)
tfidf = TfidfVectorizer()
doc_term_mat = tfidf.fit_transform(strings) reduced = svd.fit_transform(doc_term_mat)
visualize_pca_plays(reduced, “Shakespeare Plays Visualized with PCA (TF-IDF)
Plays with the same genre are more similar. According to the results of PCA, samples with the same color (plays with the
same genre) have a more concentrated distribution on the two-
dimensional plane. The Tempest, A Midsummer Night’s Dream and Love’s Labours Lost are outliers. I think TF provided better insights than TF-IDF because, based on the results of PCA,
samples in the TF results are more distinctive.
Part C: Visualizing the Players (8 pts)
Now you must repeat this same kind of visualization, but instead of visualizing plays, you must visualize players. The process will be essentially the same, starting with an array of strings representing the lines spoken by each player. Use one of TF or TF–IDF, and use different colors for the genders.
Use a figsize of (8,8) and a fontsize of 4 to make this a bit more visible.
Again, comment on what you observe (it will not be as satisfying as the previous part).
genders_to_colors = { “male”: “blue”,
“female”: “red”,
}
In [96]:
def visualize_pca_players(reduced, title): plt.figure(figsize=(8, 8))
plt.title(title)
for i, (player, gender) in enumerate(player_genders_array):
plt.plot(reduced[i][0], reduced[i][1], ‘o’, c=genders_to_colors[gen plt.text(reduced[i][0], reduced[i][1], player, fontdict={“fontsize”
plt.show()
players_to_lines = defaultdict(list)
for _, _, player, line in plays_array:
if player == “<Stage Direction>”:
continue
players_to_lines[player].append(line)
strings = []
for player, _ in player_genders_array:
strings.append(” “.join(players_to_lines[player]))
doc_term_mat = tfidf.fit_transform(strings) reduced = svd.fit_transform(doc_term_mat)
visualize_pca_players(reduced, “Shakespeare Players Visualized with PCA (TF-
From the visualization results of PCA, it can be seen that
the red(female) and blue(male) samples are mixed together and
cannot be distinguished. So there is not much difference in lines between roles of different genders.
Part D: DIY Word Embeddings (8 pts)
In this part you will create a word–word matrix where each row (and each column) represents a word in the vocabulary. Each entry in this matrix represents the number of times a particular word (defined by the row) co–occurs with another word (defined by the column) in a sentence (i.e., line in plays ). Using the row word vectors, create a document–term matrix which represents a play as the average of all the word vectors in the play.
Display the plays using TruncatedSVD as you did previously.
Again, comment on what you observe: how different is this from the first visualization? Notes:
. Remove punctuation marks . , ; : ? ! but leave single quotes.
. One way to proceed is to create a nested dictionary mapping each word to a dictionary of the frequency of words that occur in the same line, then from this to create the sparse matrix which is used to create the aerage document–term matrix which is input to TruncatedSVD .
. If you have trouble with the amount of memory necessary, you may wish to eliminate “stop words“ and then isolate some number (say, 5000) of the remaining most common words, and build your visualization on that instead of the complete vocabulary.
import nltk
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
stops = set(stopwords.words(‘english’))
dictionary = defaultdict(int)
for _, _, player, line in plays_array:
if player == “<Stage Direction>”:
continue
words = line.split(” “)
words = list(filter(None, map(lambda x: re.sub(“[.,;:?!t
“]”, “”, x)
for word in words:
dictionary[word] += 1 my_stops = set()
for word, count in dictionary.items():
if count > 3000:
my_stops.add(word) stops = stops | my_stops
vocab = set(dictionary.keys()) – stops N = len(vocab)
In [ ]:
words_to_idx = {} all_lines = []
players_to_words = defaultdict(lambda: defaultdict(int))
for _, _, player, line in plays_array:
if player == “<Stage Direction>”:
continue
words = line.split(” “)
words = list(filter(None, map(lambda x: re.sub(“[.,;:?!t
“]”, “”, x) all_lines.append(words)
for word in words:
if word not in vocab:
continue
players_to_words[player][word] += 1
for i, word in enumerate(vocab): words_to_idx[word] = i
word_word_mat = np.zeros((N, N))
for line in all_lines: n = len(line)
for i in range(n):
word_i = line[i]
if word_i not in vocab:
continue
for j in range(i + 1, n): word_j = line[j]
if word_j not in vocab:
continue
word_word_mat[words_to_idx[word_i]][words_to_idx[word_j]] += 1
doc_term_mat = []
for player, _ in player_genders_array: vec = np.zeros((N,))
words_to_nums = players_to_words[player] n = 0
for word, number in words_to_nums.items():
vec += word_word_mat[words_to_idx[word]] * number n += number
vec /= number
doc_term_mat.append(vec)
doc_term_mat = np.array(doc_term_mat)
reduced = svd.fit_transform(doc_term_mat)
visualize_pca_players(reduced, “Shakespeare Players Visualized with PCA (DIY
In [97]:
Compared to the first visualization result, the sample points are more concentrated on the two-dimensional plane, and there are several blue(male) samples being farther away from other samples.
Part E: Visualizing the Plays using Word2Vec Word Embeddings (8 pts)
Now we will do the play visualization using word embeddings created by Gensim‘s
Word2Vec , which can create word embeddings just as you did in the previous part, but using better algorithms.
You can read about how to use Word2Vec and get template code here: https://radimrehurek.com/gensim/models/word2vec.html
I strongly recommend you follow the directions for creating the model, then using
KeyedVectors to avoid recomputing the model each time.
Experiment with the window (say 5) and the min_count (try in the range 1 – 5) parameters to get the best results.
Display the plays using PCA instead of TruncatedSVD .
Again, comment on what you observe: how different is this from the other visualizations?
from gensim.models import Word2Vec
model = Word2Vec(sentences=all_lines, vector_size=100, window=5, min_count= word_vectors = model.wv
word_vectors.save(“word2vec.wordvectors”)
In [91]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load(“word2vec.wordvectors”, mmap=‘r’) plays_to_words = defaultdict(list)
for _, play, player, line in plays_array:
if player == “<Stage Direction>”:
continue
words = line.split(” “)
words = list(filter(None, map(lambda x: re.sub(“[.,;:?!t
“]”, “”, x) plays_to_words[play] += words
doc_term_mat = []
for play, _, _ in play_attributes_array:
doc_term_mat.append(np.average(np.array(list(map(lambda x: wv[x], plays_ doc_term_mat = np.array(doc_term_mat)
pca = PCA()
pca.fit_transform(doc_term_mat)
visualize_pca_plays(doc_term_mat, “Shakespeare Plays Visualized with PCA (Wo
In [95]:
Unlike the previous two results, in Word2Vec’s PCA results, blue sample points are distributed on the left and green
sample points are distributed on the right.
Part F: Visualizing the Players using Word2Vec Word Embeddings (8 pts)
Now you must repeat Part C, but using these Word2Vec embeddings. Use a figsize of (8,8) and a fontsize of 4 to make this a bit more visible.
Again, comment on what you observe. How is this different from what you saw in Part C?
doc_term_mat = []
for player, _ in player_genders_array: vec = 0
words_to_nums = players_to_words[player] n = 0
for word, number in words_to_nums.items(): vec = vec + wv[word] * number
n += number vec /= number
doc_term_mat.append(vec)
doc_term_mat = np.array(doc_term_mat)
In [99]:
reduced = svd.fit_transform(doc_term_mat)
visualize_pca_players(reduced, “Shakespeare Players Visualized with PCA (Wor
The PCA results of Word2Vec are basically consistent with DIY’s.
Problem Two: Classifying Text with a Feed-Forward Neural Network (50 pts)
In this problem, you must create a FFNN in Pytorch to classify emails from the Enron dataset as to whether they are spam or not spam (“ham“). For this problem, we will use
Glove pretrained embeddings. The dataset and the embeddings are in the following
location:
https://drive.google.com/drive/folders/1cHR4VJuuN2tEpSkT3bOaGkOJrvIV–lSR? usp=sharing
(You can also download the embeddings yourself from the web; but the dataset is one created just for this problem.)
Part A: Prepare the Data (10 pts)
Compute the features of the emails (the vector of 100 floats input to the NN) vector based on the average value of the word vectors that belong to the words in it.
Just like the previous problem, we compute the ‘representation‘ of each message, i.e. the vector, by averaging word vectors; but this time, we are using Glove word embeddings instead. Specifically, we are using word embedding ‘glove.6B.100d‘ to obtain word vectors of each message, as long as the word is in the ‘glove.6B.100d‘ embedding space.
Here are the steps to follow:
. Have a basic idea of how Glove provides pre–trained word embeddings (vectors).
. Download and extract word vectors from ‘glove.6B.100d‘.
. Tokenize the messages ( spacy is a good choice) and compute the message vectors by averaging the vectors of words in the message. You will need to test if a word is in the model (e.g., something like if str(word) in glove_model … ) and ignore any words which have no embeddings.
Part B: Create the DataLoader (15 pts)
Now you must separate the data set into training, validation, and testing sets, and build a ‘Dataset‘ and ‘DataLoader‘ for each that can feed data to train your model with Pytorch.
Use a train–validation–test split of 80%-10%-10%. You can experiment with different batch sizes, starting with 64.
Hints:
. Make sure init , len and getitem of the your defined dataset are implemented properly. In particular, the getitem should return the specified message vector and its label.
. Don‘t compute the message vector when calling the getitem function, otherwise the training process will slow down A LOT. Calculate these in an array before creating the data loader in the next step.
. The data in the .csv is randomized, so you don‘t need to shuffle when doing the split.
import spacy
nlp = spacy.load(“en_core_web_sm”)
embeddings = {}
with open(“./glove.6B/glove.6B.100d.txt”, “r”) as f: lines = f.readlines()
for line in lines:
word_embedding = line.split(” “) word = word_embedding[0]
embedding = np.array(list(map(lambda x: float(x), word_embedding[1: embeddings[word] = embedding
class MyDataset(Dataset):
def init (self, file_path) -> None:
In [ ]:
df = pd.read_csv(file_path) self._raw_data = []
n = len(df)
for i in range(n):
message = df.iloc[i][“Message”] message = nlp(message)
word_vecs = []
for word in message:
if word.text in embeddings.keys():
word_vecs.append(embeddings[word.text]) vec = np.array(word_vecs)
vec = np.average(vec, axis=0)
self._raw_data.append((vec, df.iloc[i][“Spam”]))
def len (self) -> int:
return len(self._raw_data)
def getitem (self, index) -> tuple[str, int]:
return self._raw_data[index]
dataset = MyDataset(“./data_pa5/enron_spam_ham.csv”)
seed = 42
torch.manual_seed(seed) # 设置PyTorch随机数种⼦
np.random.seed(seed) # 设置NumPy随机数种⼦
batch_size = 64
trainset, devset, testset = random_split(dataset, [0.8, 0.1, 0.1], generator train_dataloader = DataLoader(trainset, batch_size=batch_size)
dev_dataloader = DataLoader(devset, batch_size=batch_size)
test_dataloader = DataLoader(testset, batch_size=batch_size)
In [123…
Part C: Build the neural net model (25 pts)
Once the data is ready, we need to design and implement our neural network model. The model does not need to be complicated. An example structure could be:
. linear layer 100 x 15
. ReLU activation layer
. linear layer 15 x 2
But feel free to test out other possible combinations of linear layers & activation function and whether they make significant difference to the model performance later.
In order to perform “early stopping,” you must keep track of the best validation score as you go through the epochs, and save the best model generated so far; then use the model which existed when the validation score was at a minimum to do the testing. (This could also be the model which is deployed, although we won‘t worry about that.) Read about torch.save(…) and torch.load(…) to do this.
Experiment with different batch sizes and optimizers and learning rates to get the best validation score for the model you create with early stopping. (Try not to look too hard at the final accuracy!) Include your final performance charts (using
show_performance_curves ) when you submit.
Conclude with a brief analysis (a couple of sentences is fine) relating what experiments you did, and what choices of geometry, optimizer, learning rate, and batch size gave you the best results. It should not be hard to get well above 90% accuracy on the final test.
In [124…
class MyModel(nn.Module):
def init (self) -> None: super(). init ()
self.classifier = nn.Sequential(
nn.Linear(100, 15, dtype=torch.float64), nn.ReLU(),
nn.Linear(15, 2, dtype=torch.float64), nn.Softmax()
)
def forward(self, x):
return self.classifier(x)
max_epoch = 100
model = MyModel()
optimizer = optim.AdamW(model.parameters(), lr=1e-3) loss_fn = nn.CrossEntropyLoss()
last_acc = 0
early_stop = 0
for i in range(max_epoch): model.train()
for batch in train_dataloader: optimizer.zero_grad()
x, y = batch
pred = model(x)
loss = loss_fn(pred, y) loss.backward()
optimizer.step() model.eval()
correct = 0
for batch in dev_dataloader: x, y = batch
pred = model(x)
pred = pred[:, 0] < .5
correct = correct + (pred == y).sum() acc = correct / len(devset)
acc = acc.item()
if acc < last_acc:
if last_acc – acc < 0.0001: early_stop += 1
if early_stop == 5:
break
else:
early_stop = 0
else:
last_acc = acc
torch.save(model, “best_model.pt”)
model = torch.load(“best_model.pt”) model.eval()
correct = 0
for batch in test_dataloader: x, y = batch
pred = model(x)
pred = pred[:, 0] < .5
correct = correct + (pred == y).sum()
acc = correct / len(testset) acc = acc.item()
print(“Test set accuracy: “, acc)
Test set accuracy: 0.9562744498252869
batch size |
learning rate |
optimizer |
acc |
64 |
1e-3 |
AdamW |
95.63 |
64 |
1e-4 |
AdamW |
94.13 |
64 |
1e-5 |
AdamW |
86.60 |
64 |
1e-3 |
AdamW |
94.67 |
128 |
1e-4 |
AdamW |
93.60 |
128 |
1e-5 |
AdamW |
83.29 |
64 |
1e-3 |
SGD |
77.11 |
64 |
1e-2 |
SGD |
93.49 |
128 |
1e-3 |
SGD |
70.14 |
128 |
1e-2 |
SGD |
91.72 |
The experiment involved testing different combinations of
batch sizes, learning rates, and optimizers to evaluate their impact on accuracy. Two optimizers, AdamW and SGD, were
compared across batch sizes of 64 and 128 and learning rates
ranging from 1e-5 to 1e-2. The highest accuracy of 95.63% was achieved with a batch size of 64, learning rate of 1e-3, and AdamW optimizer. Generally, smaller learning rates and larger batch sizes led to better accuracy, with AdamW consistently
outperforming SGD in this experiment.
Reviews
There are no reviews yet.