You should submit a .ipynb file with your solutions to NYU Brightspace.

In this homework, we will reuse the spam prediction dataset used in HW1.
We will use a word-level BiLSTM sentence encoder to encode the sentence and a neural network classifier.

For reference, you may read this paper.

Lab 3 is especially relevant to this homework.

Points distribution
code spam_collate_func: 25 pts
code LSTMClassifier.init: 25 pts
code LSTMClassifier.forward: 20 pts
code evaluate: 10 pts
code for training loop: 10 pts
Question on early stopping: 10 pts

How we grade the code:

full points if code works and the underlying logic is correct;
half points if code works but the underlying logic is incorrect;
zero points if code does not work.

Therefore, make sure your code works, i.e., no error is being produced when you execute the code.

Data Loading
First, reuse the code from HW1 to download and read the data.

Load Glo

def load_glove(glove_path, embedding_dim):
with open(glove_path) as f:
token_ls = [PAD_TOKEN, UNK_TOKEN]
embedding_ls = [np.zeros(embedding_dim), np.random.rand(embedding_dim)]
for line in f:
token, raw_embedding = line.split(maxsplit=1)
embedding = np.array([float(x) for x in raw_embedding.split()])
embeddings = np.array(embedding_ls)
return token_ls, embeddings

EMBEDDING_DIM=300 # dimension of Glove embeddings
glove_path = glove.6B.300d__50k.txt
vocab, embeddings = load_glove(glove_path, EMBEDDING_DIM)

Import packages

!pip install sacremoses

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
import sacremoses
from torch.utils.data import dataloader, Dataset
from tqdm.auto import tqdm

Collecting sacremoses
Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
|| 895 kB 5.4 MB/s
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.1.0)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.15.0)
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from sacremoses) (2019.12.20)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from sacremoses) (4.62.3)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses) (7.1.2)
Installing collected packages: sacremoses
Successfully installed sacremoses-0.0.47

Tokenize text data.
We will use the tokenize function to convert text data into sequence of indices.

def tokenize(data, labels, tokenizer, vocab, max_seq_length=128):
vocab_to_idx = {word: i for i, word in enumerate(vocab)}
text_data = []
label_data = []
for ex in tqdm(data):
tokenized = tokenizer.tokenize(ex.lower())
ids = [vocab_to_idx.get(token, 1) for token in tokenized]
return text_data, labels
tokenizer = sacremoses.MosesTokenizer()
train_data_indices, train_labels = tokenize(train_texts, train_labels, tokenizer, vocab)
val_data_indices, val_labels = tokenize(val_texts, val_labels, tokenizer, vocab)
test_data_indices, test_labels = tokenize(test_texts, test_labels, tokenizer, vocab)

Train text first 5 examples:
, train_data_indices[:5])
Train labels first 5 examples:
, train_labels[:5])

Train text first 5 examples:
[[43, 1, 1079, 6, 2, 425, 7, 967, 122, 328, 201, 4004, 129], [1, 16, 2, 12873, 3, 1, 2, 302, 17, 5928, 15, 725, 18811, 91, 2751, 725, 16537, 91, 14360, 452, 1, 21, 200, 2334, 66, 1, 34, 719, 3, 146, 660, 285, 1, 1908, 190], [5281, 525, 5, 9, 302, 13, 78, 149, 17, 194, 1558, 1, 194, 704, 121, 1, 3469, 1, 194, 1110, 15477, 3816, 102, 121, 194, 1815, 4, 43, 390, 6, 1, 153, 194, 3298, 121, 38, 3469, 4], [8978, 43, 3318, 1, 271, 20, 1434, 6, 21590, 3, 20, 975, 206, 189, 415], [43897, 436, 1866, 1, 99, 129, 436]]

Train labels first 5 examples:
[0, 0, 0, 0, 0]

Create DataLoaders (25 pts)
Now, lets create pytorch DataLoaders for our train, val, and test data.

SpamDataset class is based on torch Dataset. It has an additional parameter called self.max_sent_length and a spam_collate_func.

In order to use batch processing, all the examples need to effectively be the same length. Well do this by adding padding tokens. spam_collate_func is supposed to dynamically pad or trim the sentences in the batch based on self.max_sent_length and the length of longest sequence in the batch.

If self.max_sent_length is less than the length of longest sequence in the batch, use self.max_sent_length. Otherwise, use the length of longest sequence in the batch.
We do this because our input sentences in the batch may be much shorter than self.max_sent_length.

Please check the comment block in the code near TODO for more details.

PAD token id = 0
max_sent_length = 5

input list of sequences:


then padded minibatch looks like this:

padded_input =

import numpy as np
import torch
from torch.utils.data import Dataset

class SpamDataset(Dataset):
Class that represents a train/validation/test dataset thats readable for PyTorch
Note that this class inherits torch.utils.data.Dataset

def __init__(self, data_list, target_list, max_sent_length=128):
@param data_list: list of data tokens
@param target_list: list of data targets

self.data_list = data_list
self.target_list = target_list
self.max_sent_length = max_sent_length
assert (len(self.data_list) == len(self.target_list))

def __len__(self):
return len(self.data_list)

def __getitem__(self, key, max_sent_length=None):
Triggered when you call dataset[i]
if max_sent_length is None:
max_sent_length = self.max_sent_length
token_idx = self.data_list[key][:max_sent_length]
label = self.target_list[key]
return [token_idx, label]

def spam_collate_func(self, batch):
Customized function for DataLoader that dynamically pads the batch so that all
data have the same length
# What the input `batch`? Thats for you to figure out!
# You can read the Dataloader documentation, or you can use print
# function to debug.
data_list = [] # store padded sequences
label_list = []
max_batch_seq_len = None # the length of longest sequence in batch
# if it is less than self.max_sent_length
# else max_batch_seq_len = self.max_sent_length

# Pad the sequences in your data
# if their length is less than max_batch_seq_len
# or trim the sequences that are longer than self.max_sent_length
# return padded data_list and label_list
1. TODO: Your code here

return [data_list, label_list]

train_dataset = SpamDataset(train_data_indices, train_labels, max_sent_length)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,

val_dataset = SpamDataset(val_data_indices, val_labels, train_dataset.max_sent_length)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,

test_dataset = SpamDataset(test_data_indices, test_labels, train_dataset.max_sent_length)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,

Lets try to print out an batch from train_loader.

data_batch, labels = next(iter(train_loader))
print(data batch dimension: , data_batch.size())
print(data_batch: , data_batch)
print(labels: , labels)

Build a BiLSTM Classifier (20 + 25 + 10 pts)
Now we are going to build a BiLSTM classifier. Check this blog post and torch.nn.LSTM for reference. Recall that weve also seen LSTM in Lab 3.

The hyperparameters for LSTM are already given, but they are not necessarily optimal. You should get a good accuracy with these hyperparameters but you may try to tune the hyperparameters and use different hyperparameters to get better performance.

__init__: Class constructor. Here we define layers / parameters of LSTM.
forward: This function is used whenever you call your object as model(). It takes the input minibatch and returns the output representation from LSTM.

# First import torch related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTMClassifier(nn.Module):
LSTMClassifier classification model
def __init__(self, embeddings, hidden_size, num_layers, num_classes, bidirectional, dropout_prob=0.3):
self.embedding_layer = self.load_pretrained_embeddings(embeddings)
self.dropout = None
self.lstm = None
self.non_linearity = None # For example, ReLU
self.clf = None # classifier layer
Define the components of your BiLSTM Classifier model
2. TODO: Your code here
raise NotImplementedError# delete this line

def load_pretrained_embeddings(self, embeddings):
The code for loading embeddings from Lab 3 Deep Learning
Unlike lab, we are not setting `embedding_layer.weight.requires_grad = False`
because we want to finetune the embeddings on our data
embedding_layer = nn.Embedding(embeddings.shape[0], embeddings.shape[1], padding_idx=0)
embedding_layer.weight.data = torch.Tensor(embeddings).float()
return embedding_layer

def forward(self, i

