5/5 - (1 vote)

HW2_Spam_Classification_with_LSTM

You should submit a .ipynb file with your solutions to NYU Brightspace.

In this homework, we will reuse the spam prediction dataset used in HW1.
We will use a word-level BiLSTM sentence encoder to encode the sentence and a neural network classifier.

For reference, you may read this paper.

Lab 3 is especially relevant to this homework.

Points distribution
code spam_collate_func: 25 pts
code LSTMClassifier.init: 25 pts
code LSTMClassifier.forward: 20 pts
code evaluate: 10 pts
code for training loop: 10 pts
Question on early stopping: 10 pts

How we grade the code:

full points if code works and the underlying logic is correct;
half points if code works but the underlying logic is incorrect;
zero points if code does not work.

Therefore, make sure your code works, i.e., no error is being produced when you execute the code.

Data Loading
First, reuse the code from HW1 to download and read the data.

!wget https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR -O spam.csv

2022-02-16 23:52:19https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com) 74.125.202.101, 74.125.202.138, 74.125.202.139,
Connecting to docs.google.com (docs.google.com)|74.125.202.101|:443 connected.
HTTP request sent, awaiting response 303 See Other
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/va2ei70761h7r8rlq63433gnfu6orla0/1645055475000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
Warning: wildcards not supported in HTTP.
2022-02-16 23:52:19https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/va2ei70761h7r8rlq63433gnfu6orla0/1645055475000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com) 173.194.197.132, 2607:f8b0:4001:c1b::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)|173.194.197.132|:443 connected.
HTTP request sent, awaiting response 200 OK
Length: 503663 (492K) [text/csv]
Saving to: spam.csv

spam.csv100%[===================>] 491.86K.-KB/sin 0.009s

2022-02-16 23:52:20 (55.1 MB/s) spam.csv saved [503663/503663]

import pandas as pd
import numpy as np

df = pd.read_csv(spam.csv, usecols=[v1, v2], encoding=latin-1)
# 1 spam, 0 ham
df.v1 = (df.v1 == spam).astype(int)

00Go until jurong point, crazy.. Available only
10Ok lar Joking wif u oni
21Free entry in 2 a wkly comp to win FA Cup fina
30U dun say so early hor U c already then say
40Nah I dont think he goes to usf, he lives aro

We will split the data into train, val, and test sets.

train_texts, val_texts, and test_texts should contain a list of text examples in the dataset.

# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

# Shuffle the data
df = df.sample(frac=1)
# Split df to test/val/train
test_df = df[:test_size]
val_df = df[test_size:test_size+val_size]
train_df = df[test_size+val_size:]

train_texts, train_labels = list(train_df.v2), list(train_df.v1)
val_texts, val_labels = list(val_df.v2), list(val_df.v1)
test_texts, test_labels = list(test_df.v2), list(test_df.v1)

# Check that idces do not overlap
assert set(train_df.index).intersection(set(val_df.index)) == set({})
assert set(test_df.index).intersection(set(train_df.index)) == set({})
assert set(val_df.index).intersection(set(test_df.index)) == set({})
# Check that all idces are present
assert df.shape[0] == len(train_labels) + len(val_labels) + len(test_labels)

fSize of initial data: {df.shape[0]}

fTrain size: {len(train_labels)}

fVal size: {len(val_labels)}

fTest size: {len(test_labels)}

Size of initial data: 5572
Train size: 3902
Val size: 835
Test size: 835

train_texts[:10]# Just checking the examples in train_text

[Ill talk to the others and probably just come early tomorrow then,
House-Maid is the murderer, coz the man was murdered on<#> th January.. As public holiday all govt.instituitions are closed,including post office..understand?,
Sad story of a Man Last week was my bday. My Wife didnt wish me. My Parents forgot n so did my Kids . I went to work. Even my Colleagues did not wish.,
Nah I dont think he goes to usf, he lives around here though,
Nope C _ then,
I sent your maga that money yesterday oh.,
URGENT This is our 2nd attempt to contact U. Your 900 prize from YESTERDAY is still awaiting collection. To claim CALL NOW 09061702893. ACL03530150PM,
Lol I would but my mom would have a fit and tell the whole family how crazy and terrible I am,
Check mail.i have mailed varma and kept copy to you regarding membership.take care.insha allah.,
88066 FROM 88066 LOST 3POUND HELP]

Download and Load Glo
We will use GloVe embedding parameters to initialize our layer of word representations / embedding layer.
Lets download and load glove.

This is related Lab 3 Deep Learning, please watch the recording and check the notebook for details.

Download GloVe word embeddings

# === Download GloVe word embeddings
# !wget http://nlp.stanford.edu/data/glove.6B.zip

# === Unzip word embeddings and use only the top 50000 word embeddings for speed
# !unzip glove.6B.zip
# !head -n 50000 glove.6B.300d.txt > glove.6B.300d__50k.txt

# === Download Preprocessed version
!wget https://docs.google.com/uc?id=1KMJTagaVD9hFHXFTPtNk0u2JjvNlyCAu -O glove_split.aa
!wget https://docs.google.com/uc?id=1LF2yD2jToXriyD-lsYA5hj03f7J3ZKaY -O glove_split.ab
!wget https://docs.google.com/uc?id=1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f -O glove_split.ac
!cat glove_split.?? > glove.6B.300d__50k.txt

2022-02-16 23:52:32https://docs.google.com/uc?id=1KMJTagaVD9hFHXFTPtNk0u2JjvNlyCAu
Resolving docs.google.com (docs.google.com) 74.125.202.102, 74.125.202.139, 74.125.202.113,
Connecting to docs.google.com (docs.google.com)|74.125.202.102|:443 connected.
HTTP request sent, awaiting response 200 OK
Length: unspecified [text/html]
Saving to: glove_split.aa

glove_split.aa[ <=>] 1.93K.-KB/sin 0s

2022-02-16 23:52:33 (25.6 MB/s) glove_split.aa saved [1978]

2022-02-16 23:52:33https://docs.google.com/uc?id=1LF2yD2jToXriyD-lsYA5hj03f7J3ZKaY
Resolving docs.google.com (docs.google.com) 74.125.202.138, 74.125.202.100, 74.125.202.102,
Connecting to docs.google.com (docs.google.com)|74.125.202.138|:443 connected.
HTTP request sent, awaiting response 200 OK
Length: unspecified [text/html]
Saving to: glove_split.ab

glove_split.ab[ <=>] 1.93K.-KB/sin 0s

2022-02-16 23:52:36 (25.5 MB/s) glove_split.ab saved [1978]

2022-02-16 23:52:37https://docs.google.com/uc?id=1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f
Resolving docs.google.com (docs.google.com) 74.125.202.138, 74.125.202.113, 74.125.202.101,
Connecting to docs.google.com (docs.google.com)|74.125.202.138|:443 connected.
HTTP request sent, awaiting response 303 See Other
Location: https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/rjp7g6lolbttv0mftfnic5fr5oum8q4r/1645055550000/14514704803973256873/*/1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f [following]
Warning: wildcards not supported in HTTP.
2022-02-16 23:52:37https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/rjp7g6lolbttv0mftfnic5fr5oum8q4r/1645055550000/14514704803973256873/*/1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f
Resolving doc-04-0g-docs.googleusercontent.com (doc-04-0g-docs.googleusercontent.com) 173.194.197.132, 2607:f8b0:4001:c1b::84
Connecting to doc-04-0g-docs.googleusercontent.com (doc-04-0g-docs.googleusercontent.com)|173.194.197.132|:443 connected.
HTTP request sent, awaiting response 403 Forbidden
2022-02-16 23:52:37 ERROR 403: Forbidden.

!wget https://campuspro-uploads.s3.us-west-2.amazonaws.com/f14e42f6-0f57-4d3c-bf3c-6eb8982c822b/1447cf92-9ef5-4097-939d-f69337174ded/glove.6B.300d__50k.txt.zip
!unzip glove.6B.300d__50k.txt.zip

2022-02-16 23:57:18https://campuspro-uploads.s3.us-west-2.amazonaws.com/f14e42f6-0f57-4d3c-bf3c-6eb8982c822b/1447cf92-9ef5-4097-939d-f69337174ded/glove.6B.300d__50k.txt.zip
Resolving campuspro-uploads.s3.us-west-2.amazonaws.com (campuspro-uploads.s3.us-west-2.amazonaws.com) 52.218.246.249
Connecting to campuspro-uploads.s3.us-west-2.amazonaws.com (campuspro-uploads.s3.us-west-2.amazonaws.com)|52.218.246.249|:443 connected.
HTTP request sent, awaiting response 200 OK
Length: 49335722 (47M) [application/zip]
Saving to: glove.6B.300d__50k.txt.zip

glove.6B.300d__50k. 100%[===================>]47.05M22.6MB/sin 2.1s

2022-02-16 23:57:21 (22.6 MB/s) glove.6B.300d__50k.txt.zip saved [49335722/49335722]

Archive:glove.6B.300d__50k.txt.zip
replace glove.6B.300d__50k.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: glove.6B.300d__50k.txt
inflating: __MACOSX/._glove.6B.300d__50k.txt

Load Glo

def load_glove(glove_path, embedding_dim):
with open(glove_path) as f:
token_ls = [PAD_TOKEN, UNK_TOKEN]
embedding_ls = [np.zeros(embedding_dim), np.random.rand(embedding_dim)]
for line in f:
token, raw_embedding = line.split(maxsplit=1)
token_ls.append(token)
embedding = np.array([float(x) for x in raw_embedding.split()])
embedding_ls.append(embedding)
embeddings = np.array(embedding_ls)
print(embedding_ls[-1].size)
return token_ls, embeddings

PAD_TOKEN = UNK_TOKEN =
EMBEDDING_DIM=300 # dimension of Glove embeddings
glove_path = glove.6B.300d__50k.txt
vocab, embeddings = load_glove(glove_path, EMBEDDING_DIM)

Import packages

!pip install sacremoses

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
import sacremoses
from torch.utils.data import dataloader, Dataset
from tqdm.auto import tqdm

Collecting sacremoses
Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
|| 895 kB 5.4 MB/s
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.1.0)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.15.0)
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from sacremoses) (2019.12.20)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from sacremoses) (4.62.3)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses) (7.1.2)
Installing collected packages: sacremoses
Successfully installed sacremoses-0.0.47

Tokenize text data.
We will use the tokenize function to convert text data into sequence of indices.

def tokenize(data, labels, tokenizer, vocab, max_seq_length=128):
vocab_to_idx = {word: i for i, word in enumerate(vocab)}
text_data = []
label_data = []
for ex in tqdm(data):
tokenized = tokenizer.tokenize(ex.lower())
ids = [vocab_to_idx.get(token, 1) for token in tokenized]
text_data.append(ids)
return text_data, labels
tokenizer = sacremoses.MosesTokenizer()
train_data_indices, train_labels = tokenize(train_texts, train_labels, tokenizer, vocab)
val_data_indices, val_labels = tokenize(val_texts, val_labels, tokenizer, vocab)
test_data_indices, test_labels = tokenize(test_texts, test_labels, tokenizer, vocab)

print(
Train text first 5 examples:
, train_data_indices[:5])
print(
Train labels first 5 examples:
, train_labels[:5])

Train text first 5 examples:
[[43, 1, 1079, 6, 2, 425, 7, 967, 122, 328, 201, 4004, 129], [1, 16, 2, 12873, 3, 1, 2, 302, 17, 5928, 15, 725, 18811, 91, 2751, 725, 16537, 91, 14360, 452, 1, 21, 200, 2334, 66, 1, 34, 719, 3, 146, 660, 285, 1, 1908, 190], [5281, 525, 5, 9, 302, 13, 78, 149, 17, 194, 1558, 1, 194, 704, 121, 1, 3469, 1, 194, 1110, 15477, 3816, 102, 121, 194, 1815, 4, 43, 390, 6, 1, 153, 194, 3298, 121, 38, 3469, 4], [8978, 43, 3318, 1, 271, 20, 1434, 6, 21590, 3, 20, 975, 206, 189, 415], [43897, 436, 1866, 1, 99, 129, 436]]

Train labels first 5 examples:
[0, 0, 0, 0, 0]

Create DataLoaders (25 pts)
Now, lets create pytorch DataLoaders for our train, val, and test data.

SpamDataset class is based on torch Dataset. It has an additional parameter called self.max_sent_length and a spam_collate_func.

In order to use batch processing, all the examples need to effectively be the same length. Well do this by adding padding tokens. spam_collate_func is supposed to dynamically pad or trim the sentences in the batch based on self.max_sent_length and the length of longest sequence in the batch.

If self.max_sent_length is less than the length of longest sequence in the batch, use self.max_sent_length. Otherwise, use the length of longest sequence in the batch.
We do this because our input sentences in the batch may be much shorter than self.max_sent_length.

Please check the comment block in the code near TODO for more details.

PAD token id = 0
max_sent_length = 5

input list of sequences:

[1,4,5,3,5,6,7,4,4],
[3,5,3,2],
[2,5,3,5,6,7,4],

then padded minibatch looks like this:

padded_input =
[[1,4,5,3,5],
[3,5,3,2,0],
[2,5,3,5,6]]

import numpy as np
import torch
from torch.utils.data import Dataset

class SpamDataset(Dataset):
Class that represents a train/validation/test dataset thats readable for PyTorch
Note that this class inherits torch.utils.data.Dataset

def __init__(self, data_list, target_list, max_sent_length=128):
@param data_list: list of data tokens
@param target_list: list of data targets

self.data_list = data_list
self.target_list = target_list
self.max_sent_length = max_sent_length
assert (len(self.data_list) == len(self.target_list))

def __len__(self):
return len(self.data_list)

def __getitem__(self, key, max_sent_length=None):
Triggered when you call dataset[i]
if max_sent_length is None:
max_sent_length = self.max_sent_length
token_idx = self.data_list[key][:max_sent_length]
label = self.target_list[key]
return [token_idx, label]

def spam_collate_func(self, batch):
Customized function for DataLoader that dynamically pads the batch so that all
data have the same length
# What the input `batch`? Thats for you to figure out!
# You can read the Dataloader documentation, or you can use print
# function to debug.
data_list = [] # store padded sequences
label_list = []
max_batch_seq_len = None # the length of longest sequence in batch
# if it is less than self.max_sent_length
# else max_batch_seq_len = self.max_sent_length

# Pad the sequences in your data
# if their length is less than max_batch_seq_len
# or trim the sequences that are longer than self.max_sent_length
# return padded data_list and label_list
1. TODO: Your code here

return [data_list, label_list]

BATCH_SIZE = 64
max_sent_length=128
train_dataset = SpamDataset(train_data_indices, train_labels, max_sent_length)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=BATCH_SIZE,
collate_fn=train_dataset.spam_collate_func,
shuffle=True)

val_dataset = SpamDataset(val_data_indices, val_labels, train_dataset.max_sent_length)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
batch_size=BATCH_SIZE,
collate_fn=train_dataset.spam_collate_func,
shuffle=False)

test_dataset = SpamDataset(test_data_indices, test_labels, train_dataset.max_sent_length)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=BATCH_SIZE,
collate_fn=train_dataset.spam_collate_func,
shuffle=False)

Lets try to print out an batch from train_loader.

data_batch, labels = next(iter(train_loader))
print(data batch dimension: , data_batch.size())
print(data_batch: , data_batch)
print(labels: , labels)

AttributeErrorTraceback (most recent call last)
in ()
1 data_batch, labels = next(iter(train_loader))
-> 2 print(data batch dimension: , data_batch.size())
3 print(data_batch: , data_batch)
4 print(labels: , labels)

AttributeError: list object has no attribute size

Build a BiLSTM Classifier (20 + 25 + 10 pts)
Now we are going to build a BiLSTM classifier. Check this blog post and torch.nn.LSTM for reference. Recall that weve also seen LSTM in Lab 3.

The hyperparameters for LSTM are already given, but they are not necessarily optimal. You should get a good accuracy with these hyperparameters but you may try to tune the hyperparameters and use different hyperparameters to get better performance.

__init__: Class constructor. Here we define layers / parameters of LSTM.
forward: This function is used whenever you call your object as model(). It takes the input minibatch and returns the output representation from LSTM.

# First import torch related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTMClassifier(nn.Module):
LSTMClassifier classification model
def __init__(self, embeddings, hidden_size, num_layers, num_classes, bidirectional, dropout_prob=0.3):
super().__init__()
self.embedding_layer = self.load_pretrained_embeddings(embeddings)
self.dropout = None
self.lstm = None
self.non_linearity = None # For example, ReLU
self.clf = None # classifier layer
Define the components of your BiLSTM Classifier model
2. TODO: Your code here
raise NotImplementedError# delete this line

def load_pretrained_embeddings(self, embeddings):
The code for loading embeddings from Lab 3 Deep Learning
Unlike lab, we are not setting `embedding_layer.weight.requires_grad = False`
because we want to finetune the embeddings on our data
embedding_layer = nn.Embedding(embeddings.shape[0], embeddings.shape[1], padding_idx=0)
embedding_layer.weight.data = torch.Tensor(embeddings).float()
return embedding_layer

def forward(self, i

CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] NOW 09061702893. ACL03530150PM,

Reviews

Whatsapp Us

[SOLVED] NOW 09061702893. ACL03530150PM,

Reviews

Related products

[Solved] Modify your first program to print a table of the words

[Solved] BinaryAdd

[SOLVED] pakudex

[SOLVED] Project 9-1: Monthly Payment Calculator

[Solved] Payroll calculation program-Python

[SOLVED] SciCalculator