# 1 Written Questions

Answer the following questions in the template provided. Then upload your solutions to Gradescope. You may use L^{A}T_{E}X or print the template and hand-write your answers then scan it in. Failure to use the template may result in a penalty.

1.1 Reductions to Binary Classification

- (1 point) Can we construct an ECOC matrix that yields the same algorithm as One vs. All classification?
*If yes, describe the matrix; if not, explain why not.* - (1 point) Can we construct an ECOC matrix that yields the same algorithm as All vs. All classification?

*If yes, describe the matrix; if not, explain why not.*

- (1 point) True or False: One-against-some classifier has logarithmic runtime in the number of examples.

True

False

- (1 point) True or False: Extreme classification assumes a large label set.

True

False

- (2 points) For a full binary classification tree of depth
*d*, what is the probability of making a correct classification on. Assume that the probability of a binary classifier at any node making an error is . (A full binary tree is one in which every node, except the leaves, has two children.)^{X~ } - Numerical answer: For
*k*-way multi-classification with*k*= 16, how many classifiers need to be constructed for…- (1 point) …One vs. All classification?
- (1 point) …All vs. All classification?

- (1 point) For an arbitrary classification tree with
*k*classes, what is the minimum number of binary classifiers that could be needed? - (1 point) For an arbitrary classification tree with
*k*classes, What is the maximum number of classifiers that could be needed? (Assume that the sets of classes for each pair of siblings are mutually exclusive.)

## 1.2 Learning to Search

In this section, we’ll consider a simple version of learning to search for the task of identifying names in a text. In this setting our action space will be the set {+*,*−} where + denotes labeling a word as part of a name and − denotes labeling a word as not being part of a name. Our state space will be represented by *S _{z }*where

*z*is a vector denoting the labeling of our sentence so far. For example, if we are in state

*S*

_{++− }it means we have labeled word 0 as +, word 1 as +, word 2 as −, and we are currently attempting to label word 3.

Throughout this section, we will be referring exclusively to the following input sentence *~x *and oracle labeling *~y*:

In Figure 1.1 you can see a small part of the search space for this problem.

Figure 1.1

- (1 point) How many nodes are in the search space for sentence
*~x*? - Suppose our loss function is Hamming loss.
- (1 point) What does the oracle policy return for
*S*_{++−}? - (1 point) What does the oracle policy return for
*S*_{−−−}?

- (1 point) What does the oracle policy return for
- Suppose our loss function is Hamming loss with an additional error for including only part of a name,

e.g. omitting a person’s title. More precisely,

- (1 point) What does the oracle policy return for
*S*_{−}? - (1 point) What does the oracle policy return for
*S*_{+}?

- Suppose that our scoring function is of the form
*θ*^{T }**f**(*x*) where_{i},y_{i}

**f**(*x _{i},y_{i}*) =[I{

*x*starts with a capital letter and

_{i }*y*= +} I{

_{i }*x*starts with a capital letter and

_{i }*y*= −}

_{i }*,*I{ is in our gazetteer list and

*y*= +}

_{i }*,*

I{ is in our gazetteer list and *y _{i }*= −}

*,*I{

*y*

_{i}_{−1 }= −

*,y*= −}]

_{i }and *θ *is a vector of length 5. A gazetteer list is essentially a lookup table of entities that we know are names. Our gazetteer list will include { Xavier, Mellon }.

Assume our initial weight vector *θ *= [−1*,*1*,*−1*,*1*,*0].

- (1 point) What score would our scoring function assign to the ground truth labeling?
- (1 point) What labeling would the greedy policy induced by this scoring function return? Break any ties with +.
- (1 point) Suppose we use the Supervised Approach to Imitation Learning. Which (state, action) pairs would be produced by the learning algorithm? Denote the action by either + or −. Denote a state by the partial sequence it corresponds to (e.g. the state +− corresponds to a state that took action + followed by action −).
- (1 point) Suppose we use DAgger, Which (state, action) pairs would be produced by the learning algorithm? Use the same denotation of states and actions as in the previous question.

We decide to train our linear model using the Perceptron update rule. That is, if the classifier (aka. greedy policy) makes a positive mistake (i.e. a mistake where *y _{i }*= +) in its action selection, then we

*add*the feature vector to the weights. If it makes a negative mistake (i.e. a mistake where

*y*= −), then we

_{i }*subtract*the feature vector from the weights. We treat the arrival of each (state, action) pair as a separate online example for the classifier.

- (1 point) Using the (state, action) pairs produced by the Supervised Approach to Imitation Learning, what would the new value of
*θ*be after completing the corresponding Perceptron updates? - (1 point) Using the (state, action) pairs produced by DAgger, what would the new value of
*θ*be after completing the corresponding Perceptron updates?

## 1.3 Recurrent Neural Network Language Models

In this section, we wish to use an Elman Network as a building block to design an RNN language model. Below we define the Elman network recursively.

**b*** _{t }*= relu(

**B**

^{T }**b**

*−*

_{t}_{1 }+

**A**

^{T }**a**

*+*

_{t }**d**)

**c**

*= softmax(*

_{t }**C**

^{T }**b**

*+*

_{t }**e**)

where **a**_{t}*,*∀*t *is given as input to the network; **A***,***B***,***C***,***d***,***e **are parameters of the network; and the initial hidden units **b**_{0 }are also treated as parameters. In this problem, we assume **a**_{t}*,***b**_{t}*,***c*** _{t }*∈ R

^{2 }for all

*t*, i.e. all vectors have length two, and that the parameters matrices and vectors are appropriately defined to preserve those dimensions. Above, the function relu(·) is the Rectified Linear Unit function applied elementwise to its vector-valued parameter. The function softmax(·) is likewise vector-valued and defined below. We use [·]

*to explicitly denote the*

_{i }*i*th element of its argument.

[relu(**v**)]* _{i }*, max(0

*,v*)

_{i}[softmax

Figure 1.2 depicts the Elman Network. The parameters are shown in red, but their connections into the computation graph are not shown.

Figure 1.2

Assume that we wish to build a language model *p*(**y**) of binary sequences **y **∈ {+*,*−}* ^{L }*of fixed length

*L*. We have training data consisting of sequences D = {

**y**

^{(1)}

*,…,*

**y**

^{(N)}}, , where |

**y**

^{(i)}| =

*L*. Assume further that we have pre-encoded the data as sequences of one-hot vectors D

^{0 }= {

**z**

^{(1)}

*,…,*

**z**

^{(N)}} such that:

if, then **z**, if, then **z**.

For such a pair of vectors, we write that **z**^{(i) }= one-hot(**y**^{(i)})

- (1 point) Short answer: Since we wish to treat this Elman Network as an RNN-LM, how many inputs
**a**will we need for a single training example_{t }**y**∈ D with |**y**| =*L*? Explain your answer. - (1 point) Select one: If we decide to train this RNN-LM with
*Teacher Forcing*, we will need to compute a loss function for an input example**y**∈ D. Assuming so, how would we define**a**?_{t}*Note: Be sure to account for all**t in your definition.* - (1 point) Select one: If we decide to train this RNN-LM with
*Scheduled Sampling*, we will need to compute a loss function for an input example**y**∈ D. Assuming we use a schedule that always selects the model policy with probability 1, how would we define**a**?_{t}*Note: Be sure to account for all**t in your definition.* - (1 point) Write the cross entropy loss
*`*for a single training example**z**∈ D^{0 }in terms of the units and/or parameters of the RNN-LM:**a**_{t}*,***b**_{t}*,***c**_{t}*,***A***,***B***,***C***,***d***,***e**.

Suppose we have parameter values as defined below:

**A **** B **** C ** (1.1) **d **** e **** b** (1.2)

- Numerical answer: When computing the probability of the sequence
**y**= [+*,*−*,*+], what is the value of the following three quantities?*Note: Round each numerical value to two significant figures.*- (1 point)
*b*_{1,1 }= - (1 point)
*b*_{2,1 }=

- (1 point)
- Numerical answer: When computing the probability of the sequence
**y**= [+*,*−*,*+], what is the value of the following three quantities?*Note: Round each numerical value to two significant figures.*- (1 point)
*c*_{1,1 }= - (1 point)
*c*_{2,1 }=

- (1 point)
- (1 point) Numerical answer: What is the probability of the sequence
**y**= [+*,*−*,*+] according to this RNNLM?*Note: Round the numerical value to two significant figures.*

*p*(**y**) =

- (1 point) Numerical answer: What is the probability of the
*(length one!)*sequence**y**^{0 }= [−] according to this RNNLM?*Note: Round the numerical value to two significant figures.*

*p*(**y**^{0}) =

## 1.4 Empirical Questions

The following questions should be completed after you work through the programming portion of this assignment (Section 2).

- (10 points) Record your model’s performance on the test set after 10 epochs in terms of Cross Entropy (CE) and Character Error Rate (CER) when trained with the following schemes.
*Note: Round each numerical value to two significant figures.*

Schedule | CE | CER |

All Oracle | ||

All Model | ||

β = 0.75 |
||

Linear Decay | ||

Exponential Decay |

- (10 points) Plot training and testing cross entropy curves for three different training procedures:
*All Oracle*,*All Model*, and the fixed*β*= 0*.*75 training procedure. Let the*x*-axis ranges over 10 epochs.

*Note: Your plot must be machine generated.*

- In class we saw that we can prove a no-regret bound for sequences of
*β*such that

.

- (1 point) Show that a fixed
*β*does not satisfy this condition. - (1 point) Show that exponential decay does satisfy this condition.
- (1 point) Did this theoretical difference make a difference in practice? Briefly explain why or why not with respect to this dataset and problem setting.

## 1.5 Wrap-up Questions

- (1 point) Multiple Choice: Did you correctly submit your code to Autolab?

Yes

No

- (1 point) Numerical answer: How many hours did you spend on this assignment?.

# 2 Programming

Your goal in this assignment is to implement a deep learning model for acoustic speech recognition (ASR). You will implement a function to encode speech data into a fixed length vector, a function to decode this fixed length vector into a sequence of characters, and a variety of strategies to train the overall model.

Your solution for this section must be implemented in PyTorch using the data files we have provided to you. This restriction is because we will be grading your code by hand to check your understanding as well as your model’s performance.

## 2.1 Task Background

Acoustic speech recognition is the problem of taking in recordings of people talking and predicting what was said. While many successful models try to predict linguistically meaningfully units like phonemes, we will be directly predicting characters from waveforms for simplicity.

Though we will train our model with a standard classification loss (Cross-Entropy), what we really care about is the accuracy of our transcription. For a given target sentence, we define the Character Error Rate as the number of character deletions (*d*), character insertions (*i*), and character substitutions (*s*) need to transform the output transcription to the goal transcription over the number of characters in the output transcription (n).

### CER (2.1)

Conveniently, this equation can be calculated with the following snippet of code, which runs a dynamic programming algorithm to compute edit distance:

**from **nltk.metrics.distance **import **edit_distance

cer = edit_distance(our_string, goal_string) / **len**(our_string)

## 2.2 Data

In order to reduce your workload and computational requirements, we have preprocessed a subset of the TIMIT dataset for you to use in this task.

The data is divided into two folders, train and test. In each folder is a sequence of pairs of numpy files (.npy) and text files (.txt). If a numpy file is named ”xyz.npy”, it’s corresponding transcription can be found in ”xyz.txt”.

Given Input: Preprocessed audio files in numpy array of size 20 × *L*, where *L *is the number of timesteps in the audio file.

Goal Output: Preprocessed text files of the transcribed audio.

For this section’s points, you will need to implement data loaders for PyTorch so that we can efficiently train our model batch by batch. Note that because we are working with sequences, we will need all sequences in a batch to have the same length.

(Hint: PyTorch allows you to pass a ”collate fn” to torch.utils.data.DataLoader and provides a function called ”pad sequence” in torch.nn.utils.rnn)

## 2.3 Seq2Seq Implementation

A sequence to sequence model (commonly abbreviated Seq2Seq) is a model that takes in a sequence of input, like words in an English sentence or samples of a waveform, and outputs another sequence, like words in Arabic or transcribed phonemes. Models of this type are frequently used in machine translation, text to speech applications, and automatic speech recognition.

Seq2Seq models comprise of two parts, an encoder and a decoder. The encoder takes in a sequence of inputs and outputs an *encoding *that represents all of the information contained in the input. The decode then takes in this encoding and outputs a sequence in the new domain. This process can be seen in the figure below.

Figure 2.1: The encoder and decoder are trained jointly in a Seq2Seq model.

For this section, you must implement a working Seq2Seq model. Your implementation must have both the encoder and decoder as single-layer LSTMs with 50% dropout applied to the input layer to the encoder. Every hidden dimension in your neural networks should be 128 and your embedding size should be 256. Set your optimizer to be Adam with the default PyTorch parameters. Your program should be able to run quickly and easily on a laptop due to the simplicity of our model and the limited size of our data.

## 2.4 DAgger Implementation

As we’ve seen in class, DAgger is an algorithm for collecting training examples for our model by sometimes following the actions performed by an expert and sometimes following our model’s decisions.

For this section’s points, you will need to implement DAgger as described in Algorithm 1. Note that this algorithm allows *β _{i }*to vary with timestep

*i*. This allows us to explore various schedules for the

*β*parameter, including some that don’t have the theoretical guarantees discussed in class.

Please implement a general version of DAgger and then the code necessary to run DAgger with (1) a fixed *β*, (2) a linearly decaying *β *where *β *is decreased by 0*.*05 after each epoch, and (3) an exponentially decaying *β*, where *β *= exp−1 × *i *where *i *is the current epoch.

Algorithm 1 DAgger

Initialize D ← ∅.

Initialize *π*ˆ_{1 }to any policy in Π. for i=1 to N do

Let *π _{i }*=

*β*

_{i}π^{∗ }+ (1 −

*β*)

_{i}*π*ˆ

_{i}Sample *T*-step trajectories using *π _{i}*.

Get dataset D* _{i }*= {

*s,π*

^{∗}(

*s*)} of visited states by

*π*and actions given by the expert.

_{i }Aggregate datasets: D ← D ∪ D* _{i}*.

Train classifier *π*ˆ_{i}_{+1 }on D.

end for return best *π*ˆ* _{i }*on validation.

## 2.5 Autolab Submission

You must submit a .tar file named seq2seq.tar containing seq2seq.py, which contains all of your code.

You can create that file by running:

tar -cvf seq2seq.tar seq2seq.py

from the directory containing your code.

Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf. DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all your files should be named in lowercase. You must submit this file to the corresponding homework link on Autolab.

Your code will not be autograded on Autolab. Instead, we will grade your code by hand; that is, we will read through your code in order to grade it. As such, please carefully identify major sections of the code via comments.

## Reviews

There are no reviews yet.