, , , , , ,

[SOLVED] Deep learning systems (engr-e 533) homework 1 to 4 solutions

$25

File Name: Deep_learning_systems__engr_e_533__homework_1_to_4_solutions.zip
File Size: 565.2 KB

5/5 - (1 vote)

1. Replicate the test accuracy graph on M02-S09.
2. Show me your weight visualization, too.
3. Please do not use any advanced optimization methods (Adam, batch norm, dropout, etc.) or
initialization methods (Xavier and so on). Plan SGD should just work.
4. In TF 2.x, you can do something like this to download the MNIST dataset:
mnist = tf.keras.datasets.mnist
In PT, you can use these lines of commands (don’t worry about the batch size and normalization–
you can go for your own option for them):
import torchvision
mnist_train=torchvision.datasets.MNIST(’mnist’,
train=True,
download=True,
transform=torchvision.transforms.Compose([
1
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.1307,), (0.3081,))
]))
mnist_test=torchvision.datasets.MNIST(’mnist’,
train=False,
download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.1307,), (0.3081,))
]))
Problem 2: Autoencoders [4 points]
1. Replicate the test accuracy graph on M02-S12.
2. It means, you also want to show the figures in M02-S11.
3. Note that your encoder weights are frozen; you only update the softmax layer weights (the
100 × 10 matrix and the bias).
Problem 3: A shallow NN [3 points]
1. Replicate the test accuracy graph on M02-S14.
2. I don’t have to see the visualization of the first layer. Just show me your graphs.
Problem 4: Full BP on the both layers [6 points]
1. Replicate the test accuracy graph on M02-S17.
Replicate the figures in M03 Adult Optimization, slide 22 using the details as follows:
1. Use the same network architecture and train five different network instances in five different
setups. The architecture has to be a fully connected network (a regular network, not a CNN
or RNN) with five hidden layers, 512 hidden units per layer.
2. Create five different networks that share the same architecture as follows:
(a) Activation function: the logistic sigmoid function; initialization: random numbers generated from the normal distribution (µ = 0, σ = 0.01)
(b) Activation function: the logistic sigmoid function; initialization: Xavier initializer
(c) Activation function: ReLU; initialization: random numbers generated from the normal
distribution (µ = 0, σ = 0.01)
(d) Activation function: ReLU; initialization: Xavier initializer
(e) Activation function: ReLU; initialization: Kaiming He’s initializer
3. You don’t have to implement your own initializer. Both TF and PT come with pre-implemented
initializers.
2
4. Train them with the traditional SGD. Do not improve SGD by introducing momentum or
any other advanced stuff. Your goal is to replicate the figures in 22. Feel free to use preimplemented SGD optimizer.
5. In practice, you will need to investigate different learning rate as for SGD, which will give
you different convergence behaviors.
6. Don’t worry if your graphs are slightly different from mine. We will give a full mark if your
graphs show the same trend.

Problem 1: Speech Denoising Using Deep Learning [3 points]
1. If you took my MLSP course, you may think that you’ve seen this problem. But, it’s actually
somewhat different from what you did before, so read carefully. And, this time you SHOULD
implement a DNN with at least two hidden layers. .
2. When you attended IUB, you took a course taught by Prof. K. Since you really liked his
lectures, you decided to record them without the professor’s permission. You felt awkward,
but you did it anyway because you really wanted to review his lectures later.
3. Although you meant to review the lecture every time, it turned out that you never listened
to it. After graduation, you realized that a lot of concepts you face at work were actually
covered by Prof. K’s class. So, you decided to revisit the lectures and study the materials
once again using the recordings.
4. You should have reviewed your recordings earlier. It turned out that a fellow student who used
to sit next to you always ate chips in the middle of the class right beside your microphone.
So, Prof. K’s beautiful deep voice was contaminated by the annoying chip eating noise.
5. But, you vaguly recall that you learned some things about speech denoising and source separation from Prof. K’s class. So, you decided to build a simple deep learning-based speech
denoiser that takes a noisy speech spectrum (speech plus chip eating noise) and then produces
a cleaned-up speech spectrum.
1
6. Since you don’t have Prof. K’s clean speech signal, I prepared this male speech data recorded
by other people. train dirty male.wav and train clean male.wav are the noisy speech and
its corresponding clean speech you are going to use for training the network. Take a listen to
them. Load them and covert them into spectrograms, which are the matrix representation of
signals. To do so, you’ll need to install librosa and use it by using the following codes:
!pip install librosa # in colab, you’ll need to install this
import librosa
s, sr=librosa.load(‘train_clean_male.wav’, sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)
sn, sr=librosa.load(‘train_dirty_male.wav’, sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)
which is going to give you two matrices S and X of size 513 × 2459. This procedure is
something called Short-Time Fourier Transform.
7. Take their magnitudes by using np.abs() or whatever other suitable methods, because S and
X are complex valued. Let’s call them |S| and |X|.
8. Train a fully-connected deep neural network. A couple of hidden layers would work, but
feel free to try out whatever structure, activation function, initialization scheme you’d like.
The input to the network is a column vector of |X| (a 513-dim vector) and the target is its
corresponding one in |S|. You may want to do some mini-batching for this. Make use of
whatever functions in Tensorflow or Pytorch.
9. But, remember that your network should predict nonnegative magnitudes as output. Try to
use a proper activation function in the last layer to make sure of that. I don’t care which
activation function you use in the middle layers.
10. test 01 x.wav is the noisy signal for validation. Load them and apply STFT as before. Feed
the magnitude spectra of this test mixture |Xtest| to your network and predict their clean
magnitude spectra |Sˆ
test|. Then, you can recover the (complex-valued) speech spectrogram
of the test signal in this way:
Sˆ =
Xtest
|Xtest|
⊙ |Sˆ
test|, (1)
which means you take the phase information of the input noisy signal Xtest
|Xtest|
and use that to
recover the clean speech. ⊙ stands for the Hadamard product and the division is element-wise,
too.
11. Recover the time domain speech signal by applying an inverse-STFT on Sˆ
test, which will give
you a vector. Let’s call this cleaned-up test speech signal sˆtest. I’ll calculate something called
Signal-to-Noise Ratio (SNR) by comparing it with the ground truth speech I didn’t share
with you. It should be reasonably good. You can actually write it out by using the following
code:
librosa.output.write_wav(‘test_s_01_recons.wav’, sh_test, sr)
or
2
import soundfile as sf
sf.write(‘test_s_01_recons.wav’, sh_test, sr)
12. You can compute SNR if you know the ground-truth source. Load test 01 s.wav. This is
the ground-truth clean signal buried in test 01 x.wav. Compute the SNR of the predicted
validation signal by comparing it to test 01 s.wav, but do not include this example to your
training process. Once the training process is done, or even in the middle of training epochs,
you apply your model to this validation example, and compute the SNR value. That way,
you can simulate the testing environment, although it doesn’t guarantee that the model will
work well on the test example, because the validation example can be different from the test
set. This approach is related to the early stopping technique explained in M03 S37. Use this
validation signal to prevent overfitting. By the way, SNR is defined as follows:
SNR = 10 log10
P
t
s
2
(t)
P
t

s(t) − sˆ(t)
2
, (2)
where s(t) and ˆs(t) are the ground-truth clean speech and the recovered one in the time
domain, respectively. Be careful with the division and logarithm: you don’t want your denominator to be zero or anything inside the log function zero. Adding a very small number,
e.g., 1e
−20, is a good idea to prevent it.
13. Do the same testing procedure for test 02 x.wav, which actually contains Prof. K’s voice
along with the chip-eating noise. Enjoy his enhanced voice using your DNN.
14. Grading will be based on the denoised version of test 02 x.wav. So, submit the audio file.
Problem 2: Speech Denoising Using 1D CNN [4 points]
1. As an audio guy it’s sad to admit, but a lot of audio signal processing problems can be solved
in the time-frequency domain, or an image version of the audio signal. You’ve learned how
to do it in the previous homework by using STFT and its inverse process.
2. What that means is nothing stops you from applying a CNN to the same speech denoising
problem. In this question, I’m asking you to implement a 1D CNN that does the speech
denoising job in the STFT magnitude domain. 1D CNN here means a variant of CNN which
does the convolution operation along only one of the axes. In our case it’s the frequency axis.
3. Like you did in Problem 1, install/load librosa. Take the magnitude spectrograms of the
dirty signal and the clean signal |X| and |S|.
4. Both in Tensorflow and PyTorch, you’d better transpose this matrix, so that each row of
the matrix is a spectrum. Your 1D CNN will take one of these row vectors as an example,
i.e. |X|

:,i. Since this is not an RGB image with three channels, nor you’ll use any other
information than just the magnitude during training, your input image has only one channel
(depth-wise). Coupled with your choice of the minibatch size, the dimensionality of your
minibatch would be like this: [(batch size) × (number of channels) × (height) × (width)] =
[B × 1 × 1 × 513]. Note that depending on the implementation of the 1D CNN layers in TF
or PT, it’s okay to omit the height information. Carefully read the definition of the function
you’ll use.
3
5. You’ll also need to define the size of the kernel, which will be 1 × D, or simply D depending
on the implementation (because we know that there’s no convolution along the height axis).
6. If you define K kernels in the first layer, the output feature map’s dimension will be [B ×K ×
1 × (513 − D + 1)]. You don’t need too many kernels, but feel free to investigate. You don’t
need too many hidden layers, either.
7. In the end, you know, you have to produce an output matrix of [B × 513], which are the
approximation of the clean magnitude spectra of the batch. It’s a dimension hard to match
using CNN only, unless you take care of the edges by padding zeros (let’s not do zero-padding
for this homework). Hence, you may want to flatten the last feature map as a vector, and
add a regular linear layer to reduce that dimensionality down to 513.
8. Meanwhile, although this flattening-followed-by-linear-layer approach should work in theory,
the dimensionality of your flattened CNN feature map might be too large. To handle this
issue, we will used the concept we learned in class, striding: usually, a stride larger than
1 can reduce the dimensionality after each CNN layer. You could consider this option in
all convolutional layers to reduce the size of the feature maps gradually, so that the input
dimensionality of the last fully-connected (FC) layer is manageable. Maxpooling, coupled
with the striding technique, would be something to consider.
9. Be very careful about this dimensionality, because you have to define the input and output
dimensionality of the FC layer in advance. For example, a stride of 2 pixels will reduce the
feature dimension down to roughly 50%, though not exactly if the original dimensionality is
an odd number.
10. Don’t forget to apply the activation function of your choice, at every layer, especially in the
last layer.
11. Try whatever optimization techniques you’ve learned so far.
12. Check on the quality of the test signal you used in P1. Submit the denoised signal.
Problem 3: Data Augmentation [4 points]
1. CIFAR10 is a pretty straightforward image classification task, that consists of 10 visual object
classes.
2. Download them from here1 and be ready to use it. Both PyTorch and Tensorflow have options
to conveniently load them, but I chose to download them directly and mess around because
I found it easier.
3. Set aside 5,000 training examples for validation.
4. Build your baseline CNN classifier.
(a) The images need to be reshaped into 32 × 32 × 3 tensor.
(b) Each pixel is an integer with 8bit encoding (from 0 to 255). Transform them down to a
floating point with a range [0, 1]. 0 means a black pixel and 1 is a white one.
1https://www.cs.toronto.edu/ kriz/cifar.html
4
(c) People like to rescale the pixels to [-1, 1] so that the input to the CNN is well centered
around 0, instead of 0.5.
(d) I know you are eager to try out a fancier net architecture, but let’s stick to this simple
one:
1st 2d conv layer: there are 10 kernels whose size is 5x5x3; stride=1
Maxpooling: 2×2 with stride=2
1st 2d conv layer: there are 10 kernels whose size is 5x5x10; stride=1
Maxpooling: 2×2 with stride=2
1st fully-connected layer: [flattened final feature map] x 20
2st fully-connected layer: 20 x 10 Softmax on the 10 classes
Let’s stick to ReLU for activation and the He initializer.
(e) Train this net with an Adam optimizer with a default initial learning late (i.e. 0.001).
Check on the validation accuracy at the end of every epoch. Report your validation
accuracy over the epochs as a graph. This is the performance of your baseline system.
5. Build another classifier using augmented dataset. Prepare four different datasets out
of the original CIFAR10 training set (except for the 5,000 you set aside for validation):
(a) I know you already changed the scale of the pixels from 0—255 to -1—+1. Let’s go back
to the intermediate range, 0—1.
(b) Augmented dataset #1: Brighten every pixel in every image by 10%, e.g., by multiplying 1.1. Make sure though, that they don’t exceed 1. For example, you may want to
do something like this: np.minimum(1.1*X, 1).
(c) Augmented dataset #2: Darken every pixel in every image by 10%, e.g., by multiplying 0.9.
(d) Augmented dataset #3: Flip all images horizontally (not upside down). As if they
are mirrored.
(e) Augmented dataset #4: The original training set.
(f) Merge the four augmented dataset into one gigantic training set. Since there are 45,000
images in the original training set (after excluding the validation set), after the augmentation you have 45,000×4=180,000 images. Each original image has four different
versions: brighter, darker, horizontally flipped, and original versions. Note that the four
share the same label: a darker frog is still a frog.
(g) Don’t forget to scale back to -1—+1.
(h) You better visualize a few images after the augmentation to make sure what you did is
correct.
(i) Train a fresh new network with the same architecture, but using this augmented dataset.
Record the validation accuracy over the epochs.
6. Overlay the validation accuracy curve from the baseline with the new curve recorded from
the augmented dataset. I ran 200 epochs for both experiments and was able to see convincing
results (i.e., the data augmentation improves the validation performance).
7. In theory you have to conduct a test run on the test set, but let’s forget about it.
5
Problem 4: Self-Supervised Learning via Pretext Tasks [4 points]
1. Suppose that you have only 50 labeled examples per class for your CIFAR10 classification
problem, totaling 500 training images. Presumably it might be tough to achieve a high
performance in this situation.
2. Set aside 500 examples from your training set (I chose the last 500 examples).
3. The pretext task:
(a) On the other hand, we will assume that the rest of the 49,500 training examples are
unlabeled. We will create a bogus classification problem using them. Let this unlabeled
examples (or the examples that you disregard their original labels) be “class 0”.
(b) “class 1”: Create a new class, by vertically flipping all the images upside down.
(c) “class 2”: Create another class, by rotating the images 90 degree counter-clock wise.
(d) Now you have three classes, each of which contains 49,500 labeled examples.
(e) This is not a classification problem one can be serious about, but the idea here is that a
classifier that is trained to solve this problem may need to learn some features that are
going to be helpful for the original CIFAR10 classification problem.
(f) Train a network with the same setup/architecture described in Problem 3. In theory
you need to validate every now and then to prevent overfitting, but who cares about this
dummy problem? Let’s forget about it and just run about a hundred epochs.
(g) Store your model somewhere safe. Both TF and PT provide a nice way to save the net
parameters.
4. The baseline:
(a) Train a classifier from scratch on the 500 CIFAR10 dataset you set aside in the begining.
Note that they are for the original 10-class classification problem, and you ARE doing
the original CIFAR10 classification, except that you use a ridiculously small amount of
dataset. Let’s stick to the same architecture/setup. You may need to choose a reasonable
initializer, e.g., the He initializer. You know, since the training set is too small, you may
not even have to do batching.
(b) Let’s cheat here and use the test set of 10,000 examples as if they are our validation set.
If you check on the test accuracy at every 100th epoch, you will see it overfit at some
point. Record the accuracy values over iterations.
5. The transfer learning task:
(a) Train our third classifier on the 500 CIFAR10 dataset you set aside in the begining.
Again, note that they are for the original 10-class classification problem.
(b) Instead of using an initializer, you will reload the weights from the pretext network.
Yes, that’s exactly the definition of transfer learning. But, because you learned it from
an unlabeled set, and had to create a pretext task to do so, it falls in the category of
self-supervised learning.
6
(c) Note that you can trasfer all the parameters in except for the final softmax layer, as the
pretext task is only with 3 classes. Let’s randomly initialize the last layer parameters
with He.
(d) You need to reduce the learning rates for transfer learning in general. More importantly,
for the ones you transfer in, they have to be substantially lower than 1 × 10−3
, e.g.
1 × 10−5or1 × 10−6
. Meanwhile, the last softmax layer will prefer the default learning
rate 1 × 10−3
, as it’s randomly initialized.
(e) Report your test accuracy at every 100th epoch.
6. Draw two graphs from the two experiments, the baseline and the finetuning method, and
compare the results. For your information, I ran both of them 10,000 epochs, and recorded
the validation accuracy (actually, the test accuracy as I used the test set) at every 100th
epoch. Of course, the point is that the self-supervised features should give improvement.

Problem 1: Network Compression Using SVD [2 points]
1. Train a fully-connected net for MNIST classification. It should be with 5 hidden layers each of which
is with 1024 hidden units. Feel free to use whatever techniques you learned in class. You should be
able to get the test accuracy above 98%. Let’s call this network “baseline”. You can reuse the one
from the previous homework if its accuracy is good enough. Otherwise, this would be a good chance
for you to improve your “baseline” MNIST classifier.
2. You learned that Singular Value Decomposition (SVD) can compress the weight matrices (Module
6). You have 6 different weight matrices in your baseline network, i.e. W(1) ∈ R
784×1024
,W(2) ∈
R
1024×1024
, · · · ,W(5) ∈ R
1024×1024
,W(6) ∈ R
1024×10. Run SVD on each of them, except for W(6)
which is too small already, to approximate the weight matrices:
W(l) ≈ Wc
(l)
= U
(l)S
(l)V
(l)⊤
(1)
For this, feel free to use whatever implementation you can find. tf.svd or torch.svd will serve the
purpose. Note that we don’t compress bias (just because we’re lazy).
3. If you look into the singular value matrix S
(l)
, it should be a diagonal matrix. Its values are sorted
in the order of their contribution to the approximation. What that means is that you can discard
the least important singular values by sacrificing the approximation performance. For example, if you
choose to use only D singular values and if the singular values are sorted in the descending order,
W(l) ≈ Wc
(l)
= U
(l)
:,1:DS
(l)
1:D,1:D

V
(l)
:,1:D
⊤
. (2)
You may expect the Wc
(l)
in (2) is a worse approximation of W(l)
than the one in (1) due to the
missing components. But, by doing so you can do some compression.
1
4. Vary your D from 10, 20, 50, 100, 200, to Dfull, where Dfull is the original size of S
(l)
(so D = Dfull
means you use (1) instead of (2)). For example, Dfull = 784 when l = 1 and 1024 when l > 1. Now you
have 6 differently compressed versions that are using Wc
(l)
for feedforward. Each of the 6 networks are
using one of the 6 D values of your choice. Report the test accuracy of the six approximated networks
(perhaps a graph whose x-axis is D and y-axis is the test accuracy). You’ll see that when D = Dfull
the test accuracy is almost as good as the baseline, while D = 10 will give you the worst performance.
Note, however, that D = Dfull doesn’t give you any compression, while smaller choices of D can reduce
the amount of computation during feedforward.
5. Report your test accuracies of the six SVDed versions along with your baseline performance. Report
the number of parameters of your SVDed networks and compare them to the baseline’s. Be careful
with the S
(l) matrices: they are diagonal matrices, meaning that there are only D nonzero elements.
6. Note that you don’t have to run the SVD algorithm multiple times to vary D. Run it once, and extract
different versions by varying D. That’s what’s good about SVD.
Problem 2: Network Compression Using SVD [2 points]
1. Now you learned that the low rank approximation of W(l)
gives you some compression. However, you
might not like the performance of the too small D values. From now on, fix your D = 20 and let’s
improve its performance.
2. Define a NEW network whose weight matrices W(l)
are factorized. Again, this is a new one, different
from your baseline in P1. In this new network, you don’t estimate W(l)
directly anymore, but its
factor matrices, to reconstruct W(l)
as follows: W(l) = U
(l)V
(l)⊤
.
3. In other words, the feedforward is now defined like this:
x
(l+1) ← g

U
(l)V
(l)⊤
x
(l) + b
(l)

(3)
4. But instead of randomly initializing these factor matrices, initialize them using the P1 SVD results of
the D = 20 case:
U
(l) ← U
(l)
:,1:20, V
(l)⊤
← S
(l)
1:20,1:20V
(l)⊤
:,1:20 (4)
5. Again, note that U and V are the new variables that you need to estimate via optimization. They
are fancier though, because they are initialized using the SVD results. If you stop here, you’ll get the
same test performance as in P1.
6. Finetune this network. Now this new network has new parameters to update, i.e. U
(l)
and V
(l)
(as
well as the bias terms). Update them using BP. Since you initialized the new parameters with SVD,
which is a pretty good starting point, you may want to use a smaller-than-usual learning rate.
7. Report the test-time classification accuracy.
Problem 3: Network Compression Using SVD [3 points]
1. Another way to improve our D = 20 case is to inform the training process of the SVD approximation.
It’s a different method from P1, where SVD was performend once after the network training was
completed. This time, we do SVD at every epoch.
2. Initialize W(l)
using the “baseline” model. We will finetune it.
3. This time, for the feedforward pass, you never use W(l)
. Instead, you do SVD at every iteration and
make sure the feedforward pass always uses Wc
(l)
= U
(l)
:,1:20S
(l)
1:20,1:20V
(l)⊤
:,1:20.
4. What that means for the training algorithm is that you should think of the low-rank SVD procedure
as an approximation function W(l) ≈ f(W(l)
) = U
(l)
:,1:20S
(l)
1:20,1:20V
(l)⊤
:,1:20.
2
5. Hence, the update for W(l)
involves the derivative f

(W(l)
) due to the chain rule (See M6 S15 where I
explained this in the quantization context). You can na¨ıvely assume that your SVD approximation is
near perfect (although it’s not). Then, at least for the BP, you don’t have to worry about the gradients
as the derivative will be just one everywhere, because f(x) = x. By doing so, you can feedforward
using Wc
(l)
while the updates are done on W(l)
:
Feedforward: (5)
Perform SVD: W(l) ≈ U
(l)
:,1:20S
(l)
1:20,1:20V
(l)⊤
:,1:20 (6)
Perform Feedforward: x
(l+1) ← g

U
(l)
:,1:20S
(l)
1:20,1:20V
(l)⊤
:,1:20x
(l) + b
(l)

(7)
Backpropagation: (8)
Update Parameters: W(l) ← W(l) − η
∂L
∂f(W(l)
)
∂f(W(l)
)
∂W(l)
(9)
Note that ∂f(W(l)
)
∂W(l) = 1 everywhere due to our identity assumption.
6. As the feedforward is always using the SVD’ed version of the weights, the network is aware of the
additional error introduced by the compression and can deal with it during training. The implementation of this technique requires you to define an SVD routine running in the middle of the feedforward
process. Both TF and PT provide their SVD implementations you can use:
Tensorflow 2x: https://www.tensorflow.org/api_docs/python/tf/linalg/svd
PyTorch: https://pytorch.org/docs/stable/generated/torch.svd.html
Although it takes more time to train (because you need to do SVD at every iteration), I like it as I can
boost the performance of the D = 20 compressed network up to around 97%. Considering the amount
of memory saving (i.e., after the compression it uses only about 2%!), this is a great way to compress
your network.
Problem 4: Speaker Verification [4 points]
1. In this problem, we are going to build a speaker verification system. It takes two utterances as input,
and predicts whether they were spoken by the same speaker (positive class) or not (negative class).
2. trs.pkl contains an 500×16,180 matrix, whose row is a speech signal with 16,180 samples. They are
the returned vectors from the librosa.load function. Similarly, tes.pkl holds a 200×22,631 matrix.
3. The training matrix is ordered by speakers. Each speaker has 10 utterances, and there are 50 such
speakers (that’s why there are 500 rows). Similarly, the test set has 20 speakers, each of which is with
10 utterances.
4. Randomly sample L pairs of utterances from the ten utterance of the first speaker. In theory, there are

10
2

= 45 pairs you can sample from (the order of the two utterances within a pair doesn’t matter).
You can use all 45 of them if you want. These are the positive examples in your first minibatch.
5. Let’s construct L negative pairs as well. First, randomly sample L utterances from the 49 training
speakers. Second, randomly sample another L utterances from the first speaker (the speaker you
sampled the positive pairs from). Using these two sets, each has L examples, form another set of
L pairs. If L > 10, you’ll need to repeatedly use the first speaker’s utterance (i.e. sampling with
replacement). This set is your negative examples, each of whose pair contains an utterance from the
first speaker and a random utterance spoken by a different speaker.
6. The L positive pairs and L negative pairs form your first minibatch. You have 2L pairs of utterances
in total.
3
7. Repeat this process for the other training speakers, so that each speaker is represented by L positive
pairs and L negative pairs. By doing so, you can form 50 minibatches with a balanced number of
positive and negative pairs.
8. Train a Siamese network that tries to predict 1 for the positive pairs and 0 for the negative ones. In a
minibatch, since you have L positive and L negative pairs, respectively, your net must predict L ones
and L zeros, respectively.
9. I found that STFT on the signals serves the initial feature extraction process. Therefore, your Siamese
network will take as input TWO spectrograms, each of which is of size 513 × T. I wouldn’t care
too much about your choice of the network architecture this time (if it works anyway), but it has to
somehow predict a fixed-length feature vector for the given sequence of spectra (consequently, TWO
fixted-length vectors for the pair of input spectrograms). Using the inner product of the two latent
embedding vectors as the input to the sigmoid function, you’ll do a logistic regression. Use your
imagination and employ whatever techniques you learned in class to design/train this network.
10. Construct similar batches from the test set, and test the verification accuracy of your network. Report
your test-time speaker verification performance. I was able to get a decent result (∼ 70%) with a
reasonable network architecture (e.g., a GRU working on STFT), which converged in a reasonable
amount of time (i.e. in an hour).
11. Submit your code and accuracy on the test examples.
Problem 5: Speech Denoising Using RNN [4 points]
1. Audio signals natually contain some temporal structure to make use of for the prediction job. Speech
denoising is a good example. In this problem, we’ll come up with a reasonably complicated RNN
implementation for the speech denoising job.
2. homework3.zip contains a folder tr. There are 1,200 noisy speech signals (from trx0000.wav to
trx1199.wav) in there. To create this dataset, I start from 120 clean speech signal spoken by 12
different speakers (10 sentences per speaker), and then mix each of them with 10 different kinds
of noise signals. For example, from trx0000.wav to trx0009.wav are all saying the same sentence
spoken by the same person, while they are contaminated by different noise signals. I also provide the
original clean speech (from trs0000.wav to trs1199.wav) and the noise sources (from trn0000.wav
to trn1199.wav) in the same folder. For example, if you add up the two signals trs0000.wav and
trn0000.wav, that will make up trx0000.wav, although you don’t have to do it because I already did
it for you.
3. Load all of them and convert them into spectrograms like you did in homework 2. Don’t forget to
take their magnitudes. For the mixtures (trxXXXX.wav) You’ll see that there are 1,200 nonnegative
matrices whose number of rows is 513, while the number of columns depends on the length of the
original signal. Ditto for the speech and noise sources. Eventually, you’ll construct three lists of
magnitude spectrograms with variable lengths: |X
(l)
tr |, |S
(l)
tr |, and |N
(l)
tr |, where l denotes one of the
1,200 examples.
4. The |X
(l)
tr | matrices are your input to the RNN for training. An RNN (either GRU or LSTM is fine)
will consider it as a sequence of 513 dimensional spectra. For each of the spectra, you want to do a
prediction for the speech denoising job.
5. The target of the training procedure is something called Ideal Binary Masks (IBM). You can easily
construct an IBM matrix per spectrogram as follows:
M(l)
f,t =
(
1 if |S
(l)
tr |f,t > |N
(l)
tr |f,t
0 if |S
(l)
tr |f,t ≤ |N
(l)
tr |f,t
(10)
4
IBM assumes that each of the time-frequency bin at (f, t), an element of the |Xtr|
(l) matrix, is from
either speech or noise. Although this is not the case in the real world, it works like charm most of the
time by doing this operation:
S
(l)
tr ≈ Sˆ
(l)
tr = M(l) ⊙ X
(l)
tr . (11)
Note that masking is done to the complex-valued input spectrograms. Also, since masking is elementwise, the size of M(l)
and X
(l)
tr is same. Eventually, your RNN will learn a function that approximates
this relationship:
M(l)
:,t ≈ Mˆ
(l)
:,t = RNN
|X
(l)
tr |:,1:t; W

, (12)
where W is the network parameters to be estimated.
6. Train your RNN using this training dataset. Feel free to use whatever LSTM or GRU cells available
in Tensorflow or PyTorch. I find dropout helpful, but you may want to be gentle about the dropout
ratio. I didn’t need too complicated network structures to beat a fully-connected network.
7. Implementation note: In theory you must be able to feed the entire sentence (one of the X
(l)
tr matrices)
as an input sequence. You know, in RNNs a sequence is an input sample. On top of that, you still
want to do mini-batching. Therefore, your mini-batch is a 3D tensor, not a matrix. For example, in
my implementation, I collect ten spectrograms, e.g. from X
(0)
tr to X
(9)
tr , to form a 513 × T × 10 tensor
(where T means the number of columns in the matrix). Therefore, you can think that the mini-batch
size is 10, while each example in the batch is not a multidimensional feature vector, but a sequence of
them. This tensor is the mini-batch input to my network. Instead of feeding the full sequence as an
input, you can segment the input matrix into smaller pieces, say 513×Ttrunc×Nmb, where Ttrunc is the
fixed number to truncate the input sequence and Nmb is the number of such truncated sequences in a
mini-batch, so that the recurrence is limited to Tmb during training. In practice this doesn’t make big
difference, so either way is fine. Note that during the test time the recurrence works from the beginning
of the sequence to the end (which means you don’t need a truncation for testing and validation).
8. I also provide a validation set in the folder v. Check out the performance of your network on this dataset.
Of course you’ll need to see the validation loss, but eventually you’ll need to check out the SNR values.
For example, for a recovered validation sequence in the STFT domain, Sˆ
(l)
v = Mˆ
(l)
⊙ X(l)
v
, you’ll
perform an inverse-STFT using librosa.istft to produce a time domain wave form ˆs(t). Normally
for this dataset, a well-tuned fully-connected net gives slightly above 10 dB SNR. So, your validation
set should give you a number larger than that. Once again, you don’t need to come up with a too large
network. Start from a small one.
9. We’ll test the performance of your network in terms of the test data. I provide some test signals in
te, but not their corresponding sources. So, you can’t calculate the SNR values for the test signals.
Submit your recovered test speech signals in a zip file, which are the speech denoising results on the
signals in te. We’ll calculate SNR based on the ground-truth speech we set aside from you.

Problem 1: RNNs as a generative model [4 points]
1. We will train an RNN (LSTM or GRU; you choose one) that can predict the rest of the bottom half of
an MNIST image given the top half. So, yes, this is a generative model that can “draw” handwritten
digits.
1
2. As the first step, let’s divide every training image into 16 smaller patches. Since the original images
are with 28×28 pixels, what I mean is that you need to chop off the image into 7 × 7 patches. There
is no overlap between those patches. Above is an example from my implementation. It’s an image of
number “5” obviously, but it’s just chopped into 16 patches.
3. Let’s give them an order. Let’s do it from the top left corner to the bottom right corner. Below is
going to be the order of the patches.
4. Now that we have an order, we’ll use it to turn each MNIST image into a sequence of smaller patches.
Although, below would be a potential way to turn these patches into a sequence,
I wouldn’t use this way exactly, because then it is a sequence of 2d arrays, not vectors.
5. While I’ll keep the same order, to simplify our model architecture, we will vectorize each patch from
7×7 to a 49-dimensional vector. Finally, our sequence is a matrix X ∈ R
16×49, where 16 is the number
of “time” steps. This is an input sequence to your RNN for training.
6. With a proper batch size, say 100, now an input tensor is defined as a 3D array of size 100 × 16 × 49.
But, I’ll ignore the batch size in the equations below to keep the notation uncluttered.
7. Train an RNN out of these sequences. There must be 50,000 such sequences, or 500 minibatches if
your batch size is 100. I tried a couple of different model architectures but both worked quite well.
The smallest one I tried was a 2×64 LSTM. I didn’t do any fancy things like gradient clipping, as the
longest sequence length is still just 16.
8. Remember to add a dense layer, so that you can convert whatever choice of the LSTM or GRU hidden
dimension back to 49. You may also want to use an activation function for your output units so that
the output is bounded.
9. You need to train your RNN in a way that it can predict the next patch out of the so-far-observed
patches. To this end, the LSTM should predict the next patch in the following manner:
(Yt,:
, Ct+1,:
, Ht+1,:) = LSTM(Xt,:
, Ct,:
, Ht,:), (1)
where C and H denote the memory cell and hidden state, respectively (or with GRU C will be omitted),
that are 0 when t = 0. To work as a predictive model, when you train, you need to compare Yt,: (the
prediction) with Xt+1,: (the next patch) and compute the loss (I used MSE as I’m lazy).
2
10. In other words, you will feed the input sequence X1:15,: (the full sequence except for the last patch) to
the model, whose output Y ∈ R
15×49 will need to be compared to X2:16,:
, vector-by-vector, to compute
the loss:
L =
X
16
t=2
X
49
d=1
D(Xt,d||Yt−1,d), (2)
where D(·||·) is a distance metric of your choice.
11. Let’s use the test set to validate the model at every epoch, to see if it overfits. If it starts to overfit,
stop the training process early. It took from a few to tens of minutes to train the network.
12. Once your net converges, let’s move on to the fun “generation” part. Pick up a test image that
belongs to a digit class, and feed its first 8 patches to the trained model. It will generate eight patches
(Y ∈ R
8×49), and two other vectors as the last memory cell and hidden states: C9,:
, H9,:
. Note that
the dimension of C and H vectors depends on your choice of model complexity.
13. Then, run the model frame-by-frame, by feeding the last memory cell states, last hidden states and
the last predicted output as if it’s the new input. You will need to run this 7 times using a for loop,
instead of feeding a sequence. Remember, for example, you don’t know what to use as an input at
t = 9, because we pretend like we don’t know X9,:
, until you predict Y8,:
:
(Y9,:
, C10,:
, H10,:) = LSTM(Y8,:
, C9,:
, H9,:) (3)
(Y10,:
, C11,:
, H11,:) = LSTM(Y9,:
, C10,:
, H10,:) (4)
(Y11,:
, C12,:
, H12,:) = LSTM(Y10,:
, C11,:
, H11,:) (5)
.
.
. (6)
(Y15,:
, C16,:
, H16,:) = LSTM(Y14,:
, C15,:
, H15,:) (7)
14. Note that Y15,:
is the prediction for your 16-th patch and, e.g., Y8,:
is the prediction for your 9-th
patch, and so on. We will discard Y1:7,:
, as they are the predictions of patches that are already given
(i.e., t < 9). Once again, you know, we pretend like the top half (patch 1 to 8) are given, while the
rest (patch 9 to 16) are NOT known.
15. Combine the known top half X1:8,: and the predicted patches Y8:15,:
into a sequence of 16 patches. We
are curious of how the bottom half looks like, as they are the generated ones.
16. Reshape the synthesized matrix [X1:8,:
, Y8:15,:
] back into a 28 × 28 image. Repeat this experiment on
10 chosen images from the same digit class.
17. Repeat the experiment for all 10 digit classes. You will generate 100 images in total.
18. Below are examples from my model. On the left, you see the examples whose bottom half is “generated”
from the LSTM model, while the right images are the original (whose top half was fed to the LSTM). I
can see that the model does a pretty good job, but at the same time there are some interesting failure
cases (blue boxes). For example, if the upper arch of “3” is too large, LSTM thinks it’s 2 and draws a
2. Or, for some reason, if the some 5’s are not making a sharp corner on its top left, LSTM thinks it’s
6. Same story for tilted 7 that LSTM thinks it’s 2. So, my point is, if I had to guess the botton half
of these images, I’d have been confused as well.
19. Submit your 10 × 10 images that your LSTM generated. Submit their original images as well. Your
figures should look like mine (the two figures shown below) in terms of quality. Feel free to do it better
and embarrass me but you’ll get a full mark if the generated images look like mine. Note that these
have to be sampled from your test set, not the training set.
3
(a) LSTM generated images (b) The original images
Problem 2: Variational Autoencoders on Poor Sevens [3 points]
1. tr7.pkl contains 6,265 MNIST digits from its training set, but not all ten digits. I only selected 7’s.
Therefore, it’s with a rank-3 tensor of size 6, 265 × 28 × 28. Similarly, te7.pkl contains 1,028 7’s.
2. The digit images in this problem are special, because I added a special effect to them. So, they are
different from the original 7’s in the MNIST dataset in a way. I want you to find out what I did to the
poor 7’s.
3. Instead of eyeballing all those images, you need to implement a VAE that finds out a few latent
dimensions, one of which should show you the effect I added.
4. Once again, I wouldn’t care too much about your network architecture. This could be a good chance for
you to check out the performance of the CNN encoder, followed by a decoder with deconvolution layers
(or transposed convolution layers), but do something else if you feel like. I found that fully-connected
networks work just fine.
5. What’s important here in the VAE is, as a VAE, it needs a hidden layer that is dedicated to learn the
latent embedding. In this layer, each hidden unit is governed by a standard normal distribution as its
a priori information. Also, be careful about the re-parameterization technique and the loss function.
6. You’ll need to limit the number of hidden units K in your code layer (the embedding vector) with a
small number (e.g. smaller than 5) to reduce your search space. Out of K, there must be a dimension
that explains the effect that I added.
7. One way to prove that you found the latent dimension of interest is to show me the digits generated
by the decoder. More specifically, you may want to “generate” new 7’s by feeding a few randomly
generated code vectors, that are the random samples from the K normal distributions that your VAE
learned. But, they won’t be enough to show which dimension takes care of my added effect. Therefore,
your random code vectors should be formed specially.
8. What I’d do is to generate code vectors by fixing the K − 1 dimensions with the same value over the
codes, while varying only one of them.
4
9. For example, if K = 3 and you’re interested in the third dimension, your codes should look like as
follows:
Z =













0.23 −0.18 −5
0.23 −0.18 −4.5
0.23 −0.18 −4.0
0.23 −0.18 −3.5
0.23 −0.18 −3.0
.
.
.
0.23 −0.18 4.5
0.23 −0.18 5.0













(8)
Note that the first two column vectors are once randomly sampled from the normal distributions, but
then shared by all the codes so that the variation found in the decoded output relies solely on the third
dimension.
10. You’ll want to examine all the K dimensions by generating samples from each of them. Show me
the ones you like. They should show a bunch of similar-looking 7’s but with gradually changing
effect on them. The generated samples that show a gradual change of the thickness of the stroke, for
example, are not a good answer, because that’s not the one I added, but something that was there in
the dataset already.
11. Submit your notebook with figure and code.
Problem 3: Conditional GAN [3 points]
1. Let’s develop a GAN model that can generate MNIST digits, but based on the auxiliary input from
the user indicating which digit to create.
2. To this end, the generate has to be trained to receive two different kinds of input: the random vector
and the class label.
3. The random vector is easy to create. It should be a d-dimensional vector, sampled from a standard
normal distribution N (0, 1). d = 100 worked just fine for me.
4. As for the conditioning vector, you somehow need to inform the network of your intention. For example,
if you want to generate a “0”, you need to give that information to the generator. There are many
different ways to condition a neural network at various stages. But, this time, let’s use a simple one.
We will convert the digit label into a one-hot vector. For example, if you want to generate a “7” the
conditioning vector is [0, 0, 0, 0, 0, 0, 0, 1, 0, 0].
5. Then, we need to combine these two different kinds of information. Again, there are many different
ways, but let’s just stick to a simple solution. We will concatenate the d-dimensional random vector and
the 10-dimensional one-hot vector. Therefore, the input to your generator is with d + 10 dimensions.
If your d = 100, the input dimension is 110.
6. You are free to choose whatever network architecture you want to practice with. Here’s the fullyconnected one I found as a good starting point: 110 × 200 × 400 × 784. I used ReLU as the activation
function, but as for the last layer, I used tanh. It means that I’ll interpret −1 as the black pixel while
+1 being the white pixel.
7. The discriminator has a similar architecture: 794 × 400 × 200 × 100 × 1. The reason why it takes a
794-dim vector is that it wants to know what the image sample is conditioned on. Also note that it
does binary classification to discern whether the conditioned input image is a real or fake example, i.e.,
you will need to set up the last layer as a logistic regression function.
8. To train this GAN model, sample a minibatch of B examples from your MNIST dataset. These are
your real examples. But, instead of feeding them directly to your discriminator, you’ll append their
label information by turning it into the one-hot representation. Don’t forget to match the scale: it has
to be from −1 to +1 instead of [0, 1] as that’s how the generator defines the pixel intensity.
5
9. Accordingly, generate a class-balanced set of fake examples by feeding B random vectors to your
generator. Again, each of your random vectors needs to be appended by a randomly chosen onehot vector. For example, if your B = 100, you may want to generate ten ones, ten twos, and so on.
Although the generated images are not with any label information anymore, you know that each should
belong to a particular digit class based on your conditioning vector. Therefore, when you feed these
fake examples to the discriminator, you need to append the one-hot vectors once again. Of course, the
one-hot vectors should match the ones you used to inform the generator as input.
10. To summarize, the input to your generator is a d + 10-dim vector. The last 10 elements should be
copied to augment your fake example, generated from the generator, to construct a 794-dim vector.
You have B fake examples as such. The real examples are with the same size, but their first 784
elements are from the real MNIST images, accompanied by the last 10 elements representing the class,
to which the image belongs.
11. Train this GAN model. I used Adam with lower-than-usual learning rates. Dropout helped the
discriminator. Below is the figure that shows the change of the classification accuracy over the epochs
(red for real and blue for fake examples). I can see that it converged to the Nash equilibrium, as the
discriminator seems to be confused.
12. Below is the test examples that I generated by feeding new random vectors (plus the intended class
labels). I placed ten examples per class in a row. These are of course not the best MNIST digits I can
imagine, but they look fine given the simple structure and algorithm I used.
6
13. Please feel free to use whatever other things you want to try out, such as WGAN, but if your results
are decent (like mine) we’ll give away the full score.
14. Report both the convergence graph as well as the generated examples.
Problem 4: Missing Value Imputation Using Conditional GAN [5 points]
1. We’ve already seen that LSTM can act like generative model that can “predict future patches” given
the “past patches” in P1.
2. This time, we’ll do something similar but by using GAN. This time, it works like a missing value
imputation system. We assume that only the center part of the image is known, while the generator
has to predict what the other surrounding pixels are.
3. We’ll formulate this as a conditional GAN. First, take a batch of MNIST images. Take their center
10 × 10 patches, and then flatten it. This is your 100-dimensional conditioning vector. Since there
are 28 × 28 pixels in each image, you’ll do something like this, X[:,10:19,10:19], to take the center
patch. This will form a B × 100 matrix, for your batch of B conditioning vectors.
4. Append this matrix to your random vectors of 100 dimensions drawn from the standard normal distribution. This B × 200 matrix is the input to your generator.
5. The generator takes this 200 dimensional vectors and synthesizes MNIST-looking digits. You will
need to prepare another set of B real examples. Eventually, you feed 2B examples in total to your
discriminator as a minibatch.
6. If both discriminator and generator are trained properly, you can see that the results are some MNISTlooking digits. But, I found that the generator simply ignores the conditioning vector and generate
whatever it wants to generate. They all certainly look like MNIST digits, but the conditioning part
doesn’t work. Below is the generated images (left) and the ground-truth images that I extracted the
center patches from (right).
They are completely different from each other.
7. So, even though I did feed the center patch as the conditioning vector to the generator, it ignores it
and generate something totally different. It’s because, I think, the generator has no way to know the
conditioning vector is actually the center patch of the digit that it must generate. In other words, the
generator is generating the whole image, although it doesn’t have to generate the center patch, which
is known to me. Instead, I wanted it to generate the surrounding pixels, that are the missing values.
7
8. As a remedy, I added another regularizer to my generator so that it functions as an autoencoder at
least for the center pixels. You know, in an ordinary GAN setup, the generator loss has to penalize the
discriminator’s decision that classifies the fake examples into the fake class (i.e., when the generator
fails to fool the discriminator). On top of this ordinary generator loss, I add a simple mean squared
error term that penalizes the difference between the conditioning vector and the center patch of the
generated image, as they have to be the same, essentially.
9. Since it’s a regularizer, I needed to investigate different λ values to control its contribution to the total
loss of the generator. It turned out that the generator is not too sensitive to this choice, although it
does generate a “less conditioned” example when it comes to a too small λ. Below is the two sets of
examples when I set λ = 0.1 (left) and λ = 10 (right).
10. Replicate what I did with the regularized model and submit your code and generated examples (i.e.,
you don’t have to replicate my failed model with no regularization). Once again, you can try some
other fancy models and different ways to condition the model. But we’ll give you a full score if your
results are as good as mine.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Deep learning systems (engr-e 533) homework 1 to 4 solutions[SOLVED] Deep learning systems (engr-e 533) homework 1 to 4 solutions
$25