[Solved] CSC401 Assignment 3-Speaker Identification and Speech Recognition

$25

File Name: CSC401_Assignment_3-Speaker_Identification_and_Speech_Recognition.zip
File Size: 612.3 KB

SKU: [Solved] CSC401 Assignment 3-Speaker Identification and Speech Recognition Category: Tag:
5/5 - (1 vote)

This assignment introduces you to Gaussian mixture modelling, and two basic tasks in speech technology: speaker identification, in which we try to determine who is talking, and speech recognition, in which we try to determine what was said.

The assignment is divided into two sections. In the first, you will experiment with speaker identification by training mixtures of Gaussians to the acoustic characteristics of individual speakers, and then identify speakers based on these models. In the second section, you will evaluate two speech recognition engines.

The data come from the CSC Deceptive Speech corpus, which was developed by Columbia University, SRI International, and University of Colorado Boulder. It consists of 32 hours of audio interview from 32 native speakers of Standard American English (16 male,16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.

Data are in /u/cs401/A3/data/; each sub-folder represents speech from one speaker and contains raw audio, pre-computed MFCCs, and orthographic transcripts. Further file descriptions are in Appendix A.

1 Speaker Identification

Speaker identification is the task of correctly identifying speaker sc from among S possible speakers si=1..S given an input speech sequence X, consisting of a succession of d-dimensional real vectors. In the interests of efficiency, d = 13 in this assignment. Each vector represents a small 25 ms unit of speech called a frame. Speakers are identified by training data that are ascribed to them. This is a discrete classification task (choosing among several speakers) that uses continuous-valued data (the vectors of real numbers) as input.

Gaussian Mixture Models

Gaussian mixture models are often used to generalize models from sparse data. They can tightly constrain large-dimensional data by using a small number of components but can, with many more components, model arbitrary density distributions. Sometimes, they are simply used because the domain being modelled appears to have multiple modes.

Given M components, GMMs are modelled by a collection of parameters, = {m=1..M,m=1..M,m=1..M}, where m is the probability that an observation is generated by the mth component. These are subject

0Copyright 2019, Frank Rudzicz. All rights reserved.

to the constraint that Pm m = 1 and 0 m 1. Each component is a multivariate Gaussian distribution, which is characterized by that components mean, m, and covariance matrix, m. For reasons of computational efficiency, we will reintroduce some independence assumptions by assuming that every components covariance matrix is diagonal, i.e.:

m[1] 0m = 0 0m[2]0 0 0 m[d]

for some vector ~ m. Therefore, only d parameters are necessary to characterize a components (co)variance.

1.1 Utility functions

First, we implement three utility functions in /u/cs401/A3/code/a3 gmm.py. First, implement log b m x, which implements the log observation probability of xt for the mth mixture component, i.e., the log of:

(1)

Next, implement log p m x, which is the log probability of m given xt using model , i.e., the log of:

(2)

Finally, implement logLik, which is the log likelihood of a set of data X, i.e.:

) (3)

where

) (4)

and bm is defined in Equation 1. For efficiency, we just pass and precomputed bm(x~t) to this function.

1.2 Training Gaussian mixture models

Now we train an M-component GMM for each of the speakers in the data set. Specifically, for each speaker s, train the parameters s = {m=1..M,m=1..M,m=1..M} according to the method described in Appendix B. In all cases, assume that covariance matrices m are diagonal. Start with M = 8. Youll be asked to experiment with that in Section 2.4. Complete the function train in /u/cs401/A3/code/a3 gmm.py.

1.3 Classification with Gaussian mixture models

Now we test each of the test sequences weve already set aside for you in the main function. I.e., we check if the actual speaker is also the most likely speaker, s:

(5)

s=1

Complete the function test in /u/cs401/A3/code/a3 gmm.py. Run through a train-test cycle, and save the output that this function writes to stdout, using the k = 5 top alternatives, to the file gmmLiks.txt.

1.4 Experiments and discussion

Experiment with the settings of M and if you wish). For example, what happens to classification accuracy as the number of components decreases? What about when the number of possible speakers, S, decreases? You will be marked on the detail with which you empirically answer these questions and whether you can devise one or more additional valid experiments of this type.

Additionally, your report should include short hypothetical answers to the following questions:

  • How might you improve the classification accuracy of the Gaussian mixtures, without adding more training data?
  • When would your classifier decide that a given test utterance comes from none of the trained speaker models, and how would your classifier come to this decision?
  • Can you think of some alternative methods for doing speaker identification that dont use Gaussian mixtures?

Put your experimental analysis and answers to these questions in the file gmmDiscussion.txt.

2 Speech Recognition

Automatic speech recognition (ASR) is the task of correctly identifying a word sequence given an input speech sequence X. To simplify your lives, we have ran two popular ASR engines on our data: the opensource and highly customizable Kaldi (specifically, a bi-directional LSTM model trained on the Fisher corpus), and the neither-open-source-nor-particularly-customizable Google Speech API.

We want to see which of Kaldi and Google are the most accurate on our data. For each speaker in our data, we have three transcript files: transcripts.txt (the gold-standard transcripts, from humans), transcripts.Kaldi.txt (the ASR output of Kaldi), and transcripts.Google.txt (the ASR output of Google); see Appendix A.

Complete the file at /u/cs401/A3/code/a3 levenshtein.py. Specifically, in the Levenshtein function, accept lists of words r (Reference) and h (hypothesis), and return a 4-item list containing the floatingpoint WER, the number of substitutions, the number of insertions, and the number of deletions where

Assume that the cost of a substitution is 0 if the words are identical and 1 otherwise. The costs of insertion and deletion are both 1.

In the main function, iterate through each of the speakers, and iterate through each line i of their transcripts. For each line, preprocess these transcripts by removing all punctuation (other than [ and ]) and setting the text to lowercase. Output the following to stdout:

[SPEAKER] [SYSTEM] [i] [WER] S:[numSubstitutions], I:[numInsertions], D:[numDeletions]

where [SYSTEM] is either Kaldi or Google.

Save this output and put it into asrDiscussion.txt.

On the second-to-last line of asrDiscussion.txt, in free text, summarize your findings by reporting the average and standard deviation of WER for each of Kaldi and Google, separately, over all of these lines. If you want to be fancy, you can compute a statistical test of significance to see if one is better than the other, but you dont need to.

On the last line of asrDiscussion.txt, add a sentence or two describing anything you observe about the types of errors being made by each system, by manually examining the transcript files.

3 Bonus

We will give up to 10 bonus marks for innovative work going substantially beyond the minimal requirements. These marks can make up for marks lost in other sections of the assignment, but your overall mark for this assignment cannot exceed 100%. You may decide to pursue any number of tasks of your own design related to this assignment, although you should consult with the instructor or the TA before embarking on such exploration. Certainly, the rest of the assignment takes higher priority. Some ideas:

Voice banking

We are running a large study or normative data in which people from the general population donate their speech (and language) data so that we can learn subtle differences in pathological populations. If you go to https://www.cs.toronto.edu/talk2me/, you can obtain 5 bonus points if you complete 10 sessions, each at least 1 day apart. There is no limit on your age, or first language. Currently, only the Chrome and most recent Firefox browser are supported. Create a new username for this assignment, and indicate your username in your submission so that we can validate your submitted data.

Dimensionality reduction

Principal components analysis (PCA) is a method that converts some multivariate representation of data into a lower-dimensional representation by transforming the original data according to mutually orthogonal principal components.

Implement an algorithm that discovers a dd0 matrix W that transforms a d-dimensional vector, ~x into a d0-dimensional vector ~y through a linear transformation based on PCA, where d0 < d. Repeat speaker identification using data that has been transformed by PCA and report on what you observe, e.g., for different values of d0. Submit all code and materials necessary to repeat your experiments.

ASR with sequence-to-sequence models

Try to do better than Kaldi or Google by implementing:

Chiu C-C, Sainath TN, Wu Y, et al. (2017) State-of-the-art Speech Recognition With Sequence-toSequence Models. http://arxiv.org/abs/1712.01769.

Consider using open-source end-to-end ASR using TensorFlow, e.g., deepSpeech.

Truth-and-lie detection

Each of the utterances has been labelled as either truthful or deceitful (see Appendix A). Train and test models to tell these utterances apart using the provided data. E.g.,

  • Train a GMM for each of the Truth and Lie categories, using your code from Section 2.
  • Try recurrent neural networks that read one MFCC vector at a time.
  • Extract engineered features, such as those extracted in Assignment 1, from the text transcripts and classify using discriminative models in scikit-learn. Are words more discriminative than the audio?

Consider how errors in ASR transcripts affect those extracted features and therefore overall system accuracy in a manner not dissimilar to Zhou L, Fraser KC, Rudzicz F. (2016) Speech recognition in Alzheimers disease and in its assessment. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CSC401 Assignment 3-Speaker Identification and Speech Recognition
$25