The first project will be related to building a language model from scratch. In the first instance, the first thing that is required is that you prepare a corpus (choose only one) and carry out the necessary preprocessing.
Maltese Corpus: http://mlrs.research.um.edu.mt/index.php?page=downloads
(Baby) British National Corpus: http://ota.ox.ac.uk/desc/2553
First set of tasks:
- Extract the selected corpus
- Build frequency counts for n-grams
Things to keep in mind:
- Try to make your code modular. Of course, initially hard coding stuff is ok. But ideally, evolve your hard-coding to generic, modular code
- You might also want to hard code a specific input string so that you can ensure that your frequency counts are correct.
- Consider computational issues check how long it takes to read the corpus, build frequency counts, how well it scales up, amount of RAM that the system ends up using, etc.
- When reading the files, you might want to consider using one file initially. But already start investigating using all files in the corpus. Remember the larger your corpus, the better the models will be.
- Think ahead Once you get the frequency counts sorted, remember that like in any machine learning set up, we will be needing a training set and a test set. So you will eventually need to split your corpus into two distinct parts. More about this at a later stage, but it is good that you start thinking about this eventuality.
Building a Language Model Part II
Now that you have selected the corpus, extracted it and built your frequencies, you can now start working towards building a language model from scratch.
We will be building a language model that will include the Unigram, Bigram and Trigram models and we will be building three different langauge models
- The Vanilla Language model is your regular language model
- The Laplace Language model will take the Vanilla as its basis, but you will now include Laplace smoothing.
- The UNK Language model will take the Vanilla as its basis, but now you will set all the words that have a count of 2 or less as <UNK> tokens. And then recalculate accordingly. And recalculate Laplace smoothing on this model.
Next, create a function that will carry out Linear Interpolation this function takes one of the above 3 flavours of language models and calculate the probability of a sentence using the following lambdas: trigram = 0.6; bigram = 0.3; unigram = 0.1.
Evaluation:
- Take the test corpus, iterate through the sentences and calculate the probabilites for each sentence with every model. You will then use this to calculate the Perplexity and produce the following table:
| Unigram | Bigram | Trigram | Linear Interpolation | |
| Vanilla | ||||
| Laplace | ||||
| UNK |
- Create a function that allows the user to select one of the Language Models (Vanilla, Laplace, UNK) and input a phrase. The function will use this phrase to Generate the rest of the sentence until it encounters the end of sentence token (i.e. the LM says stop!)
Building a Language Model Practicalities
Submission Requirements:
- Code & Documentation be concise I dont want an explanation of the class notes. I want an explanation of your implementation choices. I also accept a submission of a jupiter notebook with documentation integrated in it.
- Demo (arranged after submission)
Marking Scheme:
Preprocessing: 5 marks
Vanilla LM: 10 marks
Laplace LM: 10 marks
UNK LM: 10 marks
Interpolation: 10 marks
Perplexity Evaluation: 10 marks
Generation Evaluation: 10 marks
Documentation (justification of choices, etc.): 20 marks
Demo: 15 marks

![[Solved] ICS2203 Part 1 Language Model](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] ICS2203 Part 2 Phenome Recognition](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.