Automatic Language Identification
In this project you will build and experiment with a probabilistic language identification system to identify the language of a sentence.
Your Task
You will build a character-based n-gram model for 3 languages: English, French and a language of your choice that uses the same character set. As a basic setup, you must:
- Train a unigram and bigram character-based language model for each language.
- Use your language models to identify the most probable language of a sentence given as input.
Training Set: As training set, you will start with the corpora available on Moodle (2 books in English, and 2 books in French, where diacritics have been removed). For the 3rd language, you must find your own training corpora. The Web is a great place to find electronic texts. Look at the Web page of Project Gutenberg[1] or Internet Archive[2] for good starting points.
Character Set: Make sure that all 3 languages use the same character set. In particular, you should not take diacritics (accents, cedillas) into account. The French corpora on Moodle have been cleaned of diacritics. For the basic setup, use only the 26 letters of the alphabet (i.e. ignore punctuations and other characters) and reduce all letters to their lower case versions.
Log Space: In order to avoid arithmetic underflow, remember to work in log space. Any logarithmic base would work; but for the basic setup, use base 10.
Smoothing: For the basic setup, both your unigram and your bigram models must be smoothed using add-delta, with delta=0.5.
Programming Environment:
You can use Java, C, C++ or Python. If you wish to use another language, please check with me first.
Input:
Your program will take as input:
- 3 file names (txt, trainEN.txt and trainOT.txt) containing the training texts for each language, and
- a file containing the sentences to be classified.
Output: Your program will output:
- a dump of the language models in the following files: txt, bigramFR.txt, unigramEN.txt, bigramEN.txt, unigramOT.txt, bigramOT.txt
Each file should contain a dump of the language model in list format. For example,
for the unigram models: for the bigrams models:
P(a) = 7.9834e-005 // some arbitrary value used P(b) = 7.9834e-005 // everywhere in the handout) = 7.9834e-005 z) = 7.9834e-005 | P(a|a) = 7.9834e-005P(b|a) = 7.9834e-005 P(c|a) = 7.9834e-005 P(z|z) = 7.9834e-005 |
P(c P(
- For each input sentence in the input file:
- on the console, the most probable language of the sentence
- a dump of the trace of that sentence in a file name out#.txt where # is the number of the sentence in the input file.
Each output file must contain the sentence itself, a trace and the result of its classification, following the format below.
For example, if the input file contains 30 sentences to test, then you should generate 30 output files named out1.txt to out30.txt. If the first sentence is What will the Japanese economy be like next year? then out1.txt will contain:
What will the Japanese economy be like next year? UNIGRAM MODEL: UNIGRAM: wFRENCH: P(w) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005ENGLISH: P(w) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005 OTHER: P(w) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005 UNIGRAM: hFRENCH: P(h) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005ENGLISH: P(h) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005OTHER: P(h) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005 FRENCH: P(r) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005ENGLISH: P(r) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005 OTHER: P(r) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005According to the unigram model, the sentence is in English – BIGRAM MODEL: BIGRAM: whFRENCH: P(h|w) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005ENGLISH: P(h|w) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005 OTHER: P(h|w) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005 BIGRAM: haFRENCH: P(a|h) = 7.9834e-005 ==> log prob of sentence so far: 7.9834e-005 According to the unigram model, the sentence is in English |
- Submit the output of your program with 30 input sentences. Your 30 sentences must include: a. the following 10 sentences
- What will the Japanese economy be like next year? (EN)
- She asked him if he was a student at this school. (EN)
- Im OK. (EN)
- Birds build nests. (EN)
- I hate AI. (EN)
- Loiseau vole. (FR)
- Woody Allen parle. (FR)
- Est-ce que larbitre est la? (FR)
- Cette phrase est en anglais. (FR)
- Jaime lIA. (FR)
- 10 sentences that your system classifies correctly
- 10 sentences that your system gets wrong
Experiments:
As with the previous mini-projects, you are expected to perform additional experimentations with your program beyond the basic setup specified above.
Examples of experiments include: experimenting with a variety of values for n, for delta or for the character set, a variety of languages (similar or different), etc. For each experiment, report and analyse your results in your report.
Deliverables:
The submission of the project will consist of 3 deliverables:
- The source code and an executable that runs in the lab machines
- The 30 output files when your language model is trained with the corpora given
- A Report
- A Demo
The Code:
Submit all files necessary to run your code in addition to a README.txt which will contain specific instructions how to run your program on the desktops in the computer labs.
Reviews
There are no reviews yet.