[Solved] CLC Project 2- East Asian Language Tokenization

$25

File Name: CLC_Project_2-_East_Asian_Language_Tokenization.zip
File Size: 442.74 KB

SKU: [Solved] CLC Project 2- East Asian Language Tokenization Category: Tag:
5/5 - (1 vote)

For this project you have a choice between implementing Thai tokenization or Japanese tokenization.

Thai Tokenization

You will take a finite state approach to Thai tokenization. This is a simplified FSA that will work on the examples provided, but may not work on every Thai sentence.

./thai_tokenizer.py <input_file> <output_file>

Characters in Thai can be categorized as follows:

Category Members Unicode
V1 Preposed vowel 0E40, 0E41, 0E42, 0E43, 0E44
C1 Initialconsonant 0E01 0E2E
C2 Clustered consonant 0E19, 0E21, 0E23, 0E25, 0E27
V2 Super/subscript vowel 0E31, 0E34, 0E35, 0E36, 0E37, 0E38, 0E39, 0E47
T Superscript tonemark 0E48, 0E49, 0E4A, 0E4B
V3 Postposed vowel 0E22, 0E27, 0E2D, 0E32
C3 Final consonant 0E01, 0E07, 0E14, 0E19, 0E1A, 0E21, 0E22, 0E27

Every word has a consonant from the list of initial consonants, but characters from the other categories may or may not appear in a word. They appear in the following order:

V1? C1 C2? V2? T? V3? C3?

The states and transitions are as follows:

v1 c1 c2 v2 t v3 c3 #
q0 q1 q2
q1 q2
q2 q7 q8 q3 q4 q5 q6 q9
q3 q4 q5 q6 q9
q4 q7 q8 q5 q6 q9
q5 q7 q8 q6 q9
q6 q7 q8 q9
q7 q1
q8 q2
q9 q0

If the state becomes 7 or 8, the FSA has encountered a new starting vowel or consonant, insert a space before the character that initiated the transition. From state 7 or 8, transition to state 1 or 2 (respectively), before looking at the next character.

If the state becomes 9, the FSA has encountered a final consonant, insert a space after the character that initiated the transition. From state 9, transition to state 0. Note that there are overlaps in the categories! (The characters in C2, C3, and V3 overlap with C1). To deal with this, accept the characters in the following order:

Character Class Order
q0 V1, C1
q1 C1
q2 C2, V2, T, V3, C3, V1, C1
q3 V2, T, V3, C3
q4 T, V3, C3, V1, C1
q5 V3, C3, V1, C1
q6 C3, V1, C1
q7
q8
q9

The sample_out.txt file contains the correctly tokenized output of the in.txt file. Please check your program output against this file.

Japanese Tokenization

For the Japanese tokenization project, you will use the MaxMatch algorithm to read in a file with lines of Japanese text without spaces, find the word boundaries, and output a file with the same lines of text with spaces added between words.

./japanese_tokenizer.py <input_file> <output_file>

The MaxMatch algorithm uses a dictionary (japanese_wordlist.txt). Starting from the beginning of the sentence, match the longest consecutive string of characters that exists in the dictionary. Once you have found that longest string, insert a space and continue. If no matches exist in the dictionary, consider the character a one character word, insert a space and continue.

The sample_out.txt file contains the correctly tokenized output of the in.txt file. Please check your program output against this file.

gold_standard.txt contains the ideal tokenization. The MaxMatch algorithm makes mistakes, so dont expect your tokenization output to match this. When youre done implementing MaxMatch, compare the output of your file to the gold_standard. Make a file (by hand or programmatically) named evaluation.txt that contains the following:

# of sentences tokenized correctly: <# of lines of output that match gold_standard>

# of sentences tokenized incorrectly: <# of lines of output that dont match gold_standard>

accuracy: <# correct / total # of sentences>

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CLC Project 2- East Asian Language Tokenization
$25