5/5 - (1 vote)

For this project you have a choice between implementing Thai tokenization or Japanese tokenization.

Thai Tokenization

Name: [Solved] CLC Project 2- East Asian Language Tokenization
Brand: Assignment Chef
SKU: [Solved] CLC Project 2- East Asian Language Tokenization
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

You will take a finite state approach to Thai tokenization. This is a simplified FSA that will work on the examples provided, but may not work on every Thai sentence.

./thai_tokenizer.py <input_file> <output_file>

Characters in Thai can be categorized as follows:

Category		Members	Unicode
V1	Preposed vowel		0E40, 0E41, 0E42, 0E43, 0E44
C1	Initialconsonant		0E01 0E2E
C2	Clustered consonant		0E19, 0E21, 0E23, 0E25, 0E27
V2	Super/subscript vowel		0E31, 0E34, 0E35, 0E36, 0E37, 0E38, 0E39, 0E47
T	Superscript tonemark		0E48, 0E49, 0E4A, 0E4B
V3	Postposed vowel		0E22, 0E27, 0E2D, 0E32
C3	Final consonant		0E01, 0E07, 0E14, 0E19, 0E1A, 0E21, 0E22, 0E27

Every word has a consonant from the list of initial consonants, but characters from the other categories may or may not appear in a word. They appear in the following order:

V1? C1 C2? V2? T? V3? C3?

The states and transitions are as follows:

v1		c1	c2	v2	t	v3	c3	#
q0	q1	q2
q1		q2
q2	q7	q8	q3	q4	q5	q6	q9
q3				q4	q5	q6	q9
q4	q7	q8			q5	q6	q9
q5	q7	q8				q6	q9
q6	q7	q8					q9
q7								q1
q8								q2
q9								q0

If the state becomes 7 or 8, the FSA has encountered a new starting vowel or consonant, insert a space before the character that initiated the transition. From state 7 or 8, transition to state 1 or 2 (respectively), before looking at the next character.

If the state becomes 9, the FSA has encountered a final consonant, insert a space after the character that initiated the transition. From state 9, transition to state 0. Note that there are overlaps in the categories! (The characters in C2, C3, and V3 overlap with C1). To deal with this, accept the characters in the following order:

Character Class Order
q0	V1, C1
q1	C1
q2	C2, V2, T, V3, C3, V1, C1
q3	V2, T, V3, C3
q4	T, V3, C3, V1, C1
q5	V3, C3, V1, C1
q6	C3, V1, C1
q7
q8
q9

The sample_out.txt file contains the correctly tokenized output of the in.txt file. Please check your program output against this file.

Japanese Tokenization

For the Japanese tokenization project, you will use the MaxMatch algorithm to read in a file with lines of Japanese text without spaces, find the word boundaries, and output a file with the same lines of text with spaces added between words.

./japanese_tokenizer.py <input_file> <output_file>

The MaxMatch algorithm uses a dictionary (japanese_wordlist.txt). Starting from the beginning of the sentence, match the longest consecutive string of characters that exists in the dictionary. Once you have found that longest string, insert a space and continue. If no matches exist in the dictionary, consider the character a one character word, insert a space and continue.

The sample_out.txt file contains the correctly tokenized output of the in.txt file. Please check your program output against this file.

gold_standard.txt contains the ideal tokenization. The MaxMatch algorithm makes mistakes, so dont expect your tokenization output to match this. When youre done implementing MaxMatch, compare the output of your file to the gold_standard. Make a file (by hand or programmatically) named evaluation.txt that contains the following:

# of sentences tokenized correctly: <# of lines of output that match gold_standard>

# of sentences tokenized incorrectly: <# of lines of output that dont match gold_standard>

accuracy: <# correct / total # of sentences>

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] CLC Project 2- East Asian Language Tokenization

Thai Tokenization

Japanese Tokenization

Reviews

Related products

[Solved] CLC Project 3- N-Grams

[Solved] CLC Project 1 Cleaning and Tokenization

[Solved] CLC Project 4- POS Tagging

[Solved] CLC Project 5-Word Analogy