[SOLVED] CS IFN647 Workshop (Week 3)

$25

File Name: CS_IFN647_Workshop_(Week_3).zip
File Size: 254.34 KB

5/5 - (1 vote)

Objectives:
IFN647 Workshop (Week 3)
Pre-processing Textual Data
Understand how to use python to do basic data pre- processing for text documents.

Copyright By Assignmentchef assignmentchef

Task 1: Write a program that loads (reads) an XML document, and prints out the itemid and the number of words in of the document.
The following is the Format of the Document:
BELGIUM: MOTOR RACING-LEHTO AND SOPER HOLD ON FOR GT VICTORY.
MOTOR RACING-LEHTO AND SOPER HOLD ON FOR GT VICTORY.
SPA FRANCORCHAMPS, Belgium

J.J. Lehto of Finland and of Britain drove their ailing McLaren to victory in the fifth round of the world GT championship on Sunday, beating the Mercedes of Schneider and Austrian Alexander Wurz by 15 seconds.

Their victory enabled them to open up a 16-point lead in the overall standings over Schneider, who mounted a strong challenge on the struggling leaders in the final minutes of the four-hour race.

But Soper, struggling with the cars handling caused by a broken undertray, just managed to hold on for the win.

Lehto had opened up a lead of over 90 seconds during a mid-race downpour in the Ardennes mountains.

I thought that everyone else was driving on dry-weather tyres, he joked afterwards.

We swapped to rain tyres at exactly the right time and I was able to push hard and open up a big lead.

Third to finish was the Porsche of Frances and and Belgian Thierry Boutsen.

The Belgian, a former Formula One driver, switched from the car he normally shares with Stuck following a power-steering failure on his own car.

(c) Reuters Limited 1997

Task 2: Design a parsing function (parse_doc(input, stops)) to read a file and represent the file as a tuple
(word_count, {docid:curr_doc})
word_countisthenumberofwordsin <text> docidissimplyassignedbytheitemidin curr_docisadictionaryofterm_frequencypairs.
You only need to tokenize the part of the document into words, exclude all tags, and discard punctuations and numbers. Then please remove stopping words and at last get all terms used in the part.
o Download the stopping words list (common-english- words.txt) from the Blackboard, and use it for this task.
You can initialize dictionary curr_doc ={}, then add terms into curr_doc ={}, when you go through terms, you may need to check if the new term exists in curr_doc and then update its frequency.
The following is an example of the return value of parse_doc() for file 6146.xml (see the file in the data folder in the Blackboard):
(133, {6146: {argentine: 1, bonds: 1, slightly: 1, higher: 1, small: 1, technical: 2, bounce: 2, , newsroom: 1}})

Task 3: Design a main function to read a xml file and common-english-words.txt (the list of stopping words), call function parse_doc(input, stops), and print the itemid (docid), the number of words (word_count) and the number of terms (len(curr_doc)).
The following is an example of the outputs:
Document itemid: 6146 contains: 133 words and 75 terms

CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS IFN647 Workshop (Week 3)
$25