[SOLVED] COSC 472/572 Big Data

$25

File Name: COSC_472/572_Big_Data.zip
File Size: 197.82 KB

5/5 - (1 vote)

COSC 472/572 Big Data

Assignment 4 Create an Inverted Index

Due: Monday April 13, 2015

This is a paired assignment. You must have one (and only one partner). You must submit a paper with both partners names on it by Monday March 9.

Inverted indexes are one of the main data structures employed by a search engine. Basically, an inverted index has as a key value a specific word (e.g. cat). It will then contain every files (html document) that contains the word cat. A more elaborate version of an inverted index would contain a file name and word location. So for example, we might have something like the following:

cat-
a.html , carnivores.html , jungle_animals.html ,

The more elaborate model would look like:

cat-
a.html (3,7,103) , carnivores.html (10,23,44,56,1234) , etc

In the second instance, the list indicates the position of the word cat within the file. Your assignment is to create an inverted index. This will be done in several steps:

1) Use heritrix to crawl the web. You must collect at least a gigabyte of data. This will take a long time (10-20 hours) . Instructions on using heritrix , including an excellent video done by graduate student Cade Sperlich, can be found below:

Wikipedia Inverted Indexes
Heritrix home page
Cade Sperlichs excellent Heritrix web crawl video

2) Heritrix creates .warc files . .warc stands for web archive file. But we will need text files as output. Again, Cade has provided excellent documentation on this conversion, as well as a Python script:

Cades Python script for converting .warc files to text files Some notes from Cade

3) Now that you have text files, create a Hadoop/MapReduce job to create the inverted index (again, refer to Cades notes). This will take some time.

4) Import your results into sqlite

5) Create a simple search engine! Write your program in either Python or Java and interface to your sqlite database. The search engine will have two input options:

input a single word . We define word as and contiguous collection of non-blank roman letters like cat, dog, eastern, michigan. In this case, the program should return all documents containing the word

input a pound sign (#) followed by a word, for example #cat , #dog, #eastern . In this case, the

program should return with the total number of documents containing the word.

References for calling sqlite from Python or java can be found here:

Java and SQLite Python and SQLite

What to hand in:
A paper copy of your mapper and reducer programs. Make sure this code is well documented and

has both members names on each paper. You will also need to demonstrate your program to me.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] COSC 472/572 Big Data
$25