COSC 472/572 Big Data
Assignment 4 Create an Inverted Index
Due: Monday April 13, 2015
This is a paired assignment. You must have one (and only one partner). You must submit a paper with both partners names on it by Monday March 9.
Inverted indexes are one of the main data structures employed by a search engine. Basically, an inverted index has as a key value a specific word (e.g. cat). It will then contain every files (html document) that contains the word cat. A more elaborate version of an inverted index would contain a file name and word location. So for example, we might have something like the following:
cat-
a.html , carnivores.html , jungle_animals.html ,
The more elaborate model would look like:
cat-
a.html (3,7,103) , carnivores.html (10,23,44,56,1234) , etc
In the second instance, the list indicates the position of the word cat within the file. Your assignment is to create an inverted index. This will be done in several steps:
1) Use heritrix to crawl the web. You must collect at least a gigabyte of data. This will take a long time (10-20 hours) . Instructions on using heritrix , including an excellent video done by graduate student Cade Sperlich, can be found below:
Wikipedia Inverted Indexes
Heritrix home page
Cade Sperlichs excellent Heritrix web crawl video
2) Heritrix creates .warc files . .warc stands for web archive file. But we will need text files as output. Again, Cade has provided excellent documentation on this conversion, as well as a Python script:
Cades Python script for converting .warc files to text files Some notes from Cade
3) Now that you have text files, create a Hadoop/MapReduce job to create the inverted index (again, refer to Cades notes). This will take some time.
4) Import your results into sqlite
5) Create a simple search engine! Write your program in either Python or Java and interface to your sqlite database. The search engine will have two input options:
input a single word . We define word as and contiguous collection of non-blank roman letters like cat, dog, eastern, michigan. In this case, the program should return all documents containing the word
input a pound sign (#) followed by a word, for example #cat , #dog, #eastern . In this case, the
program should return with the total number of documents containing the word.
References for calling sqlite from Python or java can be found here:
Java and SQLite Python and SQLite
What to hand in:
A paper copy of your mapper and reducer programs. Make sure this code is well documented and
has both members names on each paper. You will also need to demonstrate your program to me.
Reviews
There are no reviews yet.