Suggested reading: Mining of Massive Datasets: Chapter 2.
Hadoop: The Definitive Guide: Chapter 17 (Hive).
- Describe how you would implement a MapReduce job consisting of Map and Reduce description. You can describe it in your own words or as pseudo-code. Keep in mind that map task reads the input file and produces (key, value) pairs. Reduce task takes a list of (key, value) pairs for each key and combines all values for each key.
Remember that Map operates on individual blocks and Reduce on individual keys with a set of values. Thus, for Mapper you need to state what your code does given a block of data and for Reduce you need to state what your reducer does for each key. You can assume that you know how to parse the file and extract numbers/names.
- For a Student table (ID, FullName, Address, Phone, Year, Month), convert
SELECT Year, Month, COUNT(FullName)
FROM Student
GROUP BY Year, Month;
- For Employee(EID, First, Last, Phone, Age) and Agent(AID, First, Last, Address), find everyone with the same name using MapReduce:
SELECT a.First, a.Last, EID, AID, Phone
FROM Employee as e, Agent as a
WHERE e.Last = a.Last AND e.First = a.First;
- Same tables:
SELECT Age, COUNT(DISTINCT a.Last)
FROM Employee, Agent
WHERE EID = AID
GROUP BY Age;
- Suppose you are tasked with analysis of the companys web server logs. The log dump contains a large amount of information with up to 10 different attributes (columns). You regularly run a Hadoop job to perform analysis pertaining to 3 specific attributes TimeOfAccess, OriginOfAccess and FileName.
- If a Mapper task fails while processing a block of data what is the best location to restart it?
- If the job is executed with 5 Reducers
- How many files does the output generate?
- Suggest one possible hash function that may be used to assign keys to reducers.
- True or False?
- A message that was encrypted with a public key can be decrypted with a corresponding private key
- A message that was encrypted with a private key can be decrypted with a corresponding public key
- A message that was encrypted with a public key can only be read by its intended recipient, the holder of the private key
- Consider a Hadoop job that processes an input data file of size equal to 38 disk blocks (38 different blocks, you can assume that HDFS replication factor is set to 1). The mapper in this job requires 1 minute to read and fully process a single block of data. For the purposes of this assignment, you can assume that the reduce part of this job takes zero time.
- Approximately how long will it take to process the file if you only had one Hadoop worker node? You can assume that that only one mapper is created on every node.
- 20 Hadoop worker nodes?
- 50 Hadoop worker nodes?
- 75 Hadoop worker nodes?
- Now suppose you were told that the replication factor has been changed to 3? That is, each block is stored in triplicate, but file size is still 38 blocks. Which of the answers (if any) in a)-c) above will have to change?
You can ignore the network transfer costs and other potential overheads as well as the possibility of node failure. If you feel some information is missing please be sure to state your assumptions.
- In this section we are going to use Hive to run a few queries over the Hadoop framework. These instructions assume that you are starting from a working Hadoop installation. It should be sufficient to start your instance and the Hadoop framework on it.
Hive commands are listed in Calibri bold font
- Download and install Hive:
cd
(this command is there to make sure you start from home directory, on the same level as where hadoop is located)
wget http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/apache-hive-2.0.1-bin.tar.gz
gunzip apache-hive-2.0.1-bin.tar.gz
tar xvf apache-hive-2.0.1-bin.tar
set the environment variables (can be automated by adding these lines in ~/.bashrc). If you dont, you will have to set these variables every time you use Hive.
export HIVE_HOME=/home/ec2-user/apache-hive-2.0.1-bin
export PATH=$HIVE_HOME/bin:$PATH
$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
(if you get an error here, it means that /user/hive does not exist yet. Fix that by running $HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse instead)
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
We are going to use Vehicle data (originally from http://www.fueleconomy.gov/feg/download.shtml)
You can get the already unzipped, comma-separated file from here:
wget http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/vehicles.csv
You can take a look at the data file by either
nano vehicles.csv or
more vehicles.csv (you can press space to scroll and q or Ctrl-C to break out)
Note that the first row in the data is the list of column names. What follows after commands that start Hive, is the table that you will create in Hive loading the first 5 columns. Hive is not particularly sensitive about invalid or partial data, hence if we only define the first 5 columns, it will simply load the first 5 columns and ignore the rest.
You can see the description of all the columns here (atvtype was added later)
http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle
Create the ec2-user directory on the HDFS side (absolute path commands should work anywhere and not just in Hadoop directory as bin/hadoop does). Here, we are creating the user home directory on the HDFS side.
hadoop fs -mkdir /user/ec2-user/
Run hive (from the hive directory because of the first command below):
cd $HIVE_HOME
$HIVE_HOME/bin/schematool -initSchema -dbType derby
(NOTE: This command initializes the database metastore. If you need to restart/reformat or see errors related to meta store, run rm -rf metastore_db/ and then repeat the above initSchema command)
bin/hive
You can now create a table by pasting this into the Hive terminal:
CREATE TABLE VehicleData (
barrels08 FLOAT, barrelsA08 FLOAT,
charge120 FLOAT, charge240 FLOAT,
city08 FLOAT)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY , STORED AS TEXTFILE;
You can load the data (from the local file system, not HDFS) using:
LOAD DATA LOCAL INPATH /home/ec2-user/vehicles.csv
OVERWRITE INTO TABLE VehicleData;
(NOTE: If you downloaded vehicles.csv file into the hive directory, you have to change file name to /home/ec2-user/apache-hive-2.0.1-bin/vehicles.csv instead)
Verify that your table had successfully loaded by running
SELECT COUNT(*) FROM VehicleData;
(Copy the query output and report how many rows you got as an answer.)
Run a couple of HiveQL queries to verify that everything is working properly:
SELECT MIN(barrels08), AVG(barrels08), MAX(barrels08) FROM VehicleData;
(copy the output from that query)
SELECT (barrels08/city08) FROM VehicleData;
(you do not need to report the output from that query, but report Time taken)
Next, we are going to output three of the columns into a separate file (as a way to transform data for further manipulation that you may be interested in)
INSERT OVERWRITE DIRECTORY ThreeColExtract
SELECT barrels08, city08, charge120
FROM VehicleData;
You can now exit Hive by running exit;
And verify that the new output file has been created (the file will be called 000000_0)
The file would be created in HDFS in user home directory (/user/ec2-user/ThreeColExtract)
Report the size of the newly created file.
Next, you should go back to the Hive terminal, create a new table that is going to load 8 columns instead of 5 in our example (i.e. create and load a new table that defines 8 columns by including columns city08U,cityA08,cityA08U) and use Hive to generate a new output file containing only the city08U and cityA08U columns from the vehicles.csv file. Report the size of that output file as well.
Reviews
There are no reviews yet.