[Solved] INF 551 Homework #5

$25

File Name: INF_551__Homework__5.zip
File Size: 188.4 KB

SKU: [Solved] INF 551 – Homework #5 Category: Tag:
5/5 - (1 vote)

In this homework, we will consider again the country data set, country.csv, city.csv, and countrylanguage.csv, similar to that you have seen in homework 1. But note that all NULL values in the data set are replaced with empty string , and header lines (1st line) have been removed. Use the data set attached with this handout.

  1. [Hadoop MapReduce, 55 points] Write a MapReduce program AvgExp.java that implements the following SQL query:

SELECT Continent, avg(LifeExpectancy) FROM country where GNP > 10000 group by Continent having count(*) >= 5;

You can take WordCount.java as the template. But note the following:

  • You may want to use split function of Java instead of StringTokenizer:

o https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.

lang.String)

  • Replace IntWritable with FloatWritable
  • It is OK that your implementation does not utilize a combiner.
  • Name your jar file ae.jar.

Execution format: hadoop jar ae.jar AvgExp.java <input-hw5> <output-hw5> Ignore the angle brackets.

Where the <input-hw5> directory stores a single file country.csv. Submission: AvgExp.java ae.jar

  1. [Apache Spark, 45 points] For each of the following questions, write a Spark program in Python. You can assume that all three csv files, country.csv, city.csv, and countrylanguage.csv (note files names are all lowercase letters), are available in the same directory where you execute the code. Note that you should NOT use Spark DataFrames or Spark SQL for this homework.
    1. [10 points] Find the 10 most populated countries in a given continent. Return the names of countries and their populations in the descending order of populations. Name your script pop10.py.
      • Execution format: spark-submit pop10.py Asia
      • This find the top-10 most populated countries in Asia.
      • Sample output:

China, 1277558000.0

India, 1013662000.0

1

INF 551 Spring 2020

  1. [10 points] Find names of countries which do not have any languages recorded (in the country language table). Output one line per country. Name your script no-lang.py.
  2. [15 points] Find names of countries which have at least 10 unofficial languages. Order the countries by the descending order of the number of unofficial languages. Name your script unofficial10.py.
  3. [10 points] Implement the SQL query in Question 1. Name your script avgexp.py.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] INF 551  Homework #5[Solved] INF 551 Homework #5
$25