Task 1:
According to the description of NCDC data format in the Description of Data.txt file, you need to store all the data in the .op.gz file to the HDFS. And then load data from HDFS to table observations and counts of HBase. Set the column families of the two tables to info and data respectively. The counts table stores all the count information in the .op.gz files, and the observations table stores others.
Task 2:
An HBase table can be the source or target of a MapReduce job, or also we can use it as both input and output. Get data from tables observations and counts and use MapReduce to calculate the following results:
Which station has the most records? (One row represents one data record (one days data))
Since each station only records part of the days in a year (eg, the observation data of station which station ID is 007026-99999 in 2016, this station only observed 8 days of data from June 22 to 29 in a year), you need to count which station has the most total days in the last 100 years.
Which year has the most records?
Similarly, you need to figure out which year in the last 100 years has the most data recorded by these stations.
In which year and which station has the most observations (sum of all count) of the specified information
The number of observations of TEMP, DEWP, SLP, STP, VISIB and WDSP are different in different days for each station. Calculate in which year and which station has the largest number of observations of these information.
Get one or more conclusions from the dataset by calculation and data processing. Give detailed procedures of the data analytics.
Reviews
There are no reviews yet.