5/5 - (1 vote)

Data Processing in the Wild: Hadoop andSE 3244: Data Management in the Cloud Professor
This slide set was first created by Prof. Xiaoyi Lu for The Ohio State course CSE3244
Thank you Professor Lu (UC Merced)

Recall: WordCount Execution in MapReduce
The overall execution process of WordCount in MapReduce
The Ohio State University CSE 3244 2

A Hadoop MapReduce Example WordCount
public class WordCount {
public static class Map extends Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
static class Reduceextends Reducer {
public void reduce(Text key, Iterator values, Context context)
throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
context.write(key, new IntWritable(sum));
The Ohio State University CSE 3244 3

Map Reduce Objectives
Writing software that processes big unstructured data that has (1) many records, (2) large records, and (3) complicated structure is HARD.
Build a platform that simplifies the process for the programmer.
First attempt: Map & Reduce functions in Java
Productive
Scalable Fault-Tolerant
The Ohio State University CSE 3244 3

Scalability Problems in MapReduce
The Ohio State University
Scale: Add blades with processor, memory, disk
Problem: In practice, disk is often the bottleneck (slowest link)
Dont forget to make multiple copies in case of failure
HDFS HDFS HDFS
In-Memory?
faster than network and disk

Scalability in Spark: RDD Programming
Key idea: Resilient Distributed Datasets (RDDs)
Immutable distributed collections of objects that can be cached in memory across
cluster nodes
Created by transforming data in stable storage using data flow operators (map, filter, groupBy, )
Manipulated through various parallel operators
Automatically rebuilt on failure Fault tolerance by design
rebuilt if a partition is lost
The Ohio State University CSE 3244 5

Productivity: RDD Operations
Clean language-integrated API in Scala (Python & Java) Can be used interactively from Scala console
Transformations (define a new RDD)
sample union groupByKey reduceByKey sortByKey join
Actions (return a result to driver)
countByKey saveAsTextFile saveAsSequenceFile
More Information:
https://spark.apache.org/docs/latest/programming-guide.html#transformations https://spark.apache.org/docs/latest/programming-guide.html#actions
The Ohio State University CSE 3244 6

RDD Example: Word Count in Spark!
val file = spark.textFile(hdfs://)
val counts = file.flatMap(line => line.split( )) .map(word => (word, 1))
.reduceByKey(_ + _) counts.saveAsTextFile(hdfs://)
The Ohio State University CSE 3244 9

Overview of Apache Hadoop Architecture
Open-source implementation of Google MapReduce, GFS, and BigTable for Big Data Analytics
Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN
http://hadoop.apache.org Hadoop 1.x
Hadoop 2.x
(Data Processing)
(Cluster Resource Management & Job Scheduling)
Other Models
(Data Processing)
(Cluster Resource Management & Data Processing)
Hadoop Distributed File System (HDFS)
Hadoop Common/Core (RPC, ..)
Hadoop Distributed File System (HDFS)
Hadoop Common/Core (RPC, ..)
The Ohio State University CSE 3244 10

MapReduce on Hadoop 2.x YARN Architecture
Resource Manager: coordinates the allocation of compute resources
Node Manager: in charge of resource containers, monitoring resource usage, and reporting to Resource Manager
Application Master: in charge of the life cycle an application, like a MapReduce Job. It negotiates with Resource Manager of cluster resources and keeps track of task progress and status
Data Nodes Data Nodes
Locality settings
Data Nodes Data Nodes
Courtesy: http://www.cyanny.com/2013/12/05/hadoop-mapreduce-2-yarn/
Data Nodes
The Ohio State University CSE 3244 11

of Apache Hadoop 3.x Architecture
Hadoop Apps
MR HIVE YARN
docker docker
TensorFlow
Erasure Coding
MapReduce
Task-level native optimization (up to 30%
faster for shuffle-intensive jobs)
Support for more than 2 NameNodes
Intra-datanode balancer
Built-in support for Long Running Services
Better resource isolation (isolation supports for disk and network) and Docker
Scheduling enhancement (enhance container scheduling throughput by 6x)
Re-architecture for YARN Timeline Service ATS v2
The Ohio State University CSE 3244 14

An in-memory data-processing framework
Iterative machine learning jobs
Interactive data analytics
Scala based Implementation
Standalone, YARN, Mesos
A unified engine to support Batch, Streaming, SQL, Graph, ML/DL workloads
Scalable and communication intensive
Wide dependencies between Resilient
Distributed Datasets (RDDs)
MapReduce-like shuffle operations to repartition RDDs
Worker Worker
SparkContext
Caffe, TensorFlow,
BigDL, etc.
(real-time)
Map Reduce
Standalone
Apache Mesos YARN
http://spark.apache.org
MLlib Machine Learning
The Ohio State University CSE 3244 15

CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS SE 3244: Data Management in the Cloud Professor

Reviews

Whatsapp Us

[SOLVED] CS SE 3244: Data Management in the Cloud Professor

Reviews

Related products

[Solved] BinaryAdd

[SOLVED] COP 3223 Program #4: Turtle Time and List Power

[SOLVED] ITEC136 Python Program

[Solved] Program that has three functions: sepia(), remove_all_red(), and gray_scale()

[Solved] Python Assignment-Financial Products and Markets

[SOLVED] pakudex