Unstructured Data Example
Opinion pieces, box scores, summaries, ads, user comments, etc.
1 website, many types of data
Copyright By Assignmentchef assignmentchef
Unstructured Data Example cont. Business goal: Identify popular players
Download each web page (or frame) and store as a file
file1.txt file2.txt ..
Find out which players names appear most frequently?
How would you solve this problem?
Hire a data scientist to:
Real Map Reduce: Player name mentions
Solve problem with MapReduce, assume function:
bool isPlayerName (String s);
true if s is name of current WNBA player false otherwise
(on the board)
Unstructured Data Example cont.
Oh, excuse me. Did I say WNBA? I meant NBA. No, I meant all professional sports? No, I meant all corporate entities?
The same problem is relevant across many domains
At some point, data doesnt fit on your laptop How do mappers & reducers find the files they need?
A distributed file system is the answer, e.g., Hadoop Distributed
File System (HDFS)
Dont move data to workers move workers to the data!
Store data on the local disks of nodes in the cluster
Start up the workers on the node that has the data local
Map Reduce
Programmers specify two functions: map (k1, v1) [
reduce (k2, [v2]) [
All values with the same key are sent to the same reducer
The execution framework handles everything else
Everything Else
The execution framework handles everything else
Scheduling: assigns workers to map and reduce tasks
Data distribution: moves processes to data
Synchronization: gathers, sorts, and shuffles intermediate data Errors and faults: detects worker failures and restarts
You dont know:
Where mappers and reducers run
When a mapper or reducer begins or finishes
Which input a particular mapper is processing
Which intermediate key a particular reducer is processing
What can you do?
Cleverly structure intermediate data to reduce network traffic
Distributed File Systems
Companies like Google, Apple and Facebook run map reduce jobs over 10,000+ machines at multiple locations (called datacenters) hourly!
Challenge: Improve throughput for distributed file systems (and hence map reduce)
Namenode Responsibilities
Managing the file system namespace:
Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc. Coordinating file operations:
Directs clients to datanodes for reads and writes
No data is moved through the namenode Maintaining overall health:
Periodic communication with the datanodes Block re-replication and rebalancing
Garbage collection
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.