, , ,

[SOLVED] Cs537 p6: concurrency (map-reduce)

$25

File Name: Cs537_p6__concurrency__map_reduce_.zip
File Size: 320.28 KB

5/5 - (1 vote)
5/5 – (1 vote)

In 2004, engineers at Google introduced a new paradigm for large-scale parallel data processing known as MapReduce (the original paper is here.  As a bonus, see that authors of the the citations at the end).  MapReduce makes developers efficiently and easily program large-scale clusters, especially for tasks with a lot of data processing.  With Map Reduce, the developer can focus on writing their task as a set of Map and Reduce functions, and let the underlying infrastructure handle parallelism/concurrency, machine crashes, and other complexities common within clusters of machines.

In this project, you’ll be building a simplified version of MapReduce for just a single machine. While it may seem simple to build MapReduce for a single machine, you will still face numerous challenges, mostly in building the correct concurrency support. You’ll have to think about how to design your MapReduce implementation, and then build it to work efficiently and correctly.

There are three specific objectives to this assignment:

  • To gain an understanding of the fundamentals of the MapReduce paradigm.
  • To implement a correct and efficient MapReduce framework using threads and related functions.
  • To gain experience designing and developing concurrent code.

Deliverables:

  • Two files named mapreduce.c and mapreduce.h
  • The files will be compiled with some test files. It should compile successfully when compiled with the -Wall and -Werror flags

Administrivia:

  • This project may be performed in groups of 2.
  • Due Date: Aug 9th at 11:59PM (slip day policy will apply to both people in the the group)
  • The final submission will be tested on lab machinesLinks to an external site., so make sure to test your code on these machines.

Background

To understand how to make progress on any project that involves concurrency, you should understand the basics of thread creation, mutual exclusion (with locks), and signaling/waiting (with condition variables). These are described in the following book chapters:

Read these chapters carefully in order to prepare yourself for this project.


General Idea

Let’s now get into the exact code you’ll have to build. The MapReduce library/infrastructure you will build should support the execution of user-defined Map() and Reduce() functions.

As from the original paper: “Map(), written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key K and passes them to the Reduce() function.”

“The Reduce() function, also written by the user, accepts an intermediate key K and a set of values for that key. It merges together these values to form a possibly smaller set of values; typically just zero or one output value is produced per Reduce() invocation. The intermediate values are supplied to the user’s reduce function via an iterator.”

A classic example, written here in pseudocode (note: your implementation will have different interfaces), shows how to count the number of occurrences of each word in a set of documents:

map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
        EmitIntermediateValues(w, "1");
        
reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
        result += ParseIntermediate(v);
    print key, result;

The following image from https://dzone.com/articles/word-count-hello-word-program-in-mapreduce might help you understand the behavior of the MapReduce framework for this sample wordcount program.

111.png

What’s fascinating about MapReduce is that so many different kinds of relevant computations can be mapped onto this framework. The original paper lists many examples, including word counting (as above), a distributed grep, a URL frequency access counters, a reverse web-link graph application, a term-vector per host analysis, and others.

What’s also quite interesting is how easy it is to parallelize: many mappers can be running at the same time, and later, many reducers can be running at the same time. Users don’t have to worry about how to parallelize their application; rather, they just write Map() and Reduce() functions and the infrastructure does the rest.

Code Overview

We give you here the mapreduce.h header file that specifies exactly what you must build in your MapReduce library:

#ifndef __mapreduce_h__
#define __mapreduce_h__

// Different function pointer types used by MR
typedef char *(*Getter)(char *key, int partition_number);
typedef unsigned long (*Partitioner)(char *key, int num_partitions);
typedef void (*Mapper)(char *file_name);
typedef void (*Reducer)(char *key, Getter get_func, int partition_number);

unsigned long MR_DefaultHashPartition(char *key, int num_partitions);

// External functions: these are what *you must implement*
void MR_Emit(char *key, char *value);

void MR_Run(int argc, char *argv[], 
	    Mapper map, int num_mappers, 
	    Reducer reduce, int num_reducers, 
	    Partitioner partition, int num_partitions);

#endif // __mapreduce_h__

You must implement the MR_Run and MR_Emit functions. In addition, you will have to implement a get_next function that is described in the next section.

  1. MR_Run: The most important function is MR_Run, which takes the command line parameters of a given program: a pointer to a Map function (type Mapper, called map), the number of mapper threads your library should create (num_mappers), a pointer to a Reduce function (type Reducer, called reduce), the number of reducers (num_reducers), a pointer to a Partition function (partition, described below), and the number of partitions. Thus, when a user is writing a MapReduce computation with your library, they will implement a Map function, implement a Reduce function, possibly implement a Partition function, and then call MR_Run(). The infrastructure will then create threads as appropriate and run the computation.
    One basic assumption is that the library will create num_mappers threads (in a thread pool) that perform the map task. Another is that your library will create num_reducers threads (in a thread pool) to perform the reduction tasks. Finally, your library will create some kind of internal data structure to pass keys and values from mappers to reducers; more on this below.
  2. MR_Emit: The MR_Emit function is another key part of your library; it needs to take key/value pairs from the many different mappers and store them in a partition such that later reducers can access them, given constraints described below. Designing and implementing this data structure is thus a central challenge of the project.

 

Simple Example: Wordcount

Here is a simple (but functional) wordcount program written to use the MapReduce library:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "mapreduce.h"

void Map(char *file_name) {
    FILE *fp = fopen(file_name, "r");
    assert(fp != NULL);

    char *line = NULL;
    size_t size = 0;
    while (getline(&line, &size, fp) != -1) {
        char *token, *dummy = line;
        while ((token = strsep(&dummy, " t
r")) != NULL) {
            MR_Emit(token, "1");
        }
    }
    free(line);
    fclose(fp);
}

void Reduce(char *key, Getter get_next, int partition_number) {
    int count = 0;
    char *value;
    while ((value = get_next(key, partition_number)) != NULL)
        count++;
    printf("%s %d
", key, count);
}

int main(int argc, char *argv[]) {
    MR_Run(argc, argv, Map, 10, Reduce, 10, MR_DefaultHashPartition, 10);
}

Let’s walk through this code, in order to see what it is doing and describe the requirements for this project.

First, notice that Map() is called with a file name. In general, we assume that this type of computation is being run over many files; each invocation of Map() is thus handed one file name and is expected to process that file in its entirety.

In this example, the code above just reads through the file, one line at a time, and uses strsep() to chop the line into tokens. Each token is then emitted using the MR_Emit() function, which takes two strings as input: a key and a value. The key here is the word itself, and the token is just a count, in this case, 1 (as a string). It then closes the file.

After the mappers are finished, your library should have stored the key/value pairs(passed to MR_Emit) in partitions such that the reducers can access later. Your implementation must ensure that partition i is sent to a reducer before partition i+1.

Reduce() is invoked per key, and is passed the key along with a function that enables iteration over all of the values that produced that same key. To iterate, the code just calls get_next() repeatedly until a NULL value is returned; get_next returns a pointer to the value passed in by the MR_Emit() function above, or NULL when the key’s values have been processed. The output, in the example, is just a count of how many times a given word has appeared, and is just printed to standard output.

All of this computation is started off by a call to MR_Run() in the main() routine of the user program. This function is passed the argv array, and assumes that argv[1] … argv[n-1] (with argc equal to n) all contain file names that will be passed to the mappers.

One interesting function that you also need to pass to MR_Run() is the partitioning function. For simplicity, all the test programs will use the default function (MR_DefaultHashPartition), which is defined below. Copy it into your file:

unsigned long MR_DefaultHashPartition(char *key, int num_partitions) {
    unsigned long hash = 5381;
    int c;
    while ((c = *key++) != '')
        hash = hash * 33 + c;
    return hash % num_partitions;
}

The function’s role is to take a given key and map it to a number, from 0 to num_partitions-1. Its use is internal to the MapReduce library, but critical. Specifically, your MR library should use this function to decide which partition (and hence, which reducer thread) gets a particular key/list of values to process. Note that the number of partitions can be greater (or smaller) than the number of reducers; thus, some reducers may need to handle multiple (or no) partitions.

For some applications, which reducer thread processes a particular key is not important (and thus the default function above should be passed in to MR_Run()); for others, it is, and this is why the user can pass in their own partitioning function as need be.

One last requirement: For each partition, keys (and the value list associated with said keys) should be sorted in ascending key order; thus, when a particular reducer thread (and its associated partition) are working, the Reduce() function should be called on each key in order for that partition.  For this sort within a partition, the sorted order should be the lexical order provided by strcmp() over the original keys.


Considerations

Here are a few things to consider in your implementation:

  • Thread Management: This part is fairly straightforward. You should create num_mappers mapping threads, and assign a file to each Map() invocation in some manner you think is best (e.g., Round Robin, Shortest-File-First, etc.). Think about which algorithm might lead to the best performance? You should also create num_reducers reducer threads at some point (either at the beginning or after the mappers finish), to work on the emitted key value pairs.
  • Partitioning and sorting: Your central data structure should be concurrent, allowing mappers to each put values into different partitions correctly and efficiently. Once the mappers have completed, a sorting phase should order the key/value-lists. Then, finally, each reducer thread should start calling the user-defined Reduce() function on the keys in sorted order per partition. You should think about what type of locking is needed throughout this process for correctness.
  • Memory Management One last concern is memory management. The MR_Emit() function is passed a key/value pair; it is the responsibility of the MR library to make copies of each of these. However, avoid making too many copies of them since the goal is to design an efficient concurrent data structure. Then, when the entire mapping and reduction is complete, it is the responsibility of the MR library to free everything.

Testing and handin instructions

  • Tests will be provided at ~cs537-1/tests/p6.
  • Handing it in: Copy your files to ~cs537-1/handin/login/p6 where login is your CS login. Do NOT use this handin directory for your work space.  You should keep a separate copy of your project files in your own home directory and then simply copy the relevant files to this handin directory when you are done.  The permissions to this handin directory will be turned off promptly when the deadline passes and you will no longer be able to modify files in that directory.
    Each project partner should turn in their joint code to each of their handin directories. Each person should place a file named partners.txt in their handin/p6 directory, so that we can tell who worked together on this project.  The format of partners.txt should be exactly as follows:

    cslogin1 wiscNetid1 Lastname1 Firstname1
    cslogin2 wiscNetid2 Lastname2 Firstname2

    It does not matter who is 1 and who is 2.  If you worked alone, your partners.txt file should have only one line.  There should be no spaces within your first or last name; just use spaces to separate fields. To repeat, both project partners should turn in their code and both should have a turnin/p2a/partners.txt file.

Shopping Cart

No products in the cart.

No products in the cart.

[SOLVED] Cs537 p6: concurrency (map-reduce)[SOLVED] Cs537 p6: concurrency (map-reduce)
$25