1 Project Description
A makefile is often used to simplify the compilation and build process of large software tools, via
the GNU make [2] command. The software installation targets are primarily UNIX like operating
systems. A typical makefile has the following structure [1]
target: dependencies
recipe(s)
Given a makefile filename and an optional target as command line arguments, your task is
to generate an executable ’mymake’, from a C program. It must parse the makefile and build a
dependency graph for the (optional) provided target. Also, execute the recipes for the said target,
by spawning a new process per recipe, via fork and exec, as determined by the graph traversal. If
no target is specified as a command line argument, the first target in the makefile is chosen.
Alternatively, if your first argument is a print flag (-p) followed by the makefile, you must only
print the target, dependencies, and recipes in the makefile. A second type of flag (-r) followed by
the makefile must print the recipe order of the target. Do not fork/exec the recipes when running
either of the flags. More details are described in the following three phases:
1.1 Parsing the input makefile
The makefile consists of multiple targets and each target describes the dependencies that must be
satisfied before executing the recipes within a target. Recipes, rules, and commands mean the same
thing. In this project makefile, each target can either have multiple dependencies with one recipe
in a new line. Targets and dependencies are separated by a colon (:). Multiple dependencies for
a target are separated by space. Each recipe for a target must be indented via a single tab (t,
not 2/3/4/8 spaces). Recipes within the same target, must be executed in the order they appear.
Blank lines are optional and improve readability. More assumptions about the file are presented
later in the section 5
1
To simplify your job, we will provide a helper function that reads the contents of the makefile
into a two dimensional char array. The first step is to parse these lines from a char array to accurately determine targets, dependencies, and recipes. We suggest adopting the technique described
in makeargv.c from the textbook https://usp.cs.utsa.edu/usp/programs/chapter02/. One sample
makefile is shown below:
all : dep1 dep2
ls -l
dep1: dep3
gcc -o file1.o -c file4.c file1.c
dep2:
gcc -o file3.o -c file3.c
dep3: dep4
gcc -o file2.o -c file2.c
dep4 :
mkdir myproject
The above makefile with no command line target, runs as ./mymake -p makefile in. It must print
the target name, the dependency count and recipe count for each of the target in the Makefile.
Output of one target can look like
target ’all’ has 2 dependencies and 1 recipe
Dependency 0 is dep1
Dependency 1 is dep2
Recipe 0 is ls -l
Similarly, you must print the remaining 4 targets, their respective dependencies and recipes to get
full credit for this phase.
1.2 Determining the order of recipes
The makefile 1.1 when run with the recipe flag option, must print the order of recipes, such as
running ./mymake -r makefile in on command line. It must pick all and look for recipes in the
order – dep4, dep3, dep1, dep2. Finally, it prints the following lines:
mkdir myproject
gcc -o file2.o -c file2.c
gcc -o file1.o -c file4.c file1.c
gcc -o file3.o -c file3.c
ls -l
Running mymake -r makefilename will print all the recipes in order of execution, and no targets
must be passed on the commandline.
2
1.3 Fork and exec the recipes.
Using the data structures that model the graph, execute the recipes by forking a new process per
recipe. Your program should determine which recipes are eligible to run, execute those recipes,
wait for any recipe to finish, then repeat this process until all recipes finished executing.
Instead, if a project is run with a specific target, say dep2, ./mymake makefile in dep2, your
program must only print and then fork/exec one recipe: gcc -o file3.o -c file3.c
2 Implementation:
We will provide an example framework to get started and examples to test the code. However, you
are free to use your own implementation, without following the given structure. In the folder pa1,
the src folder contains the code, test contains the test cases . The data structure used to store
information about the target in desired ’container’ data structure looks like
typedef struct target_block {
char *name; //Target name
char *depend[MAX_DEP]; // dependencies of target
char *recipe[MAX_RECIPES_PT]; // recipes per target
unsigned char dep_count; // number of dependencies in target
unsigned char recipe_count; // number of recipes in target
unsigned char visited; // used in DFS
} target;
The dependency graph uses the result of topological sort [3] – an application of Depth First
Search algorithm. Given a set of targets, dependencies, and recipes, one must construct a directed
graph by determining the nodes and edges as targets and dependencies, respectively. Alternatively,
a less modular deisgn involves inserting the targets into a stack based on the DFS traversal, avoiding
building a graph.
3 Execution:
There are three ways to execute the project via
$ ./mymake filename [target]
$ ./mymake [-p] filename
$ ./mymake [-r] filename
Your executable can take upto two command line arguments. The filename argument is mandatory. If the optional arguments are provided, they must be in fixed positions. Any other variations
must error out.
4 Testing
As described in 2, to test your code against the provided test cases within the framework, run $
make -i test-main. The -i flag ignores the error return value and lets you test the whole suite. We
3
will use more tests cases than the ones provided. If you have not attempted extra credit, discussed
later, you must modify the makefiles to suit your code.
5 Assumptions:
To simplify the project, you can assume the following about the makefile:
• The makefile is well formed and has no circular dependencies.
• It doesnt use make variables (denoted via $), macros, or wildcards.
• It can have a maximum of 128 lines and each line is upto 128 characters long
• Each target can have upto 8 dependencies and one recipe. This assumption is relaxed, if you
are attempting extra credit.
• There can be no more than 128 targets in the makefile. Targets having 0 dependencies and
0 recipes, simulatenously, do not exist. However, targets with 0 recipes are possible.
• Recipes do not spread across lines
• We limit the executables to those found in default PATH environment of the CSELAB machines. Sample executables are Linux commands such as ls, cat, pwd, mkdir, or building an
executable via gcc.
• Each makefile has a single dependency tree.
• You are free to make any reasonable assumptions not stated in this document, but do mention
them early in your README file.
6 Extra credit:
Unlike the makefile described earlier, one can specify multiple recipes for a target. Multiple recipes
in different lines, within the same target, are executed in the order they are read. Recipes order
within the same line are from left to right. Also, multiple recipes can be solved in parallel by
separating them with a comma (,). For example, assume the following recipes in your makefile
named makefile ec
all : dep1 dep2
ls -l
dep1: dep3
gcc -o file1.o -c file4.c file1.c
dep2:
gcc -o file3.o -c file3.c
dep3: dep4
4
gcc -o file2.o -c file2.c
gcc -o file4.o -c file4.c
dep4 :
mkdir myproject, pwd
cat file1.c
By executing this makefile with no targets as $ ./mymake makefile ec. The program must pick all
and look for dependencies in the order — dep4, dep3, dep1, dep2. The recipes are executed as:
mkdir myproject
pwd
cat file1.c
gcc -o file2.o -c file2.c
gcc -o file4.o -c file4.c
gcc -o file1.o -c file4.c file1.c
gcc -o file3.o -c file3.c
ls -l
For dep4, do not execute cat file1.c until both the previous commands finished executing. The
parent must use wait system call on both the commands to avoid zombie process. If your group
chooses to attempt the extra credit for an additional 10% of the project grade, you must correctly
handle both parallel execution of recipes and multiple recipes per target, within the same program.
7 Deliverables:
One student from each group should upload to Canvas, a zip file containing their C source code
files, a makefile, and a README that includes the following details:
• The purpose of your program
• How to compile the program
• What exactly your program does
• Any assumptions outside this document
• Team names and x500
• Your and your partners individual contributions
• If you have attempted extra credit
The README file does not have to be long, but must properly describe the above points. Proper
in this case refers to – first-time user can answer the above questions without any confusion. Within
your code you should use one or two comments to describe each function that you write. You do not
need to comment every line of your code. However, you might want to comment portions of your
code to answer why, rather than how you implement the said code. At the top of your README
file and main C source file please include the following comment:
5
/*test machine: CSELAB_machine_name
* date: mm/dd/yy
* name: full_name1, [full_name2]
* x500: id_for_first_name, [id_for_second_name]
*/
8 Grading Rubric:
1. 10% Correct README contents
2. 10% Code quality such as using descriptive variable names, modularity, comments, indentation. You must stick to one style throughout the project. If you do not already have a style
for C programming, we suggest adopting K&R style [4].
3. 15% for using meaningful data structures and appropriate error checking and handling
(a) 10% No program crashes or segfault
(b) 5% uses appropriate data structures
4. 20% Parse the makefile accurately – determined via -p flag
5. 15% Implement the dependency graph algorithm for identifying recipe order – determined
via -r flag
6. 30% Correctly call fork, exec, wait system calls not leading to fork bombs or zombie processes.
7. 10% Extra credit
If your code passes the test cases, you will get 75% of the credit. The remaining 25% depends
on the implementation quality and documentation, as described in points 1, 2 & 3b.
9 References:
1. Makefile rules: https://en.wikipedia.org/wiki/Makefile#Rules
2. GNU Make: https://www.gnu.org/software/make/manual/make.html
3. Topological sorting: https://en.wikipedia.org/wiki/Topological_sorting
4. K&R Style guide: https://en.wikipedia.org/wiki/Indentation_style#K&R_style
5. AND operator in make https://www.gnu.org/software/make/manual/make.html#Execution
Jeffrey Dean and Sanjay Ghemawat from Google published a research paper [1] in 2004, on a new programming model that can process very large datasets in an efficient, distributed manner. This programming model is called the “MapReduce”.
The MapReduce consists of two main phases, Map and Reduce. In the ’Map’ phase, a user written map
function is used to process input <key, value> pairs to intermediate <key, value> pairs. Then in the ’Reduce’ phase, a reduce function combines the intermediate pairs based on the keys to give the final output.
Since the dataset input can be very large, there will be multiple map and reduce jobs and it is essential to
maintain a synchronized system.
Let us consider an example of counting the number of occurences of each word in a corpus of documents.
The map function emits each word (key) it sees with a count ’1’ (value). The reduce function sums together
all counts of emitted word by a particular word(key). The pseudocode for the same looks as below:
map ( String key , String value ):
// key : document name
// value : document contents
for each word w in value :
EmitIntermediate (w ,”1″);
reduce ( String key , Iterator values ):
// key : a word
// values : a list of counts
int result = 0;
for each v in values :
result += ParseInt (v );
Emit ( AsString ( result ));
The above example is taken from [1].
In this project, we will mimic certain core functionalities of the MapReduce programming model using OS
system calls.
1
Concepts covered in this project.
• Process spawning
• Directory traversal
• File I/O operations
• Pipes
• Redirection
Given a folder with multi-level hierarchy, i.e., folders with multiple level of folders and text files. Each text
file has a word per line. Your task is to count the number of words, per letter of the alphabet, i.e., compare
the first letter of each word to increment the count corresponding to a letter, in the entire folder hierarchy.
The result should be reported in the alphabetical order.
Key parties involved in the project:
• Master process – Parent process to all the spawned processes
• Mapper processes – Executes the map function on the partitioned data. The number of mapper
processes should be specified at the execution time as input argument
• Reducer process – Executes the reduce function over the results from all the mapper process. For
easiness, we have only a single reducer process for this project.
There are four phases in this project. First is the data partitioning phase where the ’master’ will traverse
through the folder hierarchy, identify all the files and split them equally among the mapper processes.
During the second phase, the ’mappers’ will process the files alloted to them by the ’master’ and each of
them will come up with a list of per letter word count. For the third phase, each ’mapper’ will send their
list to the ’reducer’ process, who will combine them all to give a single list. In the last phase, the final list
is then taken by the ’master’ and reported.
Now let us have a look at the project implementation details. An example is provided for your understaning
after the explanation of each phase.
Notice: You may use any algorithm that will help you to reach the final goals. For each phase there
is an expected output, unless otherwise specified. They should be in the format specified, i.e., if the
results are to be stored as a text file with a specific name format in a folder with a specific name, you
should follow it.
!
This phase aims at traversing the folder hierarchy and partitioning the files equally among the ’mapper’
processes.
The ’master’ process creates a folder “MapperInput” in the current directory. It will then recursively traverses through the ’Sample’ folder hierarchy (assuming ’Sample’ is the relative or absolute, path of the
folder that you passed to the executable) and divide the list of filepaths of text files with words, among ’m’
files of name format ’Mapper_i.txt’, where ’m’ is the number of ’mappers’ and i belongs to [1, 2, . . . , m].
The ’Mapper_i.txt’ should be created inside ’MapperInput’. You may use any partitoning logic that allows
you to have almost same number of filepaths in each text file. For example, let there be 6 text files in
’Sample’ hierarchy and 3 mapper processes. Then each of the three ’Mapper_i.txt’ files created, will have
2
two file paths. In case the count of files is not a multiple of number of mappers, then add the extra files to
’Mapper_i.txt’ in a round robin fashion.
Notice: Assume that the number of files is always greater than or equal to the number of mappers
except for the case of empty folder.
!
The expected outputs from this Phase are the “Mapper_i.txt” files inside “MapperInput” folder
In Figure 1, ’Sample’ is the folder passed as input to your executable. F1, F2, . . . are the folders and
Tfile*.txt are the text files with words.
Figure 1: Data Partitioning
The master process creates ’m’ pipes, one per mapper to communicate with the reducer process. The master then spawns the mapper processes. Each mapper process will pick one file from the “MapperInput”,
open each filepaths in the file and find the number of words corresponding to each letter of the alphabet.
This information is then send to the reducer process via pipe by each mapper. Note to process the words
as case-insensitive, i.e., take ’a’ and ’A’ as ’A’.
No output is expected from this phase. The grading will be carried out based on the per letter word
count algorithm, pipe setup and usage.
In Figure 2, assume there are two file paths per ’Mapper_i.txt’. The ’Master’ process forks the ’mappers’.
Mapper1 processes ’Mapper_1.txt’ and Mapper2 processes ’Mapper_2.txt’. Each mapper has a list with
that keeps track of the per letter word count.
Notice: Please do not assume that the process ids of mappers are [1, 2, . . . ]. It is upto to the OS to
decide the process id. So there won’t be a one-to-one mapping between the names of text files and
mapper process ids.
!
3
Figure 2: Map
The reducer process will receive the lists from each of the mapper processes via pipes and combine them
to create a single list. The list is then written into a text file “ReducerResult.txt” in the current folder.
Each line in the “ReducerResult.txt” will have the format as given below. There is only one space between
’letter_of_alphabet’ and ’wordcount’.
letter_of_alphabet wordcount
The expected output from this phase is the “ReducerResult.txt”
In Figure 3, the ’reducer’ receives lists from Mapper1 and Mapper2 via two pipes. This list is then combined
and written to ReducerResult.txt in the current folder.
4
Figure 3: Reduce
4.4 Phase 4 -Final Result
The mapper processes will have exited by now. The ’master’ process will wait for the reducer to finish. It
will then read the results from ’ReducerResult.txt and report it to standard output. But the catch here is
that, the standard output is redirected to a file “FinalResult.txt” in the current folder. In MapReduce the
’master’ process exits towards the end after all the processes have completed. We are emulating that in
this phase.
The expected output from this phase is the FinalResult.txt which is having the same format as ReducerResult.txt
In Figure 4, the ’master’ will redirect the results from standard output to ’FinalResult.txt’.
Figure 4: Final Result
5
5 Execution
The executable ’mapreduce’, for the project will accept 2 parameters
Command Line
$ ./mapreduce folderName #mappers
folderName is the name of the root folder to be traversed
#mappers is the number of mapper processes
6 Expected output
• The Sample folder is empty or there are no files in the Sample hierarchy
Command Line
$ ./mapreduce Sample 4
The Sample folder is empty
• The Sample folder has files in its hierarchy
Command Line
$ ./mapreduce Sample 4
The result is present in FinalResult.txt (below format) in the current folder
A 5
B 10
C 3
.
.
.
Z 12
7 Testing
The results from the executable, ’mapreduce’, generated out of your program will be tested on folders of
multiple levels. The number of files that will be present in the folders will range from ’m’ to 500, where
’m’ is the number of mappers. Note in case of empty folder, the file count will be zero.
Notice: There will be an exact pattern matching algorithm used for grading the results obtained. So
please be sure to adhere to the format.
!
To test your code with the sample test folders, run
Command Line
$ make test-mapreduce
Note that if you want to run explicitly on a test case, you will have to do it without the test-mapreduce
target. Also, there is no auto-checking enabled for the correctness of result. The Expected_Output folder
in Testcases have the expected output for each of the given test folders.
6
8 Assumptions
• Number of masters = 1, 0 < Number of mappers <= 32 and Number of reducers = 1.
• File count >= number of mappers except for the case of empty folder.
• Only consider the file count for splitting data equally among mapper processes. Do not look at the
size of the file.
• Use C library file operations to read and write (FILE *, fgets, fputs, fscanf, fprintf and so on ) to and
from a file in Phase 1, Phase 2 and Phase 3.
• In phase 4, use OS system calls to do redirection from standard output to file.
• Use pipes to communicate between Phase 2 and Phase 3.
• The executable name should be ’mapreduce’, as specified in the makefile.
• The template code consists of 4 phase*.c files. Add code of each phase as functions to corresponding
files and call them from main.c.
• Ensure to add the function prototypes to corresponding phase*.h files in ’include’ folder.
• You are expected to provide proper guard mechanisms for header files.
• Ensure proper error handling mechanisms for file operations and pipe handling.
• Expect empty and multilevel folder hierarchies.
9 Extra credit
Symbolic link or soft link is a kind of file that points to another file name. This can be extended to directories as well. Refer to [4] for more details.
Consider the following example
Folder1/file1.txt ——> Folder5/file6.txt ====> physical data
Here file1.txt in Folder1 is a symbolic link for file6.txt in Folder5. One can interchangeably use file1.txt
and file6.txt to read and write from the same physical location in the memory. The same can apply to
directories.
In Phase 1, there can be folders or files that are symbolic links. Your task is to identify such folders/files
and avoid reprocessing them. You may assume that all the symbolic links are valid.
To test this scenario with your code run
Command Line
$ make test-extracredits
If you are attempting extra credits, mention it in the README file
10 Deliverables
One student from each group should upload to Canvas, a zip file containing their C source codefiles, a
makefile, and a README that includes the following details:
• The purpose of your program
• How to compile the program
7
• What exactly your program does
• Any assumptions outside this document
• Team names and x500
• Your and your partners individual contributions
• If you have attempted extra credit
The README file does not have to be long, but must properly describe the above points. Proper in this case
refers to – first-time user can answer the above questions without any confusion. Within your code you
should use one or two comments to describe each function that you write. You do not need to comment
every line of your code. However, you might want to comment portions of your code to answer ‘why’,
rather than ‘how’ you implement the said code. At the top of your README file and main C source file
please include the following comment:
/* test machine : CSELAB_machine_name
* date : mm / dd / yy
* name : full_name1 , [ full_name2 ]
* x500 : id_for_first_name , [ id_for_second_name ]
*/
11 Grading Rubric
1. 10% Correct README contents
2. 10% Code quality such as using descriptive variable names, modularity, comments, indentation. You
must stick to one style throughout the project. If you do not already have a style for C programming,
we suggest adopting K&R style [5].
3. 15% Proper error handling and data structure usage
4. 15% Data partitioning and file operations
5. 20% Process handling
6. 30% Pipe communication and redirection
7. 10% Extra credit
If your code passes the test cases, you will get 75% of the credit. The remaining 25% will be 1, 2 and
proper data sructure usage.
References
[1] Dean, J. and Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), pp.107-113.
[2] Kay, A. R.,Robbins, S., 2004. Chapter 6 – Unix Systems Programming: Communication, Concurrency And
Threads, 2/E. Pearson Education India.
[3] Kay, A. R.,Robbins, S., 2004. Chapter 4 – Unix Systems Programming: Communication, Concurrency And
Threads, 2/E. Pearson Education India.
[4] Symbolic links
https://www.gnu.org/software/libc/manual/html_node/Symbolic-Links.html
[5] K&R Style
https://en.wikipedia.org/wiki/Indentation_style#K&R_style
In multi-threads programming, threads can perform the same or different roles. In
some multithreading scenarios like producer-consumer, producer and consumer threads
have different functionality. In other multithreading scenarios like many parallel
algorithms, threads have almost the same functionality.
In this programming assignment, we want to work on both sides in problem “word count statistics”, which is
the multithreaded version of MapReduce application in programming assignment 2.
An example problem is shown in Fig. 1 below. A text file contains several lines of
English characters, and this application will count the histogram of the starting letter for
each word.
Fig. 1, an example of “word count statistics”
One idea to solve this problem is to cut the file into smaller pieces and hand it
over to many workers. Then in this programming assignment, we will use multithreading
to create a producer to read the file and multiple consumers to process the smaller
piece data. We will have two forms of synchronization: a shared queue synchronized
between producer and consumers, and a global result histogram synchronized by
consumers.
In multithreading “word count statistics” application, the program contains: a
master thread, one producer thread and many consumer threads (shown in Fig. 2
below). The producer and consumers will share a queue for data transfering. For
program execution flow, the entire program contains 4 parts: 1) the master thread
initializes the shared queue, result histogram, producer thread and consumer threads;
2) the producer thread will read the input file, cut the data into smaller pieces and feed
into the shared queue; 3) the consumers will read from the shared queue, compute the
“word count statistics” for its data pieces and synchronize the result with a global
histogram; 4) after producer and all consumers complete their work, the master thread
writes the final result into the output file.
Fig. 2, 4 parts of multithreading “word count statistics”
In the next section, we will go through the implementation details of this project.
The implementation requirements will be provided.
Before we go into each part, the application comparison between PA2
(Programming Assignment 2) and PA3 (Programming Assignment 3) needs to be
clarified. Even though we use the same application “word count statistics” in PA2, there
are several key differences in the input file and processing flows: 1) PA2 works on
different files located in different directories, but PA3 takes only one file as your input
file. 2) In PA2, each file contains multiple lines and each line contains only one word.
But PA3’s input file is more “natural” , which contains multiple words in each line and
the file contains multiple lines. 3) PA2 uses multi-processing, which means the parent
forks several children and seperate the workload. But in PA3, you only have one
process but multiple threads . PA3 is mainly focused on the usage of threads and
thread-safe data structures.
The core of this multithreading “word count statistic” application is a thread-safe
shared queue. This shared queue should be implemented as a linked-list unbounded
buffer . The producer inserts the data in the tail of the linked-list and the consumer
extracts the data from the head. Also, it should be implemented in a non-busy-waiting
way (use “mutex lock + conditional variable” or “semaphore”).
In this stage, the program needs to check the input arguments and print error
messages if argument mismatch (see Section 6 execution for arguments). Then it
should perform the initialization on data structure and launch the producer/consumers
thread. Then the master will wait for all threads to join back (4.4 Master Finalize).
The main functionality of the producer thread is to read the input file and pass the
data into the shared queue (by a package). The file is required to be read by line
granularity (one line at a time), thus each consumer will work on one line at a time.
Since there are multiple consumers, if the EOF is reached, the producer should send
notifications to consumers specifying there will be no more data. The producer
terminates after sending those notifications. Note that the package is the information
transferred between producer/consumers via shared queue. It should contain the data
for consumers and other information if needed.
The consumer will repeatedly check the queue for a new package, work on it for
word count statistics , and then go back to get the next package. This will continue
until receiving the EOF notification, then it will terminate. Besides package handling, it’s
the consumer’s responsibility to synchronize the result with the global final histogram.
Then after all consumers finish their work, the master thread could summarize it and
generate the output.
The word count statistics will check the beginning character of a word. The
definition of a word is a continuous a-zA-Z character (shown in Fig. 3 below). Note that,
all other characters are treated as space. The characters like “-”, “_” are not connecting
words. Same as PA2, we don’t differentiate between uppercase and lowercase letters.
Fig. 3, the word count statistics function
After the producer and consumers have joined back, the master thread will write
the final result (global histogram) into the output file named “result.txt” . The output
format should be: “%c: %d
” , and it will loop through the global histogram from ‘a’ to
‘z’ (shown in Fig. 4 below).
Fig. 4, the output file format
4.5 Log Printout
The program will also print a log file if the argument option is specified . The
log file should be named as “log.txt” . The producer and consumers should print their
execution information into the log:
Producer:
1. Print “producer
” when the producer is launched
2. Print “producer: %d
” for the current line number (starts from 0)
Consumer:
1. Print “consumer %d
” when launched, with the consumer id (0 to
number of consumers minus 1)
2. Print “consumer %d: %d
” for the consumer id and the line number
it currently works on.
There are some other notes when writing the log:
– The print library functions are usually thread-safe, so you don’t need to
use a lock or worry about messy printing (unless you meet that).
– Since the execution order of threads is nondeterministic, you usually will
not get a stable log print out.
– EOF notification should be printed out as line number “-1”.
5. Extra Credits
The extra credits will be granted if a bounded buffer shared queue is
implemented. The application could choose the unbounded/bounded buffer by the
commandline argument option. Also the bounded buffer should have a configurable
buffer size specified by commandline argument option.
6. Execution Syntax, Options
The “word count statistic” will take the arguments syntax:
$ ./wcs #consumer filename [option] [#queue_size]
– The second argument “#consumer” is the number of consumers the
program will create.
– The third argument “filename” is the input file name
– Options have only three possibilities: “-p”, “-b”, “-bp”.
1) “-p” means printing, the program will generate log in this case.
2) “-b” means bounded buffer (extra credit), the program will use it
instead of unbounded buffer.
3) “-bp” means both bounded buffer and log printing.
– The last option argument is the queue size if using bounded buffer (extra
credits).
7. Testing
There will be 4 testing files in the “testing” folder, each of them has different
scenarios specified in the first line of testfile. Also, the solution for 4 test files is
contained in the “testing/sol” folder.
8. Assumptions and Hints
– One line of input file has at most 1024 chars.
– You are allowed to use library file IO functions (instead of open/read/write
syscall).
– Use the keyword “extern” in the header file for the global variable consistency.
– Check if the library function is thread-safe before using them (if you use them in
concurrent threads scenario).
9. Submission
*(this section is modified from Programming Assignment 2 document)
One student from each group should upload to Canvas, a zip file containing their
C source files, a makefile, and a README that includes the following details:
– Team names and x500
– How to compile the program
– Your and your partner’s individual contributions
– Any assumptions outside this document
– What exactly your program does
– If you have attempted extra credit
The README file does not have to be long, but must properly describe the
above points. Your source code should provide appropriate comments for functions
and critical code sections (like the purpose of these code and logical execution flow). At
the top of your README file and each C source file please include the following
comment:
/*test machine: CSELAB_machine_name * date: mm/dd/yy
* name: full_name1 , [full_name2]
* x500: id_for_first_name , [id_for_second_name] */
10. Grading Policy
1. (5%) correct README content
2. (5%) appropriate code style and comments
3. (40%) passed all tests (there may be some other tests when grading)
4. (15%) correct shared queue data structure and functions (like insert, extract,
non-busy-waiting, and etc. )
5. (5%) correct thread create and join usage
6. (10%) correct global histogram usage (like synchronization)
7. (10%) correct producer functionality
8. (10%) correct consumer functionality
9. (10%) extra credit: bounded buffer
Socket programming is a way of connecting two nodes (sockets) on a network to communicate with each other. One node (Server socket) listens on a particular IP address and port number, while other nodes (Clients sockets) reaches out to the server socket to build a connection.
Through the programming assignment 4 (PA4), you will extend the word counting application of programming assignment 2 (PA2) using the socket programming. Similar to PA2, there will be a master process, multiple mapper processes, and a single reducer process. However, for communication, you will use socket-based TCP connections instead of pipes as shown in Figure 1 and Figure 2. 3.
In PA4, you will make two executables (two separate independent programs), server program and client program. Master process is the main process of the client program. Master process spawns multiple mapper processes similar to PA2. Reducer process is the main process of the server program.
Note that the reducer process is no longer spawned by the master process. The reducer process spawns a thread whenever it establishes a new TCP connection with a client. It is called a multi-threaded server. Server, Reducer and Reducer process are interchangeably used in this document. The client program has two types of clients – Master clients and Mapper clients. Master process and mapper processes can be a client. Master client implementation is extra credit.
Details of master client can be found in the section 5. Extra credit. Now, Let’s fist focus on the mapper clients. Each mapper process initiates a TCP connection to the server. Once the connection is established, the mapper client sends deterministic requests, a fixed set of requests, to the server. Mappers, Mapper clients, and Master’s child processes indicate the same thing.
The relationship of client processes and server threads are shown in Figure 2. < Figure 1: Process relationship of programming assignment 2 > < Figure 2: Process and thread relationships of programming assignment 4 > 3.1 The server program The server program is a multi-threaded server. The server is responsible for listening and building connections from clients. The server waits for connections on the specified port. When it receives a connection, it should spawn a new thread to handle that connection.
A thread is responsible for handling requests from a client by reading client’s messages from a socket set up between them, doing necessary computation (refer to section 4. Communication Protocol) and sending responses to the client back through the socket. The thread should exit after it close the client connection. < Figure 4: Information and data structure the server maintains > Server maintains two important lists – azList and updateStatus. azList is a 1-D integer list and updateStatus is a 2-D integer list, as shown in Figure 4.
The server saves the sum of all the word count results from mapper clients in the azList. updateStatus table has 3 essential attributes – MapperID, Number of updates, check in/out flag. A new entry of the updateStatus table is created when the server receives a CHECKIN request from a new mapper client (refer to the details of request commands in section 4.1 Request).
You can feel free to add more columns to the table if needed. Table 1: Attributes of updateStatus Table Attribute Name E MapperID Unique mapper ID, which is greater than 0 Number of updates Incremented by 1 when receiving an UPDATE_AZLIST request from a mapper client. The value should be the same with the number of files the mapper client processes. Check In / out flag 1 – When a mapper client is checked in. 0 – When a mapper client is checked out. 3.2 The client program Client program is responsible for initiating a connection with the server on the IP address and port number given to it. It consists of 3 phases. [ Phase 1 ] File Path Partitioning – By Master Client This is exactly the same with phase 1 of PA2. You are given the codes (phase1.c and phase1.h) for phase1. This code will generate a folder “MapperInput” and “Mapper_.txt” socalled mapper files in the folder. The mapper files contain a list of file paths. MapperID starts from 1 and increments by 1. Please refer to the PA2 write-up for further details of phase 1. [ Phase 2 ] Deterministic Request Handling – By Mapper Clients Master process assigns a unique mapperID to mapper processes while it spawns mapper processes.
The mapper ID starts from 1, and increments by 1. Mapper processes run in parallel. Each mapper client sets up a TCP connection to the server, and sends the following a fixed set of requests sequentially in a deterministic manner. A TCP connection is used to send all the following requests. Mapper clients can access their own mapper file in MapperIntput folder. 1. CHECKIN 2. UPDATE_AZLIST (If the mapper client has multiple file paths in its mapper file, this request is sent to the server multiple times. For example, Mapper_1.txt contains 10 lines of file paths, the mapper client sends 10 UPDATE_AZLIST requests to the server with the word count result of each file. 3. GET_AZLIST 4. GET_MAPPER_UPDATES 5. GET_ALL_UPDATES 6. CHECKOUT For mapper clients to send any requests to the server, they should be checked in first. After check-in, they can send the other types of requests to the server. After exchanging the messages with the server, they should check out, close the TCP connection, and exit. Master waits until all mapper processes are terminated. You can find details of requests in section 4. [ Phase 3 ] Dynamic Request Handling – By Master Clients Phase 3 is extra credit. You will add the master client functionality. After the master process should make sure all mapper processes are terminated, it sends any requests dynamically by reading commands from a file named “commands.txt”. It uses separate TCP connections to send the requests.
You can find the details in Section 5. Extra Credits. 4. Communication Protocol Communication protocol is an application-layer protocol formed of requests and responses. Both requests and responses are integer arrays. Requests are sent from the client, received by the server. After the server does necessary computation, it responds to clients. You can find the details of the requests and responses in the section. 4.1 Request The request structure is as follows. You can find the relevant definitions in the given protocol.h. You can use them in your implementation. Table 2: Request Structure Field Name Size (# of Integer) Purpose Request Command 1 Specifies request command MapperID 1 Specified mapperID (-1 for master client) Data 26 Relevant data for the command. If there is no data, fill with zeros Table 3: Request Command Codes Request Command Command Name Data Request permission (Who can send) 1 CHECKIN Zeros Mapper clients 2 UPDATE_AZLIST Word count result of a file Mapper clients 3 GET_AZLIST Zeros Mapper clients Master clients (extra credit) 4 GET_MAPPER_UPDATES Zeros Mapper clients 5 GET_ALL_UPDATES Zeros Mapper clients Master clients (extra credit) 6 CHECKOUT Zeros Mapper clients Details of each request are as follows. ● 1. CHECKIN o [Client] Mapper clients should send this request before sending other types of requests. o [Server] Server creates a new entry in updateStatus table for a new mapper client if corresponding entry does not exist in the table. If there is an existing entry in the table, the server simply changes the check in/out field to checked-in (1). ● 2. UPDATE_AZLIST o [Client] Mapper clients send it to the server with PER-FILE word count results. If there are multiple file paths in the mapper file, this request should be sent as many as the same number of files paths. If there is no files in the mapper file, mapper clients SHOULD NOT send this message. o [Server] Server sums the word count results in the azList, and increases the number of update field of updateStatus table by 1 for the corresponding mapper client. The number of updates of a particular mapper client should be the same with the number of files paths in the mapper’s mapper file. ● 3. GET_AZLIST o [Client] These requests can be sent by both mapper clients and master clients (extra credit). If mapper clients want to send this request to the server, they should be already checked in. On the other hand, master clients can send it without check-in (extra credit).. o [Server] Server returns the current values of the azList ● 4. GET_MAPPER_UPDATES o [Client] Only mapper clients can send this request. They should be already checked in. o [Server] Server returns the current value of “number of updates” field of updateStatus table for the corresponding mapper ID. ● 5. GET_ALL_UPDATES o [Client] These requests can be sent by both mapper clients and master clients (extra credit). If mapper clients want to send this request to the server, they should be already checked in. On the other hand, master clients can send it without check-in (extra credit). o [Server] Server returns the sum of all values of “number of updates” field in the updateStatus table. ● 6. CHECKOUT o [Client] This request is the last request sent from a mapper client. After getting a response, the mapper client closes its TCP connection and terminates its own process. o [Server] Server updates check in/out field of theupdateStatus table to checked-out (0). 4.2 Response The response structure is as follows. You can find the relevant definitions in the given protocol.h. You can use them in your implementation. Table 3: Response Structure Field Name Size (# of Integer) Purpose Request code 1 Specifies type of request Response code 1 Specified response code Data 1 or 26 Relevant data from server Table 4: Response Code Response Code Meaning Purpose 0 RSP_OK Success For successful requests 1 RSP_NOK Error For unsuccessful requests Table 5: Data Returned on Success Request Command Request Name Data Returned 1 CHECKIN 1 value (mapperID) 2 UPDATE_AZLIST 1 value (mapperID) 3 GET_AZLIST 26 word count values (azList values) at that moment the request is received regardless of mapperID 4 GET_MAPPER_UPDATES 1 value (The value of the number of updates field of updateStatus table for the corresponding mapperID) 5 GET_ALL_UPDATES 1 value (Sum of all entries of the number of updates field in the updateStatus table) 6 CHECKOUT 1 value (mapperID) Servers should handle various error cases such as: – When receiving a request with unknown request code – When mapper ID is not greater than zero – When there is no corresponding entry in the updateSatus table – When a mapper client sends CHECKIN request, if it is already checked-in. – When a mapper client sends CHECKOUT request, if it is not checked-in. – Handling request permission etc. 4.3 Log Printout The server program prints the following logs in terminal. The client programs print the following logs in a log file “log_client txt” in the “log” folder. (refer to section 6. Folder Structure). Please stick closely to this output format. The following items are minimal requirements to print, so you can print additional logs or messages for failure cases. In addition, you can use a function createLogFile() to initialize a client log file in the client.c. You can find example logs in the PA4_Appendix document, and other expected outputs and logs in the “Testcases/ExpectedResult” folder. ● When the server is ready to listen. ○ [Client] None ○ [Server] Print “server is listening
”. ● Establish a TCP connection between a client and server. ○ [Client] Print “[%d] open connection
”, mapperID. ○ [Server] Print “open connection from %s:%d
”, client_ip, client_port. ● Close the TCP connection between a client and server. ○ [Client] Print “[%d] close connection
”, mapperID. ○ [Server] Print “close connection from %s:%d
”, client_ip, client_port. ● 1. CHECKIN ○ [Client] Print “[%d] CHECKIN: %d %d
”, mapperID, Response Code (response[1]), Data (response[2]) ○ [Server] Print “[%d] CHECKIN
”, mapperID (request[1]) ● 2. UPDATE_AZLIST ○ [Client] Print “[%d] UPDATE_AZLIST: %d
”, mapperID, Total number of messages sent to server. Print out this log after a mapper client sends all UPDATE_AZLIST requests to the server. ○ [Server] None ● 3. GET_AZLIST ○ [Client] Print “[%d] GET_AZLIST: %d <26 numbers>
”, mapperID, Response Code (response[1]), and all data received from the server (26 numbers) in the same line (1 space between numbers) ○ [Server] Print “[%d] GET_AZLIST
”, mapperID (request[1]) ● 4. GET_MAPPER_UPDATES ○ [Client] Print “[%d] GET_MAPPER_UPDATES: %d %d
”, mapperID, Response Code (response[1]), Data (response[2]) ○ [Server] Print “[%d] GET_MAPPER_UPDATES
”, mapperID (request[1]) ● 5. GET_ALL_UPDATES ○ [Client] Print “[%d] GET_ALL_UPDATES: %d %d
”, mapperID, Response Code (response[1]), Data (response[2]) ○ [Server] Print “[%d] GET_ALL_UPDATES
”, mapperID (request[1]) ● 6. CHECKOUT ○ [Client] Print “[%d] CHECKOUT: %d %d
”, mapperID, Response Code (response[1]), Data (response[2]) ○ [Server] Print “[%d] CHECKOUT
”, mapperID (request[1]) 5. Extra Credits You can get extra credits by implementing the master client’s dynamic request handling (Phase3 of client program). Unlike the deterministic request handling by mapper clients, the master client dynamically sends requests by reading request commands from a command file named “commands.txt”. The master client initiates a new TCP connection to the server after it makes sure all mapper processes are terminated. When the master client sends a request, the mapperID field of the request should be set to -1. Requests from master client does not require check-in and check-out unlike mapper clients, so master client should set up a new TCP connection for each request, and close after getting a response. For example, if you use the following commands.txt, server will get successful response only for the request commands 3 (GET_AZLIST) and 5 (GET_ALL_UPDATES). For other request, it will get unsuccessful responses from the server. You can assume that 2 (UPDATE_AZLIST) doesn’t appear in the commands.txt. 1 3 4 5 6 7 < example commands.txt file > < Figure 5: Extra Credit – Master Client > 6. Folder Structure Please strictly conform with the folder structure. The conformance will be graded. You should have two project folders named “PA4_Client” and “PA4_Server”. “PA4_Client” should contain “include”, “src”, “log”, “Testcases”, and “MapperInput” folders. “PA4_Server” should contain include” and “src”. You can feel free to modify the provided Makefiles. ● “include” folder: All .h header files ● “src” folder: All .c source files ● “log” folder: log_client.txt. ● “Testcases” folder: 5 TestCase folders and ExpectedResult folder are provided for your testing. Your program will be tested with additional hidden TestCases. THe TestCases are located in this folder. Please DO NOT include this folder in the final deliverable. ● “MapperInput” folder: This folder is created as part of phase1 of the client program. Please do not include this folder in the final deliverable. ● Each top folders (“PA4_Client” and “PA4_Server”) should contain the following files. ○ Makefile ○ executable ○ commands.txt (only in PA4_Client if you attempt extra credit) 7. Execution Syntax The usage of your server program is as follows. The executable name should be “server”. ./server ● is any unsigned integer to be used as a port number The usage of your client program is as follows. The client executable name should be “client”. ./client <# of Mappers> ● the name of the root folder to be traversed. ● <# of Mappers> the number of mapper processes spawned by master. ● IP address of the server to connect to. ● port number of the server to connect to. 7. Assumptions and Hints – You should use TCP sockets. – Maximum number of mapper clients per a master client is 32. – Maximum number of concurrent connections at the server side is 50. – Mapper IDs should be greater than zero, and unique in a client program. – A client connects to a single server at any time. – A server considers multiple client requests at a time. – The server program is not terminated unless it is killed by a user. – Requests and response messages are integer arrays. – Given phase1 code handles symbolic folders and files. – Your server will get any types of requests including both successful and unsuccessful requests, so you need to handle various errors at the server side. – Commands.txt file contains any integers (>0) except for 2. – Start to work on a local host for both client and server programs, then try it on multiple machines later. 8. Submission One student from each group should upload to Canvas, a zip file containing the two project folders (“PA4_Client” and “PA4_Server”) and a README.md that includes the following details: – Team names and x500 – How to compile the client and server programs – How to run the client and server programs – Your and your partner’s individual contributions – Any assumptions outside this document – If you have attempted extra credit The README.md file does not have to be long, but must properly describe the above points. Your source code should provide appropriate comments for functions. At the top of your README.md file and each C source file please include the following comments: /*test machine: CSELAB_machine_name * date: mm/dd/yy * name: full_name1 , [full_name2] * x500: id_for_first_name , [id_for_second_name] */ 9. Grading Policy 1. (5%) Correct README content 2. (5%) Appropriate code style and comments 3. (10%) Conformance check a. (5%) Folder structure and executable names b. (5%) Log format for client logs and server logs 4. (30%) TCP connection setup between clients and server a. (10%) Set up a TCP connection between mapper client and the server b. (10%) Use threads at the server side c. (10%) Set up multiple TCP connections between mappers and the server 5. (30%) Deterministic requests handling between Mapper clients and the server a. (15%) Mapper client side b. (15%) Server side 6. (10%) Error Handling 7. (10%) Successful executions of client and server programs in different machines. 8. (10%) Extra credit – Master client’s dynamic request handling a. (5%) Set up a TCP connection between master client and the server b. (5%) Read request commands, and process send and receive the commands
Reviews
There are no reviews yet.