COMP2129 Assignment 4
Task description
In this assignment your task is to implement the PageRank algorithm in C using the power method described below and then optimise and parallelise your code to ensure peak performance is achieved.
The PageRank algorithm was developed in 1996 by Larry Page and Sergey Brin when they were graduate students at Stanford University. Google and other search engines compare words in search phrases to words in web pages and use ranking algorithms to determine the most relevant results.
PageRank assigns a score to a set of web pages that indicates their importance. The underlying idea behind PageRank is to model a user who is clicking on web pages and following links from one page to another. In this framework, important pages are those which have incoming links from many other pages, or have incoming links from other pages with a high PageRank score, or both.
You are encouraged to ask questions on Ed using the assignments category. As with any assignment, make sure that your work is your own, and you do not share your code or solutions with other students.
Working on your assignment
You can work on this assignment on your own computer or the lab machines. Students can also take advantage of the online code editor on Ed. Simply navigate to the assignment page and you are able to run, edit and submit code from within your browser. We recommend that you use Safari or Chrome.
It is important that you continually back up your assignment files onto your own machine, flash drives, external hard drives and cloud storage providers. You are encouraged to submit your assignment while you are in the process of completing it. By submitting you will obtain some feedback of your progress.
Academic declaration
By submitting this assignment you declare the following:
I declare that I have read and understood the University of Sydney Academic Dishonesty and Plagiarism in Coursework Policy, and except where specifically acknowledged, the work contained in this assignment or project is my own work, and has not been copied from other sources or been previously submitted for award or assessment.
I understand that failure to comply with the Academic Dishonesty and Plagiarism in Coursework Policy, can lead to severe penalties as outlined under Chapter 8 of the University of Sydney By-Law 1999 (as amended). These penalties may be imposed in cases where any significant portion of my submitted work has been copied without proper acknowledgement from other sources, including published works, the internet, existing programs, the work of other students, or work previously submitted for other awards or assessments.
I realise that I may be asked to identify those portions of the work contributed by me and required to demonstrate my knowledge of the relevant material by answering oral questions or by undertaking supplementary work, either written or in the laboratory, in order to arrive at the final assessment mark.
I acknowledge that the School of Information Technologies, in assessing this assignment, may reproduce it entirely, may provide a copy to another member of faculty, and or communicate a copy of this assignment to a plagiarism checking service or in house computer program, and that a copy of the assignment may be maintained by the service or the School of IT for the purpose of future plagiarism checking.
1
Due: 9:00pm Sunday, 5 June 2016
This assignment is worth 8% of your final assessment
First we construct the matrix M as follows:
8><>:1 / N,
Mij = 1/|OUT(j)|,
if |OUT(j)| = 0
if j links to i (1) otherwise
0,
For the PageRank vector, we use the notation P(t) to represent the PageRank score for page p at
iteration t. We initialise the scores for all pages to an initial value of 1 N
P(0) = 1
so that the sum equals 1.
(2)
(3)
(4)
Page 2 of ??
p
COMP2129
The PageRank algorithm
PageRank is an iterative algorithm that is repeated until a stopping criteria is met. The last iteration gives us the result of the search, which is a score per web page. A high score indicates a very relevant web page whereas a low score indicates a not so relevant web page for a search. Sorting the web pages by their scores in descending order gives us the order for the result list of a search query.
For describing the PageRank algorithm we introduce the following symbols:
S is the set of all web pages that we are computing the PageRank scores for
N = |S| is the total number of web pages
P = [P1t, P2t, , PNt ] is the vector of PageRank scores
d is a dampening factor for the probability that the user continues clicking on web pages is the convergence threshold
IN(p) is the set of all pages in S which link to page p
OUT(p) is the set of all pages in S which page p links to
Mij is the transition probability of travelling to page i from j
Mc is a matrix calculated from M, d and N that will be multiplied with P for each iteration
u
Next we construct matrix Mc as follows:
Mc = dM + 1 dE
N
whereEisanN N matrixofallones.
During each iteration of the algorithm, the value of P is updated as follows:
P(t+1) = McP(t) Operating Systems and Machine Principles
N
for some vector x = {x1,x2,,xn}:
Example
||x|| = sX x2i (6) i
|OUT(A)| = |OUT(C)| = 0 Operating Systems and Machine Principles
Page 3 of ??
- IN(A) =
- IN(B) =
- IN(C) =
- IN(D) =
{B, D} {D} {B, D} ;
OUT(A) = OUT(B) = OUT(C) = OUT(D) =
;
{A, C}
;
{A, B, C}
Each node represents a web page.
Edges in the graph indicate that the source of the edge is linking to the destination of the edge.
First calculate M:
2ABCD3 A 0.25 0.5 0.25 0.333
B64 0.25 0 0.25 0.333 75 C 0.25 0.5 0.25 0.333 D 0.25 0 0.25 0
M =
Notice that the sum of each column is 1. Columns A and C have value 1/N because
COMP2129
The algorithm continues to iterate until the convergence threshold is reached; that is, the algorithm terminates when PageRank scores stop varying between iterations. The PageRank scores have converged when the following condition is met:
||P(t+1) P(t)|| (5) The vector norm is defined to be the standard Euclidean vector norm; that is,
Our example has four web pages: S = {A, B, C, D}. In this example, d = 0.85 and = 0.005. The referencing structure of the web pages A, B, C, and D is given in the graph below.
260.250 0.463 0.250 0.32137 Mc = dM + 1 dE 640.250 0.038 0.250 0.32175
N 0.250 0.463 0.250 0.321 0.250 0.038 0.250 0.038
Initialise P(0):
260.25037
P(0) = 640.25075 0.250
0.250
0.463 0.250 0.038 0.250 0.463 0.250 0.038 0.250
The initialisation and first iteration result in the following values for P:
tABCD
0 0.250 0.250 0.250 0.250 1 0.321 0.215 0.321 0.144
Next, we check whether or not the algorithm has converged.
||P(1) P(0)|| = ||{0.071, 0.035, 0.071, 0.106}|| = p0.022543
0.150 Since 0.150 6 0.005 (), we perform another iteration.
COMP2129
Now calculate Mc:
Then perform the first iteration:
260.250 P(1) = McP(0) = 640.250
0.250 0.250
0.32137 260.25037 260.32137 0.32175 640.25075 640.21575 0.321 0.250 0.321 0.038 0.250 0.144
260.250 0.463 P(2) = McP(1) = 640.250 0.038
0.250 0.463 0.250 0.038
0.250 0.32137 260.32137 260.30637 0.250 0.32175 640.21575 640.21575 0.250 0.321 0.321 0.307 0.250 0.038 0.144 0.174
Which leaves us with the following three iterations of P:
tABCD
0 0.250 0.250 0.250 0.250 1 0.321 0.215 0.321 0.144 2 0.306 0.215 0.306 0.174
Operating Systems and Machine Principles
Page 4 of ??
COMP2129
Again, we check whether or not the algorithm has converged:
||P(2) P(1)|| 0.036 6 Convergence has not been reached, so we continue with the algorithm.
After another two iterations, we have finally reached convergence:
tABCD
0 0.250 0.250 0.250 0.250 1 0.321 0.215 0.321 0.144 2 0.306 0.215 0.306 0.174 3 0.308 0.217 0.308 0.167 4 0.308 0.216 0.308 0.168
||P(4) P(3)|| 0.001
At this point the algorithm terminates and the values of P(4) are the final PageRank scores for each of the pages. If this were a query, the resulting ranks would be A, C, B, D where we arbitrarily rank page A before page C since they have the same score.
Implementation details
Your program must implement the PageRank algorithm using the power method described above.
The header file pagerank.h contains an init function that processes the list of pages and edges from standard input and stores a linked list of pages and corresponding inlinks in the following structs:
struct page {
char* name; // name of this page
size_t index; // index of this page
size_t noutlinks; // number of outlinks from this page
node* inlinks; // linked list of pages with inlinks to this page
};
struct node {
page* page; // pointer to the page data structure node* next; // pointer to the next page in the list
};
Your program must produce no errors when built on the lab machines and ed with the command:
$ clang -O1 -std=gnu11 -march=native -lm -pthread pagerank.c -o pagerank Your program output must match the exact output format shown by the prototype and on Ed. You are
encouraged to submit your assignment while you are working on it, so you can obtain some feedback.
Operating Systems and Machine Principles Page 5 of ??
Program input
<dampening factor><number of pages><name of web page>...
<number of edges><source page> <destination page>...
The first line specifies the dampening factor to be used by the program. The second line defines the number of pages to be declared, followed by a list of page names with one name per line. Then, the number of edges in the graph is specified, followed by a list of edges with one edge per line.
The init function reads the input and terminates outputting an error message if the input is invalid. The input is considered invalid if a page name exceeds the maximum length, the dampening factor is not in the range 0 d 1, a page is declared twice, or an edge is defined to a nonexistent page.
Example
0.85 4
A
B
C
D
5 DA DB DC BA BC
This example has four web pages with names: A, B, C, and D
There are five edges: D ! A, D ! B, D ! C, B ! A, and B ! C
The output is the list of scores for each web page, for example:
A 0.30791363B 0.21580945C 0.30791363D 0.16836329
The scores are ordered in the same order they were defined in the input. Forprintingthescore,usetheformatstring%s %.8lf
For all test cases set = 0.005, use the constant defined as EPSILON
COMP2129
Operating Systems and Machine Principles
Page 6 of ??
Submission details
Final deliverable for the performance is due in week 13. Further details will be announced on Ed. You must submit your code using the assignment page on Ed. To submit, simply place all of your
files and folders into the workspace, click run to check your program works and then click submit. You are encouraged to submit multiple times, but only your last submission will be marked.
Marking
In this assignment you will receive marks for correctness and performance of your program. However, submissions are only eligible for the performance component if they pass all of the correctness tests.
5 marks are assigned based on automatic tests for the correctness of your program.
This component will use our own hidden test cases that cover every aspect of the specification. To pass test cases your solution must produce the correct output within the imposed time limit.
5 marks are assigned based on the performance of your code relative to the benchmark.
This component is tested on a separate machine in a consistent manner. Submissions faster than the benchmark implementation will receive 5 marks, with successively slower ones receiving lower marks. Submissions faster than our basic implementation will receive at least 2.5 marks.
Your program will be marked automatically, so make sure that you carefully follow the assignment specifications. Your program must match the exact output in the examples and the testcases on Ed.
Warning: Anyattemptstodeceiveordisruptthemarkingsystemwillresultinanimmediatezerofor the entire assignment. Negative marks can be assigned if you do not properly follow the assignment specification, or your code is unnecessarily or deliberately obfuscated.
COMP2129
Operating Systems and Machine Principles Page 7 of ??
Reviews
There are no reviews yet.