In this question, you will implement the PageRank algorithm in Python for large dataset.
The PageRank algorithm was first proposed to rank web search results, so that more important web pages are ranked higher. It works by considering the number and importance of links pointing to a page, to estimate how important that page is. PageRank outputs a probability distribution over all web pages, representing the likelihood that a person randomly surfing the web (randomly clicking on links) would arrive at those pages.
As mentioned in the lectures, the PageRank values are the entries in the dominant eigenvector of the modified adjacency matrix in which each columns values adds up to 1 (i.e., column normalized), and this eigenvector can be calculated by the power iteration method, which iterate through the graphs edges multiple times to updating the nodes probabilities (scores in pagerank.py) in each iteration :
For each iteration, the Page rank computation for each node would be :
Where:
You will be using the dataset Wikipedia adminship election data which has almost 7K nodes and 100K edges. Also, you may find the dataset under the hw4-skeleton/Q1 as soc-wiki-elec.edges
In pagerank.py,
- You will be asked to implement the simplified PageRank algorithm, where Pd ( vj ) = 1/n in the script provided and need to submit the output for 10, 25 iteration runs.
To verify, we are providing with the sample output of 5 iterations for simplified pagerank.
- For personalized PageRank, the Pd ( ) vector will be assigned values based on your 9 digit GTID (Eg: 987654321) and you are asked to submit the output for 10, 25 iteration runs.
Deliverables:
- pagerank.py [12 pts]: your modified implementation
- simplified_pagerank_{n}.txt: 2 files (as given below) containing the top 10 node IDs and their simplified pageranks for n iterations
simplified_pagerank10.txt [2 pts]
simplified_pagerank25.txt [2 pts]
- personalized_pagerank_{n}.txt: 2 files (as given below) containing the top 10 node IDs and their simplified pageranks for n iterations
personalized_pagerank10.txt [2 pts]
personalized_pagerank25.txt [2 pts]
Random Forest Classifier
Q2.1 Random Forest Setup [45 pts]
Note: You must use Python 3.7.x for this question.
You will implement a random forest classifier in Python. The performance of the classifier will be evaluated via the out-of-bag (OOB) error estimate, using the provided dataset.
Note: You may only use the modules and libraries provided at the top of the .py files included in the skeleton for Q2 and modules from the Python Standard Library. Python wrappers (or modules) may NOT be used for this assignment. Pandas may NOT be used while we understand that they are useful libraries to learn, completing this question is not critically dependent on their functionality. In addition, to make grading more manageable and to enable our TAs to provide better, more consistent support to our students, we have decided to restrict the libraries accordingly.
The dataset you will use is Predicting a Pulsar Star dataset. Each record consists of the parameters of a pulsar candidate. The dataset has been cleaned to remove missing attributes. The data is stored in a comma-separated file (csv) in your Q2 folder as pulsar_stars.csv. Each line describes an instance using 9 columns: the first 8 columns represent the attributes of the pulsar candidate, and the last column is the class which tells us if the candidate is a pulsar or not (1 means it is a pulsar, 0 means not a pulsar).
Note: The last column should not be treated as an attribute.
Note2: Do not modify the dataset.
You will perform binary classification on the dataset to determine if a pulsar candidate is a pulsar or not.
Essential Reading
Decision Trees
To complete this question, you need to develop a good understanding of how decision trees work. We recommend you review the lecture on decision tree. Specifically, you need to know how to construct decision trees using Entropy and Information Gain to select the splitting attribute and split point for the selected attribute. These slides from CMU (also mentioned in lecture) provide an excellent example of how to construct a decision tree using Entropy and Information Gain.
Random Forests
To refresh your memory about random forests, see Chapter 15 in the Elements of Statistical Learning book and the lecture on random forests. Here is a blog post that introduces random forests in a fun way, in laymans terms.
Out-of-Bag Error Estimate
In random forests, it is not necessary to perform explicit cross-validation or use a separate test set for performance evaluation. Out-of-bag (OOB) error estimate has shown to be reasonably accurate and unbiased. Below, we summarize the key points about OOB described in the original article by Breiman and Cutler.
Each tree in the forest is constructed using a different bootstrap sample from the original data. Each bootstrap sample is constructed by randomly sampling from the original dataset with replacement (usually, a bootstrap sample has the same size as the original dataset). Statistically, about one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. For each record left out in the construction of the kth tree, it can be assigned a class by the kth tree. As a result, each record will have a test set classification by the subset of trees that treat the record as an out-of-bag sample. The majority vote for that record will be its predicted class. The proportion of times that a predicted class is not equal to the true class of a record averaged over all records is the OOB error estimate.
Starter Code
We have prepared starter code written in Python for you to use. This would help you load the data and evaluate your model. The following files are provided for you:
- util.py: utility functions that will help you build a decision tree
- decision_tree.py: a decision tree class that you will use to build your random forest
- random_forest.py: a random forest class and a main method to test your random forest
What you will implement
Below, we have summarized what you will implement to solve this question. Note that you MUST use information gain to perform the splitting in the decision tree. The starter code has detailed comments on how to implement each function.
- util.py: implement the functions to compute entropy, information gain, and perform splitting.
- decision_tree.py: implement the learn() method to build your decision tree using the utility functions above.
- decision_tree.py: implement the classify() method to predict the label of a test record using your decision tree.
- random_forest.py: implement the methods _bootstrapping(), fitting(), voting()
Note: You must achieve a minimum accuracy of 75% for the random forest.
Note 2: Your code must take no more than 5 minutes to execute.
Note 3: Remember to remove all of your print statements from the code. Nothing other than the existing print statements in main() should be printed on the console. Failure to do so may result in point deduction. Do not remove the existing print statements in main() in random_forest.py.
As you solve this question, you will need to think about multiple parameters in your design, some may be more straightforward to determine, some may be not (hint: study lecture slides and essential reading above). For example:
- Which attributes to use when building a tree?
- How to determine the split point for an attribute?
- When do you stop splitting leaf nodes?
- How many trees should the forest contain?
Note that, as mentioned in lecture, there are other approaches to implement random forests. For example, instead of information gain, other popular choices include Gini index, random attribute selection (e.g., PERT Perfect Random Tree Ensembles). We decided to ask everyone to use an information gain based approach in this question (instead of leaving it open-ended), to help standardize students solutions to help accelerate our grading efforts.
Q2.2 forest.txt [5 pts]
In forest.txt, report the following:
- What is the main reason to use a random forest versus a decision tree? (<= 50 words)
- How long did your random forest take to run? (in seconds)
- What accuracy (to two decimal places, xx.xx%) were you able to obtain?
Deliverables
- util.py [10 pts]: The source code of your utility functions.
- decision_tree.py [10 pts]: The source code of your decision tree implementation.
- random_forest.py [25 pts]: The source code of your random forest implementation with appropriate comments.
- forest.txt [5 pts]: The text file containing your responses to Q2.2
Reviews
There are no reviews yet.