In this mini-project you will develop models to analyze text from the website Reddit (https://www.reddit. com/), a popular social media forum where users post and comment on content in different themed communities, or subreddits. The goal of this project is to develop a supervised classification model that can predict what community a comment came from. You will be competing with other groups to achieve the best accuracy in a competition for this prediction task. However, your performance on the competition is only one aspect of your grade. We also ask that you implement a minimum set of models and report on their performance in a write-up.
The Kaggle website has a link to the data, which is a 20-class classification problem with a (nearly) balanced dataset (i.e., there are equal numbers of comments from 20 different subreddits). The data is provided in CSVs, where the text content of the comment is enclosed in quotes. Each entry in the training CSV contains a comment ID, the text of the comment, and the name of the target subreddit for that comment. For the test CSV, each line contains a comment ID and the text for that comment. You can view and download the data via this link: https://www.kaggle.com/c/reddit-comment-classification-comp-551/data
You need to submit a prediction for each comment in the test CSV; i.e., you should make a prediction CSV where each line contains a comment ID and the predicted subreddit for that comment. Since the data is balanced and involves multiple classes, you will be evaluated according to the accuracy score your the model. An example of the proper formatting for the submission file can be viewed at: https://www.kaggle.com/ c/reddit-comment-classification-comp-551/overview/evaluation.
Tasks
You are welcome to try any model you like on this task, and you are free to use any libraries you like to extract features. However, you must meet the following requirements:
- You must implement a Bernoulli Naive Bayes model (i.e., the Naive Bayes model from Lecture 5) from scratch (i.e., without using any external libraries such as SciKit learn). You are free to use any text preprocessing that you like with this model. Hint 1: you many want to use Laplace smoothing with your Bernoulli Naive Bayes model. Hint 2: you can choose the vocabulary for your model (i.e, which words you include vs. ignore), but you should provide justification for the vocabulary you use.
- You must run experiments using at least two different classifiers from the SciKit learn package (which are not Bernoulli Naive Bayes). Possible options are:
- Logistic regression
(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- Decision trees
(https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- Support vector machines [to be introduced in Lecture 10 on Oct. 7th]
(https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
- You must develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and report on the performance of the above mentioned model variants.
- You should evaluate all the model variants above (i.e., Naive Bayes and the SciKit learn models) using your validation pipeline (i.e., without submitting to Kaggle) and report on these comparisons in your write-up. Ideally, you should only run your best model on the Kaggle competition, since you are limited to two submissions to Kaggle per day.
Deliverables
You must submit two separate files to MyCourses (using the exact filenames and file types outlined below):
- zip: A collection of .py, .ipynb, and other supporting code files, which must work with Python version 3. You must include your implementation of Bernoulli Naive Bayes and it must be possible for the TAs to reproduce all the results in your report and your Kaggle leaderboard submissions using your submitted code. Please submit a README detailing the packages you used and providing instructions to replicate your results.
- pdf: Your (max 5-page) project write-up as a pdf (details below).
Project write-up
Your team must submit a project write-up that is a maximum of five pages (single-spaced, 10pt font or larger; extra pages for references/bibliographical content and appendices can be used). We highly recommend that students use LaTeX to complete their write-ups and use the bibtex feature for citations. You are free to structure the report how you see fit; below are general guidelines and recommendations, but this is only a suggested structure and you may deviate from it as you see fit.
Abstract (100-250 words) Summarize the project task and your most important findings.
Introduction (5+ sentences) Summarize the project task, the dataset, and your most important findings. This should be similar to the abstract but more detailed.
Related work (4+ sentences) Summarize previous literature related to the sentiment classification problem.
Dataset and setup (3+ sentences) Very briefly describe the dataset and any basic data pre-processing methods that are common to all your approaches (e.g., tokenizing). Note: You do not need to explicitly verify that the data satisfies the i.i.d. assumption (or any of the other formal assumptions for linear classification).
Proposed approach (7+ sentences ) Briefly describe the different models you implemented/compared and the features you designed, providing citations as necessary. If you use or build upon an existing model based on previously published work, it is essential that you properly cite and acknowledge this previous work. Discuss algorithm selection and implementation. Include any decisions about training/validation split, regularization strategies, any optimization tricks, setting hyper-parameters, etc. It is not necessary to provide detailed derivations for the models you use, but you should provide at least few sentences of background (and motivation) for each model.
Results (7+ sentences, possibly with figures or tables) Provide results on the different models you implemented (e.g., accuracy on the validation set, runtimes). You should report your leaderboard test set accuracy in this section, but most of your results should be on your validation set (or from cross validation).
Discussion and Conclusion (3+ sentences) Summarize the key takeaways from the project and possibly directions for future investigation.
Statement of Contributions (1-3 sentences) State the breakdown of the workload.
Reviews
There are no reviews yet.