DSC423: DATA ANALYSIS AND REGRESSION / DSC 323: DATA ANALYSIS & STATISTICAL SOFTWARE II
Final Project | Total Points 100
The purpose of the final project is to demonstrate your ability to apply the knowledge and the techniques learned during this course. The final project for this class is more extensive analysis task, chosen by you from among the topics we discuss. Final projects will include a 2-part final paper.
DELIVERABLES FOR THE FINAL PROJECT:
I. (a) Find Teammates (Week 1-5) [Individual Effort]
The group size will be 5 or 6 member teams. You can find teams in class or post the following information under D2L discussion post Final Project Looking for Teammates so that students without a groupmate can respond.
Information to include:
1. Full Name
2. Your research areas interest/types of dataset you like to analyze [optional]
IMPORTANT: If you do not find a team on your own, you will lose 1 point from your final project
I. (b) Post Team List (Apr. 30) [Group Effort]
Once you have the required number of students in the group, please post the following information under the discussion topic Project Team List on D2L.
(1) Team name
(2) Listofthenamesineachteam(useyouractualnameasitappearsoncampusconnect) (3) On-lineorIn-classsections
Note: If you are still not part of a team, post your information as a one-member team and I will see which team you can fit into. If you have less number of students in a team, I will have to break you up to join other teams.
II. Select Dataset and Develop Project Proposal (May 14) [Group Effort] [5 Points] Datasets / Data Sources:
There are several public repositories available online which includes datasets from various industries.
dataset(s) you pick has to be publically available, and cannot be from private source or sensitive in nature
The minimal requirement for the dataset is that it contains at least
.
records/observations. If you have taken this class before, you cannot use the dataset you used previously.
Datasets:
(Will need data cleaning, recoding and/or summarization, etc. before you begin your analysis)
KDnuggets is a great website that contains lots of information of interest to data scientists. It also includes
a long list of data repositories: http://www.kdnuggets.com/datasets/index.html
Datasets used for data analytics competitions at https://www.kaggle.com/datasets
________________________________________________________________________________________________________
Final Project Instructions Nandhini Gulasingam 1
6 predictors
and
more than 600
The
Extra Credit: Extra credit will be awarded to students based on 2 sets of criteria
Two extra credit points will be provided to students implementing items 1-3
Three extra credit points will be given for students implementing 1-3 and either item 4, 5 or 6
1. Data obtained from either KDnuggets or Kaggle.
2. Dataset should contain at least 10+ predictors. To qualify for the extra credit, you have to use these
predictors in your initial analysis.
3. Predictors should consist at least one qualitative predictor with 3 or more levels and quantitative
predictors. To qualify for the extra credit, you have to use these predictors in your analysis.
4. Extensive data preprocessing. For example, combining multiple datasets, removing 1K+ outliers, recoding
multiple variables.
5. Dataset consists of date/time and/or location-based predictors which are recoded and used in your
analysis.
6. The number of observations should be 2,000 or more.
Proposal:
Submit 1-2 page proposal (can exceed the page limit) that includes:
1. Project title: be creative, come up with a catchy title (if possible)
2. Team mates: Full names of all team mates as it appears on campusconnect
3. Dataset: This should include
a. dataset name
b. brief description of the dataset
c. # of DV(s) and description of the dependent variables including data type (number, text, etc.)
d. # of IV and description of the independent variables, # of numeric variables, # of text variables, # of
date/time variables, # of location related variables
e. number of rows/observations and
f. the URL to the site where you got the data from
4. Problem description: What you plan to predict, analyze, etc. and why
5. Proposed methodology: Proposed approach as to what steps you will follow to address when you
mentioned in (4) above. (Hint: Make a list of everything we have learnt in class, and put them in order)
6. References: Use at least one reference per team member. The references cannot be your textbook, class
notes or documents posted as additional reading. References are journal articles, or research papers that will be helpful in understanding what scholars and industry experts suggest in terms of methodology, variables, or future direction for similar datasets. See document on References_How to cite them.PDF to see how to cite and use references.
Note: Extra Credit is awarded ONLY after you implement it in your code and analysis, and not given at the proposal stage.
________________________________________________________________________________________________________
Extra Credit: Two extra credit points will be provided to those who implement approach or methodology
obtained from the references that was not covered during the course. Even if you implement techniques
learnt outside the class, you still have to use at least one of the model techniques learnt in the class for your
project, otherwise you would lose 98% of your grade for the project as it counts ONLY towards 2 extra credit
points.
Final Project Instructions Nandhini Gulasingam 2
Reports (Jun. 10) [95 Points] Each group member should write his/her SAS code to solve the problem, but the group will write a single analysis of the results using a word processor. The analysis will be included in two separate reports: Note: Page length does not matter, if you need more pages to explain what you need to it is OK.
1. A Non-technical Summary Report (15 Points): [Group Effort] No longer than two typewritten page, describing the objective or goal and conclusions of your statistical analysis to a non-technical audience. It should be understandable to a person who does not know regression analysis or statistics. Often, in your workplace, you will have to present your findings to non-technical folks to convince them without using any technical jargons. Make sure to include the model statement.
2. A Technical Summary Report (80 Points): [Individual Effort] A 10-15 page technical report per person should include the following sections. The appendix, code and references do not count towards the 10-15 page limit. It should also include all the important outputs in the appendix section. This report is intended for a statistically literate audience and must be written in a clear organized fashion using the correct terminology. It should consist of the following sections:
Abstract
Give a short summary of the goal, approach/methodology and important findings and recommendations
Group Effort
5 Points
Introduction
Describe the goal or objective and any hypothesis, any literature review or background research you did using the references, why it is important, context, motivation etc.
Individual Effort if Goals are different, otherwise, Group Effort
3 Points
Methodology
Steps of your approach, specifically where you obtained the data (site the exact data source/link), how you pre-processed or cleaned the data (recoding, transformations, interaction variables, etc.), model approach, validation method, and any type(s) of analysis did you performed
Individual Effort
10 Points
Analysis, Results and Findings
Your analysis should address the following points:
1. The exploratory analysis of the data including descriptives that may suggest a possible model that is adequate for fitting the data. Do the data show a non-linear relationship? Should a transformation of the response variable and/or the predictors be useful?
2. Try interaction variables.
3. Use either liner, logistic, polynomial regression techniques
and/or transformations
4. Check for collinearity among the independent variables.
5. A variable selection method will enable you to select
suitable models and find the set of predictor variables, which are more informative for predicting the response variable.
6. You may want to fit a few models (at least 1 final model per teammate) that seem adequate for your data and then select the model among them that provides the best prediction of Y.
7. Analyze the residual plots to look for patterns that might suggest a failure in the assumptions and some inadequacies in the selected model.
Note: This is a list of techniques is provided only for guidance, and
is not given in the order of execution. Students should review all
the course materials to come up with a list of all techniques learnt
in class, and implement the relevant techniques for your dataset in
the right sequence.
Individual Effort
67 Points
________________________________________________________________________________________________________
Final Project Instructions Nandhini Gulasingam 3
8. The existence of outliers and influential points may have dramatic effects on your analysis. Check also if there are outliers.
9. Can your model be improved? Are you satisfied with the model you have chosen?
10. Usetheselectedregressionmodeltoexaminethe relationship and associations among the variables in your study and to identify, among the observed independent variables, the strongest predictors for the response variable.
11. Computetwopredictionsincludingtheprediction intervals using the regression model.
12. Applyvalidationtechniquestoevaluatethepredictive power of your model. Split the original dataset at random into a training and test set. Test set should have at least 15 observations in order to compute meaningful validation statistics. Discuss the model performance using training, and testing sets.
Hints for the Statistical Analysis: It is possible that you may not find a satisfactory model that fits adequately your data. Sometimes a data set may admit more than one satisfactory answer; sometimes there may be none. If the statistical analysis shows that no regression models are suitable for your data set, mention what approaches you have tried and what was unsatisfactory about them. If there is more than one suitable model, mention the pros and cons and compare their performance in predicting the response variable.
The final aim of any statistical analysis is the understanding of a phenomenon or the investigation of a scientific problem, which your data arise from. Remember that the regression function is a mathematical representation of such a problem and the interpretation of the parameters values will give you insights about the relationships of the variables in the problem.
IMPORTANT: Each team member should come up with a final
model that is distinct from the rest of the team and evaluate the
performance metrics using test sets. There will be a 20% reduction
Future Work
in the final report grade if test set performance is not included.
There will be a 50% reduction in final report grade if SAS Code is
not included or does not run.
Individual Effort if Goals are different, otherwise, Group Effort
2 Points
Code
Any additional avenues worth exploring based on what you have discovered so far? Does the current results obtained suggest new directions worth exploring by you? Explain how?
Appendix
All relevant outputs should be included here and cross referenced in your Analysis, Results & Findings section. Appendix should be the last section of your report.
Individual Effort
2 Points
Attach the SAS code from each member to the zip file along with the dataset used. If different datasets were used by each member, provide the code and the dataset labeled with your name as a prefix (e.g. Jake_SAScode_loans.sas and Jake_loans.csv), so that when the code is run, I see the same output as what is provided in the report.
If the code is not included or doesnt execute, you
Individual Effort
2 Points
will lose 50% of your grade for the report.
________________________________________________________________________________________________________
Final Project Instructions Nandhini Gulasingam 4
IMPORTANT: Code should be provided as a SAS file that doesnt include any syntax errors. Do not copy and paste code to word file or submit text files, etc.
References
Papers that you read and cited in your paper/final report, data sources. See document on References_How to cite them.PDF to see how to cite and use references. There should be at least one citation per team member.
Individual Effort
2 Points
Zip File
Zip the folder with all the data files (raw and cleaned excel/PDF files, all corresponding files used), each team members SAS code and the 2 reports. Do not provide TAR or RAR files.
Group Effort
2 Points
Team Contribution (Jun. 10) [Individual Effort] (-2 points if not submitted) Group members should also submit the team evaluation document via D2L (see TeamEvaluation.docx under Group Project section under Content). Two points will be deducted if evaluations are not submitted by the due date/time.
________________________________________________________________________________________________________
Final Project Instructions Nandhini Gulasingam 5
Reviews
There are no reviews yet.