The ames_train data set contains approximately 2039 records. See the data description in the file Introduction_to_Ames_Housing_Data. This is a random selection of training data selected from the full dataset. Note, the index numbers have been randomized and the split between train and test is also random so you will not be able to match the test data with sale price values. You are to use OLS (“Linear”) Regression to predict the sale price for homes in the ames_test_sfam dataset by building two models using the ames_train data. Note, the test data set is single family homes, the training data is all homes.
- Read the report template. Your write up in PDF Format (no zip files). Your write up should have five sections. Each section should have enough detail so that I can follow your logic and someone else can replicate your work. (150 Points)
- A file that contains all the python code you used in your analysis. I should be able to run this file and get all the output that you got.
- A csv file, which has the scored records values from ames_test_sfam. There will be only two columns in this file: index and p_saleprice. You will be graded on how your model performs versus my model and those of other students in the class.
- Optional: You can see how well your model performs by submitting your csv file to kaggle at https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf
It is OK to submit to Kaggle many times.
You will have to tell me your alias for Kaggle so I can see the score.
WRITE UP (200 POINTS)
1. First Steps (40 points)
Describe the ames_train data set so that I am convinced you understand it.
Use my shell code as a start to explore the data. Apply your creativity and go from there.
If you know how to do pivot tables in Excel, it is a great tool for Exploratory Data Analysis (EDA).
EDA was well established by John Tukey. He was a great advocate for it and developed much of what we do today.
Knowing your data typically consists of three components: (a) a data survey, (b) a data quality check, and (c) an initial exploratory data analysis.
(a) A Data Survey
- Take a broad overview of the Ames housing data set. Read over the data documentation. What data do you have, and what is it supposed to represent? – In the linear regression component of this course you build linear regression models to predict the value of a property (single family home). Do you have the right data to properly address the problem? Are there observations in the data that should be excluded?
- What kinds of problems can you properly address given the data that you have? In particular if you were to build a regression model with the variable SalePrice as the response variable, what types of properties would you be valuing? Be careful about what you are doing here.
(b) Define the Sample Population
- When building statistical models you have to define the population of interest, and then sample from THAT population. Frequently you will not actively perform the sampling function. Instead, the data will be made available and you will have to sample from it retrospectively, i.e. you will need to carve out the population of interest. In this assignment the objective of is to be able to provide estimates of home values for ‘typical’ homes in Ames, Iowa. You may not be able to define what ‘typical’ is, but can use the data to find out what is atypical. Any values which are not atypical are then considered to be typical.
- Define the appropriate sample population for your statistical problem. Hint: You are building regression models for the response variable SalePrice. Are all properties the same? Would you want to include an apartment building in the same sample as a single family residence? Would you want to include a warehouse or a shopping center in the same sample as a single family residence? Would you want to include condominiums in the same sample as a single family residence? – Define your sample using ‘drop conditions’. Create for the drop conditions and include it in your report so that it is clear to any reader what you are excluding from the data set when defining your sample population.
The definition of your sample data should be clearly noted in your assignment report.
(c) A Data Quality Check
- In practice your data will not be ‘clean’. You will need to examine your data for errors and outliers. Errors will not always show as outliers, and outliers are not necessarily errors.
- If you have a data dictionary that states the set of proper values for each field, then you will want to check your data against the data dictionary.
- If you do not have a data dictionary, then you will need to reason and explore your way to a proper data set.
Example 1: In this project you will be modeling the sales price of housing transactions. It should be obvious that none of these sales prices should be zero or negative. Observations with a zero or negative sales price should logically be considered to be errors.
Example 2: Suppose we had a ‘small’ number of housing transactions with a sale price over one million dollars, should we consider these sales prices to be valid? In this case these values could be valid data points, which would make them outliers, or they could be errors, such as 140,000.00 entered as 1,400,000. In either case they are not relevant data points if the objective is to model the ‘typical’ home price for the area.
2. EDA (30 Points)
Pick ten variables from the data quality check to explore in your initial exploratory data analysis. Perform an initial exploratory data analysis. How do you perform an exploratory data analysis for continuous versus discrete (or categorical) data?
Consider the use of scatterplots, scatterplot smoothers such as LOESS, and boxplots to produce relevant graphics when appropriate.
Note that you are particularly interested in the relationships between the response variable and the predictor variables.
Suggest you split your EDA into two sections in your report – one section for continuous variables and one section for discrete variables.
3. BUILD MODELS (100 Points)
Build at least four different LINEAR REGRESSION models.
The first model should be a simple (single prediction variable) model. Find the best single variable model.
The next model should be a multiple regression model with two predictor variables. Find the best two variable model.
You do not need to build more complex models for this assignment. More complex models will be the topic for hw02.
Show all of your models and the statistical significance of the input variables. Discuss the quality of fit, R squared and adjusted R squared, parsimony and anything else you can think of that might be of value to share.
Discuss the coefficients in the model you select, do they make sense? Are you keeping the model even though it is counter intuitive?
4. SELECT MODELS (20 Points)
Decide on the criteria for selecting the “Best Model”. Will you use a metric such as Adjusted R-Square or AIC? Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. Put the metrics in a table to display the results.
WRITE MODEL FORMULA (10 Points)
Write a mathematical formula that will show the model you selected. Explain your formula.
Make sure you include this as a section in your report. Do not expect that I will search your report to find it. This step should allow someone else to deploy your model.
The variable with the predicted saleprice should be named:
SCORED DATA FILE (100 POINTS)
Use the python model that you selected. Score the data file ames_test_sfam. Overall scoring for your model is based on providing a prediction for every record in the test data. Make sure you have not deleted any records in the test data and that none of your predictions are out of range. Create a file that has only TWO variables for each record:
The first variable, index, will allow me to match my grading key to your predicted value. If I cannot do this, you won’t get a grade. So please include this value. The second value, p_saleprice is the predicted price for a property per your model.
Your values will be compared against …
- A Perfect Model
- Shell Code Example Model
- Performance of Other Students
- Predict the Average value for everybody (MEAN)
If your model is not better than simply using an AVERAGE value, you will lose points.