STAT0021: Assessment 4 Instructions
Term 1, 2023-24
1 Introduction
Please read and understand these instructions before you begin the assessment.
Assessment 4 will begin with the release of these instructions on the STAT0021 course Moodle page within the “Assessment 4 – Individual Coursework – Term 1” section at 1pm on Wednesday 13th December 2023.
The intention of the assessment is for you to apply the techniques you have learned during the course to a real-world dataset made up of a number of variables (percentage of the population double vaccinated for COVID-19, median household income, median house price, etc.) measured for subregions of London.
A copy of the data to be analysed is available as an Excel spreadsheet on the course Moodle page within the “Assessment 4 – Individual Coursework – Term 1” section.
Assessment 4 makes up 50% of your module mark for STAT0021
2 Data
The data are real measurements for subregions (Middle Layer Super Output Areas, or MSOAs) of London. In total, there are 11 variables recorded for 982 observations. Vaccination data was recorded in December 2021. Demographic data is accurate as of the 2011 census, but can be treated as being contemporaneous with the vaccination data.
Variable name Description |
|
ID |
A unique identifying number assigned to each observation. |
VaxPercent |
The percentage of the population who have received at least two COVID- 19 vaccination doses. |
Political |
An indicator of the political group which controls the borough in which the subregion is located. 0: Conservative 1: Labour 2: Other (Liberal Democrat or no majority party) |
PopDensity |
The population density (people/km2) |
Over65 |
The percentage of the population who are aged 65 or over |
Obesity |
The percentage of the population who are classified as obese (BMI ≥ 30) |
PostALevel |
The percentage of the population who have a qualification above A–Level (e.g. a university degree or similar vocational qualification) |
Unemployment |
The percentage of the population who are unemployed |
HHBenefit |
The percentage of the population living in households reliant upon means- tested benefits |
MedHHInc |
The median household income |
MedianHP |
The median house price |
3 Submission structure
You should structure your analysis and subsequent write-up according to the below headings.
3.1 Exploratory data analysis
The first step in any data analysis is to explore the data to get a sense of what the variables represent and the potential for relationships between them.
Your submission should include three separate, distinct exploratory analyses, each of which contains all of:
A. The results of a single numerical calculation (e.g. a summary statistic or the results of a hypothesis test).
B. A single figure (generally containing a single plot, but potentially containing up to three related plots).
C. A discussion of what your numerical result and figure tell us about the London and justification of why this information is interesting.
Note that:
1. Each of your exploratory analyses will be marked out of 6 marks (for a total of 3×6=18 marks overall for this section of the assessment).
2. Marks will be awarded for the degree of insight shown in each part of the analysis. A numerical result and/or plot which is not discussed will receive a poor mark – a large proportion of the marks will be awarded based upon the degree to which your discussion correctly interprets your results and justifies why they are insightful.
3. Variety across the three analyses will be rewarded. For example, submissions which repeat the same analysis and discussion for three sets of variables fail to show a breadth of understanding and will receive a poor mark.
4. Neither of your discussions should include VaxPercent, as this is the focus of the later parts of the assessment.
5. You are free to transform and/or combine variables, and to identify and potentially remove any outliers from the data. Any such decisions should be justified in your discussion.
3.2 Simple linear regression
VaxPercent is the focus of this part of the task. How can the other variables be used to explain the variability in VaxPercent via simple linear regression?
Your submission should include:
A. Justification of which variable can be used as a covariate to produce the best simple linear regression model for the outcome VaxPercent.
B. An interpretation of the estimated model coefficients for your best simple linear regression model.
C. Comments on the fit of your best simple linear regression model.
D. A plot of VaxPercent against the covariate in your best simple linear regression model with the accompanying regression line.
Note that:
1. This component of your submission will be marked out of 9 marks.
2. There is not a specific definition of the “best” model. It is likely to be based both upon how well the model fits the data and how well the assumptions underlying simple linear regression are satisfied (quantitative and qualitative evidence). Include in your justification why you would categorise your model as being the best and the steps you took to arrive at this best model.
3. Your model can include a variable which is not present in the original dataset, but which has been obtained via a transformation or combination of variables in the original dataset. You should not bring in external data. You should provide a justification of why any new variable is useful/interesting if you haven’t already given an explanation earlier in your submission.
4. You should support your justification, interpretation and comments with suitable Stata output.
5. If there are any particularly unusual observations identifiable as a result of your analysis, you should mention them using their ID and justify why you do or do not believe them to be outliers. If you believe them to be outliers, then you can exclude them when fitting your model.
Your submission should also include:
E. The lower quartile, median, and upper quartile value for the covariate in your best simple linear regression model.
F. A mathematical equation to indicate how your best simple linear regression model can be used to make predictions of VaxPercent.
G. Predictions of the value of VaxPercent when the covariate in your best simple linear regression model takes its lower quartile, median, and upper quartile values.
Note that:
6. This component of your submission will be marked out of 3 marks.
7. If your best model includes variable x as the covariate, you should use Stata to calculate the lower quartile, median, and upper quartile values of x. Then, calculate the corresponding predicted values of VaxPercent according to your best model.
3.3 Multiple linear regression
VaxPercent is again the focus of this part of the task. How can the other variables be used to explain the variability in VaxPercent via multiple linear regression? Your submission should include:
A. Justification of which variables can be used as covariates to produce the best multiple linear regression model for the outcome VaxPercent.
B. An interpretation of the estimated model coefficients for your best multiple linear regression model.
C. Comments on the fit of your best multiple linear regression model fit. Note that:
1. This component of the assessment will be marked out of 9 marks.
2. There is not a specific definition of the “best” model. It is likely to be based both upon how well the model fits the data and how well the assumptions underlying multiple linear regression are satisfied (quantitative and qualitative evidence). Include in your justification why you would categorise your model as being the best and the steps you took to arrive at this best model.
3. Your model can include variables which are not present in the original dataset, but which are obtained via a transformation or combination of variables in the original dataset. You should not bring in external data. You should provide a justification of why your new variables are useful/interesting if you haven’t already given an explanation earlier in your submission.
4. You should support your justification, interpretation and comments with suitable Stata output.
5. If there are any particularly unusual observations identifiable as a result of your analysis, you should mention them using their ID and justify why you do or do not believe them to be outliers. If you believe them to be outliers, then you can exclude them when fitting your model.
Your submission should also include:
D. The lower quartile, median, and upper quartile values for each covariate in your best multiple linear regression model.
E. A mathematical equation to indicate how your best multiple linear regression model can be used to make predictions of VaxPercent.
F. Predictions of the value of VaxPercent when the covariates in your best multiple linear regression model jointly take their lower quartile, median, and upper quartile values.
Note that:
6. This component of the assessment will be marked out of 3 marks.
7. If your best model includes variables x1, x2, … as the covariates, you should use Stata to calculate the lower quartile, median, and upper quartile values of x1, x2, … . Then, calculate the corresponding predicted values of VaxPercent according to your best model. That is, you should submit three predicted values. One for when all of your covariates take their lower quartile values, one for when they all take their median values, and one for when they all take their upper quartile values.
3.4 Linear regression with a factor variable as a covariate
Linear regression is useful for understanding how continuous variables influence other continuous variables. There may be occasions when we would like to understand how categorical variables, also referred to as factor variables, influence a continuous variable. Careful consideration of a factor variable can allow for its inclusion as a covariate within a linear regression. While linear regression with a factor variable as a covariate isn’t taught as part of STAT0021, you should be able to extend your knowledge of linear regression from STAT0021 to understand the basics of linear regression with a factor variable as a covariate through a small amount of research.
VaxPercent is the outcome of interest, with the link to Political being the aim of the investigation.
Your submission should include:
A. Stata output including the results of an appropriate test taught as part of STAT0021 to determine whether the mean value of VaxPercent differs according to the levels of Political.
B. An interpretation of those test results.
C. A suitable plot to compare VaxPercent and Political.
D. Plot (or plots) necessary to verify whether the assumptions of your test are satisfied.
Note that:
1. This component of the assessment will be marked out of 3 marks.
Your submission should also include:
E. Stata output including the results of a linear regression model for VaxPercent using Political treated as a factor variable as the covariate.
F. An interpretation of the estimated model coefficients from that linear regression model.
Note that:
2. This component of the assessment will be marked out of 3 marks.
Your submission should also include:
G. A mathematical equation to indicate how this regression model can be used to make predictions of VaxPercent.
H. Predicted values of VaxPercent when Political takes each of its three different levels.
Note that:
3. This component of the assessment will be marked out of 3 marks.
Your submission should also include:
I. Discussion of the benefits of building a linear regression model using Political as a factor variable covariate in contrast to the drawbacks of a linear regression model using Political as a continuous variable covariate when wishing to determine how VaxPercent varies with Political.
Note that:
4. The component of the assessment will be marked out of 3 marks.
3.5 General marks
6 marks are available to submissions which:
A. Are clear, well-written and formatted; with plots and Stata output adequately sized and labelled; and which correctly follow the submission format instructions.
4 Submission details
4.1 Submission format
You should submit a single file, saved as a pdf and named as “Assessment 4 [your student number]” . For example, if your student number is 22000000 then your submission should be a single pdf file named “Assessment 4 22000000” .
Your submission should also include within it your student number, but should not contain your name.
4.2 Submission length
Your submission should be made up of no more than:
. Five A4 pages of discussions which cover all of the requirements outlined in the previous section, with a font size no smaller than 10 points.
. Ten pages of Stata output (as screenshots) and other relevant figures. Each figure should
have a number by which it is referred to in your discussions. Figures should be of a suitable size and quality to be easily interpretable.
. One page, if necessary, of references to journal articles, books, websites, AI tools, etc.
Requesting the discussions and figures be separated in this way may seem unusual, but is done to stress both that enormous amounts of writing are not expected for this assessment and that carefully chosen figures can be just as (or even more) useful than a greater volume of text. The permitted length is an upper limit, not a guide for how much you are expected to submit. If you can clearly explain your thoughts more concisely then shorter submissions will not automatically be marked lower.
Any submission which is over the permitted length will suffer a penalty of 10 percentage points, although any such penalty will not reduce a mark below the pass mark of 40%.
4.3 Submission procedure and deadline
You must complete your submission via the “Assessment 4 – Individual Coursework – Term 1” section of the STAT0021 course Moodle page before the deadline of 1pm on Wednesday 17th January 2024.
There are standard non-negotiable penalties for late submissions which you can read about in the UCL Academic Manual. Any extension to the deadline can only be granted where a student has a Summary of Reasonable Adjustments (SoRA) or has successfully claimed extenuating circumstances. Extenuating circumstances are handled by your parent department and not by the teaching department.
4.4 Stata
Throughout the information above on the expected submission structure it is mentioned that you should include supporting evidence from Stata. This is referred to because use of Stata has been taught as part of STAT0021. If you would prefer to make use of other software to perform the analyses, and you believe that you can obtain results just as good as those produced by Stata, then you are free to do so. If you are considering this, you are strongly encouraged to contact the course lecturer at [email protected] to discuss your decision.
Reviews
There are no reviews yet.