Preliminary Information
In both courseworks you will be analysing the same dataset.
Do not model your answer on the workshop material. The objective of the workshops is to introduce
you to dierent data mining tasks discussed in lectures, and not to give you a roadmap on how to
answer the coursework. Therefore if you simply reproduce the steps in the workshops you are very
likely to make serious mistakes.
In both courseworks you will be assessed on your understanding of the data mining process, your ability
to use correctly the tools that we covered in the course, and the ability to draw correct conclusions
from what you observe. You will not be assessed by your capability to use R or any other software.
Therefore, dont include information about commands you used, or options you set, or how to draw a
figure etc. You will be simply wasting valuable space.
You are free to use any software to do the coursework. However, you cant use as an excuse the fact that
you couldnt do a particular task because the software you chose doesnt oer a particular capability
which we covered in the workshops.
The page limit for this report is 8 pages using at least 11 point typeface. This limit is strict and it
includes appendices (which I strongly recommend that you dont use). Standard penalties apply for
exceeding this limit.
Please pay particular attention to the disclaimer at the end of the assignment that gives more details
about the assessment of your report.
This is an individual piece of assessment, and you should ensure that your report reflects your own
work exclusively. All reports go through automated software to detect plagiarism from a variety of
sources (including past and current studentss reports as well as online resources, conference and journal
publications etc.) The consequences of plagiarism are very serious.
Description of the Problem and the Data
A bank wants to develop a credit scoring model to classify applications for mortgages. You are provided
with a sample of 2000 observations (past customers). Table 1 provides a description of the variables at your
disposal. The target variable is named Good and indicates whether a customer proved to be a good
customer (Good = 1) or bad customer (Good =0). A bad customer is defined as someone that has missed
three or more payments during the first year of the mortgage.
Tasks
Based on the project description and the distribution of the target variable, what are the implications
for building and assessing classification models for this problem? (10 marks)
Use visualisation tools and appropriate statistical measures covered in the course (i) to perform a
preliminary data analysis (answering questions about data quality like outliers, missing values, etc)
and (ii) to quantify how relevant each variable is for the classification problem at hand. (30 marks)
Certain variables in the dataset contain missing values. Is this relevant for your task, and how would
you treat these? Use data analysis tools like the visualisation and statistical measures to obtain insights
about the properties of missing values and what is a sensible way to treat these? You can make use a
logistic regression classifier to answer this question. (Base your conclusions and recommendations on
properties of the data, and more generally your findings, rather than generic arguments.) (30 marks)
1
YIRU
YIRU
YIRU
YIRU
YIRU
YIRU
modelmake decision
variables classification
Develop a logistic regression classifier that you think is appropriate for this dataset. Your discussion
needs to show evidence of tackling issues such as the indicative questions listed below (this is an
indicative and not an exhaustive list):
Consider dierent ways to handle missing values and assess their implications. What seems to be
the better way of handling these and what are the implications? What did you learn through this
process, and how can you relate this to your previous findings?
Which variables are important for this problem and how do your findings compare with the
expectations you formed during the preliminary data analysis?
Explain carefully how you assessed models to evaluate their suitability and how this process led
you to revise / improve your recommendations.
Explain what the final model you develop actually implies for the problem at hand.
(30 marks)
Report Assessment
Your coursework will not be evaluated by the quality of the final logistic regression alone, or by whether you
got a particular answer right. You will be primarily assessed by whether you are able to correctly justify
the steps you took to complete the assignment. In other words, your report needs to document that you are
able to intelligently analyse the provided data, that you draw correct conclusions from you observations, and
that these conclusions lead you either to the next logical step of the data mining process, or to the revision
of decisions made in previous steps of the analysis. (Refer to the flowchart of data mining stages we covered
in the first lectures and in particular to the feedback loops)
Therefore, dont simply present the conclusions/ results of your analysis and expect to get a high mark.
Reports that dont document the steps followed and the reasons why these were chosen will receive minimal
marks, even if the final answer is sensible. Explain your reasoning clearly and in good English. Dont provide
a list of bullet points, or unstructured sentences etc. Similarly, dont include figures or any other output
from R that you dont comment/ explain in the text. I will not assume that you know how to interpret these
correctly.
2
YIRU
Good (Target variable)
1: Good customer
0: Bad customer
Income Annual Gross Income
Amount Amount of requested loan
Installment Percentage Installment as percentage of monthly earnings
Applications Applications for credit over past year
Loans Number of existing loans
Credit Cards Credit cards currently held
Payments Missed or Delayed Payments in last 5 years: None / Delayed / Missed
Age (in years)
Marital Status Married/ Single/ Divorced
Employment Other/ Self Employment / Part time / Full time
Time at Employment (in years)
Residential Status Rent / Own / Other
Time at Address (in years)
Repayment method Non-Automated / Automated
Area indicator Location of branch receiving application
Table 1: Data Description
3
non-ignorable
Reviews
There are no reviews yet.