Semester 2, 2021
Lecture 5: Data quality and pre-processing
Why is pre-processing needed?
Copyright By Assignmentchef assignmentchef
Date of Birth
20 years ago
13th Feb. 2019
Mike___Moore
Why is pre-processing needed?
Date of Birth
20 years ago
13th Feb. 2019
Measuring data quality Accuracy
Correct or wrong, accurate or not Completeness
Not recorded, unavailable Consistency
E.g. discrepancies in representation Timeliness
Updated in a timely way Believability
Do I trust the data is true?
Interpretability
How easily can I understand the data?
1 2 3 4 5 6 7
Inconsistent data
Different naming representations (Melbourne University versus University of Melbourne) or (three versus 3)
Different date formats (3/4/2016 versus 3rd April 2016)
Age=20, Birthdate=1/1/1971
Two students with the same student id
Outliers (e.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999)
No good if it is list of ages of hospital patients
Might be ok for a listing of people number of contacts on Linkedin though Can use automated techniques, but also need domain knowledge
Missing or incomplete data
Lacking feature values Name=
Age=null
Some types of missing data (Rubin 1976) Missing completely at random
Missing at random
Missing not at random
Missing completely at random (MCAR)
Missing data are MCAR when the probability of missing data on a variable is unrelated to any other measured variable and is unrelated to the variable with missing values itself.
Missing at random (MAR)
Missing data are MAR when the probability of missing data on a variable is related to other fully measured variables.
Missing not at random (MNAR)
Data are MNAR when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables.
For example, when data are missing on IQ and only the people with low IQ values have missing observations for this variable.
Too much missing data!
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
Statistical Measurement Imputation Mean
Median Mode
Simple Strategies with sklearn
Simple Strategies with sklearn
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
Statistical Measurement Imputation Mean
Median Mode
Pro: easy to compute and administer; no loss of records Con: biases other statistical measurements
variance
standard deviation
Multivariate Strategies
Multivariate (more than one variable) 1. Logical rules
Disguised missing data
Everyones birthday is January 1st?
Email address is
Adriaans and Zantige
Recently, a colleague rented a car in the USA. Since he was Dutch, his post- code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.
How to handle
Look for unusual or suspicious values in the dataset, using knowledge about the domain
Major data preprocessing activities
Data mining concepts and techniques, Han et al 2012
Data cleaning process
Many tools exist (Google Refine, Kettle, Talend, )
Data scrubbing
Data discrepancy detection
Data auditing
ETL (Extract Transform Load) tools: users specify transformations via a graphical interface
Our emphasis will be to understand some of the methods employed by typical tools
Domain knowledge is important
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.