Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Fundamentals of Machine Learning for Predictive Data Analytics
Chapter 2: Data to Insights to Decisions
John Kelleher and Namee and Aoife DArcy
Copyright By Assignmentchef assignmentchef
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Converting Business Problems into Analytics Solutions
Case Study: Motor Insurance Fraud
Assessing Feasibility
Case Study: Motor Insurance Fraud
Designing the Analytics Base Table
Case Study: Motor Insurance Fraud
Designing & Implementing Features
Different Types of Data Different Types of Features Handling Time
Legal Issues
Implementing Features
Case Study: Motor Insurance Fraud
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Converting Business Problems into Analytics Solutions
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Converting a business problem into an analytics solution involves answering the following key questions:
1 What is the business problem?
2 What are the goals that the business wants to achieve?
3 How does the business currently work?
4 In what ways could a predictive analytics model help to
address the business problem?
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
In spite of having a fraud investigation team that investigates up to 30% of all claims made, a motor insurance company is still losing too much money due to fraudulent claims.
What predictive analytics solutions could be proposed to help address this business problem?
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Potential analytics solutions include: Claim prediction
Member prediction Application prediction Payment prediction
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Assessing Feasibility
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Evaluating the feasibility of a proposed analytics solution involves considering the following questions:
1 Is the data required by the solution available, or could it be made available?
2 What is the capacity of the business to utilize the insights that the analytics solution will provide?
What are the data and capacity requirements for the proposed Claim Prediction analytics solution for the motor insurance fraud scenario?
What are the data and capacity requirements for the proposed Claim Prediction analytics solution for the motor insurance fraud scenario?
Case Study: Motor Insurance Fraud
[Claim prediction]
Data Requirements: A large collection of historical claims marked as fraudulent and non-fraudulent. Also, the details of each claim, the related policy, and the related claimant would need to be available.
Capacity Requirements: The main requirement is that a mechanism could be put in place to inform claims investigators that some claims were prioritized above others. This would also require that information about claims become available in a suitably timely manner so that the claims investigation process would not be delayed by the model.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Designing the Analytics Base Table
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
The basic structure in which we capture historical datasets is the analytics base table (ABT)
Descrip(ve Features
Target Feature
– – – – – – – – – –
– – –
– – –
– – –
– – –
– – –
– – – – – – – – – –
Figure: The general structure of an analytics base tabledescriptive features and a target feature.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Figure: The different data sources typically combined to create an analytics base table.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
The prediction subject defines the basic level at which predictions are made, and each row in the ABT will represent one instance of the prediction subjectthe phrase one-row-per-subject is often used to describe this structure.
Each row in an ABT is composed of a set of descriptive features and a target feature.
Defining features can be difficult!
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
A good way to define features is to identify the key domain concepts and then to base the features on these concepts.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Analytics Solution
Domain Concept
Domain Concept
Target Concept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Target Feature
Feature Feature Feature Feature Feature Feature Feature Feature
Figure: The hierarchical relationship between an analytics solution, domain concepts, and descriptive features.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
There are a number of general domain concepts that are often useful:
Prediction Subject Details Demographics
Changes in Usage Special Usage
Lifecycle Phase Network Links
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Motor Insurance Claim Fraud Prediction
Policy Details
Claim Details
Claimant History
Claimant Links
Claimant Demographics
Fraud Outcome
Claim Types
Claim Frequency
Links with Other Claims
Figure: Example domain concepts for a motor insurance fraud claim prediction analytics solution.
Links with Current Claim
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Designing & Implementing Features
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Three key data considerations are particularly important when we are designing features.
Data availability Timing Longevity
Problems to Solutions
Assessing Feasibility ABT Design Designing & Implementing Features Summary
Different Types of Data
DATE OF ID NAME BIRTH
CREDIT GENDER RATING
male aa female c female b
male b female aa
Categorical
0034 Brian 0175 Mary 0456 Sinead 0687 Paul 0982 Donald 1103 Agnes
22/05/78 04/06/45 29/02/82 11/11/67 01/12/75 17/09/76
COUNTRY ireland france ireland usa australia sweden
SALARY 67,000 65,000
112,000 34,000 88,000
Figure: Sample descriptive feature data illustrating numeric, binary, ordinal, interval, categorical, and textual types.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Different Types of Features
The features in an ABT can be of two types:
raw features derived features
There are a number of common derived feature types:
Aggregates Flags Ratios Mappings
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Handling Time
Many of the predictive models that we build are propensity models, which inherently have a temporal element
For propensity modeling, there are two key periods:
the observation period the outcome period
In some cases the observation and outcome period are measured over the same time for all predictive subjects.
on*Period*
Outcome*Period*
(a) Observation period and outcome period
(b) Observation and outcome periods for multiple customers (each line rep- resents a customer)
Figure: Modeling points in time.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Handling Time
Often the observation period and outcome period will be measured over different dates for each prediction subject.
ObservaCon%Period%
Outcome%Period%
(a) Actual (b) Aligned
Figure: Observation and outcome periods defined by an event rather than by a fixed point in time (each line represents a prediction subject and stars signify events).
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Handling Time
In some cases only the descriptive features have a time component to them, and the target feature is time independent.
Observa=on%Period%
(a) Actual (b) Aligned
Figure: Modeling points in time for a scenario with no real outcome period (each line represents a customer, and stars signify events).
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Handling Time
Conversely, the target feature may have a time component and the descriptive features may not.
Outcome%Period%
(a) Actual (b) Aligned
Figure: Modeling points in time for a scenario with no real observation period (each line represents a customer, and stars signify events).
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Legal Issues
Data analytics practitioners can often be frustrated by legislation that stops them from including features that appear to be particularly well suited to an analytics solution in an ABT.
There are significant differences in legislation in different jurisdictions, but a couple of key relevant principles almost always apply.
1 Anti-discrimination legislation
2 Data protection legislation
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Legal Issues
Although, data protection legislation changes significantly across different jurisdictions, there are some common tenets on which there is broad agreement which affect the design of ABTs
The collection limitation principle The purpose specification principle The use limitation principle
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Implementing Features
Implementing a derived feature, however, requires data from multiple sources to be combined into a set of single feature values.
A few key data manipulation operations are frequently used to calculate derived feature values:
joining data sources
filtering rows in a data source
filtering fields in a data source
deriving new features by combining or transforming existing features
aggregating data sources
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
What are the observation period and outcome period for the motor insurance claim prediction scenario?
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
What are the observation period and outcome period for the motor insurance claim prediction scenario?
The observation period and outcome period are measured over different dates for each insurance claim, defined relative to the specific date of that claim.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
What are the observation period and outcome period for the motor insurance claim prediction scenario?
The observation period and outcome period are measured over different dates for each insurance claim, defined relative to the specific date of that claim.
The observation period is the time prior to the claim event, over which the descriptive features capturing the claimants behavior are calculated
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
What are the observation period and outcome period for the motor insurance claim prediction scenario?
The observation period and outcome period are measured over different dates for each insurance claim, defined relative to the specific date of that claim.
The observation period is the time prior to the claim event, over which the descriptive features capturing the claimants behavior are calculated
The outcome period is the time immediately after the claim event, during which it will emerge whether the claim is fraudulent or genuine.
Case Study: Motor Insurance Fraud
What features could you use to capture the Claim Frequency domain concept?
Motor Insurance Claim Fraud Prediction
Policy Details
Claim Details
Claimant History
Figure: Example domain concepts for a motor insurance fraud prediction analytics solution.
Claimant Links
Claimant Demographics
Fraud Outcome
Claim Types
Claim Frequency
Links with Other Claims
Links with Current Claim
Case Study: Motor Insurance Fraud
What features could you use to capture the Claim Frequency domain concept?
Motor Insurance Claim Fraud Prediction
Claimant History
Claim Frequency
Number of Claims in Claimant Lifetime
Derived Aggregate
Number of Claims by Claimant in Last 3 Months
Derived Aggregate
Figure: A subset of the domain concepts and related features for a motor insurance fraud prediction analytics solution.
Average Claims Per Year
by Claimant
Derived Aggregate
Ratio of Avg. Claims Per Year to Number of Claims in last 12 Months
Derived Ratio
Case Study: Motor Insurance Fraud
What features could you use to capture the Claim Types domain concept?
Motor Insurance Claim Fraud Prediction
Policy Details
Claim Details
Claimant History
Figure: Example domain concepts for a motor insurance fraud prediction analytics solution.
Claimant Links
Claimant Demographics
Fraud Outcome
Claim Types
Claim Frequency
Links with Other Claims
Links with Current Claim
Case Study: Motor Insurance Fraud
What features could you use to capture the Claim Types domain concept?
Motor Insurance Claim Fraud Prediction
Claimant History
Claim Types
Number of Soft Tissue Claims
Derived Aggregate
Ratio of Soft Tissue Claims to Other Claims
Derived Ratio
Figure: A subset of the domain concepts and related features for a motor insurance fraud prediction analytics solution.
Unsuccessful Claim Made
Derived Flag
Diversity of Claim Types
(measured using entropy) Derived Other
Case Study: Motor Insurance Fraud
What features could you use to capture the Claim Details domain concept?
Motor Insurance Claim Fraud Prediction
Policy Details
Claim Details
Claimant History
Figure: Example domain concepts for a motor insurance fraud prediction analytics solution.
Claimant Links
Claimant Demographics
Fraud Outcome
Claim Types
Claim Frequency
Links with Other Claims
Links with Current Claim
Case Study: Motor Insurance Fraud
What features could you use to capture the Claim Details domain concept?
Motor Insurance Claim Fraud Prediction
Claim Details
Claim to Premium Paid Ratio
Derived Ratio
Injury Type
Claim Amount
Figure: A subset of the domain concepts and related features for a motor insurance fraud prediction analytics solution.
Accident Region
Derived Mapping
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
The following table illustrates the structure of the final ABT that was designed for the motor insurance claims fraud detection solution.
The table contains more descriptive features than the ones we have discussed
The table also shows the first four instances.
If we examine the table closely, we see a number of strange values (for example, 9 999) and a number of missing valueswe will return to these in Chapter 3.
Table: The ABT for the motor insurance claims fraud detection solution.
ID TYPE INC. 1 CI 0 2CI0 3 CI 54 613 4 CI 0
CLAIM AMT. 1 625 2BackYes15028 1 Broken Limb No -9 999 3 Serious Yes 270200
TOTAL ID CLAIMED
AVG. NUM. % CLAIMS SOFT SOFT
MARITAL STATUS
NUM. INJURY HOSPITAL CLMNTS. TYPE STAY
2 Soft Tissue No
NUM. NUM. CLAIMS
AVG. CLAIMS PER YEAR
CLAIMS 3 MONTHS
1 3250 2 0 1 1 2 1 2 60112 1 0 1 1 0 0 30000000 40000000
CLAIM UNSUCC. AMT. CLAIMS REC. 1 2 0 2 0 15028 3 0 572 4 0 270200
CLAIM DIV.
FRAUD PREM. REGION FLAG
RATIO TISSUE
0 32.5 MN 1 0 57.14 DL 0 0 -89.27 WAT 0 0 30.186 DL 0
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization, not an end in themselves.
It is important to fully understand the business problem that a model is being constructed to addressthis is the goal behind converting business problems into analytics solutions
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Predictive data analytics models are reliant on the data that is used to build themthe analytics base table (ABT).
The first step in designing an ABT is to decide on the prediction subject.
An effective way in which to design ABTs is to start by defining a set of domain concepts in collaboration with the business, and then designing features that express these concepts in order to form the actual ABT.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
Features (both descriptive and target) are concrete numeric or symbolic representations of domain concepts.
It is useful to distinguish between raw features that come directly from existing data sources and derived features that are constructed by manipulating values from existing data sources.
Common manipulations used in this process include aggregates, flags, ratios, and mappings, although any manipulation is valid.
Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary
The techniques described here cover the Business Understanding, Data Understanding, and (partially) Data Preparation phases of the CRISP-DM process.
Business Understanding
Deployment
Data Understanding
Data Prepara1on
Figure: A diagram of the CRISP-DM process.
Understand Business Problem
Propose Analy5cs Solu5ons
Explore Data (1)
Assess Analy5cs Solu5ons
Choose Analy5cs Solu5on
Agree on Analy5cs Goals
Design Domain Concepts
Brainstorm Domain Concepts
Domain Concepts
Explore Data (2)
Design Features
Review Features
Clean & Prepare Data
Figure: A summary of the tasks in the Business Understanding, Data Understanding, and Data Preparation phases of the CRISP-DM process.
Data Data Business Prepara5on Understanding
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.