Information Management
Data Mining
(from the book Data Mining: Concepts and Techniques)
Copyright By Assignmentchef assignmentchef
Universita degli Studi di Mining
What is data mining?
It is a set of techniques and tools aimed at extracting interesting patterns from data
Data mining is part of KDD (knowledge discovery in databases)
KDD is the process of identifying valid, novel, potential useful, and ultimately
understandable patterns in data
Data warehouses are crucial for data mining purposes
Types of data mining
Concept description
Characterization: concise and succinct representation of a collection of data Comparison: description comparing two (or more) collections of data
Descriptive data mining
Describes concepts or task-relevant data sets in a concise and informative
Predictive data mining
Builds a model based on data and on their analysis, the model is then used to predict trends and properties of unknown data
Data mining branches (1)
Aggregation queries are a very simple kind of mining
Classification
Build a model to categorize data in classes
Regression
Build a model to predict the result of a real-valued function
Clustering
Organize data into groups of similar items
Outlier detection
Identify unusual data items
Data mining branches (2)
Trend analysis and forecasting
Identify changes in patterns of data over time
Detect dependencies among data
Identify whether attributes are correlated with each other Identify which attributes likely occur together
Temporal pattern detection (or time series mining) Identify common patterns in time series
Data mining: be careful! (1)
Overfitting
Identify spurious patterns: be careful not to take coincidence for causality!
May be due to the analysis of too many attributes or of a limited number of data items
Example: ask 10.000 subjects to predict the color of 10 face-down cards, 10 subjects predicted correctly all the 10 cards
conclusion: 1 out 1.000 subjects have extra sensory perception NO
Report obvious results that do not derive from data analysis Example: women are more likely to have breast cancer
Data mining: be careful! (2)
Confuse correlation and causation
Data mining identifies correlated attributed, but this does not always imply
causality relationship!
Example: overweight people are more likely to drink diet soda
Conclusion: diet soda causes obesity NO
It is necessary to correctly interpret mining results
Data mining algorithms are not magic
Results must be carefully analyzed to avoid drawing wrong conclusions
Examples of application domains
Market analysis
Targeted marketing, customer profiling
Determining patterns of purchases over time for suggestions Cross market analysis
Corporate analysis and risk management Finance planning and asset evaluation
Resource planning Competition
Frequent itemset mining
Frequent patterns
Frequent patterns: patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set
Example: a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset
Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data
Application examples: market basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.
Let I={i1,,in} be a collection of one or more data items
Transaction T I is a subset of items in I
Itemset X I is a subset of items
k-Itemset X: itemset with k items
Absolute support of X: frequency of occurrence of
Relative support: fraction of transactions that include X (probability that X appears in a transaction)
Frequent itemset: itemset with support above a threshold
Customer buys milk
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Customer buys bread
Customer buys both
Items bought
Association rules
Association rule AaB where A and B are itemsets
Support s:
fraction of transactions including both A and B
probability P(A B)
{Bread, Milk}aCoffee
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Confidence c:
fraction of transactions including A that also include B
probability P(B|A)
Association rules
Association rule AaB where A and B are itemsets
Support s:
fraction of transactions including both A and B
probability P(A B)
{Bread, Milk}aCoffee
s = |{$%&'(,*+,,&&,-./0}| = 3 = 0.4
c = |{$%&'(,*+,,&&,-./0}| = 3 = 0.5 |{$%&'(,-./0}| 8
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Confidence c:
fraction of transactions including A that also include B
probability P(B|A)
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Strong rules
Goal: given a set T of transactions, find strong association rules that have
Support minsup threshold
Confidence minconf threshold
All the rules in the example are partitions of {Bread,Coffee,Milk}
Have the same support
Have different confidence
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Strong rules
Goal: given a set T of transactions, find strong association rules that have
Support minsup threshold
Confidence minconf threshold
All the rules in the example are partitions of {Bread,Coffee,Milk}
Have the same support
Have different confidence
{Bread,Milk}a{Coffee} s=0.4, c=0.5 {Bread,Coffee}a{Milk} s=0.4, c=1.0 {Milk,Coffee}a{Bread} s=0.4, c=1.0 {Bread}a{Coffee,Milk} s=0.4, c=0.5 {Coffee}a{Bread,Milk} s=0.4, c=0.67 {Milk}a{Bread,Coffee} s=0.4, c=0.5
Association rule mining
Association rule mining can operate in two steps
1. Find all frequent itemsets
Determines the performance of the approach
2. Generate strong association rules from frequent itemstets
Much less costly
Note: a frequent itemset of length 100 includes 100 1-frequent itemsets, 100!/2!98! = 100/2 = 50 2-frequent itemsets overall, it includes nearly 1.27*1030 frequent patterns!
Too many!!
Itemset lattice
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Closed and maximal itemsets
X is a closed itemset if there exists no proper super-itemset Y with the same absolute support (i.e., frequency) as X
X is a closed frequent itemset if it is a closed itemset with frequency above minfreq
X is a maximal frequent itemset (max-itemset) if it is frequent and there exists no super-itemset Y such that X Y and Y is frequent
X is frequent and is not included in another frequent itemset
Apriori property
Apriori property: all non- empty subsets of a frequent itemset must also be frequent
If I is not frequent, adding i to I produces an itemset I {i} that cannot appear more frequently than I
Antimonotonicity: if a set does not pass a test, also its supersets will fail the same test
not frequent null
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE
BCE BDE CDE
Apriori algorithm (1)
Scan all transactions and count the occurrences of each item (1-itemsets in candidate set C1)
Select the 1-itemsets in C1 satisfying minsup threshold, and include them in L1
Compute the set of candidates frequent 2-itemsets C2 as L1L1
Scan all transactions to compute the absolute support of each
2-itemset in C2
Select the 2-itemsets in C2 satisfying minsup threshold, and include them in
Compute the set of candidates frequent 3-itemsets C3 as L2L2
Stop when Ci is empty
Apriori algorithm (2)
Ck: Candidate itemset of size k Lk : frequent itemset of size k
L1 = {frequent 1-itemsets}
for (k = 2; Lk-1 !=; k++) do begin
Ck = generate candidates Lk-1
for each transaction t in database do
increment the count of all candidates in Ck that are contained in t Lk = candidates in Ck with support higher than minsup
return Ek Lk
Generate candidates
To generate candidates Ck, compute a self-join between Lk-1 and Lk-1
Join condition: the first k-2 items should be in common
Items l1 and l2 in Lk-1 are joined if
(l1[1]=l2[1]) ^ (l1[2]=l2[2]) ^ ^ (l1[k-2]=l2[k-2]) (l1[k-1]
Association rules can be generated as follows:
1. For each frequent itemset l, generate all non-empty subsets of l
2. For every non-empty subset s of l, generate rule sa(l-s) if support(A B) minconf
support(A)
Note: minimum support is guaranteed by the fact that l is a frequent itemset
Generating association rules: example
Consider frequent itemset X={I1, I2, I5}
Non-empty subsets of X are
{I1,I2} generated rule {I1,I2}aI5 with confidence 2/4 {I1,I5} generated rule {I1,I5}aI2 with confidence 2/2 {I2,I5} generated rule {I2,I5}aI1 with confidence 2/2 {I1} generated rule I1a{I2,I5} with confidence 2/6
{I2} generated rule I2a{I1,I5} with confidence 2/7
{I5} generated rule I5a{I2,I5} with confidence 2/2
minsup=0.7
List of item IDS
I1,I2,I3,I5
Improving the Efficiency of Apriori
Major computational challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans Shrink number of candidates
Facilitate support counting of candidates
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski and S. Navathe, VLDB95
DB1 + DB2 + + DBk = DB sup1(i) < DB1 sup2(i) < DB2 supk(i) < DBk sup(i) < DB DHP: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of{ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD95Hash Table{ab, ad, ae}{bd, be, de}{yz, qs, wt} Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen. Sampling large databases for association rules. In VLDB96 DIC: Reduce Number of ScansAB AC BC AD BD CD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD beginsItemset latticeS.. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD97Transactions1-itemsets 2-itemsets1-itemsets FP-tree (1)1. Scan the DB (similarly to Apriori) to find frequent 1-itemsets2. Sort frequent items in frequency descending order (f-list)3. Scan DB again, construct FP-treef-list: f, c, a, m, b, pTransactional DatabaseList of item IDSf, a, c, d, g, i, m, pa, b, c, f, l, m, ob, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nSupport count List of item IDSOrdered itemsf, a, c, d, g, i, m, pf, c, a, m, pa, b, c, f, l, m, of, c, a, b, mb, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nf, c, a, m, pFP-tree (2) Root: empty node {} For each transaction, order its itemsaccording to the f-list Each transaction is a path in the FP- tree Build a path for each transaction, increasing by one the counter in common sub-pathsc:3 b:1 b:1 a:3 p:1m:2 b:1 p:2 m:1Support count FP-tree: example (1)List of item IDSf, a, c, d, g, i, m, pa, b, c, f, l, m, ob, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nList of item IDSOrdered itemsf, a, c, d, g, i, m, pf, c, a, m, pa, b, c, f, l, m, of, c, a, b, mb, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nf, c, a, m, pSupport count FP-tree: example (2)List of item IDSOrdered itemsf, a, c, d, g, i, m, pf, c, a, m, pa, b, c, f, l, m, of, c, a, b, mb, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nf, c, a, m, p FP-tree: example (3)a:2 m:1 b:1List of item IDSOrdered itemsf, a, c, d, g, i, m, pf, c, a, m, pa, b, c, f, l, m, of, c, a, b, mb, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nf, c, a, m, p FP-tree: example (4)f:3 c:2 b:1a:2 m:1 b:1List of item IDSOrdered itemsf, a, c, d, g, i, m, pf, c, a, m, pa, b, c, f, l, m, of, c, a, b, mb, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nf, c, a, m, p FP-tree: example (5)c:1 c:2 b:1 b:1 a:2 p:1m:1 b:1 p:1 m:1List of item IDSOrdered itemsf, a, c, d, g, i, m, pf, c, a, m, pa, b, c, f, l, m, of, c, a, b, mb, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nf, c, a, m, p FP-tree: example (6)c:1 c:3 b:1 b:1 a:3 p:1m:2 b:1 p:2 m:1List of item IDSOrdered itemsf, a, c, d, g, i, m, pf, c, a, m, pa, b, c, f, l, m, of, c, a, b, mb, f, h, j, o, wb, c, k, s, pa, f, c, e, l, p, m, nf, c, a, m, p FP-tree: pattern base The problem of mining frequent patterns in databases is transformed to that of mining the FP-tree Extract the conditional pattern base by visiting the treec:3 b:1 b:1 a:3 p:1m:2 b:1 p:2 m:1Support countConditional pattern basefca:2 fcab:1fca:1 f:1 c:1fcam:2 cb:1 Conditional FP-tree For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Generate frequent patternsConditional pattern basefca:2 fcab:1fca:1 f:1 c:1fcam:2 cb:1f:3 c:3 a:3fm, fcm, fcamSupport count below threshold FP-growth transforms the problem of finding long frequent patterns to searching for shorter ones recursively and concatenating the suffix It uses the least frequent suffix offering a good selectivity It reduces the search cost If the tree does not fit into main memory, partition the database Efficient and scalable for mining both long and short frequent patterns Which patterns are interesting? Minimum support and confidence thresholds help to exclude uninteresting rules When mining low support thresholds or long patterns, we may end up with uninteresting rules Is a rule interesting? Subjective measure: only users can ultimately say if a rule is interesting Objective measure: may help filtering uninteresting rules based on statistics behind data Misleading association rules Consider an electronic shop and, in particular, transactions including computer games and videos Out of 10,000 transactions 6,000 include games 7,500 include videos 4,000 include both games and videos Rule {games}a{video} has support=0.4 and confidence=0.66 It is considered strong if the threshold is fixed to 0.3 Misleading!!computer games and videos are negatively associated: the purchase of one decreases the likelihood of purchasing the other Correlation analysis The confidence of a rule AaB can be deceiving estimate of the conditional probability of itemset B given itemset A does not measure the real strength of the correlation implication between A and B Need to adopt correlation analysis There exist different correlation measures Lift Chi-square … The occurrence of A is independent of the occurrence of B if P AB =P A P B otherwiseAandBaredependentand correlated as eventslift(A,B) = T(A C) T A T(C) lift(A,B)<1: A is negatively correlated with B (occurrence of A likely leads to absence of B) lift(A,B)>1: A is positively correlated with B
(occurrence of A likely leads to occurrence of B)
lift(A,B)=1: A and B are independent
Lift: example
Out of 10,000 transactions
6,000 include games
7,500 include videos
4,000 include both games and videos
Rule {games}a{video} support=0.4
confidence=0.66
lift = P({game,video})/(P({game})P({video})) = 0.4/(0.6*0,75)=0.89 < 1 Classification What is a classification? Examples: Analyze data to learn which loan applicants are risky and which are safe Analyze breast cancer data to predict which of the three specific treatments is better The classification task is aimed at predicting the class (category or label) of data It is a prediction problem Classes are discrete values that are not characterized by an order relationship Classification: a two-step processLearning (training) step Each tuple/sample is assumed to belong to a predefined class, as determined by theclass label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulaeClassification step Estimate accuracy of the model Theknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodel Accuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodel Testsetisindependentoftrainingset(otherwiseoverfitting) If the accuracy is acceptable, use the model to classify new data Learning step: exampleTraining Data Classification AlgorithmsClassifier (Model)Assistant ProfIF rank = professor OR years > 6
THEN tenured = yes
Classification step: example
Testing Data
Unseen Data
(Jeff, Professor, 4)
Classifier
Assistant Prof
Associate Prof
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.