Download Data Mining - Classification by Decision Tree Induction and more Study notes Data Mining in PDF only on Docsity!
November 27, 2014 Data Mining: Concepts and 1
Chapter 6. Classification and Prediction
- (^) What is classification? What is prediction?
- (^) Issues regarding classification and prediction
- (^) Classification by decision tree induction
- (^) Bayesian classification
- (^) Rule-based classification
- (^) Classification by back propagation - (^) Support Vector Machines (SVM) - (^) Associative classification - (^) Lazy learners (or learning from your neighbors) - (^) Other classification methods - (^) Prediction - (^) Accuracy and error measures - (^) Ensemble methods - (^) Model selection - Summary
November 27, 2014 Data Mining: Concepts and 2
Classification by Decision Tree Induction
(DTI)
- (^) DTI is the learning of decision trees from class_labeled training tuples.
- (^) A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The top most node is the root node.
- (^) Why are DT Classifier so popular?
- (^) The construction of DT classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery.
- (^) DT can handle high dimensional data.
- (^) Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans
- (^) They have good accuracy.
- (^) They may be used in medicine manufacturing, production, financial analysis, astronomy and molecular biology.
November 27, 2014 Data Mining: Concepts and 4
Output: A Decision Tree for “ buys_computer”
age?
overcast
student? credit rating?
no yes
yes
yes
no
excellent fair
no yes
November 27, 2014 Data Mining: Concepts and 5
Algorithm for Decision Tree
Induction
- (^) Basic algorithm (a greedy algorithm)
- (^) Tree is constructed in a top-down recursive divide-and-conquer manner
- (^) At start, all the training examples are at the root
- (^) If attribute A are categorical or discrete valued: In this case, the outcomes of the test at node N correspond directly to the known value of A.
- (^) (if attribute A is continuous-valued: In this case, the test at node N has two possible outcomes corresponding to the condition A<=split_point and A > split _point, respectively where split_point is the split_point returned by Attribute_selection_method as part of the splitting criterion.
- Examples are partitioned recursively based on selected attributes
- (^) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
- (^) Conditions for stopping partitioning
- (^) All samples for a given node belong to the same class
- (^) There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
- (^) There are no samples left
November 27, 2014 Data Mining: Concepts and 7 Algorithm for Decision Tree Induction: Method
- Create a node N;
- If tuples in D are all of the same class, C then
- return N as a leaf node labeled with the class C;
- If attribute_list is empty then
- return N as a leaf node labeled with the majority class in D;
- Apply attribute_selection_method(D, attribute_list) to find “best” splitting_criterion;
- Label node N with splitting_criterion
- If splitting_attribute is descrete-valued and multiway splits allowed then
- attribute_list attribute_list – splitting_atributes;
- For each outcome j of splitting_criterion
- let Dj ,be the set of data tuples in D satisfying outcome j;
- if Dj is empty then
- attach a leaf labeled with the majority class in D to Node N;
14. Else attach the node returned by
Generate_decision_tree(Dj,attribute_list) to node N;
endfor
- Return N;
November 27, 2014 Data Mining: Concepts and 8
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information
gain
Let p
i
be the probability that an arbitrary tuple in D
belongs to class C
i
, estimated by |C
i , D
|/|D|
Expected information (entropy) needed to
classify a tuple in D:
Information needed (after using A to split D into v
partitions) to classify D:
Information gained by branching on attribute A
( ) log ( )
2 1 i m i i
Info D p p
( ) | | | | ( ) 1 j v j j A I D D D Info D (^) Gain(A) Info(D) Info (D) A
November 27, 2014 Data Mining: Concepts and 10 The attribute has the highest information gain & therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly. Age? youth Middle_aged senior
November 27, 2014 Data Mining: Concepts and 11 Computing Information-Gain for Continuous-Value Attributes
- (^) Let attribute A be a continuous-valued attribute
- (^) Must determine the best split point for A
- (^) Sort the value A in increasing order
- (^) Typically, the midpoint between each pair of adjacent values is considered as a possible split point - (^) (ai+ai+1)/2 is the midpoint between the values of ai and ai+
- (^) The point with the minimum expected information requirement for A is selected as the split-point for A
- (^) Split:
- (^) D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split- point
November 27, 2014 Data Mining: Concepts and 13
Gini index (CART, IBM IntelligentMiner)
- (^) If a data set D contains examples from n classes, gini index,
gini ( D ) is defined as
where pj is the relative frequency of class j in D
- (^) If a data set D is split on A into two subsets D 1 and D 2 , the
gini index gini ( D ) is defined as
- (^) Reduction in Impurity:
- The attribute provides the smallest ginisplit ( D ) (or the largest
reduction in impurity) is chosen to split the node ( need to
enumerate all the possible splitting points for each attribute )
n j p j gini D 1 2 ( ) 1 ( ) | | | | ( ) | | | | ( ) 2 2 1 1 D gini D D D gini D D gini^ D A gini ( A ) gini ( D ) gini ( D ) A
November 27, 2014 Data Mining: Concepts and 14
Gini index (CART, IBM IntelligentMiner)
- (^) Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
- Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 Similarly Gini index for remaining splits are 0.315( {low,high}&{medium})and 0.300({medium,high}&{low}) but gini{medium,high} is 0.30 and thus the best since it is the lowest. Similarly 0.375({youth,senior}&{middle_aged})as best split for age, the attributes {students} and {credit_rating} are both binary, with Gini index values 0.367 and 0.429 respectively. The attribute income and splitting subset{medium,high} therefore give the minimum gini index overall , with a reduction in impurity of 0.459-0.300=0.159. The above binary results in the maximum reduction in impurity of tupples in D & thus a splitting criterion( Gini index selected income instead of age at the root node)
2 2
^
gini D
( ) 14 4 ( ) 14 10 gini (^) income { low , medium }( D ) Gini D 1 Gini D 1
November 27, 2014 Data Mining: Concepts and 16
Other Attribute Selection Measures
- (^) CHAID: a popular decision tree algorithm, measure based on χ^2 test for independence
- (^) C-SEP: performs better than info. gain and gini index in certain cases
- (^) G-statistics: has a close approximation to χ^2 distribution
- (^) MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): - (^) The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree
- (^) Multivariate splits (partition based on multiple variable combinations)
- (^) CART: finds multivariate splits based on a linear comb. of attrs.
- (^) Which attribute selection measure is the best?
- (^) Most give good results, none is significantly superior than others
November 27, 2014 Data Mining: Concepts and 17
Overfitting and Tree Pruning
- (^) Overfitting: An induced tree may overfit the training data
- (^) Too many branches, some may reflect anomalies due to noise or outliers
- (^) Poor accuracy for unseen samples
- (^) Two approaches to avoid overfitting
- (^) Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold - (^) Difficult to choose an appropriate threshold
- (^) Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees - (^) Use a set of data different from the training data to decide which is the “best pruned tree”
- Decision Trees suffer from Repetition and Replication
- (^) Repetition: It occurs when an attribute is repeatedly tested along a given branch of the tree (such as “age<60 ?” followed by “age<45 ?” and so on).
- (^) Replication: In replication, duplicate subtrees exist within the trees
November 27, 2014 Data Mining: Concepts and 19 A1? A2? A ? A ? A ? A ? Class A Class B Class A Class B Class A Class B Y Y Y N N Y N N N A ? A ? Class A Class B Class B Class A Y Y Y Y N N N Unpruned Decision Tree pruned Decision Tree
November 27, 2014 Data Mining: Concepts and 20
Repetition
A1 < 60? A1 < 50? A1 <45? Class A Class B Yes Yes Yes No No No Where an attribute is repeatedly tested along a given branch of the tree e.g. age