Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining - Classification by Decision Tree Induction, Study notes of Data Mining

Detail Summery about Presentation of Classification Results, Visualization of a Decision Tree in SGI/MineSet 3.0, Rainforest: Training Set and Its AVC Sets, Repetition , Enhancements to Basic Decision Tree Induction, Overfitting and Tree Pruning.

Typology: Study notes

2010/2011

Uploaded on 09/03/2011

amit-mohta
amit-mohta 🇮🇳

4.2

(152)

89 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
November 27, 2014 Data Mining: Concepts and
Techniques 1
Chapter 6. Classification and Prediction
What is classification? What is
prediction?
Issues regarding classification
and prediction
Classification by decision tree
induction
Bayesian classification
Rule-based classification
Classification by back
propagation
Support Vector Machines
(SVM)
Associative classification
Lazy learners (or learning from
your neighbors)
Other classification methods
Prediction
Accuracy and error measures
Ensemble methods
Model selection
Summary
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Data Mining - Classification by Decision Tree Induction and more Study notes Data Mining in PDF only on Docsity!

November 27, 2014 Data Mining: Concepts and 1

Chapter 6. Classification and Prediction

  • (^) What is classification? What is prediction?
  • (^) Issues regarding classification and prediction
  • (^) Classification by decision tree induction
  • (^) Bayesian classification
  • (^) Rule-based classification
  • (^) Classification by back propagation - (^) Support Vector Machines (SVM) - (^) Associative classification - (^) Lazy learners (or learning from your neighbors) - (^) Other classification methods - (^) Prediction - (^) Accuracy and error measures - (^) Ensemble methods - (^) Model selection - Summary

November 27, 2014 Data Mining: Concepts and 2

Classification by Decision Tree Induction

(DTI)

  • (^) DTI is the learning of decision trees from class_labeled training tuples.
  • (^) A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The top most node is the root node.
  • (^) Why are DT Classifier so popular?
    • (^) The construction of DT classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery.
    • (^) DT can handle high dimensional data.
    • (^) Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans
    • (^) They have good accuracy.
    • (^) They may be used in medicine manufacturing, production, financial analysis, astronomy and molecular biology.

November 27, 2014 Data Mining: Concepts and 4

Output: A Decision Tree for “ buys_computer”

age?

overcast

student? credit rating?

no yes

yes

yes

no

excellent fair

no yes

November 27, 2014 Data Mining: Concepts and 5

Algorithm for Decision Tree

Induction

  • (^) Basic algorithm (a greedy algorithm)
    • (^) Tree is constructed in a top-down recursive divide-and-conquer manner
    • (^) At start, all the training examples are at the root
    • (^) If attribute A are categorical or discrete valued: In this case, the outcomes of the test at node N correspond directly to the known value of A.
    • (^) (if attribute A is continuous-valued: In this case, the test at node N has two possible outcomes corresponding to the condition A<=split_point and A > split _point, respectively where split_point is the split_point returned by Attribute_selection_method as part of the splitting criterion.
    • Examples are partitioned recursively based on selected attributes
    • (^) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
  • (^) Conditions for stopping partitioning
    • (^) All samples for a given node belong to the same class
    • (^) There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
    • (^) There are no samples left

November 27, 2014 Data Mining: Concepts and 7 Algorithm for Decision Tree Induction: Method

  1. Create a node N;
  2. If tuples in D are all of the same class, C then
  3. return N as a leaf node labeled with the class C;
  4. If attribute_list is empty then
  5. return N as a leaf node labeled with the majority class in D;
  6. Apply attribute_selection_method(D, attribute_list) to find “best” splitting_criterion;
  7. Label node N with splitting_criterion
  8. If splitting_attribute is descrete-valued and multiway splits allowed then
  9. attribute_list attribute_list – splitting_atributes;
  10. For each outcome j of splitting_criterion
  11. let Dj ,be the set of data tuples in D satisfying outcome j;
  12. if Dj is empty then
  13. attach a leaf labeled with the majority class in D to Node N;
14. Else attach the node returned by
Generate_decision_tree(Dj,attribute_list) to node N;

endfor

  1. Return N;

November 27, 2014 Data Mining: Concepts and 8

Attribute Selection Measure:

Information Gain (ID3/C4.5)

Select the attribute with the highest information

gain

 Let p

i

be the probability that an arbitrary tuple in D

belongs to class C

i

, estimated by |C

i , D

|/|D|

 Expected information (entropy) needed to

classify a tuple in D:

 Information needed (after using A to split D into v

partitions) to classify D:

 Information gained by branching on attribute A

( ) log ( )

2 1 i m i i

Info D p p

 

( ) | | | | ( ) 1 j v j j A I D D D Info D  (^)    Gain(A) Info(D) Info (D) A  

November 27, 2014 Data Mining: Concepts and 10 The attribute has the highest information gain & therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly. Age? youth Middle_aged senior

November 27, 2014 Data Mining: Concepts and 11 Computing Information-Gain for Continuous-Value Attributes

  • (^) Let attribute A be a continuous-valued attribute
  • (^) Must determine the best split point for A
    • (^) Sort the value A in increasing order
    • (^) Typically, the midpoint between each pair of adjacent values is considered as a possible split point - (^) (ai+ai+1)/2 is the midpoint between the values of ai and ai+
    • (^) The point with the minimum expected information requirement for A is selected as the split-point for A
  • (^) Split:
    • (^) D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split- point

November 27, 2014 Data Mining: Concepts and 13

Gini index (CART, IBM IntelligentMiner)

  • (^) If a data set D contains examples from n classes, gini index,
gini ( D ) is defined as
where pj is the relative frequency of class j in D
  • (^) If a data set D is split on A into two subsets D 1 and D 2 , the
gini index gini ( D ) is defined as
  • (^) Reduction in Impurity:
  • The attribute provides the smallest ginisplit ( D ) (or the largest
reduction in impurity) is chosen to split the node ( need to
enumerate all the possible splitting points for each attribute )

    n j p j gini D 1 2 ( ) 1 ( ) | | | | ( ) | | | | ( ) 2 2 1 1 D gini D D D gini D D gini^ D A   gini ( A ) gini ( D ) gini ( D ) A   

November 27, 2014 Data Mining: Concepts and 14

Gini index (CART, IBM IntelligentMiner)

  • (^) Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
  • Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 Similarly Gini index for remaining splits are 0.315( {low,high}&{medium})and 0.300({medium,high}&{low}) but gini{medium,high} is 0.30 and thus the best since it is the lowest. Similarly 0.375({youth,senior}&{middle_aged})as best split for age, the attributes {students} and {credit_rating} are both binary, with Gini index values 0.367 and 0.429 respectively. The attribute income and splitting subset{medium,high} therefore give the minimum gini index overall , with a reduction in impurity of 0.459-0.300=0.159. The above binary results in the maximum reduction in impurity of tupples in D & thus a splitting criterion( Gini index selected income instead of age at the root node)

2 2

^ 
gini D  

( ) 14 4 ( ) 14 10 gini (^) income { low , medium }( D ) Gini D 1  Gini D 1              

November 27, 2014 Data Mining: Concepts and 16

Other Attribute Selection Measures

  • (^) CHAID: a popular decision tree algorithm, measure based on χ^2 test for independence
  • (^) C-SEP: performs better than info. gain and gini index in certain cases
  • (^) G-statistics: has a close approximation to χ^2 distribution
  • (^) MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): - (^) The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree
  • (^) Multivariate splits (partition based on multiple variable combinations)
    • (^) CART: finds multivariate splits based on a linear comb. of attrs.
  • (^) Which attribute selection measure is the best?
    • (^) Most give good results, none is significantly superior than others

November 27, 2014 Data Mining: Concepts and 17

Overfitting and Tree Pruning

  • (^) Overfitting: An induced tree may overfit the training data
    • (^) Too many branches, some may reflect anomalies due to noise or outliers
    • (^) Poor accuracy for unseen samples
  • (^) Two approaches to avoid overfitting
    • (^) Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold - (^) Difficult to choose an appropriate threshold
    • (^) Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees - (^) Use a set of data different from the training data to decide which is the “best pruned tree”
  • Decision Trees suffer from Repetition and Replication
    • (^) Repetition: It occurs when an attribute is repeatedly tested along a given branch of the tree (such as “age<60 ?” followed by “age<45 ?” and so on).
    • (^) Replication: In replication, duplicate subtrees exist within the trees

November 27, 2014 Data Mining: Concepts and 19 A1? A2? A ? A ? A ? A ? Class A Class B Class A Class B Class A Class B Y Y Y N N Y N N N A ? A ? Class A Class B Class B Class A Y Y Y Y N N N Unpruned Decision Tree pruned Decision Tree

November 27, 2014 Data Mining: Concepts and 20

Repetition

A1 < 60? A1 < 50? A1 <45? Class A Class B Yes Yes Yes No No No Where an attribute is repeatedly tested along a given branch of the tree e.g. age