Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Mining - Classification by Decision Tree Induction, Study notes of Data Mining

Moradabad Institute of Technology Data Mining

Detail Summery about Presentation of Classification Results, Visualization of a Decision Tree in SGI/MineSet 3.0, Rainforest: Training Set and Its AVC Sets, Repetition , Enhancements to Basic Decision Tree Induction, Overfitting and Tree Pruning.

Typology: Study notes

2010/2011

Uploaded on 09/03/2011

amit-mohta 🇮🇳

4.2

(152)

89 documents

1 / 29

This page cannot be seen from the preview

Don't miss anything!

November 27, 2014 Data Mining: Concepts and

Techniques 1

Chapter 6. Classification and Prediction

•What is classification? What is

prediction?

•Issues regarding classification

and prediction

•Classification by decision tree

induction

•Bayesian classification

•Rule-based classification

•Classification by back

propagation

•Support Vector Machines

(SVM)

•Associative classification

•Lazy learners (or learning from

your neighbors)

•Other classification methods

•Prediction

•Accuracy and error measures

•Ensemble methods

•Model selection

•Summary

Partial preview of the text

Download Data Mining - Classification by Decision Tree Induction and more Study notes Data Mining in PDF only on Docsity!

November 27, 2014 Data Mining: Concepts and 1

Chapter 6. Classification and Prediction

(^) What is classification? What is prediction?
(^) Issues regarding classification and prediction
(^) Classification by decision tree induction
(^) Bayesian classification
(^) Rule-based classification
(^) Classification by back propagation - (^) Support Vector Machines (SVM) - (^) Associative classification - (^) Lazy learners (or learning from your neighbors) - (^) Other classification methods - (^) Prediction - (^) Accuracy and error measures - (^) Ensemble methods - (^) Model selection - Summary

November 27, 2014 Data Mining: Concepts and 2

Classification by Decision Tree Induction

(DTI)

(^) DTI is the learning of decision trees from class_labeled training tuples.
(^) A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The top most node is the root node.
(^) Why are DT Classifier so popular?
- (^) The construction of DT classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery.
- (^) DT can handle high dimensional data.
- (^) Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans
- (^) They have good accuracy.
- (^) They may be used in medicine manufacturing, production, financial analysis, astronomy and molecular biology.

November 27, 2014 Data Mining: Concepts and 4

Output: A Decision Tree for “ buys_computer”

age?

overcast

student? credit rating?

no yes

yes

excellent fair

no yes

November 27, 2014 Data Mining: Concepts and 5

Algorithm for Decision Tree

Induction

(^) Basic algorithm (a greedy algorithm)
- (^) Tree is constructed in a top-down recursive divide-and-conquer manner
- (^) At start, all the training examples are at the root
- (^) If attribute A are categorical or discrete valued: In this case, the outcomes of the test at node N correspond directly to the known value of A.
- (^) (if attribute A is continuous-valued: In this case, the test at node N has two possible outcomes corresponding to the condition A<=split_point and A > split _point, respectively where split_point is the split_point returned by Attribute_selection_method as part of the splitting criterion.
- Examples are partitioned recursively based on selected attributes
- (^) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
(^) Conditions for stopping partitioning
- (^) All samples for a given node belong to the same class
- (^) There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
- (^) There are no samples left

November 27, 2014 Data Mining: Concepts and 7 Algorithm for Decision Tree Induction: Method

Create a node N;
If tuples in D are all of the same class, C then
return N as a leaf node labeled with the class C;
If attribute_list is empty then
return N as a leaf node labeled with the majority class in D;
Apply attribute_selection_method(D, attribute_list) to find “best” splitting_criterion;
Label node N with splitting_criterion
If splitting_attribute is descrete-valued and multiway splits allowed then
attribute_list attribute_list – splitting_atributes;
For each outcome j of splitting_criterion
let Dj ,be the set of data tuples in D satisfying outcome j;
if Dj is empty then
attach a leaf labeled with the majority class in D to Node N;

14. Else attach the node returned by

Generate_decision_tree(Dj,attribute_list) to node N;

endfor

Return N;

November 27, 2014 Data Mining: Concepts and 8

Attribute Selection Measure:

Information Gain (ID3/C4.5)



Select the attribute with the highest information

gain

 Let p

be the probability that an arbitrary tuple in D

belongs to class C

, estimated by |C

i , D

|/|D|

 Expected information (entropy) needed to

classify a tuple in D:

 Information needed (after using A to split D into v

partitions) to classify D:

 Information gained by branching on attribute A

( ) log ( )

2 1 i m i i

Info D p p

 

( ) | | | | ( ) 1 j v j j A I D D D Info D  (^)    Gain(A) Info(D) Info (D) A  

November 27, 2014 Data Mining: Concepts and 10 The attribute has the highest information gain & therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly. Age? youth Middle_aged senior

November 27, 2014 Data Mining: Concepts and 11 Computing Information-Gain for Continuous-Value Attributes

(^) Let attribute A be a continuous-valued attribute
(^) Must determine the best split point for A
- (^) Sort the value A in increasing order
- (^) Typically, the midpoint between each pair of adjacent values is considered as a possible split point - (^) (ai+ai+1)/2 is the midpoint between the values of ai and ai+
- (^) The point with the minimum expected information requirement for A is selected as the split-point for A
(^) Split:
- (^) D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split- point

November 27, 2014 Data Mining: Concepts and 13

Gini index (CART, IBM IntelligentMiner)

(^) If a data set D contains examples from n classes, gini index,

gini ( D ) is defined as

where pj is the relative frequency of class j in D

(^) If a data set D is split on A into two subsets D 1 and D 2 , the

gini index gini ( D ) is defined as

(^) Reduction in Impurity:
The attribute provides the smallest ginisplit ( D ) (or the largest

reduction in impurity) is chosen to split the node ( need to

enumerate all the possible splitting points for each attribute )

    n j p j gini D 1 2 ( ) 1 ( ) | | | | ( ) | | | | ( ) 2 2 1 1 D gini D D D gini D D gini^ D A   gini ( A ) gini ( D ) gini ( D ) A   

November 27, 2014 Data Mining: Concepts and 14

Gini index (CART, IBM IntelligentMiner)

(^) Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 Similarly Gini index for remaining splits are 0.315( {low,high}&{medium})and 0.300({medium,high}&{low}) but gini{medium,high} is 0.30 and thus the best since it is the lowest. Similarly 0.375({youth,senior}&{middle_aged})as best split for age, the attributes {students} and {credit_rating} are both binary, with Gini index values 0.367 and 0.429 respectively. The attribute income and splitting subset{medium,high} therefore give the minimum gini index overall , with a reduction in impurity of 0.459-0.300=0.159. The above binary results in the maximum reduction in impurity of tupples in D & thus a splitting criterion( Gini index selected income instead of age at the root node)

2 2

^ 

gini D  

( ) 14 4 ( ) 14 10 gini (^) income { low , medium }( D ) Gini D 1  Gini D 1              

November 27, 2014 Data Mining: Concepts and 16

Other Attribute Selection Measures

(^) CHAID: a popular decision tree algorithm, measure based on χ^2 test for independence
(^) C-SEP: performs better than info. gain and gini index in certain cases
(^) G-statistics: has a close approximation to χ^2 distribution
(^) MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): - (^) The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree
(^) Multivariate splits (partition based on multiple variable combinations)
- (^) CART: finds multivariate splits based on a linear comb. of attrs.
(^) Which attribute selection measure is the best?
- (^) Most give good results, none is significantly superior than others

November 27, 2014 Data Mining: Concepts and 17

Overfitting and Tree Pruning

(^) Overfitting: An induced tree may overfit the training data
- (^) Too many branches, some may reflect anomalies due to noise or outliers
- (^) Poor accuracy for unseen samples
(^) Two approaches to avoid overfitting
- (^) Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold - (^) Difficult to choose an appropriate threshold
- (^) Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees - (^) Use a set of data different from the training data to decide which is the “best pruned tree”
Decision Trees suffer from Repetition and Replication
- (^) Repetition: It occurs when an attribute is repeatedly tested along a given branch of the tree (such as “age<60 ?” followed by “age<45 ?” and so on).
- (^) Replication: In replication, duplicate subtrees exist within the trees

November 27, 2014 Data Mining: Concepts and 19 A1? A2? A ? A ? A ? A ? Class A Class B Class A Class B Class A Class B Y Y Y N N Y N N N A ? A ? Class A Class B Class B Class A Y Y Y Y N N N Unpruned Decision Tree pruned Decision Tree

November 27, 2014 Data Mining: Concepts and 20

Repetition

A1 < 60? A1 < 50? A1 <45? Class A Class B Yes Yes Yes No No No Where an attribute is repeatedly tested along a given branch of the tree e.g. age

Data Mining - Classification by Decision Tree Induction, Study notes of Data Mining

Related documents

Partial preview of the text

Download Data Mining - Classification by Decision Tree Induction and more Study notes Data Mining in PDF only on Docsity!

Chapter 6. Classification and Prediction

Classification by Decision Tree Induction

(DTI)

Output: A Decision Tree for “ buys_computer”

age?

overcast

student? credit rating?

no yes

yes

yes

excellent fair

no yes

Algorithm for Decision Tree

Induction

14. Else attach the node returned by

Generate_decision_tree(Dj,attribute_list) to node N;

Attribute Selection Measure:

Information Gain (ID3/C4.5)

Select the attribute with the highest information

gain

 Let p

be the probability that an arbitrary tuple in D

belongs to class C

, estimated by |C

|/|D|

 Expected information (entropy) needed to

classify a tuple in D:

 Information needed (after using A to split D into v

partitions) to classify D:

 Information gained by branching on attribute A

( ) log ( )

Info D p p

Gini index (CART, IBM IntelligentMiner)

gini ( D ) is defined as

where pj is the relative frequency of class j in D

gini index gini ( D ) is defined as

reduction in impurity) is chosen to split the node ( need to

enumerate all the possible splitting points for each attribute )

Gini index (CART, IBM IntelligentMiner)

^ 

gini D  

Other Attribute Selection Measures

Overfitting and Tree Pruning

Repetition