Download Data Mining - Clustering Methods and more Study notes Data Mining in PDF only on Docsity!
November 21, 2014
Data Mining: Concepts and
Chapter 7. Cluster
Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary
November 21, 2014
Data Mining: Concepts and
Major Clustering Approaches (I)
- (^) Partitioning approach:
- (^) Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
- (^) Typical methods: k-means, k-medoids, CLARANS
- (^) Hierarchical approach:
- (^) Create a hierarchical decomposition of the set of data (or objects) using some criterion
- (^) Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
- (^) Density-based approach:
- (^) Based on connectivity and density functions
- (^) Typical methods: DBSACN, OPTICS, DenClue
November 21, 2014
Data Mining: Concepts and
Typical Alternatives to Calculate the Distance between Clusters
- (^) Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
- (^) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
- (^) Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
- (^) Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
- (^) Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj) - (^) Medoid: one chosen, centrally located object in the cluster
November 21, 2014
Data Mining: Concepts and
Centroid, Radius and Diameter of a Cluster (for numerical data sets)
- (^) Centroid: the “middle” of a cluster
- (^) Radius: square root of average distance from any point of the cluster to its centroid
- (^) Diameter: square root of average mean squared distance between all pairs of points in the cluster N t N i ip m C ( ) 1 N m c ip t N i m R 2 ( ) 1 ( 1 ) 2 ( ) 1 1 N N iq t ip t N i N i m D
November 21, 2014
Data Mining: Concepts and
Partitioning Algorithms: Basic Concept
- (^) Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance
- (^) Given a k , find a partition of k clusters that optimizes the chosen partitioning criterion - (^) Global optimal: exhaustively enumerate all partitions - (^) Heuristic methods: k-means and k-medoids algorithms - (^) k-means (MacQueen’67): Each cluster is represented by the center of the cluster - (^) k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 2 1 ( ) t Km m mi k m C t
mi
November 21, 2014
Data Mining: Concepts and
The K-Means Clustering Method
- (^) Given k , the k-means algorithm is implemented in four steps: - (^) Partition objects into k nonempty subsets - (^) Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point , of the cluster) - (^) Assign each object to the cluster with the nearest seed point - (^) Go back to Step 2, stop when no more new assignment
November 21, 2014
Data Mining: Concepts and
Comments on the K-Means Method
November 21, 2014
Data Mining: Concepts and
Variations of the K-Means Method
- (^) A few variants of the k-means which differ in
- (^) Selection of the initial k means
- (^) Dissimilarity calculations
- (^) Strategies to calculate cluster means
- (^) Handling categorical data: k-modes (Huang’98)
- (^) Replacing means of clusters with modes
- (^) Using new dissimilarity measures to deal with categorical objects
- (^) Using a frequency-based method to update modes of clusters
- (^) A mixture of categorical and numerical data: k-prototype method
November 21, 2014
Data Mining: Concepts and
K-Means Algorithm
November 21, 2014
Data Mining: Concepts and
K-Means Example (^) Given: {2,4,10,12,3,20,30,11,25}, k= (^) Randomly assign means: m 1 =3,m 2 = (^) K 1 ={2,3},^ K 2 ={4,10,12,20,30,11,25}, m 1 =2.5, m 2 = (^) K 1 ={2,3,4},^ K 2 ={10,12,20,30,11,25}, m 1 =3, m 2 = (^) K 1 ={2,3,4,10},^ K 2 ={12,20,30,11,25}, m 1 =4.75 , m 2 =19. (^) K 1 ={2,3,4,10,11,12},^ K 2 ={20,30,25}, m 1 =7, m 2 = (^) Stop as the clusters with these means are the same.
November 21, 2014
Data Mining: Concepts and
A Typical K-Medoids Algorithm (PAM) 0
Total Cost = 20 0
K= Arbitrar y choose k object as initial medoid s
Assign each remaini ng object to nearest medoid s Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0
Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0
November 21, 2014
Data Mining: Concepts and
PAM (Partitioning Around Medoids) (1987)
- (^) PAM (Kaufman and Rousseeuw, 1987), built in Splus
- (^) Use real object to represent the cluster
- (^) Select k representative objects arbitrarily
- (^) For each pair of non-selected object h and selected object i , calculate the total swapping cost TCih
- (^) For each pair of i and h ,
- If TCih < 0, i is replaced by h
- (^) Then assign each non-selected object to the most similar representative object
- (^) repeat steps 2-3 until there is no change
November 21, 2014
Data Mining: Concepts and
What Is the Problem with
PAM?
• Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less
influenced by outliers or other extreme values than a
mean
• Pam works efficiently for small data sets but does not
scale well for large data sets.
– O(k(n-k)^2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
November 21, 2014
Data Mining: Concepts and
CLARA (Clustering Large Applications) (1990)
- (^) CLARA (Kaufmann and Rousseeuw in 1990)
- (^) Built in statistical analysis packages, such as S+
- (^) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
- (^) Strength: deals with larger data sets than PAM
- (^) Weakness:
- (^) Efficiency depends on the sample size
- (^) A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased