Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Mining - Clustering Methods, Study notes of Data Mining

Moradabad Institute of Technology Data Mining

Detailed informtion about Cluster Analysis, Clustering High-Dimensional Data , Types of Data in Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based Methods.

Typology: Study notes

2010/2011

Uploaded on 09/03/2011

amit-mohta 🇮🇳

4.2

(152)

89 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

November 21, 2014 Data Mining: Concepts and

Techniques 1

Chapter 7. Cluster

Analysis

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis

3. A Categorization of Major Clustering Methods

4. Partitioning Methods

5. Hierarchical Methods

6. Density-Based Methods

7. Grid-Based Methods

8. Model-Based Methods

9. Clustering High-Dimensional Data

10.Constraint-Based Clustering

11.Outlier Analysis

12.Summary

Partial preview of the text

Download Data Mining - Clustering Methods and more Study notes Data Mining in PDF only on Docsity!

November 21, 2014

Data Mining: Concepts and

Chapter 7. Cluster

Analysis

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis

3. A Categorization of Major Clustering Methods

4. Partitioning Methods

5. Hierarchical Methods

6. Density-Based Methods

7. Grid-Based Methods

8. Model-Based Methods

9. Clustering High-Dimensional Data

10.Constraint-Based Clustering

11.Outlier Analysis

12.Summary

November 21, 2014

Data Mining: Concepts and

Major Clustering Approaches (I)

(^) Partitioning approach:
- (^) Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
- (^) Typical methods: k-means, k-medoids, CLARANS
(^) Hierarchical approach:
- (^) Create a hierarchical decomposition of the set of data (or objects) using some criterion
- (^) Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
(^) Density-based approach:
- (^) Based on connectivity and density functions
- (^) Typical methods: DBSACN, OPTICS, DenClue

November 21, 2014

Data Mining: Concepts and

Typical Alternatives to Calculate the Distance between Clusters

(^) Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
(^) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
(^) Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
(^) Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
(^) Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj) - (^) Medoid: one chosen, centrally located object in the cluster

November 21, 2014

Data Mining: Concepts and

Centroid, Radius and Diameter of a Cluster (for numerical data sets)

(^) Centroid: the “middle” of a cluster
(^) Radius: square root of average distance from any point of the cluster to its centroid
(^) Diameter: square root of average mean squared distance between all pairs of points in the cluster N t N i ip m C ( )  1   N m c ip t N i m R 2 ( ) 1     ( 1 ) 2 ( ) 1 1        N N iq t ip t N i N i m D

November 21, 2014

Data Mining: Concepts and

Partitioning Algorithms: Basic Concept

(^) Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance
(^) Given a k , find a partition of k clusters that optimizes the chosen partitioning criterion - (^) Global optimal: exhaustively enumerate all partitions - (^) Heuristic methods: k-means and k-medoids algorithms - (^) k-means (MacQueen’67): Each cluster is represented by the center of the cluster - (^) k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 2 1 ( ) t Km m mi k m C t

mi

    

November 21, 2014

Data Mining: Concepts and

The K-Means Clustering Method

(^) Given k , the k-means algorithm is implemented in four steps: - (^) Partition objects into k nonempty subsets - (^) Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point , of the cluster) - (^) Assign each object to the cluster with the nearest seed point - (^) Go back to Step 2, stop when no more new assignment

November 21, 2014

Data Mining: Concepts and

Comments on the K-Means Method

(^) Strength: Relatively efficient : O ( tkn ), where n is # objects, k is
clusters, and t is # iterations. Normally, k , t << n.
```
 - (^) Comparing: PAM: O(k(n-k)^2 ), CLARA: O(ks^2 + k(n-k)) 
```
(^) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
(^) Weakness
- (^) Applicable only when mean is defined, then what about categorical data?
- (^) Need to specify k, the number of clusters, in advance
- (^) Unable to handle noisy data and outliers
- (^) Not suitable to discover clusters with non-convex shapes

November 21, 2014

Data Mining: Concepts and

Variations of the K-Means Method

(^) A few variants of the k-means which differ in
- (^) Selection of the initial k means
- (^) Dissimilarity calculations
- (^) Strategies to calculate cluster means
(^) Handling categorical data: k-modes (Huang’98)
- (^) Replacing means of clusters with modes
- (^) Using new dissimilarity measures to deal with categorical objects
- (^) Using a frequency-based method to update modes of clusters
- (^) A mixture of categorical and numerical data: k-prototype method

November 21, 2014

Data Mining: Concepts and

K-Means Algorithm

November 21, 2014

Data Mining: Concepts and

K-Means Example  (^) Given: {2,4,10,12,3,20,30,11,25}, k=  (^) Randomly assign means: m 1 =3,m 2 =  (^) K 1 ={2,3},^ K 2 ={4,10,12,20,30,11,25}, m 1 =2.5, m 2 =  (^) K 1 ={2,3,4},^ K 2 ={10,12,20,30,11,25}, m 1 =3, m 2 =  (^) K 1 ={2,3,4,10},^ K 2 ={12,20,30,11,25}, m 1 =4.75 , m 2 =19.  (^) K 1 ={2,3,4,10,11,12},^ K 2 ={20,30,25}, m 1 =7, m 2 =  (^) Stop as the clusters with these means are the same.

November 21, 2014

Data Mining: Concepts and

A Typical K-Medoids Algorithm (PAM) 0

Total Cost = 20 0

K= Arbitrar y choose k object as initial medoid s

Assign each remaini ng object to nearest medoid s Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0

Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0

November 21, 2014

Data Mining: Concepts and

PAM (Partitioning Around Medoids) (1987)

(^) PAM (Kaufman and Rousseeuw, 1987), built in Splus
(^) Use real object to represent the cluster
- (^) Select k representative objects arbitrarily
- (^) For each pair of non-selected object h and selected object i , calculate the total swapping cost TCih
- (^) For each pair of i and h ,
  - If TCih < 0, i is replaced by h
  - (^) Then assign each non-selected object to the most similar representative object
- (^) repeat steps 2-3 until there is no change

November 21, 2014

Data Mining: Concepts and

What Is the Problem with

PAM?

• Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less

influenced by outliers or other extreme values than a

mean

• Pam works efficiently for small data sets but does not

scale well for large data sets.

– O(k(n-k)^2 ) for each iteration

where n is # of data,k is # of clusters

 Sampling based method,

CLARA(Clustering LARge Applications)

November 21, 2014

Data Mining: Concepts and

CLARA (Clustering Large Applications) (1990)

(^) CLARA (Kaufmann and Rousseeuw in 1990)
- (^) Built in statistical analysis packages, such as S+
(^) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
(^) Strength: deals with larger data sets than PAM
(^) Weakness:
- (^) Efficiency depends on the sample size
- (^) A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

Data Mining - Clustering Methods, Study notes of Data Mining

Related documents

Partial preview of the text

Download Data Mining - Clustering Methods and more Study notes Data Mining in PDF only on Docsity!

November 21, 2014

Data Mining: Concepts and

Chapter 7. Cluster

Analysis

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis

3. A Categorization of Major Clustering Methods

4. Partitioning Methods

5. Hierarchical Methods

6. Density-Based Methods

7. Grid-Based Methods

8. Model-Based Methods

9. Clustering High-Dimensional Data

10.Constraint-Based Clustering

11.Outlier Analysis

12.Summary

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

mi

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

clusters, and t is # iterations. Normally, k , t << n.

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

November 21, 2014

Data Mining: Concepts and

What Is the Problem with

PAM?

• Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less

influenced by outliers or other extreme values than a

mean

• Pam works efficiently for small data sets but does not

scale well for large data sets.

– O(k(n-k)^2 ) for each iteration

where n is # of data,k is # of clusters

 Sampling based method,

CLARA(Clustering LARge Applications)

November 21, 2014

Data Mining: Concepts and