





















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A comprehensive introduction to cluster analysis, a fundamental data mining technique used to group similar data objects together. It covers basic concepts, partitioning methods, hierarchical methods, and evaluation of clustering. The document also includes examples and exercises to illustrate the concepts and methods discussed.
Typology: Cheat Sheet
1 / 61
This page cannot be seen from the preview
Don't miss anything!
What is Cluster Analysis?
Quality: What Is Good Clustering?
Considerations for Cluster Analysis (^) Partitioning criteria Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) (^) Separation of clusters Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) (^) Similarity measure Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) (^) Clustering space (^) Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
Requirements and Challenges (^) Scalability (^) Clustering all the data instead of only on samples (^) Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of these (^) Constraint-based clustering (^) User may give inputs on constraints Use domain knowledge to determine input parameters (^) Interpretability and usability (^) Others Discovery of clusters with arbitrary shape (^) Ability to deal with noisy data (^) Incremental clustering and insensitivity to input order High dimensionality
(^) Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB (^) Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-Cluster (^) User-guided or constraint-based: Clustering by considering user-specified or application-specific constraints (^) Typical methods: COD (obstacles), constrained clustering (^) Link-based clustering: (^) Objects are often linked together in various ways (^) Massive links can be used to cluster objects: SimRank, LinkClus
November 11, 2024 Data Mining: Concepts and Techniques 11
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
i, Kj) = dis(Ci, Cj)
i, Kj) = dis(Mi, Mj) (^) Medoid: one chosen, centrally located object in the cluster
(^) Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) (^) Given k , find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions (^) Heuristic methods: k-means and k-medoids algorithms (^) k-means : Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) : Each cluster is represented by one of the objects in the cluster 2 1 ( ) p C i k i E p c i
An Example of K-Means Clustering K= Arbitrarily choose k objects as centroids and assign each object to nearest centroid Compute new centroids for each cluster Update the cluster centroids Reassign objects Loop if needed The initial data set
Example
November 11, 2024 Data Mining: Concepts and Techniques 17
Exercise
November 11, 2024 Data Mining: Concepts and Techniques 19
(^) Strength: Efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n. (^) Comparing: PAM: O(k(n-k) (^2) ) (^) Comment: Often terminates at a local optimal. Weakness Applicable only to objects in a continuous n-dimensional space Using the k-modes method for categorical data In comparison, k-medoids can be applied to a wide range of data (^) Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (^) Sensitive to noisy data and outliers (^) Not suitable to discover clusters with non-convex shapes