Implementing K-means Clustering - Data Mining - Project | CSC 573 | Study Guides, Projects, Research Computer Science

CSC 573: Data Mining

Programming Project #1: Implementing K-means clustering

Instructor: Ratko Orlandic

Your task is to implement, in two separate programs, a data preprocessing routine and the K-means

clustering algorithm. You will test these programs on the data set indicated below. The pre-

processing program should perform simple min-max data normalization (see below). In addition to

the K-means algorithm, the clustering program must provide a way to evaluate the quality of

clustering (see below). You should develop the programs in a programming language of your

choice (Java, C, or C++), but the programs should run on Windows. While this text gives you

some implementation freedom, you should make every effort to follow this specification exactly.

Data Set

Input to your programs is the attached data file “letters.txt” (included in Project1Files.zip). The

“letters.txt” data file has 100 instances (data points) with 17 attributes (dimensions), the 17th being

the class dimension (one of 26 integer numbers 1-26 denoting a capital letter). I have extrapolated

these points from a larger data set with 20,000 points. This is a tab-delimited text file. Note that a

row in the data file represents a data point and a column represents an attribute (dimension). More

information about the data set is given in the attached text file “letters-facts.names”.

Data Preprocessing

Your data-preprocessing program “normalize” must take the original data file and transform each

non-class coordinate vi (value of a dimension i other than the class dimension) of every data point

into a real value v’i between 0 and 1, using the following min-max normalization formula:

where mini and maxi are the minimum and maximum value, respectively, in the dimension i. Note

that each normalized value is computed with respect to the minimum and maximum values in the

corresponding column (dimension) rather than the row (data point).

Given the original data set “letters.txt”, “normalize” should store normalized data (one point on a

single line) in a text file “letters-norm.txt”. The program takes no user-defined parameters.

Data Clustering

Your “cluster” program should take the normalized data set as input and perform the following

version of the K-means algorithm:

Obtain K initial cluster centers (see below);

Repeat

Assign every data point to its nearest cluster center (using Euclidean distance);

Move each cluster center to the mean (average) of its assigned points;

Until no change (no point is re-assigned).

iminmax

minv

v−

−

Implementing K-means Clustering - Data Mining - Project | CSC 573, Study Guides, Projects, Research of Computer Science

Related documents

Partial preview of the text

Download Implementing K-means Clustering - Data Mining - Project | CSC 573 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

Data Set

Data Preprocessing

Data Clustering

max min

v min

v

Input and Output of the Clustering Program

Attached Files