Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Implementing K-means Clustering - Data Mining - Project | CSC 573, Study Guides, Projects, Research of Computer Science

Material Type: Project; Class: Data Mining; Subject: Computer Science; University: University of Illinois Springfield; Term: Unknown 1989;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 08/18/2009

koofers-user-oip
koofers-user-oip 🇺🇸

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CSC 573: Data Mining
Programming Project #1: Implementing K-means clustering
Instructor: Ratko Orlandic
Your task is to implement, in two separate programs, a data preprocessing routine and the K-means
clustering algorithm. You will test these programs on the data set indicated below. The pre-
processing program should perform simple min-max data normalization (see below). In addition to
the K-means algorithm, the clustering program must provide a way to evaluate the quality of
clustering (see below). You should develop the programs in a programming language of your
choice (Java, C, or C++), but the programs should run on Windows. While this text gives you
some implementation freedom, you should make every effort to follow this specification exactly.
Data Set
Input to your programs is the attached data file “letters.txt” (included in Project1Files.zip). The
“letters.txt” data file has 100 instances (data points) with 17 attributes (dimensions), the 17th being
the class dimension (one of 26 integer numbers 1-26 denoting a capital letter). I have extrapolated
these points from a larger data set with 20,000 points. This is a tab-delimited text file. Note that a
row in the data file represents a data point and a column represents an attribute (dimension). More
information about the data set is given in the attached text file “letters-facts.names”.
Data Preprocessing
Your data-preprocessing program “normalize” must take the original data file and transform each
non-class coordinate vi (value of a dimension i other than the class dimension) of every data point
into a real value v’i between 0 and 1, using the following min-max normalization formula:
where mini and maxi are the minimum and maximum value, respectively, in the dimension i. Note
that each normalized value is computed with respect to the minimum and maximum values in the
corresponding column (dimension) rather than the row (data point).
Given the original data set “letters.txt”, “normalize” should store normalized data (one point on a
single line) in a text file “letters-norm.txt”. The program takes no user-defined parameters.
Data Clustering
Your “cluster” program should take the normalized data set as input and perform the following
version of the K-means algorithm:
Obtain K initial cluster centers (see below);
Repeat
Assign every data point to its nearest cluster center (using Euclidean distance);
Move each cluster center to the mean (average) of its assigned points;
Until no change (no point is re-assigned).
,'
ii
ii
iminmax
minv
v
=
pf3

Partial preview of the text

Download Implementing K-means Clustering - Data Mining - Project | CSC 573 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

CSC 573: Data Mining Programming Project #1 : Implementing K-means clustering Instructor: Ratko Orlandic

Your task is to implement, in two separate programs, a data preprocessing routine and the K-means clustering algorithm. You will test these programs on the data set indicated below. The pre- processing program should perform simple min-max data normalization (see below). In addition to the K-means algorithm, the clustering program must provide a way to evaluate the quality of clustering (see below). You should develop the programs in a programming language of your choice (Java, C, or C++), but the programs should run on Windows. While this text gives you some implementation freedom, you should make every effort to follow this specification exactly.

Data Set

Input to your programs is the attached data file “letters.txt” (included in Project1Files.zip). The “letters.txt” data file has 100 instances (data points) with 17 attributes (dimensions), the 17th^ being the class dimension (one of 26 integer numbers 1-26 denoting a capital letter). I have extrapolated these points from a larger data set with 20,000 points. This is a tab-delimited text file. Note that a row in the data file represents a data point and a column represents an attribute (dimension). More information about the data set is given in the attached text file “letters-facts.names”.

Data Preprocessing

Your data-preprocessing program “normalize” must take the original data file and transform each non-class coordinate vi (value of a dimension i other than the class dimension ) of every data point into a real value v’i between 0 and 1, using the following min-max normalization formula:

where mini and maxi are the minimum and maximum value, respectively, in the dimension i. Note that each normalized value is computed with respect to the minimum and maximum values in the corresponding column (dimension) rather than the row (data point).

Given the original data set “letters.txt”, “normalize” should store normalized data (one point on a single line) in a text file “letters-norm.txt”. The program takes no user-defined parameters.

Data Clustering

Your “cluster” program should take the normalized data set as input and perform the following version of the K-means algorithm:

Obtain K initial cluster centers (see below); Repeat Assign every data point to its nearest cluster center (using Euclidean distance); Move each cluster center to the mean (average) of its assigned points; Until no change (no point is re-assigned).

i i

i i i

max min

v min

v

Note that the class dimension is ignored by the K-means algorithm. However, the class dimension is used to evaluate the quality of clustering (see below). For the purposes of this program, the K initial cluster centers are the points in the data set whose order numbers are indicated in the specified input file. The input file format is discussed later in this assignment.

To evaluate the quality of clustering, we will use a measure called purity of clusters. After clustering, your program should compute the purity of each cluster and overall purity of clusters. The purity of each cluster C (^) j is defined as Pj = L (^) j /N (^) j , where L (^) j is the number of points belonging to the largest class in C (^) j (most points in C (^) j are of this class) and N (^) j is the total number of points in C (^) j. The overall purity of clusters is the average purity of clusters. The purity of a cluster and the overall purity of clusters must be real numbers between 0 and 1.

The purity of 1 indicates the best quality of clusters. However, your program may not necessarily achieve the overall purity of 1. On many data sets, the highest purity is not achievable. The purity also varies depending on the selected number of clusters K.

The overall purity of clusters can be used as the measure of the clustering quality only if the data set contains the class information. There are ways to evaluate the quality of clustering on data sets without this information (see equality 7.18 on page 402 in the textbook). However, estimating the quality of clustering is generally a difficult problem.

Input and Output of the Clustering Program

Through a simple interface, the clustering program should first ask the user to specify the names of the input file (e.g., “input.txt”) and output file (e.g., “output.txt”). The program reads the input parameters from an input file and records in the specified output file the output until the program’s termination. The initial cluster centers are the points in the data set whose order numbers (ordering begins with 1) are indicated in the specified input file. All output is directed to the output file. After performing all clustering runs indicated in the input file, your program can terminate.

I have attached an “input.txt” file, which I will use in testing your program. You can also find attached a file called “inputfile-format.txt” that explains the meaning of individual fields and lines in an input file. Please study this format carefully.

The format of the output file is given in the attached “outputfile-format.txt”. Please follow this format exactly. All messages in response to the errors detected by your program should also be directed to this file. The format of error messages is up to you.

Attached Files

In addition to this assignment text (“Project1Text.pdf”), the “Project1Files.zip” file contains:

  1. “letters.txt” --- a text data file of 100 points with 17 dimensions;
  2. “letters-facts.names” --- a text file explaining the “letters” data;
  3. “input.txt” --- a text file with the input parameters for 8 clustering runs;
  4. “inputfile-format.txt” --- a text file explaining the format of the input file;
  5. “outputfile-format.txt” --- a text file explaining the format of the output file; and
  6. “All.xls” --- a Microsoft Excel file with all of the above files in the Excel format.