

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Project; Class: Data Mining; Subject: Computer Science; University: University of Illinois Springfield; Term: Unknown 1989;
Typology: Study Guides, Projects, Research
1 / 3
This page cannot be seen from the preview
Don't miss anything!
CSC 573: Data Mining Programming Project #1 : Implementing K-means clustering Instructor: Ratko Orlandic
Your task is to implement, in two separate programs, a data preprocessing routine and the K-means clustering algorithm. You will test these programs on the data set indicated below. The pre- processing program should perform simple min-max data normalization (see below). In addition to the K-means algorithm, the clustering program must provide a way to evaluate the quality of clustering (see below). You should develop the programs in a programming language of your choice (Java, C, or C++), but the programs should run on Windows. While this text gives you some implementation freedom, you should make every effort to follow this specification exactly.
Input to your programs is the attached data file “letters.txt” (included in Project1Files.zip). The “letters.txt” data file has 100 instances (data points) with 17 attributes (dimensions), the 17th^ being the class dimension (one of 26 integer numbers 1-26 denoting a capital letter). I have extrapolated these points from a larger data set with 20,000 points. This is a tab-delimited text file. Note that a row in the data file represents a data point and a column represents an attribute (dimension). More information about the data set is given in the attached text file “letters-facts.names”.
Your data-preprocessing program “normalize” must take the original data file and transform each non-class coordinate vi (value of a dimension i other than the class dimension ) of every data point into a real value v’i between 0 and 1, using the following min-max normalization formula:
where mini and maxi are the minimum and maximum value, respectively, in the dimension i. Note that each normalized value is computed with respect to the minimum and maximum values in the corresponding column (dimension) rather than the row (data point).
Given the original data set “letters.txt”, “normalize” should store normalized data (one point on a single line) in a text file “letters-norm.txt”. The program takes no user-defined parameters.
Your “cluster” program should take the normalized data set as input and perform the following version of the K-means algorithm:
Obtain K initial cluster centers (see below); Repeat Assign every data point to its nearest cluster center (using Euclidean distance); Move each cluster center to the mean (average) of its assigned points; Until no change (no point is re-assigned).
i i
i i i
Note that the class dimension is ignored by the K-means algorithm. However, the class dimension is used to evaluate the quality of clustering (see below). For the purposes of this program, the K initial cluster centers are the points in the data set whose order numbers are indicated in the specified input file. The input file format is discussed later in this assignment.
To evaluate the quality of clustering, we will use a measure called purity of clusters. After clustering, your program should compute the purity of each cluster and overall purity of clusters. The purity of each cluster C (^) j is defined as Pj = L (^) j /N (^) j , where L (^) j is the number of points belonging to the largest class in C (^) j (most points in C (^) j are of this class) and N (^) j is the total number of points in C (^) j. The overall purity of clusters is the average purity of clusters. The purity of a cluster and the overall purity of clusters must be real numbers between 0 and 1.
The purity of 1 indicates the best quality of clusters. However, your program may not necessarily achieve the overall purity of 1. On many data sets, the highest purity is not achievable. The purity also varies depending on the selected number of clusters K.
The overall purity of clusters can be used as the measure of the clustering quality only if the data set contains the class information. There are ways to evaluate the quality of clustering on data sets without this information (see equality 7.18 on page 402 in the textbook). However, estimating the quality of clustering is generally a difficult problem.
Through a simple interface, the clustering program should first ask the user to specify the names of the input file (e.g., “input.txt”) and output file (e.g., “output.txt”). The program reads the input parameters from an input file and records in the specified output file the output until the program’s termination. The initial cluster centers are the points in the data set whose order numbers (ordering begins with 1) are indicated in the specified input file. All output is directed to the output file. After performing all clustering runs indicated in the input file, your program can terminate.
I have attached an “input.txt” file, which I will use in testing your program. You can also find attached a file called “inputfile-format.txt” that explains the meaning of individual fields and lines in an input file. Please study this format carefully.
The format of the output file is given in the attached “outputfile-format.txt”. Please follow this format exactly. All messages in response to the errors detected by your program should also be directed to this file. The format of error messages is up to you.
In addition to this assignment text (“Project1Text.pdf”), the “Project1Files.zip” file contains: