
DSC 433/533 – Homework 9
Reading
“Data Mining Techniques” by Berry and Linoff (2nd edition): chapter 11 (pages 349-381). (Note:
skip chapter 10.)
Exercises
Hand in answers to the following questions at the beginning of the FINAL class of week 10. The
questions are based on the IMRB Bath Soap Case (see separate document on the handouts page
of the course website) and the Excel dataset IMRB.xls (available on the data page of the course
website).
1. Run a hierarchical clustering analysis of the 23 variables in columns B to X of the spreadsheet.
In particular:
Select XLMiner > Data Reduction and Exploration > Hierarchical Clustering.
Make sure “IMRB” is selected as the Worksheet, “Variable names in the first row” is
checked, and “# Rows in data” is 600. Also, “Data type” should be set to “Raw data.”
Select the variables “vol_per_tran” through “prop_other” and click “Next”.
Check “Normalize input data,” make sure “Euclidean distance” is selected, select “Ward’s
method” for the “Clustering method,” and click “Next.”
Leave “Draw dendogram” selected, but uncheck “Show cluster membership,” and click
“Finish.”
You should obtain the following results on the “HC_Output1” worksheet:
Stage Cluster 1 Cluster 2 Distance Resulting #
of clusters
% Increase in
distance
134 239 0.022031
2423 497 0.06356
596 1 20 609.618371 4
597 5 7 683.815895 3 12%
598 1 5 752.144855 2 10%
599 1 4 1777.175041 1 136%
The final two columns were added manually. For example going from 4 to 3 clusters, the
distance (of the two clusters joined at each stage) increases from 609.618371 to 683.815895,
i.e., (683.815895/609.618371) – 1 = 12%. There is a similar proportional increase going from
3 to 2 clusters (10%), but then a relatively much larger increase going from 2 clusters to 1
cluster (136%).
To hand in: Why does this indicate that it will probably be useful to cluster this dataset into
at least 2 clusters? [Hint: the “distance” gives an indication of the difference between the two
clusters that are being joined at each stage – the larger the distance the more different the
clusters being joined.]
2. The dendogram on the “HC_Dendogram1” worksheet provides a graphical representation of
the different stages of the hierarchical clustering. For example, at a distance of 752 (scale on
the vertical axis) the cluster consisting of sub-clusters 1, 2, 8, 3, 10, 19, 13, 26, 24, 16, 25, 12,
29, 27, 22, and 23 joined the cluster consisting of sub-clusters 5, 8, 30, 20, 28, 9, 11, 6, 21, 15,
and 17 to form a giant cluster. At the top of the tree you can see that the final join occurred at
a distance of 1777 when the smaller cluster represented at the right of the dendogram
(consisting of sub-clusters 4, 7, and 14) joined the giant cluster (consisting of everyone else).
To hand in: How many individual cases are in this smaller cluster (consisting of sub-clusters
4, 7, and 14)? [Hint: the cases in each sub-cluster are listed just below the dendogram.]