Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Homework 9 Questions with Solutions - Information Analysis | DSC 433, Assignments of Humanities

Material Type: Assignment; Class: Info Analy Mgrl Decisn; Subject: Decision Sciences; University: University of Oregon; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 07/23/2009

koofers-user-c62-1
koofers-user-c62-1 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
DSC 433/533 – Homework 9
Reading
“Data Mining Techniques” by Berry and Linoff (2nd edition): chapter 11 (pages 349-381). (Note:
skip chapter 10.)
Exercises
Hand in answers to the following questions at the beginning of the FINAL class of week 10. The
questions are based on the IMRB Bath Soap Case (see separate document on the handouts page
of the course website) and the Excel dataset IMRB.xls (available on the data page of the course
website).
1. Run a hierarchical clustering analysis of the 23 variables in columns B to X of the spreadsheet.
In particular:
Select XLMiner > Data Reduction and Exploration > Hierarchical Clustering.
Make sure “IMRB” is selected as the Worksheet, “Variable names in the first row” is
checked, and “# Rows in data” is 600. Also, “Data type” should be set to “Raw data.”
Select the variables “vol_per_tran” through “prop_other” and click “Next”.
Check “Normalize input data,” make sure “Euclidean distance” is selected, select “Ward’s
method” for the “Clustering method,” and click “Next.”
Leave “Draw dendogram” selected, but uncheck “Show cluster membership,” and click
“Finish.”
You should obtain the following results on the “HC_Output1” worksheet:
Stage Cluster 1 Cluster 2 Distance Resulting #
of clusters
% Increase in
distance
134 239 0.022031
2423 497 0.06356
596 1 20 609.618371 4
597 5 7 683.815895 3 12%
598 1 5 752.144855 2 10%
599 1 4 1777.175041 1 136%
The final two columns were added manually. For example going from 4 to 3 clusters, the
distance (of the two clusters joined at each stage) increases from 609.618371 to 683.815895,
i.e., (683.815895/609.618371) – 1 = 12%. There is a similar proportional increase going from
3 to 2 clusters (10%), but then a relatively much larger increase going from 2 clusters to 1
cluster (136%).
To hand in: Why does this indicate that it will probably be useful to cluster this dataset into
at least 2 clusters? [Hint: the “distance” gives an indication of the difference between the two
clusters that are being joined at each stage – the larger the distance the more different the
clusters being joined.]
2. The dendogram on the “HC_Dendogram1” worksheet provides a graphical representation of
the different stages of the hierarchical clustering. For example, at a distance of 752 (scale on
the vertical axis) the cluster consisting of sub-clusters 1, 2, 8, 3, 10, 19, 13, 26, 24, 16, 25, 12,
29, 27, 22, and 23 joined the cluster consisting of sub-clusters 5, 8, 30, 20, 28, 9, 11, 6, 21, 15,
and 17 to form a giant cluster. At the top of the tree you can see that the final join occurred at
a distance of 1777 when the smaller cluster represented at the right of the dendogram
(consisting of sub-clusters 4, 7, and 14) joined the giant cluster (consisting of everyone else).
To hand in: How many individual cases are in this smaller cluster (consisting of sub-clusters
4, 7, and 14)? [Hint: the cases in each sub-cluster are listed just below the dendogram.]
pf3
pf4

Partial preview of the text

Download Homework 9 Questions with Solutions - Information Analysis | DSC 433 and more Assignments Humanities in PDF only on Docsity!

DSC 433/533 – Homework 9

Reading

“Data Mining Techniques” by Berry and Linoff (2nd edition): chapter 11 (pages 349-381). (Note: skip chapter 10.)

Exercises

Hand in answers to the following questions at the beginning of the FINAL class of week 10. The questions are based on the IMRB Bath Soap Case (see separate document on the handouts page of the course website) and the Excel dataset IMRB.xls (available on the data page of the course website).

  1. Run a hierarchical clustering analysis of the 23 variables in columns B to X of the spreadsheet. In particular:  Select XLMiner > Data Reduction and Exploration > Hierarchical Clustering.  Make sure “IMRB” is selected as the Worksheet, “Variable names in the first row” is checked, and “# Rows in data” is 600. Also, “Data type” should be set to “Raw data.”  Select the variables “vol_per_tran” through “prop_other” and click “Next”.  Check “Normalize input data,” make sure “Euclidean distance” is selected, select “Ward’s method” for the “Clustering method,” and click “Next.”  Leave “Draw dendogram” selected, but uncheck “Show cluster membership,” and click “Finish.” You should obtain the following results on the “HC_Output1” worksheet: Stage Cluster 1 Cluster 2 Distance Resulting # of clusters % Increase in distance 1 34 239 0. 2 423 497 0. 596 1 20 609.618371 4 597 5 7 683.815895 3 12% 598 1 5 752.144855 2 10% 599 1 4 1777.175041 1 136% The final two columns were added manually. For example going from 4 to 3 clusters, the distance (of the two clusters joined at each stage) increases from 609.618371 to 683.815895, i.e., (683.815895/609.618371) – 1 = 12%. There is a similar proportional increase going from 3 to 2 clusters (10%), but then a relatively much larger increase going from 2 clusters to 1 cluster (136%). To hand in: Why does this indicate that it will probably be useful to cluster this dataset into at least 2 clusters? [Hint: the “distance” gives an indication of the difference between the two clusters that are being joined at each stage – the larger the distance the more different the clusters being joined.]
  2. The dendogram on the “HC_Dendogram1” worksheet provides a graphical representation of the different stages of the hierarchical clustering. For example, at a distance of 752 (scale on the vertical axis) the cluster consisting of sub-clusters 1, 2, 8, 3, 10, 19, 13, 26, 24, 16, 25, 12, 29, 27, 22, and 23 joined the cluster consisting of sub-clusters 5, 8, 30, 20, 28, 9, 11, 6, 21, 15, and 17 to form a giant cluster. At the top of the tree you can see that the final join occurred at a distance of 1777 when the smaller cluster represented at the right of the dendogram (consisting of sub-clusters 4, 7, and 14) joined the giant cluster (consisting of everyone else). To hand in: How many individual cases are in this smaller cluster (consisting of sub-clusters 4, 7, and 14)? [Hint: the cases in each sub-cluster are listed just below the dendogram.]
  1. The results above suggest that at least 2 useful clusters could be present, while the case notes: “It is likely that marketing efforts would support 2-5 different promotional approaches.” For a full analysis, each of the 2/3/4/5-cluster solutions could be investigated, but for this assignment we’ll just investigate a 3-cluster solution. The hierarchical approach just used can be useful for identifying possible values for the number (k) of useful clusters that there are. However, it may be sub-optimal for actually assigning cases to clusters, since once two cases have been joined together in a cluster, they remain forever joined, even if the cases might fit better with different clusters that form at a later stage (of the clustering process). A better approach for assigning cases to clusters is to use k-means clustering , which iteratively assigns cases to a fixed set of k clusters, until each case is assigned to the cluster that is most suitable (i.e., contains other cases that are the most similar to it):  Select XLMiner > Data Reduction and Exploration > K-Means Clustering.  Make sure “IMRB” is selected as the Worksheet, “First row contains headers” is checked, and “# Rows in data” is 600.  Select the variables “vol_per_tran” through “prop_other” to be the “Input variables” and click “Next”.  Check “Normalize input data,” type “3” for the “# Clusters,” leave “# Iterations” at 10, leave “Fixed Start” selected, and click “Next.” [“Random starts” gives odd results in XLMiner, so don’t use this option.]  Leave “Show data summary” and “Show distances from each cluster center” selected, and click “Finish.” You should obtain the following data summary on the “KM_Output1” worksheet: Cluster #Obs Average distance in cluster Cluster-1 287 3. Cluster-2 241 4. Cluster-3 72 2. Overall 600 3. To turn in: What is the “average” number of brands purchased by households in each of these clusters? [Hint: this information is shown in the part of the output labeled “cluster centers” – it’s not quite the same as the arithmetic average of the variable for each cluster member, but instead represents a “geometric” center of the cluster with respect to the variable.]
  2. To profile the clusters (as we began to do in the last question by looking at the cluster centers for one of the variables), it helps to put the cluster centers on a comparable scale using, for example, a calculation based on the “one-sample t-statistic”. For each variable, calculate (value–mean)/(sd/√(cluster-size)), where “value” is the cluster center, “mean” is the average over the whole sample, “sd” is the standard deviation over the whole sample, and “cluster- size” is 287 for cluster-1, 241 for cluster-2, and 72 for cluster-3. For example, the calculations for the “vol_per_tran” variable are as follows: C D 27 Cluster vol_per_tran 28 Cluster-1 493. 29 Cluster-2 294. 30 Cluster-3 550. 31 Cluster vol_per_tran 32 Cluster-1 5.36=(D28-D$35)/(D$36/SQRT(287)) 33 Cluster-2 -7.50=(D29-D$35)/(D$36/SQRT(241)) 34 Cluster-3 4.61=(D30-D$35)/(D$36/SQRT(72)) 35 sample mean 415.05=AVERAGE(IMRB!B2:B601) 36 sample sd 248.76=STDEV(IMRB!B2:B601)

To turn in: Briefly outline how this information could be combined with the purchasing behavior characteristics (in question 4) to guide the development of advertising and promotional campaigns.