Homework 9 Questions with Solutions - Information Analysis | DSC 433 | Assignments Humanities

DSC 433/533 – Homework 9

Reading

“Data Mining Techniques” by Berry and Linoff (2nd edition): chapter 11 (pages 349-381). (Note:

skip chapter 10.)

Exercises

Hand in answers to the following questions at the beginning of the FINAL class of week 10. The

questions are based on the IMRB Bath Soap Case (see separate document on the handouts page

of the course website) and the Excel dataset IMRB.xls (available on the data page of the course

website).

1. Run a hierarchical clustering analysis of the 23 variables in columns B to X of the spreadsheet.

In particular:

Select XLMiner > Data Reduction and Exploration > Hierarchical Clustering.

Make sure “IMRB” is selected as the Worksheet, “Variable names in the first row” is

checked, and “# Rows in data” is 600. Also, “Data type” should be set to “Raw data.”

Select the variables “vol_per_tran” through “prop_other” and click “Next”.

Check “Normalize input data,” make sure “Euclidean distance” is selected, select “Ward’s

method” for the “Clustering method,” and click “Next.”

Leave “Draw dendogram” selected, but uncheck “Show cluster membership,” and click

“Finish.”

You should obtain the following results on the “HC_Output1” worksheet:

Stage Cluster 1 Cluster 2 Distance Resulting #

of clusters

% Increase in

distance

134 239 0.022031

2423 497 0.06356

596 1 20 609.618371 4

597 5 7 683.815895 3 12%

598 1 5 752.144855 2 10%

599 1 4 1777.175041 1 136%

The final two columns were added manually. For example going from 4 to 3 clusters, the

distance (of the two clusters joined at each stage) increases from 609.618371 to 683.815895,

i.e., (683.815895/609.618371) – 1 = 12%. There is a similar proportional increase going from

3 to 2 clusters (10%), but then a relatively much larger increase going from 2 clusters to 1

cluster (136%).

To hand in: Why does this indicate that it will probably be useful to cluster this dataset into

at least 2 clusters? [Hint: the “distance” gives an indication of the difference between the two

clusters that are being joined at each stage – the larger the distance the more different the

clusters being joined.]

2. The dendogram on the “HC_Dendogram1” worksheet provides a graphical representation of

the different stages of the hierarchical clustering. For example, at a distance of 752 (scale on

the vertical axis) the cluster consisting of sub-clusters 1, 2, 8, 3, 10, 19, 13, 26, 24, 16, 25, 12,

29, 27, 22, and 23 joined the cluster consisting of sub-clusters 5, 8, 30, 20, 28, 9, 11, 6, 21, 15,

and 17 to form a giant cluster. At the top of the tree you can see that the final join occurred at

a distance of 1777 when the smaller cluster represented at the right of the dendogram

(consisting of sub-clusters 4, 7, and 14) joined the giant cluster (consisting of everyone else).

To hand in: How many individual cases are in this smaller cluster (consisting of sub-clusters

4, 7, and 14)? [Hint: the cases in each sub-cluster are listed just below the dendogram.]

Partial preview of the text

Download Homework 9 Questions with Solutions - Information Analysis | DSC 433 and more Assignments Humanities in PDF only on Docsity!

DSC 433/533 – Homework 9

Reading

“Data Mining Techniques” by Berry and Linoff (2nd edition): chapter 11 (pages 349-381). (Note: skip chapter 10.)

Exercises

Hand in answers to the following questions at the beginning of the FINAL class of week 10. The questions are based on the IMRB Bath Soap Case (see separate document on the handouts page of the course website) and the Excel dataset IMRB.xls (available on the data page of the course website).

Run a hierarchical clustering analysis of the 23 variables in columns B to X of the spreadsheet. In particular:  Select XLMiner > Data Reduction and Exploration > Hierarchical Clustering.  Make sure “IMRB” is selected as the Worksheet, “Variable names in the first row” is checked, and “# Rows in data” is 600. Also, “Data type” should be set to “Raw data.”  Select the variables “vol_per_tran” through “prop_other” and click “Next”.  Check “Normalize input data,” make sure “Euclidean distance” is selected, select “Ward’s method” for the “Clustering method,” and click “Next.”  Leave “Draw dendogram” selected, but uncheck “Show cluster membership,” and click “Finish.” You should obtain the following results on the “HC_Output1” worksheet: Stage Cluster 1 Cluster 2 Distance Resulting # of clusters % Increase in distance 1 34 239 0. 2 423 497 0. 596 1 20 609.618371 4 597 5 7 683.815895 3 12% 598 1 5 752.144855 2 10% 599 1 4 1777.175041 1 136% The final two columns were added manually. For example going from 4 to 3 clusters, the distance (of the two clusters joined at each stage) increases from 609.618371 to 683.815895, i.e., (683.815895/609.618371) – 1 = 12%. There is a similar proportional increase going from 3 to 2 clusters (10%), but then a relatively much larger increase going from 2 clusters to 1 cluster (136%). To hand in: Why does this indicate that it will probably be useful to cluster this dataset into at least 2 clusters? [Hint: the “distance” gives an indication of the difference between the two clusters that are being joined at each stage – the larger the distance the more different the clusters being joined.]
The dendogram on the “HC_Dendogram1” worksheet provides a graphical representation of the different stages of the hierarchical clustering. For example, at a distance of 752 (scale on the vertical axis) the cluster consisting of sub-clusters 1, 2, 8, 3, 10, 19, 13, 26, 24, 16, 25, 12, 29, 27, 22, and 23 joined the cluster consisting of sub-clusters 5, 8, 30, 20, 28, 9, 11, 6, 21, 15, and 17 to form a giant cluster. At the top of the tree you can see that the final join occurred at a distance of 1777 when the smaller cluster represented at the right of the dendogram (consisting of sub-clusters 4, 7, and 14) joined the giant cluster (consisting of everyone else). To hand in: How many individual cases are in this smaller cluster (consisting of sub-clusters 4, 7, and 14)? [Hint: the cases in each sub-cluster are listed just below the dendogram.]

The results above suggest that at least 2 useful clusters could be present, while the case notes: “It is likely that marketing efforts would support 2-5 different promotional approaches.” For a full analysis, each of the 2/3/4/5-cluster solutions could be investigated, but for this assignment we’ll just investigate a 3-cluster solution. The hierarchical approach just used can be useful for identifying possible values for the number (k) of useful clusters that there are. However, it may be sub-optimal for actually assigning cases to clusters, since once two cases have been joined together in a cluster, they remain forever joined, even if the cases might fit better with different clusters that form at a later stage (of the clustering process). A better approach for assigning cases to clusters is to use k-means clustering , which iteratively assigns cases to a fixed set of k clusters, until each case is assigned to the cluster that is most suitable (i.e., contains other cases that are the most similar to it):  Select XLMiner > Data Reduction and Exploration > K-Means Clustering.  Make sure “IMRB” is selected as the Worksheet, “First row contains headers” is checked, and “# Rows in data” is 600.  Select the variables “vol_per_tran” through “prop_other” to be the “Input variables” and click “Next”.  Check “Normalize input data,” type “3” for the “# Clusters,” leave “# Iterations” at 10, leave “Fixed Start” selected, and click “Next.” [“Random starts” gives odd results in XLMiner, so don’t use this option.]  Leave “Show data summary” and “Show distances from each cluster center” selected, and click “Finish.” You should obtain the following data summary on the “KM_Output1” worksheet: Cluster #Obs Average distance in cluster Cluster-1 287 3. Cluster-2 241 4. Cluster-3 72 2. Overall 600 3. To turn in: What is the “average” number of brands purchased by households in each of these clusters? [Hint: this information is shown in the part of the output labeled “cluster centers” – it’s not quite the same as the arithmetic average of the variable for each cluster member, but instead represents a “geometric” center of the cluster with respect to the variable.]
To profile the clusters (as we began to do in the last question by looking at the cluster centers for one of the variables), it helps to put the cluster centers on a comparable scale using, for example, a calculation based on the “one-sample t-statistic”. For each variable, calculate (value–mean)/(sd/√(cluster-size)), where “value” is the cluster center, “mean” is the average over the whole sample, “sd” is the standard deviation over the whole sample, and “cluster- size” is 287 for cluster-1, 241 for cluster-2, and 72 for cluster-3. For example, the calculations for the “vol_per_tran” variable are as follows: C D 27 Cluster vol_per_tran 28 Cluster-1 493. 29 Cluster-2 294. 30 Cluster-3 550. 31 Cluster vol_per_tran 32 Cluster-1 5.36=(D28-D$35)/(D$36/SQRT(287)) 33 Cluster-2 -7.50=(D29-D$35)/(D$36/SQRT(241)) 34 Cluster-3 4.61=(D30-D$35)/(D$36/SQRT(72)) 35 sample mean 415.05=AVERAGE(IMRB!B2:B601) 36 sample sd 248.76=STDEV(IMRB!B2:B601)

To turn in: Briefly outline how this information could be combined with the purchasing behavior characteristics (in question 4) to guide the development of advertising and promotional campaigns.

Homework 9 Questions with Solutions - Information Analysis | DSC 433, Assignments of Humanities

Related documents

Partial preview of the text

Download Homework 9 Questions with Solutions - Information Analysis | DSC 433 and more Assignments Humanities in PDF only on Docsity!

DSC 433/533 – Homework 9

Reading

Exercises