Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining: Association Analysis - Computing Support, Confidence & Measures, Study notes of Computer Science

An explanation of association analysis in data mining, focusing on computing support, confidence, and various measures for association rules. It includes examples of computing support for itemsets and confidence for association rules using the apriori algorithm.

Typology: Study notes

2020/2021

Uploaded on 01/03/2024

anjali-patel-10
anjali-patel-10 🇮🇳

2 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
6
Association Analysis:
Basic Concepts and
Algorithms
1. For each of the following questions, provide an example of an association rule
from the market basket domain that satisfies the following conditions. Also,
describe whether such rules are subjectively interesting.
(a) A rule that has high support and high confidence.
Answer: Milk −→ Bread. Such obvious rule tends to be uninteresting.
(b) A rule that has reasonably high support but low confidence.
Answer: Milk −→ Tuna. While the sale of tuna and milk may be
higher than the support threshold, not all transactions that contain milk
also contain tuna. Such low-confidence rule tends to be uninteresting.
(c) A rule that has low support and low confidence.
Answer: Cooking oil −→ Laundry detergent. Such low confidence rule
tends to be uninteresting.
(d) A rule that has low support and high confidence.
Answer: Vodka −→ Caviar. Such rule tends to be interesting.
2. Consider the data set shown in Table 6.1.
(a) Compute the support for itemsets {e},{b, d},and{b, d, e}by treating
each transaction ID as a market basket.
Answer:
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Data Mining: Association Analysis - Computing Support, Confidence & Measures and more Study notes Computer Science in PDF only on Docsity!

Association Analysis:

Basic Concepts and

Algorithms

  1. For each of the following questions, provide an example of an association rule from the market basket domain that satisfies the following conditions. Also, describe whether such rules are subjectively interesting.

(a) A rule that has high support and high confidence. Answer: Milk −→ Bread. Such obvious rule tends to be uninteresting. (b) A rule that has reasonably high support but low confidence. Answer: Milk −→ Tuna. While the sale of tuna and milk may be higher than the support threshold, not all transactions that contain milk also contain tuna. Such low-confidence rule tends to be uninteresting. (c) A rule that has low support and low confidence. Answer: Cooking oil −→ Laundry detergent. Such low confidence rule tends to be uninteresting. (d) A rule that has low support and high confidence. Answer: Vodka −→ Caviar. Such rule tends to be interesting.

  1. Consider the data set shown in Table 6.1.

(a) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating each transaction ID as a market basket. Answer:

72 Chapter 6 Association Analysis

Table 6.1. Example of market basket transactions. Customer ID Transaction ID Items Bought 1 0001 {a, d, e} 1 0024 {a, b, c, e} 2 0012 {a, b, d, e} 2 0031 {a, c, d, e} 3 0015 {b, c, e} 3 0022 {b, d, e} 4 0029 {c, d} 4 0040 {a, b, c} 5 0033 {a, d, e} 5 0038 {a, b, e}

s({e}) =

s({b, d}) =

s({b, d, e}) =

(b) Use the results in part (a) to compute the confidence for the association rules {b, d} −→ {e} and {e} −→ {b, d}. Is confidence a symmetric measure? Answer:

c(bd −→ e) =

c(e −→ bd) =

No, confidence is not a symmetric measure. (c) Repeat part (a) by treating each customer ID as a market basket. Each item should be treated as a binary variable (1 if an item appears in at least one transaction bought by the customer, and 0 otherwise.) Answer:

s({e}) =

s({b, d}) =

s({b, d, e}) =

74 Chapter 6 Association Analysis

(d) Transitivity: Suppose the confidence of the rules A −→ B and B −→ C are larger than some threshold, minconf. Is it possible that A −→ C has a confidence less than minconf? Answer: Yes, It depends on the support of items A, B, and C. For example: s(A,B) = 60% s(A) = 90% s(A,C) = 20% s(B) = 70% s(B,C) = 50% s(C) = 60%

Let minconf = 50% Therefore: c(A → B) = 66% > minconf c(B → C) = 71% > minconf But c(A → C) = 22% < minconf

  1. For each of the following measures, determine whether it is monotone, anti- monotone, or non-monotone (i.e., neither monotone nor anti-monotone).

Example: Support, s = σ |(TX | )is anti-monotone because s(X) ≥ s(Y ) whenever X ⊂ Y.

(a) A characteristic rule is a rule of the form {p} −→ {q 1 , q 2 ,... , qn}, where the rule antecedent contains only a single item. An itemset of size k can produce up to k characteristic rules. Let ζ be the minimum confidence of all characteristic rules generated from a given itemset:

ζ({p 1 , p 2 ,... , pk}) = min

[

c

{p 1 } −→ {p 2 , p 3 ,... , pk}

c

{pk} −→ {p 1 , p 3... , pk− 1 }

) ]

Is ζ monotone, anti-monotone, or non-monotone? Answer: ζ is an anti-monotone measure because

ζ({A 1 , A 2 , · · · , Ak}) ≥ ζ({A 1 , A 2 , · · · , Ak, Ak+1}) (6.2)

For example, we can compare the values of ζ for {A, B} and {A, B, C}.

ζ({A, B}) = min

c(A −→ B), c(B −→ A)

= min

( (^) s(A, B) s(A)

s(A, B) s(B)

s(A, B) max(s(A), s(B))

ζ({A, B, C}) = min

c(A −→ BC), c(B −→ AC), c(C −→ AB)

= min

( (^) s(A, B, C) s(A)

s(A, B, C) s(B)

s(A, B, C) s(C)

s(A, B, C) max(s(A), s(B), s(C))

Since s(A, B, C) ≤ s(A, B) and max(s(A), s(B), s(C)) ≥ max(s(A), s(B)), therefore ζ({A, B}) ≥ ζ({A, B, C}).

(b) A discriminant rule is a rule of the form {p 1 , p 2 ,... , pn} −→ {q}, where the rule consequent contains only a single item. An itemset of size k can produce up to k discriminant rules. Let η be the minimum confidence of all discriminant rules generated from a given itemset:

η({p 1 , p 2 ,... , pk}) = min

[

c

{p 2 , p 3 ,... , pk} −→ {p 1 }

c

{p 1 , p 2 ,... pk− 1 } −→ {pk}

) ]

Is η monotone, anti-monotone, or non-monotone? Answer: η is non-monotone. We can show this by comparing η({A, B}) against η({A, B, C}).

η({A, B}) = min

c(A −→ B), c(B −→ A)

= min

( (^) s(A, B) s(A)

s(A, B) s(B)

s(A, B) max(s(A), s(B))

η({A, B, C}) = min

c(AB −→ C), c(AC −→ B), c(BC −→ A)

= min

( (^) s(A, B, C) s(A, B)

s(A, B, C) s(A, C)

s(A, B, C) s(B, C)

s(A, B, C) max(s(A, B), s(A, C), s(B, C))

Since s(A, B, C) ≤ s(A, B) and max(s(A, B), s(A, C), s(B, C)) ≤ max(s(A), s(B)), therefore η({A, B, C}) can be greater than or less than η({A, B}). Hence, the measure is non-monotone. (c) Repeat the analysis in parts (a) and (b) by replacing the min function with a max function.

Since s(A, B, C) ≤ s(A, B) and min(s(A, B), s(A, C), s(B, C)) ≤ min(s(A), s(B), s(C)) ≤ min(s(A), s(B)), η′({A, B, C}) can be greater than or less than η′({A, B}). Hence, the measure is non-monotone.

  1. Prove Equation 6.3. (Hint: First, count the number of ways to create an itemset that forms the left hand side of the rule. Next, for each size k itemset selected for the left-hand side, count the number of ways to choose the remaining d − k items to form the right-hand side of the rule.) Answer: Suppose there are d items. We first choose k of the items to form the left- hand side of the rule. There are

(d k

ways for doing this. After selecting the items for the left-hand side, there are

(d−k i

ways to choose the remaining items to form the right hand side of the rule, where 1 ≤ i ≤ d − k. Therefore the total number of rules (R) is:

R =

∑^ d

k=

d k

) (^) d∑−k

i=

d − k i

∑^ d

k=

d k

2 d−k^ − 1

∑^ d

k=

d k

2 d−k^ −

∑^ d

k=

d k

∑^ d

k=

d k

2 d−k^ −

[

2 d^ + 1

]

where (^) n ∑

i=

n i

= 2n^ − 1.

Since

(1 + x)d^ =

∑^ d

i=

d i

xd−i^ + xd,

substituting x = 2 leads to:

3 d^ =

∑^ d

i=

d i

2 d−i^ + 2d.

Therefore, the total number of rules is:

R = 3d^ − 2 d^ −

[

2 d^ + 1

]

= 3d^ − 2 d+1^ + 1.

78 Chapter 6 Association Analysis

Table 6.2. Market basket transactions. Transaction ID Items Bought 1 {Milk, Beer, Diapers} 2 {Bread, Butter, Milk} 3 {Milk, Diapers, Cookies} 4 {Bread, Butter, Cookies} 5 {Beer, Cookies, Diapers} 6 {Milk, Diapers, Bread, Butter} 7 {Bread, Butter, Diapers} 8 {Beer, Diapers} 9 {Milk, Diapers, Bread, Butter} 10 {Beer, Cookies}

  1. Consider the market basket transactions shown in Table 6.2.

(a) What is the maximum number of association rules that can be extracted from this data (including rules that have zero support)? Answer: There are six items in the data set. Therefore the total number of rules is 602. (b) What is the maximum size of frequent itemsets that can be extracted (assuming minsup > 0)? Answer: Because the longest transaction contains 4 items, the maxi- mum size of frequent itemset is 4. (c) Write an expression for the maximum number of size-3 itemsets that can be derived from this data set. Answer:

3

(d) Find an itemset (of size 2 or larger) that has the largest support. Answer: {Bread, Butter}. (e) Find a pair of items, a and b, such that the rules {a} −→ {b} and {b} −→ {a} have the same confidence. Answer: (Beer, Cookies) or (Bread, Butter).

  1. Consider the following set of frequent 3-itemsets:

{ 1 , 2 , 3 }, { 1 , 2 , 4 }, { 1 , 2 , 5 }, { 1 , 3 , 4 }, { 1 , 3 , 5 }, { 2 , 3 , 4 }, { 2 , 3 , 5 }, { 3 , 4 , 5 }.

Assume that there are only five items in the data set.

(a) List all candidate 4-itemsets obtained by a candidate generation proce- dure using the Fk− 1 × F 1 merging strategy. Answer: { 1 , 2 , 3 , 4 },{ 1 , 2 , 3 , 5 },{ 1 , 2 , 3 , 6 }. { 1 , 2 , 4 , 5 },{ 1 , 2 , 4 , 6 },{ 1 , 2 , 5 , 6 }.

80 Chapter 6 Association Analysis

the candidate generation step but is subsequently removed during the candidate pruning step because one of its subsets is found to be infrequent.

  • F: If the candidate itemset is found to be frequent by the Apriori algorithm.
  • I: If the candidate itemset is found to be infrequent after support counting. Answer: The lattice structure is shown below.

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

F F F F F

F (^) I F F (^) F F F F I

F

N (^) I (^) I N N F I N F N

N N N N N

N

F

Figure 6.1. Solution.

(b) What is the percentage of frequent itemsets (with respect to all itemsets in the lattice)? Answer: Percentage of frequent itemsets = 16/32 = 50.0% (including the null set). (c) What is the pruning ratio of the Apriori algorithm on this data set? (Pruning ratio is defined as the percentage of itemsets not considered to be a candidate because (1) they are not generated during candidate generation or (2) they are pruned during the candidate pruning step.) Answer:

{258} {289}

{356} {689}

{168} {568} {346} {367} {379} {678}

{459} {456} {789}

{125} {158} {458}

2,5,

1,4,

1,4,

1,4,

3,6,9 1,4,

3,6,

3,6, 3,6, 2,5,

2,5,

2,5,8 1,4,

3,6, 2,5,

L1 L5 L6 L7 L8 L9 L11 L

L2 L3 L

{246} {278}

{145} {178}

{127} {457}

Figure 6.2. An example of a hash tree structure.

Pruning ratio is the ratio of N to the total number of itemsets. Since the count of N = 11, therefore pruning ratio is 11/32 = 34.4%. (d) What is the false alarm rate (i.e, percentage of candidate itemsets that are found to be infrequent after performing support counting)? Answer: False alarm rate is the ratio of I to the total number of itemsets. Since the count of I = 5, therefore the false alarm rate is 5/32 = 15.6%.

  1. The Apriori algorithm uses a hash tree data structure to efficiently count the support of candidate itemsets. Consider the hash tree for candidate 3- itemsets shown in Figure 6.2.

(a) Given a transaction that contains items { 1 , 3 , 4 , 5 , 8 }, which of the hash tree leaf nodes will be visited when finding the candidates of the trans- action? Answer: The leaf nodes visited are L1, L3, L5, L9, and L11. (b) Use the visited leaf nodes in part (b) to determine the candidate item- sets that are contained in the transaction { 1 , 3 , 4 , 5 , 8 }. Answer: The candidates contained in the transaction are { 1 , 4 , 5 }, { 1 , 5 , 8 }, and { 4 , 5 , 8 }.

  1. Consider the following set of candidate 3-itemsets:

{ 1 , 2 , 3 }, { 1 , 2 , 6 }, { 1 , 3 , 4 }, { 2 , 3 , 4 }, { 2 , 4 , 5 }, { 3 , 4 , 6 }, { 4 , 5 , 6 }

null

a b c d e

ab ac ad ae be ce de

abc abd abe

abcd

acd

abcde

abce abde acde bcde

ace ade bcd bce bde cde

bc bd cd

Figure 6.4. An itemset lattice

(b) How many leaf nodes are there in the candidate hash tree? How many internal nodes are there? Answer: There are 5 leaf nodes and 4 internal nodes. (c) Consider a transaction that contains the following items: { 1 , 2 , 3 , 5 , 6 }. Using the hash tree constructed in part (a), which leaf nodes will be checked against the transaction? What are the candidate 3-itemsets contained in the transaction? Answer: The leaf nodes L1, L2, L3, and L4 will be checked against the transaction. The candidate itemsets contained in the transaction include {1,2,3} and {1,2,6}.

  1. Given the lattice structure shown in Figure 6.4 and the transactions given in Table 6.3, label each node with the following letter(s):
    • M if the node is a maximal frequent itemset,
    • C if it is a closed frequent itemset,
    • N if it is frequent but neither maximal nor closed, and
    • I if it is infrequent.

Assume that the support threshold is equal to 30%.

84 Chapter 6 Association Analysis

Answer: The lattice structure is shown below.

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

C C C C F

MC (^) I F F M C C F M C I

C

I (^) I (^) I I I M C I I M C I

I I I I I

I

C

Figure 6.5. Solution for Exercise 11.

  1. The original association rule mining formulation uses the support and confi- dence measures to prune uninteresting rules.

(a) Draw a contingency table for each of the following rules using the trans- actions shown in Table 6.4.

Rules: {b} −→ {c}, {a} −→ {d}, {b} −→ {d}, {e} −→ {c}, {c} −→ {a}.

Answer: c c b 3 4 b 2 1

d d a 4 1 a 5 0

d d b 6 1 b 3 0 c c e 2 4 e 3 1

a a c 2 3 c 3 2 (b) Use the contingency tables in part (a) to compute and rank the rules in decreasing order according to the following measures.

86 Chapter 6 Association Analysis

Rules IS Rank b −→ c 0.507 3 a −→ d 0.596 2 b −→ d 0.756 1 e −→ c 0.365 5 c −→ a 0.4 4 v. Klosgen(X −→ Y ) =

P (X, Y )×(P (Y |X)−P (Y )), where P (Y |X) = P (X,Y ) P (X). Answer:

Rules Klosgen Rank b −→ c -0.039 2 a −→ d -0.063 4 b −→ d -0.033 1 e −→ c -0.075 5 c −→ a -0.045 3 vi. Odds ratio(X −→ Y ) = P P^ ((X,YX,Y^ ))PP^ ((X,YX,Y^ )). Answer:

Rules Odds Ratio Rank b −→ c 0.375 2 a −→ d 0 4 b −→ d 0 4 e −→ c 0.167 3 c −→ a 0.444 1

  1. Given the rankings you had obtained in Exercise 12, compute the correla- tion between the rankings of confidence and the other five measures. Which measure is most highly correlated with confidence? Which measure is least correlated with confidence? Answer: Correlation(Confidence, Support) = 0.97. Correlation(Confidence, Interest) = 1. Correlation(Confidence, IS) = 1. Correlation(Confidence, Klosgen) = 0.7. Correlation(Confidence, Odds Ratio) = -0.606. Interest and IS are the most highly correlated with confidence, while odds ratio is the least correlated.
  2. Answer the following questions using the data sets shown in Figure 6.6. Note that each data set contains 1000 items and 10,000 transactions. Dark cells indicate the presence of items and white cells indicate the absence of

items. We will apply the Apriori algorithm to extract frequent itemsets with minsup = 10% (i.e., itemsets must be contained in at least 1000 transac- tions)?

(a) Which data set(s) will produce the most number of frequent itemsets? Answer: Data set (e) because it has to generate the longest frequent itemset along with its subsets. (b) Which data set(s) will produce the fewest number of frequent itemsets? Answer: Data set (d) which does not produce any frequent itemsets at 10% support threshold. (c) Which data set(s) will produce the longest frequent itemset? Answer: Data set (e). (d) Which data set(s) will produce frequent itemsets with highest maximum support? Answer: Data set (b). (e) Which data set(s) will produce frequent itemsets containing items with wide-varying support levels (i.e., items with mixed support, ranging from less than 20% to more than 70%). Answer: Data set (e).

  1. (a) Prove that the φ coefficient is equal to 1 if and only if f 11 = f1+ = f+1. Answer: Instead of proving f 11 = f1+ = f+1, we will show that P (A, B) = P (A) = P (B), where P (A, B) = f 11 /N , P (A) = f1+/N , and P (B) = f+1/N. When the φ-coefficient equals to 1:

φ =

P (A, B) − P (A)P (B)

P (A)P (B)

[

1 − P (A)

][

1 − P (B)

] = 1

The preceding equation can be simplified as follows: [ P (A, B) − P (A)P (B)

] 2

= P (A)P (B)

[

1 − P (A)

][

1 − P (B)

]

P (A, B)^2 − 2 P (A, B)P (A)P (B) = P (A)P (B)

[

1 − P (A) − P (B)

]

P (A, B)^2 = P (A)P (B)

[

1 − P (A) − P (B) + 2P (A, B)

]

We may rewrite the equation in terms of P (B) as follows:

P (A)P (B)^2 − P (A)

[

1 − P (A) + 2P (A, B)

]

P (B) + P (A, B)^2 = 0

The solution to the quadratic equation in P (B) is:

P (B) =

P (A)β −

P (A)^2 β^2 − 4 P (A)P (A, B)^2 2 P (A)

where β = 1 − P (A) + 2P (A, B). Note that the second solution, in which the second term on the left hand side is positive, is not a feasible solution because it corresponds to φ = −1. Furthermore, the solution for P (B) must satisfy the following constraint: P (B) ≥ P (A, B). It can be shown that:

P (B) − P (A, B)

=

1 − P (A)

(1 − P (A))^2 + 4P (A, B)(1 − P (A))(1 − P (A, B)/P (A))

Because of the constraint, P (B) = P (A, B), which can be achieved by setting P (A, B) = P (A).

(b) Show that if A and B are independent, then P (A, B) × P (A, B) = P (A, B) × P (A, B). Answer: When A and B are independent, P (A, B) = P (A) × P (B) or equiva- lently:

P (A, B) − P (A)P (B) = 0 P (A, B) − [P (A, B) + P (A, B)][P (A, B) + P (A, B)] = 0 P (A, B)[1 − P (A, B) − P (A, B) − P (A, B)] − P (A, B)P (A, B) = 0 P (A, B)P (A, B) − P (A, B)P (A, B) = 0.

(c) Show that Yule’s Q and Y coefficients

Q =

[

f 11 f 00 − f 10 f 01 f 11 f 00 + f 10 f 01

]

Y =

[ √

f 11 f 00 −

f 10 f 01 √ f 11 f 00 +

f 10 f 01

]

are normalized versions of the odds ratio. Answer: Odds ratio can be written as:

α =

f 11 f 00 f 10 f 01

We can express Q and Y in terms of α as follows:

Q =

α − 1 α + 1

Y =

α − 1 √ α + 1

90 Chapter 6 Association Analysis

In both cases, Q and Y increase monotonically with α. Furthermore, when α = 0, Q = Y = −1 to represent perfect negative correlation. When α = 1, which is the condition for attribute independence, Q = Y = 1. Finally, when α = ∞, Q = Y = +1. This suggests that Q and Y are normalized versions of α. (d) Write a simplified expression for the value of each measure shown in Tables 6.11 and 6.12 when the variables are statistically independent. Answer:

Measure Value under independence φ-coefficient 0 Odds ratio 1 Kappa κ 0 Interest 1 Cosine, IS

P (A, B)

Piatetsky-Shapiro’s 0 Collective strength 1 Jaccard 0 · · · 1 Conviction 1 Certainty factor 0 Added value 0

  1. Consider the interestingness measure, M = P^ (B 1 |−AP) −(BP)^ ( B), for an association rule A −→ B.

(a) What is the range of this measure? When does the measure attain its maximum and minimum values? Answer: The range of the measure is from 0 to 1. The measure attains its max- imum value when P (B|A) = 1 and its minimum value when P (B|A) = P (B). (b) How does M behave when P (A, B) is increased while P (A) and P (B) remain unchanged? Answer: The measure can be rewritten as follows: P (A, B) − P (A)P (B) P (A)(1 − P (B))

It increases when P (A, B) is increased. (c) How does M behave when P (A) is increased while P (A, B) and P (B) remain unchanged? Answer: The measure decreases with increasing P (A).