
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An explanation of association analysis in data mining, focusing on computing support, confidence, and various measures for association rules. It includes examples of computing support for itemsets and confidence for association rules using the apriori algorithm.
Typology: Study notes
1 / 24
This page cannot be seen from the preview
Don't miss anything!
(a) A rule that has high support and high confidence. Answer: Milk −→ Bread. Such obvious rule tends to be uninteresting. (b) A rule that has reasonably high support but low confidence. Answer: Milk −→ Tuna. While the sale of tuna and milk may be higher than the support threshold, not all transactions that contain milk also contain tuna. Such low-confidence rule tends to be uninteresting. (c) A rule that has low support and low confidence. Answer: Cooking oil −→ Laundry detergent. Such low confidence rule tends to be uninteresting. (d) A rule that has low support and high confidence. Answer: Vodka −→ Caviar. Such rule tends to be interesting.
(a) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating each transaction ID as a market basket. Answer:
Table 6.1. Example of market basket transactions. Customer ID Transaction ID Items Bought 1 0001 {a, d, e} 1 0024 {a, b, c, e} 2 0012 {a, b, d, e} 2 0031 {a, c, d, e} 3 0015 {b, c, e} 3 0022 {b, d, e} 4 0029 {c, d} 4 0040 {a, b, c} 5 0033 {a, d, e} 5 0038 {a, b, e}
s({e}) =
s({b, d}) =
s({b, d, e}) =
(b) Use the results in part (a) to compute the confidence for the association rules {b, d} −→ {e} and {e} −→ {b, d}. Is confidence a symmetric measure? Answer:
c(bd −→ e) =
c(e −→ bd) =
No, confidence is not a symmetric measure. (c) Repeat part (a) by treating each customer ID as a market basket. Each item should be treated as a binary variable (1 if an item appears in at least one transaction bought by the customer, and 0 otherwise.) Answer:
s({e}) =
s({b, d}) =
s({b, d, e}) =
(d) Transitivity: Suppose the confidence of the rules A −→ B and B −→ C are larger than some threshold, minconf. Is it possible that A −→ C has a confidence less than minconf? Answer: Yes, It depends on the support of items A, B, and C. For example: s(A,B) = 60% s(A) = 90% s(A,C) = 20% s(B) = 70% s(B,C) = 50% s(C) = 60%
Let minconf = 50% Therefore: c(A → B) = 66% > minconf c(B → C) = 71% > minconf But c(A → C) = 22% < minconf
Example: Support, s = σ |(TX | )is anti-monotone because s(X) ≥ s(Y ) whenever X ⊂ Y.
(a) A characteristic rule is a rule of the form {p} −→ {q 1 , q 2 ,... , qn}, where the rule antecedent contains only a single item. An itemset of size k can produce up to k characteristic rules. Let ζ be the minimum confidence of all characteristic rules generated from a given itemset:
ζ({p 1 , p 2 ,... , pk}) = min
c
{p 1 } −→ {p 2 , p 3 ,... , pk}
c
{pk} −→ {p 1 , p 3... , pk− 1 }
Is ζ monotone, anti-monotone, or non-monotone? Answer: ζ is an anti-monotone measure because
ζ({A 1 , A 2 , · · · , Ak}) ≥ ζ({A 1 , A 2 , · · · , Ak, Ak+1}) (6.2)
For example, we can compare the values of ζ for {A, B} and {A, B, C}.
ζ({A, B}) = min
c(A −→ B), c(B −→ A)
= min
( (^) s(A, B) s(A)
s(A, B) s(B)
s(A, B) max(s(A), s(B))
ζ({A, B, C}) = min
c(A −→ BC), c(B −→ AC), c(C −→ AB)
= min
( (^) s(A, B, C) s(A)
s(A, B, C) s(B)
s(A, B, C) s(C)
s(A, B, C) max(s(A), s(B), s(C))
Since s(A, B, C) ≤ s(A, B) and max(s(A), s(B), s(C)) ≥ max(s(A), s(B)), therefore ζ({A, B}) ≥ ζ({A, B, C}).
(b) A discriminant rule is a rule of the form {p 1 , p 2 ,... , pn} −→ {q}, where the rule consequent contains only a single item. An itemset of size k can produce up to k discriminant rules. Let η be the minimum confidence of all discriminant rules generated from a given itemset:
η({p 1 , p 2 ,... , pk}) = min
c
{p 2 , p 3 ,... , pk} −→ {p 1 }
c
{p 1 , p 2 ,... pk− 1 } −→ {pk}
Is η monotone, anti-monotone, or non-monotone? Answer: η is non-monotone. We can show this by comparing η({A, B}) against η({A, B, C}).
η({A, B}) = min
c(A −→ B), c(B −→ A)
= min
( (^) s(A, B) s(A)
s(A, B) s(B)
s(A, B) max(s(A), s(B))
η({A, B, C}) = min
c(AB −→ C), c(AC −→ B), c(BC −→ A)
= min
( (^) s(A, B, C) s(A, B)
s(A, B, C) s(A, C)
s(A, B, C) s(B, C)
s(A, B, C) max(s(A, B), s(A, C), s(B, C))
Since s(A, B, C) ≤ s(A, B) and max(s(A, B), s(A, C), s(B, C)) ≤ max(s(A), s(B)), therefore η({A, B, C}) can be greater than or less than η({A, B}). Hence, the measure is non-monotone. (c) Repeat the analysis in parts (a) and (b) by replacing the min function with a max function.
Since s(A, B, C) ≤ s(A, B) and min(s(A, B), s(A, C), s(B, C)) ≤ min(s(A), s(B), s(C)) ≤ min(s(A), s(B)), η′({A, B, C}) can be greater than or less than η′({A, B}). Hence, the measure is non-monotone.
(d k
ways for doing this. After selecting the items for the left-hand side, there are
(d−k i
ways to choose the remaining items to form the right hand side of the rule, where 1 ≤ i ≤ d − k. Therefore the total number of rules (R) is:
∑^ d
k=
d k
) (^) d∑−k
i=
d − k i
∑^ d
k=
d k
2 d−k^ − 1
∑^ d
k=
d k
2 d−k^ −
∑^ d
k=
d k
∑^ d
k=
d k
2 d−k^ −
2 d^ + 1
where (^) n ∑
i=
n i
= 2n^ − 1.
Since
(1 + x)d^ =
∑^ d
i=
d i
xd−i^ + xd,
substituting x = 2 leads to:
3 d^ =
∑^ d
i=
d i
2 d−i^ + 2d.
Therefore, the total number of rules is:
R = 3d^ − 2 d^ −
2 d^ + 1
= 3d^ − 2 d+1^ + 1.
Table 6.2. Market basket transactions. Transaction ID Items Bought 1 {Milk, Beer, Diapers} 2 {Bread, Butter, Milk} 3 {Milk, Diapers, Cookies} 4 {Bread, Butter, Cookies} 5 {Beer, Cookies, Diapers} 6 {Milk, Diapers, Bread, Butter} 7 {Bread, Butter, Diapers} 8 {Beer, Diapers} 9 {Milk, Diapers, Bread, Butter} 10 {Beer, Cookies}
(a) What is the maximum number of association rules that can be extracted from this data (including rules that have zero support)? Answer: There are six items in the data set. Therefore the total number of rules is 602. (b) What is the maximum size of frequent itemsets that can be extracted (assuming minsup > 0)? Answer: Because the longest transaction contains 4 items, the maxi- mum size of frequent itemset is 4. (c) Write an expression for the maximum number of size-3 itemsets that can be derived from this data set. Answer:
3
(d) Find an itemset (of size 2 or larger) that has the largest support. Answer: {Bread, Butter}. (e) Find a pair of items, a and b, such that the rules {a} −→ {b} and {b} −→ {a} have the same confidence. Answer: (Beer, Cookies) or (Bread, Butter).
{ 1 , 2 , 3 }, { 1 , 2 , 4 }, { 1 , 2 , 5 }, { 1 , 3 , 4 }, { 1 , 3 , 5 }, { 2 , 3 , 4 }, { 2 , 3 , 5 }, { 3 , 4 , 5 }.
Assume that there are only five items in the data set.
(a) List all candidate 4-itemsets obtained by a candidate generation proce- dure using the Fk− 1 × F 1 merging strategy. Answer: { 1 , 2 , 3 , 4 },{ 1 , 2 , 3 , 5 },{ 1 , 2 , 3 , 6 }. { 1 , 2 , 4 , 5 },{ 1 , 2 , 4 , 6 },{ 1 , 2 , 5 , 6 }.
the candidate generation step but is subsequently removed during the candidate pruning step because one of its subsets is found to be infrequent.
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
F F F F F
F (^) I F F (^) F F F F I
F
N (^) I (^) I N N F I N F N
N N N N N
N
F
Figure 6.1. Solution.
(b) What is the percentage of frequent itemsets (with respect to all itemsets in the lattice)? Answer: Percentage of frequent itemsets = 16/32 = 50.0% (including the null set). (c) What is the pruning ratio of the Apriori algorithm on this data set? (Pruning ratio is defined as the percentage of itemsets not considered to be a candidate because (1) they are not generated during candidate generation or (2) they are pruned during the candidate pruning step.) Answer:
{258} {289}
{356} {689}
{168} {568} {346} {367} {379} {678}
{459} {456} {789}
{125} {158} {458}
2,5,
1,4,
1,4,
1,4,
3,6,9 1,4,
3,6,
3,6, 3,6, 2,5,
2,5,
2,5,8 1,4,
3,6, 2,5,
L1 L5 L6 L7 L8 L9 L11 L
L2 L3 L
{246} {278}
{145} {178}
{127} {457}
Figure 6.2. An example of a hash tree structure.
Pruning ratio is the ratio of N to the total number of itemsets. Since the count of N = 11, therefore pruning ratio is 11/32 = 34.4%. (d) What is the false alarm rate (i.e, percentage of candidate itemsets that are found to be infrequent after performing support counting)? Answer: False alarm rate is the ratio of I to the total number of itemsets. Since the count of I = 5, therefore the false alarm rate is 5/32 = 15.6%.
(a) Given a transaction that contains items { 1 , 3 , 4 , 5 , 8 }, which of the hash tree leaf nodes will be visited when finding the candidates of the trans- action? Answer: The leaf nodes visited are L1, L3, L5, L9, and L11. (b) Use the visited leaf nodes in part (b) to determine the candidate item- sets that are contained in the transaction { 1 , 3 , 4 , 5 , 8 }. Answer: The candidates contained in the transaction are { 1 , 4 , 5 }, { 1 , 5 , 8 }, and { 4 , 5 , 8 }.
{ 1 , 2 , 3 }, { 1 , 2 , 6 }, { 1 , 3 , 4 }, { 2 , 3 , 4 }, { 2 , 4 , 5 }, { 3 , 4 , 6 }, { 4 , 5 , 6 }
null
a b c d e
ab ac ad ae be ce de
abc abd abe
abcd
acd
abcde
abce abde acde bcde
ace ade bcd bce bde cde
bc bd cd
Figure 6.4. An itemset lattice
(b) How many leaf nodes are there in the candidate hash tree? How many internal nodes are there? Answer: There are 5 leaf nodes and 4 internal nodes. (c) Consider a transaction that contains the following items: { 1 , 2 , 3 , 5 , 6 }. Using the hash tree constructed in part (a), which leaf nodes will be checked against the transaction? What are the candidate 3-itemsets contained in the transaction? Answer: The leaf nodes L1, L2, L3, and L4 will be checked against the transaction. The candidate itemsets contained in the transaction include {1,2,3} and {1,2,6}.
Assume that the support threshold is equal to 30%.
Answer: The lattice structure is shown below.
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
C C C C F
MC (^) I F F M C C F M C I
C
I (^) I (^) I I I M C I I M C I
I I I I I
I
C
Figure 6.5. Solution for Exercise 11.
(a) Draw a contingency table for each of the following rules using the trans- actions shown in Table 6.4.
Rules: {b} −→ {c}, {a} −→ {d}, {b} −→ {d}, {e} −→ {c}, {c} −→ {a}.
Answer: c c b 3 4 b 2 1
d d a 4 1 a 5 0
d d b 6 1 b 3 0 c c e 2 4 e 3 1
a a c 2 3 c 3 2 (b) Use the contingency tables in part (a) to compute and rank the rules in decreasing order according to the following measures.
Rules IS Rank b −→ c 0.507 3 a −→ d 0.596 2 b −→ d 0.756 1 e −→ c 0.365 5 c −→ a 0.4 4 v. Klosgen(X −→ Y ) =
P (X, Y )×(P (Y |X)−P (Y )), where P (Y |X) = P (X,Y ) P (X). Answer:
Rules Klosgen Rank b −→ c -0.039 2 a −→ d -0.063 4 b −→ d -0.033 1 e −→ c -0.075 5 c −→ a -0.045 3 vi. Odds ratio(X −→ Y ) = P P^ ((X,YX,Y^ ))PP^ ((X,YX,Y^ )). Answer:
Rules Odds Ratio Rank b −→ c 0.375 2 a −→ d 0 4 b −→ d 0 4 e −→ c 0.167 3 c −→ a 0.444 1
items. We will apply the Apriori algorithm to extract frequent itemsets with minsup = 10% (i.e., itemsets must be contained in at least 1000 transac- tions)?
(a) Which data set(s) will produce the most number of frequent itemsets? Answer: Data set (e) because it has to generate the longest frequent itemset along with its subsets. (b) Which data set(s) will produce the fewest number of frequent itemsets? Answer: Data set (d) which does not produce any frequent itemsets at 10% support threshold. (c) Which data set(s) will produce the longest frequent itemset? Answer: Data set (e). (d) Which data set(s) will produce frequent itemsets with highest maximum support? Answer: Data set (b). (e) Which data set(s) will produce frequent itemsets containing items with wide-varying support levels (i.e., items with mixed support, ranging from less than 20% to more than 70%). Answer: Data set (e).
φ =
The preceding equation can be simplified as follows: [ P (A, B) − P (A)P (B)
We may rewrite the equation in terms of P (B) as follows:
P (A)P (B)^2 − P (A)
The solution to the quadratic equation in P (B) is:
P (A)β −
P (A)^2 β^2 − 4 P (A)P (A, B)^2 2 P (A)
where β = 1 − P (A) + 2P (A, B). Note that the second solution, in which the second term on the left hand side is positive, is not a feasible solution because it corresponds to φ = −1. Furthermore, the solution for P (B) must satisfy the following constraint: P (B) ≥ P (A, B). It can be shown that:
P (B) − P (A, B)
=
Because of the constraint, P (B) = P (A, B), which can be achieved by setting P (A, B) = P (A).
(b) Show that if A and B are independent, then P (A, B) × P (A, B) = P (A, B) × P (A, B). Answer: When A and B are independent, P (A, B) = P (A) × P (B) or equiva- lently:
P (A, B) − P (A)P (B) = 0 P (A, B) − [P (A, B) + P (A, B)][P (A, B) + P (A, B)] = 0 P (A, B)[1 − P (A, B) − P (A, B) − P (A, B)] − P (A, B)P (A, B) = 0 P (A, B)P (A, B) − P (A, B)P (A, B) = 0.
(c) Show that Yule’s Q and Y coefficients
f 11 f 00 − f 10 f 01 f 11 f 00 + f 10 f 01
f 11 f 00 −
f 10 f 01 √ f 11 f 00 +
f 10 f 01
are normalized versions of the odds ratio. Answer: Odds ratio can be written as:
α =
f 11 f 00 f 10 f 01
We can express Q and Y in terms of α as follows:
α − 1 α + 1
Y =
α − 1 √ α + 1
In both cases, Q and Y increase monotonically with α. Furthermore, when α = 0, Q = Y = −1 to represent perfect negative correlation. When α = 1, which is the condition for attribute independence, Q = Y = 1. Finally, when α = ∞, Q = Y = +1. This suggests that Q and Y are normalized versions of α. (d) Write a simplified expression for the value of each measure shown in Tables 6.11 and 6.12 when the variables are statistically independent. Answer:
Measure Value under independence φ-coefficient 0 Odds ratio 1 Kappa κ 0 Interest 1 Cosine, IS
Piatetsky-Shapiro’s 0 Collective strength 1 Jaccard 0 · · · 1 Conviction 1 Certainty factor 0 Added value 0
(a) What is the range of this measure? When does the measure attain its maximum and minimum values? Answer: The range of the measure is from 0 to 1. The measure attains its max- imum value when P (B|A) = 1 and its minimum value when P (B|A) = P (B). (b) How does M behave when P (A, B) is increased while P (A) and P (B) remain unchanged? Answer: The measure can be rewritten as follows: P (A, B) − P (A)P (B) P (A)(1 − P (B))
It increases when P (A, B) is increased. (c) How does M behave when P (A) is increased while P (A, B) and P (B) remain unchanged? Answer: The measure decreases with increasing P (A).