You are on page 1of 40

Chapter- 4

Association Pattern Mining

This chapter proposes different algorithms for mining association patterns and their
extensions. Section 4.2 describes the proposed algorithms MBA (Matrix Based
Association pattern mining algorithm), CBA (Cluster Based Association pattern
mining algorithm), and Hybrid (Hybrid association pattern mining algorithm) for
mining association patterns. Like most of the conventional association pattern mining
algorithms, MBA and CBA are also based on the level-wise bottom up approach; while
Hybrid is based on a novel approach. It is observed that algorithm Hybrid is efficient
for mining dense databases. Section 4.3 discusses how these algorithms can be
extended by adding new modules to achieve Generalized association pattern mining,
Incremental association pattern mining, Multi-dimensional association pattern mining
- containing categorical as well as quantitative attributes, and Maxmaxpattern mining.

These algorithms have been implemented and used in the association rule mining
system (AR-Miner) described in Chapter 6 of this thesis.

4.1 A Critical Look on Currently Used Algorithms

Critical analysis of existing literature has led the author to identify the following
limitations/shortcomings in available algorithms for frequent itemsetsets generation:

71
• Recently proposed graph based approaches [38,48] for mining association
patterns require the construction of the ‘Association Graph’ [38] and the
‘Frequent Pattern tree’ [48] respectively for every new value of minimum
support, by scanning the entire database. Even for the case when database is
incremented or mining is performed at some higher concept level existing
‘Association Graph’ or ‘Frequent Pattern tree’ has to be reconstructed from
scratch level. Construction of the ‘Association Graph’ or ‘Frequent Pattern
tree’ is a time consuming activity. Further, these approaches do not offer
flexibility and reusability of computation during mining process.

• Existing association pattern mining algorithms [1,3,4,38,40,50,113] may


generate and test unnecessary insignificant 2-candidate itemsets due to the
presence of false frequent items (refer Section 3.3.4). In [50] it is observed that
execution time of the first two passes required by the Apriori algorithm [4] is
about 62 percent of the total execution time. This is very motivation in
filtering false frequent items at the early stage, which reduces the generation
and testing of 2-candidate itemsets, C2.

• AIS [1], SETM [40], Apriori [4] and its variants, and the graph based
association pattern mining algorithms [33,38,48] do not have any advance
information regarding the maximum size of frequent itemsets present in the
database prior to actual mining, at the given minimum support. Hence, they
continue to generate and test the candidate itemsets till candidate itemsets in a
pass become null. Due to this many insignificant candidate itemsets are
generated and tested. Testing of such insignificant candidate itemsets may also
demand one more database scan for algorithms [1,4,38,40] and one more
‘Frequent Pattern tree’ scan for algorithm [48].

• In the literature algorithms are available for maintaining the association rules,
due to addition or deletion of transactions in the database. However,

72
algorithms are not available for mining incremental rules due to addition of
more items/attributes.

• Algorithms [1,2,3,4,48,50,113] are not suitable/efficient for verification


driven association rule mining. Further, problem worsens if data changes
frequently.

4.2 Proposed Algorithms

4.2.1 Matrix Based Association Pattern Mining Algorithm (MBA)

To address the above-mentioned limitations/shortcomings, a new algorithm (MBA)


has been proposed in this section for mining the association patterns. MBA uses a
complete, level-wise bottom-up search, with a vertical data layout (encoded) and
enumerates all frequent itemsets. It is an iterative algorithm that counts k-itemsets in
pass k by using Property 4.1. In each pass it scans encoded database of same size. The
frequent itemsets of a pass are retained for the next higher pass, and the process is
repeated until all frequent itemsets of various lengths have been enumerated.

It uses the following property and Lemmas as given below including Lemma 3.1
given in Chapter 3. The proposed algorithm is based on the Flexible Uniform
Framework described in Chapter 3. Phase 1 of FUF corresponds to encoding and
counting has not been incorporated in the algorithm for better understanding.
Property 4.1:

The support for the itemset (i1, i2, i3, i4, …………, ik ) is the number of 1s in β1∧ β2∧β3∧
β4∧ ……∧ βK, where the notation ‘∧’ is a logical AND operator.

Lemma 4.1:

73
If an itemset is not a frequent itemset, then any superset of this itemset can not be a
frequent itemset.

Lemma 4.2:

A frequent itemset (i1, i2, i3, i4, …………, ik) can not be extended to form a frequent
itemset (i1, i2, i3, i4, …………, ik, x ) if pair (ik, x ) ∉ L2.

The formal description of the algorithm is given below.

Algorithm: Matrix Based Association pattern mining algorithm (MBA).

Input:
Data file, D
Minimum support, min_supp

Output:
Frequent itemsets, L

Process:
begin
L1 = φ; //L1 is the set of 1-frequent items
L = φ;
L1 = identify_frequent_item(count_table, min_supp);
if |L1| < 2 then
print “No association pattern exists at given support”
exit(0);
else
ξ = make_concentrator(D, L1);
//Construction of the Concentrator

74
forall l ∈ L1 do
if l.count < min_supp then
//Detection of false 1-frequent item
L1 = L1\ l ;
//Filtering of false 1-frequent item from L1
endif;
endfor;
if |L1| < 2 then
print “No association pattern exists at given support”
exit(0);
else
γ = predict_size(ξ, min_supp);
//Predicts the size of the maximum length
//frequent itemset which may be present in the
//database at given minimum support
L = L1;
for(k = 2; Lk-1 ≠ φ AND k ≤ γ; k++)
do begin
CK = gen_pattern(Lk-1, ξ);
//Generates new candidate itemsets of size k, where
//2 ≤ k ≤ γ
forall c ∈ CK do
//test whether a candidate itemset is frequent
if c.count ≥ min_supp then
Lk = Lk ∪ c;
endif;
endfor;
L = L ∪ Lk;
endfor;

75
endif;
endif;
end;

function gen_pattern(Lk-1, ξ)

//This function generates the k-candidate itemsets from the given (k-1)-frequent
//itemsets and also calculates the actual support of generated candidate itemsets.

Ck = φ;
for(i = 1; i ≤ Lk-1-1; i++ ) do begin
for(j = i+1; j ≤  Lk-1; j++) do begin
if (left(ti,  ti -1) = = (left(tj,  tj -1)) then
// ti is the ith row of Lk-1
// tj is the jth row of Lk-1
if left(ti,  ti -1) = φ then
c.item = ti ∪ tj;
else
c.item = left(ti,  ti -1) + right(tj, 1);
endif;
c.count = supp(c.item, ξ);
Ck = Ck ∪ c;
else
exitfor;
endif;
endfor;
endfor;
return Ck;

Brief description of the algorithm is given below:

76
L1 (set of frequent items of size 1) and L (set of frequent itemsets) are initialized to
null. Function identify_frequent_item() is called. This function identifies 1-frequent
items and assigns to L1 (this step corresponds to Phase 2 of FUF). The input
parameters to this function are count_table (constructed during Phase 1 of FUF and
not shown in the algorithm) and user defined minimum support, min_supp.
Description of this function has been given in Section 3.3.2 of Chapter 3. If number of
1-frequent items in L1 is less than 2 i.e. |L1| < 2 then no association pattern exists in
the database at given minimum support and algorithm terminates. Otherwise, the
function make_concentrator() is invoked (Phase 3 of FUF). This function constructs
the Concentrator for given database at the user defined support. The input parameters
of this function are database (D) and 1-frequent items (L 1) (obtained during Phase 2).
For constructing the Concentrator one full scan of the database is required. Detailed
description of this function is given in Section 3.3.3 of Chapter 3. After constructing
the Concentrator, false frequent items are detected and filtered out from L1 (Phase 4
of FUF), as explained in Section 3.3.4 of Chapter 3. After filtering false frequent
items from L1, size of L1 is again tested. If size of L1 becomes less than two then
algorithm terminates since no association patterns will exist in the data at the given
minimum support. Otherwise function predict_size() is called (Phase 5 of FUF). The
input parameters to this function are the Concentrator, ξ and minimum support,
min_supp. While it returns γ, upper limit for the size of the largest frequent itemset,
which may present in the database, at the user defined minimum support. This
function has been explained in Section 3.3.5 of Chapter 3.

Now function gen_pattern() is called to generate k-candidate itemset by using (k-1)-


frequent itemsets. It uses the Lemma 4.1 and 4.2 during this process. Generated
candidate itemsets whose all k-1 subsets are not frequent are filtered out and the
remaining itemsets are known as k-candidate itemsets, Ck. This function also finds the
actual support of generated candidate itemsets by using Property 4.1. The approach
used for generating candidate itemsets is similar to [4]. However, it differs in support
calculation mechanism.

77
Now each candidate itemset, c ∈ Ck is tested whether it is frequent. If c.count ≥
min_supp, then ‘c’ is frequent itemset and appended in L k, which is k-frequent
itemsets. At the end of pass k, Lk is appended to L. This process is repeated till Lk ≠ φ
AND k ≤ γ. At the end L gives the frequent itemsets.

4.2.1.1 Performance Evaluation for MBA

To assess the performance of Matrix Based Association pattern mining algorithm


developed in this research experiments have been performed on a Pentium-4 based PC
with a CPU clock rate of 1.6 GHz, 128 MB of main memory, and running Window 98.
Experiments have been performed on real-life datasets. Real-life datasets were
obtained from a retail departmental store (M/s NEW CITY SUPERMARKET Bangaloe,
India) which consisted of the details of sales transactions. A transaction contained data
about the items purchased by the customer in a visit to the store. A snap of such
dataset, used for experiments in this research is shown in Appendix-D.

Following tables show execution time taken by MBA for different datasets.

Table 4.1: Execution times for different datasets at different


value of minimum support

Dataset Minimum Support

0.01 0.51 1.01%


% %

T12.I8.D70K 859 389 378

T12.I11.D70K 3091 820 836

78
Table 4.2: Execution times for data having 8 frequent items at
minimum support of 0.01% (T12.I8.D70K)

Algorithm
Number of Transactions in K

15 20 25 30 35 40 45 50 55 60 65 70

15 16 18 23 25 26 28 31 32 34
MBA 211 859
4 8 9 2 1 9 9 0 4 7

Table 4.3: Execution times for data having 5 frequent items at


minimum support of 0.01% (T12.I5.D70K)
Algorithm

Number of Transactions in K

15 20 25 30 35 40 45 50 55 60 65 70

10 10
MBA 68 74 77 83 90 96 98 115 119 126
6 9

Table 4.4: Execution times at minimum support of 0.01%


Number of Transactions

Algorithm

Number of Frequent Items

3 4 5 6 7 8 11 17

79
70K MBA 49 76 125 313 620 859 3091 4085

Table 4.5: Execution times in seconds for MBA


(Dense dataset-T7.I10.D5K)

Algorithm Minimum Support

30% 40% 50% 60% 70% 80%


1
MBA 547 262 129 63 35
039

Table 4.1 gives the results obtained during experiments conducted to see the effect of
minimum support on the execution time for a given dataset. The results indicate that
execution time increases with the decrease in minimum support. It is because of at the
reduced minimum support there were more itemsets for testing.

Table 4.2 shows that for a given number of frequent items (eight) and a specified
minimum support of 0.01% the execution time increases continuously as the size of
the dataset (number of transactions) increases. Similar trends at lower number of
frequent items (five) and at the same minimum support of 0.01% also noticed through
Table 4.3. However rate of increase is rapid for the case given in Table 4.2. It may
attribute due to increase in the number of frequent items.

Table 4.4 depicts the effect of number of frequent items on the execution time at
minimum support of 0.01%. It is noticed that the execution time increase drastically
as the number of frequent items increase beyond five and continue to behave same
way with the increase in number of frequent items. This therefore puts a limit to use
MBA to those datasets where frequent items are around five at the given minimum
support.

80
Table 4.5 gives the results obtained during experiments conducted on dense data to
see the effect of minimum support on the execution time. The results indicate that
execution time increases with the decrease in minimum support. Further, execution
time required, even for data having only few thousand transactions, is very large
compared to sparse data. It may be attributed to combinatorial explosion of itemsets,
making the algorithm impracticable for dense data.

MBA presented above scans all the tuples (transactions) of the Concentrator
during each pass. In a pass k it also scans the tuples (transactions) of the
Concentrator containing less than k 1-frequent items. Statistically, scanning of all
such tuples is insignificant. Thus it seems to be inefficient; in case of data to be
mined is large and having more frequent items at the given minimum support
(unnecessary scanning of tuples requires more disk I/O operations and results in
degradation of its performance). This calls for modification to MBA to handle such
cases.

4.2.2 Cluster Based Association Pattern Mining Algorithm (CBA)

Cluster Based Association pattern mining algorithm (CBA) uses a novel approach to
reduce the disk I/O operations. It divides the tuples of the Concentrator in γ-1clusters
at the given minimum support. It may be recalled that γ is the size of the maximum
size frequent itemset, which may be present in the data at the given minimum support.
The value of γ can be predicted as in case of MBA. During a particular pass only
those tuples that seem to be statistically useful are to be scanned, as illustrated by the
following example:

• Suppose for a given database at the assumed minimum support of 15%, it is


predicted that six is the maximum size of the frequent itemset which may be
present in the data in the worst case. Then tuples of the Concentrator are
divided in 5 clusters (i.e. γ -1). First four clusters: ω2, ω3, ω4, and ω5 contain all

81
those tuples containing 2, 3, 4, and 5 frequent items respectively and the last
cluster, ω6 contain those tuples in which number of frequent items are greater
than or equal to 6. When frequent itemsets of size 5 are searched, these
itemsets may only be present in clusters ω5 and ω6. Thus, for this particular
case transactions available in clusters ω2, ω3, and ω4 are statistically
insignificant and should not be scanned. Therefore, during each higher pass
number of transactions scanned reduces compared to the previous pass. This
scan reduction approach reduces the amount of disk I/O required and will be
going to make CBA more efficient than MBA especially in the situation where
size of database is large and number of 1-frequent items is also more.

Like MBA, Cluster Based Association pattern mining algorithm also uses a complete,
level-wise bottom-up search, with a vertical data layout (encoded) and enumerates all
frequent itemsets. It is an iterative algorithm that counts itemsets of a specific length
in a given database pass by using Property 4.1. However, with contrast to MBA in
each pass it scans only those transactions, which seem statistically significant. The
frequent itemsets of a pass are retained for the next higher pass, and the process is
repeated until all frequent itemsets of various lengths have been enumerated.

Following is the lemma, other than used by MBA on which CBA is based.

Lemma 4.4:

A frequent itemset of size k can not be present in any cluster ωx, where x < (k-1)

and ωx is a cluster containing tuples having x 1-frequent items.

Formal description of the proposed algorithm is given below.

Algorithm: Cluster Based Association pattern mining algorithm (CBA).

82
Input:
Data file, D
Minimum support, min_supp
Output:
Frequent itemsets, L

Process:
begin
L1 = φ; //L1 is the set of 1-frequent items
L = φ;
L1 = identify_frequent_item(count_table, min_supp);
//Identifies 1-frequent items – Phase 2 of FUF
if |L1| < 2 then
print “No association pattern exist at given support”
exit(0);
else
ξ = make_concentrator(D, L1);
//Constructs the Concentrator – Phase 3 of FUF
//Loop 1
//Detection and filtering of false frequent items – Phase 4 of FUF
forall l ∈ L1 do
if l.count < min_supp then
// Detection of false 1-frequent item
L1 = L1\ l ;
// Filtering of false 1-frequent item from L1
endif;
endfor;
if |L1| < 2 then
print “No association pattern exists at given support”
exit(0);

83
else //Perform activities corresponds to Phase 5 of FUF
altertable ξ add (hcount) attribute;
k = no_of_ attribute(ξ);
//Loop 2
for (i = 1; i ≤ |ξ|; i++) do begin
for (j = 2; j ≤ |ξ|; j++) do begin
ti.hcount = ti.hcount + ti,j;
endfor;
endfor;
position = (|ξ|*min_supp)/100;
ξ’ = select all from ξ order by hcount desc;
γ = ξ’position.hcount;
ξ = ξ’;
// *******Association Pattern Generation
//Loop 3 making of clusters
for (i = 1; i ≤ γ; i++) do begin
ξ(i) = select all from ξ where ξ.hcount ≥ i;
endfor;
//Loop 4 Candidate itemset generation
for (k = 2; Lk-1 ≠ φ AND k ≤ γ; k++) do
begin
CK = gen_pattern_cluster(Lk-1, ξ(k));
//Generates new candidate itemsets of size k,
//where 2 ≤ k ≤ γ
forall c ∈ CK do
//Tests whether a candidate itemset is frequent
if c.count ≥ min_supp then
Lk = Lk ∪ c;
endif;

84
endfor;
L = L ∪ Lk;
endfor;
endif;
endif;
end;

function gen_pattern_cluster(Lk-1, ξ(k))

//This function generates the k-candidate itemsets from the given (k-1)-frequent itemsets
//and also calculates the actual support of generated candidates.

Ck = φ;
//Loop 5
for (i = 1; i ≤ Lk-1-1; i++ ) do begin
for (j = i+1; j ≤  Lk-1; j++) do begin

if (left(ti,  ti -1) = = (left(tj,  tj -1)) then


// ti is the ith row of Lk-1
// tj is the jth row of Lk-1
if left(ti,  ti -1) = φ then
c.item = ti ∪ tj;
else
c.item = left(ti,  ti -1) + right(tj, 1);
endif;
c.count = supp(c.item, ξ(k));
Ck = Ck ∪ c;
else
exitfor;
endif;

85
endfor;
endfor;
return Ck;

Brief description of the algorithm is given below.

L1 (set of frequent items of size 1) and L (set of frequent itemsets) are initialized to
null. Function identify_frequent_item() is called which identifies 1-frequent items and
assigns to L1 (Phase 2 of FUF). The input parameters to this function are count_table
constructed during Phase 1 of FUF (not shown in the algorithm) and user defined
minimum support, min_supp. Description of this function has been given in Section
3.3.2 of Chapter 3. If |L1| < 2 then no association pattern exists in the database at
given minimum support and algorithm terminates. Otherwise function
make_concentrator() is invoked (Phase 3 of FUF) which constructs the Concentrator
for given database at the user defined support as explained in Section 3.3.3 of Chapter
3. After constructing the Concentrator, false frequent items are detected and filtered
out from L1 (Phase 4 of FUF), as explained in Section 3.3.4 of Chapter 3. Next,
activities corresponds to Phase 5 of FUF starts and the value of γ is computed.

Now tuples of the Concentrator are divided in γ-1 clusters (Loop 3) in such a way
that each cluster i, where 2 ≤ i < γ, contains only those tuples in which number of 1’s
are i (containing i 1-frequent items) except last cluster which will contain all those
tuples for which number of 1’s are greater than or equal to γ.

Now function gen_pattern_cluster(), which is similar to gen_pattern() used in the


algorithm MBA, is called (Loop 4) to generate candidate itemset of size k i.e. Ck by
using frequent itemsets of previous pass i.e. (k-1). Now each c ∈ Ck is to be tested
(by using Property 4.1) whether it is frequent or not by searching it in all those
clusters, ωi for which i ≥ k-1. With contrast to gen_pattern(), this function scans only
those transactions which are statistically significant (Lemma 4.4). Thus, during each

86
higher pass number of scanned transactions reduces drastically. For each itemset,
occurrences are counted and the itemset for which item.count ≥ min_supp is a
frequent itemset and this itemset is appended in Lk, which is k-frequent itemset. At
the end of pass k, Lk is appended to L. This process is repeated till L k ≠ φ AND k ≤ γ.
At the end, L gives the frequent itemsets.

4.2.2.1 Performance Evaluation for CBA

For testing the performance of CBA (Cluster Based Association pattern mining
algorithm) similar experiments were performed as in case of MBA. Following tables
show execution times taken by the algorithm for different datasets.

Table 4.6: Execution times in seconds for CBA

Dataset Minimum Support

0.01 0.51 1.01%


% %

T12.I8.D70K 265 248 242

T12.I11.D70K 519 296 283

Table 4.7: Execution time for data having 8 frequent


items at minimum support of 0.01%

87
Algorithm
Number of Transactions in K

15 20 25 30 35 40 45 50 55 60 65 70

CBA 12 13 14 16 16 17 19 20 21 22 24 267
2 8 7 2 9 9 3 2 5 9 2

Table 4.8: Execution times for data having 5 frequent


items at minimum support of 0.01%
Algorithm

Number of Transactions in K

15 20 25 30 35 40 45 50 55 60 65 70

CBA 70 79 88 99 10 111 12 13 14 15 17 193


2 3 3 3 6 4

Table 4.9: Execution times at minimum support of 0.01%


Number of Transactions

Algorithm

Number of Frequent Items

3 4 5 6 7 8 11 17

70K CBA 93 124 161 176 217 267 519 713

MBA 49 76 125 313 620 859 3091 4085

88
Table 4.10: Execution times in seconds for T7.I10.D5K (Dense dataset)

Algorithm Minimum Support

30% 40% 50% 60% 70% 80%

CBA 1156 587 323 153 80 71

Table 4.6 gives the results obtained during experiments conducted to see the effect of
minimum support on the execution time for a given dataset. The results indicate that
execution time of CBA also increases like MBA with the decrease in minimum
support. However, in this case rate of increase of execution time on decreasing
support is less compared to MBA. It was as per expectation because at reduced
minimum support there were more candidate itemsets for testing. Table 4.6 also
indicates that difference in execution time taken at support 0.51% and 1.01% were
approximately same. The reason is number of candidate itemsets generated and tested
at these support were approximately same. Table 4.7 shows that for a given number
of frequent items (eight) and a specified minimum support of 0.01% the execution
time increases continuously as the size of the dataset (number of transactions)
increases. Similar trends at lower number of frequent items (five) and same minimum
support of 0.01% is also noticed through Table 4.8. However rate of increase is rapid
for the case given in Table 4.7. It may be attributed increase in the number of frequent
items.

Table 4.9 depicts the effect of number of frequent items on the execution time at the
given minimum support. It is noticed that the execution time increases as the number
of frequent items increase. However, the rate of increase is much less compared to
MBA. Further, it is to be noted that (Table 4.9) when frequent items are more than 5
CBA becomes more efficient compared to MBA. Performance gap increases with
number of frequent items in the data at the given minimum support. As both MBA and
CBA have used same candidate itemsets generation and testing (counting) techniques,
the improvement in the execution times for CBA is mainly due to the reduction in

89
scanned database size during higher passes. It is noticed that MBA scanned same size
of database during each successive pass while CBA reduced the scanned database size
progressively during each higher pass. However, during pass 1 both MBA and CBA
scanned the same size database and time required for MBA is less than CBA as shown
in Figure 4.1. It was expected because during this pass CBA also performed some
computational efforts for making the clusters. Performance gaps between MBA and
CBA increases with higher passes.

Table 4.10 gives the results obtained during experiments conducted on dense data to
see the effect of minimum support on the execution time. The results indicate that
execution time increases with the decrease in minimum support. Further, execution
time required, even for data having only few thousand transactions, is very large
compared to sparse data at the reduced minimum support. It may be attributed to
combinatorial explosion of itemsets, making the algorithm impracticable for dense
data.

T12.I8.D70K
500
MBA
450
CBA

400
Time (Second)

350

300

250

200

150

100

50

0
1 2 3

Pass Num ber

(a)

90
T12.I11.D70K
1800 MBA
CBA
1600

1400

Time (Second)
1200

1000

800

600

400

200

0
1 2 3 4

Pass Num ber

(b)

Figure 4.1: Time taken by MBA and CBA in different passes (min_supp = 0.01)

4.2.3 Hybrid Association Pattern Mining Algorithm (Hybrid)

Traditional association pattern mining algorithms have adopted an Apriori-like


candidate set generation and test approach (level-wise bottom up approach). These
algorithms always start with searching of frequent items of size one (known as 1-
frequent items) and sequentially reach to the frequent itemsets of largest size present
in the database. This approach works well when database is not dense. To mine
dense data efficiently a novel algorithm called Hybrid is presented below.

Hybrid algorithm is complete and based on vertical data layout. It uses hybrid
approach - a combination of bottom up and top down approach, for searching
frequent itemsets. Initial steps of the algorithm are same as in the proposed
algorithms MBA and CBA. After computing the value of γ at the given minimum
support, algorithm starts searching of frequent itemset of this maximum size and the
process of searching continues till one frequent itemset of the largest size present in

91
the data is encountered. This frequent pattern of maximum size is a maxpattern and
c

called maxmaxpattern (Pm) in this work. Mining of maxmaxpattern uses level-wise


top down approach. Now for mining remaining frequent itemsets of the database (all
c

these frequent itemsets will contain at least one item from ( L1 - Pm )) level-wise
bottom up approach is used. The formal description of the algorithm that
incorporates the above ideas is given below.

Algorithm: Hybrid association pattern mining algorithm (Hybrid).

Input:
Data file, D
Minimum support, min_supp

Output:
Maxpatterns, L

Process:
begin
//Step 1
L1 = φ; //L1 is the set of 1-frequent items
L = φ;
//Step 2
L1 = identify_frequent_item(count_table, min_supp);
//Step 3
if |L1| < 2 then
print “No association pattern exists at given support”
exit(0);
else
ξ = make_concentrator(D, L1);

92
//Construction of the Concentrator
//Step 4
forall l ∈ L1 do
if l.count < min_supp then
//Detection of false 1-frequent items
L1 = L1\ l ;
//Filtering of false 1-frequent items from L1
endif;
endfor;
//Step 5
if |L1| < 2 then
print “No association pattern exists at given support”
exit(0);
else
γ = predict_size(ξ, min_supp);
//Predicts the size of the maximum frequent
//pattern(s) which may be present in the database
//at the given minimum support

//Step 6
c

Pm = combined_pattern(ξ, L1, γ);

//Step 7
c

forall e ∈ Pm do

Z = L1\ e;
endfor;
c

L = L ∪ Pm ∪ L1;
c

forall e ∈ Pm do

L1 = L1\ e;
endfor;

93
first = true;
//Step 8
for (k = 2; Lk-1 ≠ φ and γ ≥ k; k++) do begin
if first = true then
CK = incr_gen_pattern1(Z, ξ, L1);
first = false;
else
CK = incr_gen_pattern(Lk-1, ξ);
endif;
forall c ∈ CK do
//Tests whether a candidate itemset is frequent
if c.count ≥ min_supp then
Lk = Lk ∪ c;
endif;
endfor;
L = L ∪ Lk ;
endfor;
//Step 9
L = maxpattern(L);
endif;
endif;
end;

function combined_pattern(ξ, L1, γ)

for (; γ ≥ 2; γ--) do begin


forall s ∈ subset(L1, γ) do
if supp(s, ξ) ≥ min_supp then
return s;
exitfunction;

94
endif;
endfor;
endfor;

function maxpattern(L);

// Generate maxpatterns from given frequent itemsets

Lmax = φ; // Lmax - set of maxpatterns

L = order by size of pattern in desc


// Loop 1
forall l ∈L do begin
flag = false;
// Loop 2
forall l ’∈ Lmax do begin
if l ⊆ l ’ then
flag = true;
exitfor;
endif;
endfor;
if flag = false then
Lmax= Lmax∪ l ;
endif;
endfor;
return Lmax;

It is to be noted that incr_gen_pattern1() and incr_gen_pattern() are similar to


gen_pattern() described earlier. A brief description of each step of the algorithm is
given below.

95
Steps 1-5: These steps of the algorithm are same as in MBA and CBA and are
correspond to Phases 2 to 5 of the proposed Flexible Uniform Framework.

Step 6: During this step function combined_pattern() is called. The input parameters
to the function are the Concentrator (ξ), 1-frequent item set (L1), and predicted size
of the largest size frequent itemset (γ). The function searches any one frequent
itemset of maximum size (having greater or equal support amongst this size frequent
c

itemsets) present in the Concentrator at the given min_supp and assign this to Pm.
The function works as follows:

First, it will search whether any itemset of predicted size γ is frequent. For this, a
sub-function subset() is called which generates the pattern of size γ, one by one
(from items of L1) and then it is tested whether it is frequent. subset() generates
those itemsets first which have higher potential to become frequent. As frequent
pattern of this size is encountered function terminates otherwise the current value of
γ is decremented by 1 i.e. γ = γ - 1. Now for this new value of γ above steps are
repeated, provided the current value of γ is greater than 2 i.e. γ > 2.

Step 7: During this step difference of L1 and Pm is obtained and stored in Z.


c

Maxmaxpattern (Pm) and members of L1 are appended in L. Then items which have
c

constituted Pm are deleted from L1.

Step 8: This step generates all the frequent itemsets containing at least one item
c

from Z (obtained in step 7) and remaining items from Pm. For each item, i ∈ Z, 2-
candidate itemsets will be generated by calling function incr_gen_pattern1(). For
instance, if z1 ∈ Z then 2-candidate itemsets due to z1 can be obtained by making
c c

pairs (z1, e) where ‘e’ is a 1-frequent item and e ∈ Pm. Now z1 is appended in Pm

96
and deleted from Z and same procedure is repeated for remaining items of Z. Now
each 2-candidate itemset is tested whether it is frequent. If it is frequent then it is
appended in L2. At the end L2 is appended in L. Afterwards, frequent items of size
(k-1), where k≥ 3 are used for making candidate itemsets of size k by calling
function incr_gen_pattern(). The working of this function is similar to
incr_gen_pattern() described earlier. Each candidate itemsets (of size k) is tested. If
it is frequent then it is appended in Lk. At the end, Lk is appended in L. This process
is repeated till | Lk | ≥ 2.

Step 9: During this step maxpatterns are computed by calling maxpattern(). Brief
description of the function is given below.

The input parameter to the function is L, while it returns Lmax. Firstly, Lmax is
initialized to null and flag is set to false. L is arranged in descending order on the
basis of size of the frequent patterns i. e. number of items in the frequent patterns.
Now each element l ∈ L is picked up (Loop 1) and tested whether it is subset of any
pattern, l ’∈ Lmax (Loop 2). If it passes the test then flag is set to true and exit from
Loop 2. If flag is false then l is appended to Lmax. At the end function returns Lmax,
set of maxpatterns.

To demonstrate the working of the algorithm, an example is outlined as given below.

Let item universe be Ι = {A, B, C, D, E} and transaction universe be Tid = {100,


200, 300, 400}.

In Table 4.11 100, 200, 300, and 400 are unique identifiers of the four transactions.
Let min_supp = 50% (to be frequent, an itemset must occur in at least 2
transactions).

97
Table 4.11: A transaction database

Tid Items
100 A, C, E
200 B, D
300 A, B, C, D
400 B, C, D

Each row in the above table can be considered as a transaction. The occurance of
each item in the database is counted. Table 4.12 shows the count value of each item
present in the database.

Table 4.12: Frequency counts of items

Item Count
A 2
B 3
C 3
D 3
E 1

Now trace the steps of algorithm Hybrid as follows:

1. L1 and L are initialized to null.

98
2. Function identify_frequent_item() is called. It returns L1 = {A, B, C, D} as 1-
frequent items set at the given minimum support of 50%.

3. The size of L1 is tested. Since L 1≥ 2, the Concentrator, ξ is constructed as


shown in Table 4.13. Note that, the item ‘E’ is not present in the Concentrator
because it is not frequent.

Table 4.13: Concentrator, ξ

Tid A B C D
100 1 0 1 0
200 0 1 0 1
300 1 1 1 1
400 0 1 1 1

4. Now false frequent items will be searched by using Lemma 3.1. Note that no
false frequent item exists in the Concentrator at the given minimum support.

5. Now function predict_size() is called. This function will return the γ, maximum
size of frequent itemset(s) which may present in the database, at the given
minimum support. For the example database the value of γ is three (3) at the
given min_supp of 50%.

6. During this step, one frequent itemset of maximum size present in the database,
for given value of min_supp, is searched by calling function
combined_pattern(). This function call a sub-function subset(), which will
generate the subset of size γ from 1-frequent set, L1 i.e. subset of size 3 from
{A, B, C, D}. These subset of L1 are generated and tested one by one. Let the
first pattern generated by subset() be {B, C, D}. Now this pattern is tested –
whether is it frequent. Since count value of {B, C, D} is 2, which is equal to
c

minimum support). Thus {B, C, D} is a maxmaxpattern (Pm).

99
c

7. During this step L1/Pm ( i.e. {A, B, C, D}/{B, C, D}) is computed and the result is
stored in Z (i.e. Z = {A}). Now {B, C, D} (maxmaxpattern) and {A}, {B}, {C},
and {D} (members of L1) are appended in L. Then items B, C, and D (constituents
c

of Pm ) are deleted from L1 and result is stored in Z i.e Z = {A}.

8. This step generates all the frequent itemsets having at least one item from Z
c

(obtained in step 7) and remaining items from Pm. Since for the given case, ‘A’ is
the only element present in Z. Now candidate itemset of size 2 will be generated
by calling function incr_gen_pattern1() i.e. C2 = {{AB}, {AC}, {AD}}. Now
these itemset will be tested. It is to be noted that {AC} (having support of 50%)
is the only frequent itemset. This itemset will be assigned to L 2. Since size of L2
is one, loop will terminate and contents of L2 will be appended in L.

9. During this step maxpatterns will be generated from all frequent itemsets
available in L. The result will be stored in L. Thus, L= {BCD, AC}.

4.2.3.1 Performance Evaluation for Hybrid

Table 4.14 shows the execution times taken by the Hybrid algorithm when tested on a
dense data (T7.I10.D5K) at different value of minimum support. It is noticed from the
table that the execution time decreased with decreasing value of minimum support in
hybrid algorithm. While the execution times increased with decreasing value of
minimum support in case of MBA (Table 4.5) and CBA (Table 4.10.). Results of
MBA and CBA have been reproduced in Table 4.14 for comparison.

Reduction in execution time in case of Hybrid with decrease in minimum support was
as per expectation. Because at reduced minimum support there are better chances to

get maxmaxpattern (Pm) of larger size. Thus lesser number of candidate itemsets will
be generated and tested. This is the reason Hybrid algorithm took the lead up to 50%

100
of minimum support value. After that performance of Hybrid degraded due to the
reason given above and MBA and CBA improved. It is also observed that at a given
minimum support, when the values of predicted size of the maximum size frequent
itemset, size of L1, and average frequent items per transaction are closer to each other
in that case the Hybrid algorithm was most efficient.

Table 4.14: Execution times in seconds for MBA, CBA, and Hybrid

Minimum Support
Dataset Algorithm
30% 40% 50% 60% 70% 80%

1
T7.I10.D5K MBA 547 262 129 63 35
039
(Dense dataset)
CBA 1156 587 323 153 80 71

Hybrid 19 76 111 185 270 314

4.3 Extensions of Association Pattern Mining

Algorithms presented in Section 4.2 generate all association patterns satisfying given
confidence and support requirements. This section presents the algorithms based on
proposed FUF, which either generate association pattern under other requirement or
extend the basic definition of what an association pattern is.

4.3.1 Mining of Generalized Association Pattern

Items present in the database can be generalized to the required concept level
according to the approach discussed in Section 3.4.2. Now mining can be performed

101
on these generalized items by using any appropriate algorithm presented in Section
4.2 of this chapter.

4.3.2 Mining Multi-dimensional Database

Business and scientific domains may have multi-dimensional databases with


quantitative and categorical attributes. This sub-section discusses how the algorithms
proposed in Section 4.2 can be extended for mining association patterns in such
databases.

Quantitative association rule problem can be mapped to the Boolean association rule
problem. If all attributes are categorical or the quantitative attributes have only a few
values, this mapping is straightforward. Conceptually, instead of having just one field
in the table for each attribute, there will be as many fields as the number of attribute
values. For instance, the categorical attribute (X, {1, 2, 3, 4}) can be transformed into
the following set of Boolean attributes: {((X_1, {0, 1}, (X_2, {0, 1}, (X_3, {0, 1},
(X_4, {0, 1})} such that X_i = 0 if X ≠ i and X_i = 1 if X= i. If the domain of values
for a quantitative attribute is large, then static discretization approach is adopted as
given below.

Prior to actual mining, all quantitative attributes of interest present in the database are
identified and discretized using predefined concept hierarchies, where numeric values
are replaced by ranges. The discretized numeric attributes, with their values, can then
be treated as categorical attributes, where each range is considered a category. Now
on the resultant data the proposed algorithm for finding association patterns can be
applied to find all frequent predicate sets (rather than frequent itemsets). The problem
of mining multi-dimensional database containing categorical and/or quantitative
attributes has been decomposed into steps given below.

1. Identify quantitative attributes/categorical attributes of interest present in the


database to be mined.

102
2(a) For each quantitative attribute obtained in step 1 determine the number of
partitions to be done for the domain of such attributes.

For quantitative attributes that are not partitioned into intervals, the values are
mapped to consecutive integers such that the order of the values is preserved.
If a quantitative attribute is partitioned into intervals, the intervals are mapped
to consecutive integers, such that the order of the intervals is preserved.

2(b) For each categorical attribute, obtained in step 1 map the values of the
attribute to a set of consecutive integers. In this step categorical attributes may
also be generalized to higher conceptual levels if desired.

3. Now find association patterns by using algorithm proposed in Section 4.2


(Select the algorithm according to database characteristics).

4. Now generate the association rules from association patterns obtained in step 3.

4.3.3 Mining of Maxmaxpattern

Maxmaxpattern is a maximum size association pattern present in the database and


having maximum support amongst the association patterns of the same size at the
given minimum support. Mining of such pattern is an important data mining task in
business domain. For instance, a mobile service provider is offering 20 different
services on payment to their valuable customers. Company may be interested in
knowing, which group of their services do the customers mostly like at different
value of minimum support. Such type of database is generally dense. The mining of
maxmaxpattern in such type of database can be done by modifying algorithm Hybrid
(by deleting steps 7 onwards).

4.3.4 Incremental Association Patterns Mining

103
Since mining of association patterns is computationally expensive, an efficient
approach to maintain the frequent itemsets is needed when the database is updated.
Database update occurs on account of insertion / deletion of transactions or attributes
in the original database. Several algorithms [20,27,28,97] are available in literature
addressing the first issue. However, the second issue i.e. addition of attributes
(deletion of attribute is straightforward) has not been addressed to the best of
knowledge of the author. The same has been taken up in this work. One simple but
inefficient solution to the problem is to rerun the mining algorithm on the updated
database. However, proposed approach will generate the association patterns that
appear due to only incremental database without knowing the association patterns for
the original database. Formal presentation of the proposed algorithm along with the
description for each step is given below.

Algorithm: Algorithm for mining Incremental Association Patterns (IAP).

Input:
Concentrator of the original database (D), ξ

Incremental data, ∆d
Minimum support, min_supp

Output:
Frequent itemsets due to incremental attributes, L∆d

Process:
begin
//Step 1
L1∆d = φ; //L1∆d is the set of 1-frequent items due to ∆d
L∆d = φ;
//Step 2

104
L1∆d = identify_frequent_item(incr_count_table, min_supp);
//Step 3
if | L1∆d | < 1 then
print “No association pattern exist due to incremental database”
exit(0);
else
ξ∆d = make_concentrator_incr(∆d, L1∆d);
//Construction of the Concentrator for ∆d
//Step 4
forall l ∈ L1∆d do
if l.count < min_supp then
//Detection of false 1-frequent items
L1∆d = L1∆d \ l ;
//Filtering of false 1-frequent items from L1∆d
endif;
endfor;
//Step 5
if | L1∆d | < 1 then
print “No association pattern exist for given support”
exit(0);
else
ξ’∆d = join(ξ, ξ∆d);
L∆d = L1∆d;
first = true;
//Step 6
for (k = 2; Lk-1∆d ≠ φ; k++) do begin
if first = true then
CK∆d = incr_gen_pattern1(Lk-1∆d, ξ’∆d);
first = false;
else

105
CK∆d = incr_gen_pattern(Lk-1∆d, ξ’∆d);
endif;
forall c ∈ CK∆d do
// Test whether a candidate itemset is frequent
if c.count ≥ min_supp then
Lk∆d = Lk∆d ∪ c;
endif;
endfor;
L∆d = L∆d ∪ Lk∆d;
endfor;
endif;
endif;
end;

function incr_gen_pattern1(Lk-1∆d, ξ’∆d, L1)

Ck∆d = φ;
for (i = 1; i ≤ Lk-1∆d ; i++ ) do begin
for (j = 1; j ≤  Lk-1; j++) do begin
c.item = ti,+ tj;
// ti is the ith row of Lk-1∆d
// tj is the jth row of Lk-1
c.count = supp(c.item, ξ’∆d);
Ck∆d = Ck∆d ∪ c;
endfor;
L1 = L1 ∪ ti;
endfor;
return Ck∆d;

106
function incr_gen_pattern(Lk-1∆d, ξ’∆d)

Ck∆d = φ;
for (i = 1; i ≤ Lk-1∆d -1; i++ ) do begin
for (j = i+1; j ≤  Lk-1∆d ; j++) do begin
if (left(ti,  ti -1) = = (left(tj,  tj -1)) then
// ti is the ith row of Lk-1∆d
// tj is the jth row of Lk-1∆d
c.item = left(ti,  ti -1) + right(tj, 1);
c.count = supp(c.item, ξ’∆d);
Ck∆d = Ck∆d ∪ c;
else
exitfor;
endif;
endfor;
endfor;
return Ck∆d;

Brief description of the algorithm is given below.

Step 1: This step is the initialization step where L1∆d (set of 1-frequent items due to ∆
d) and L∆d are initialized to null.

Step 2: During this step function identify_frequent_item() is called to find the 1-


frequent items (L1∆d) present in the incremental database (∆d) at the user defined
mininimum support. The input parameters to this function are min_supp and
incr_count_table (structure of this table is same as of count_table described earlier in
Section 3.3.1). The description of this function is given in the Section 3.3.2 of Chapter 3.

107
Step 3: This step finds the size of L1∆d i.e. number of 1-frequent items present in the
incremental database. If size of L1∆d is less than one then no association pattern exists
due to incremental database at the given support and algorithm terminates, otherwise
function make_concentrator_incr() is called. The input parameters to this function
are: ∆d (incremental database) and L1∆d (1-frequent items of incremental database).
This function is a slightly modified form of the function make_concentrator()
described in Section 3.3.3 of Chapter 3. The function will read all transactions of
incremental database and for each transaction containing frequent items from L1∆d put
a 1 in location (ti, j) of ξ∆d for each element j ∈ L1∆d (ti is the ith transaction or row of
Concentrator, ξ∆d). Thus, Concentrator, ξ∆d will contain L1∆dcolumns and each
column is a bit vector corresponding to a particular element of L1∆d. The bit vector
associated with item i is denoted as βi and number of 1s in a bit vector βi by βi(1).

Step 4: During this step false frequent items are detected and filtered from L1∆d. The
approach used for detection and filtering of false frequent items is similar as used in
the algorithms MBA and CBA. Description of this step is given in Section 3.4.4 of
Chapter 3.

Step 5: After detecting and filtering false frequent items from L1∆d size of L1∆d is
again tested. If size is less than one, then the algorithm will terminate since no
frequent patterns are present in the data due to incremental database. Otherwise
concentrators ξ and ξ∆d are joined and resultant Concentrator is designated by ξ’∆d

Step 6: This step generates all the frequent itemsets containing at least one item from
incremental attributes i.e. from L1∆d and remaining attributes from L1. For each
element, i ∈ L1∆d, 2-candidate itemsets will be generated by calling function
incr_gen_pattern1(). For instance, if z1 ∈ L1∆d then 2-candidate itemsets due to z1
can be obtained by making pairs (z1, e) where ‘e’ is a 1-frequent item and e ∈ L1.
Now z1 is appended in L1 and same procedure is repeated for remaining items of L1∆d.

108
Now each 2-candidate itemset is tested whether it is frequent. If it is frequent then it is
appended in L2∆d. At the end L2∆d is appended in L∆d. Afterwards, frequent itemsets of
size (k-1), where k≥ 3 are used for making candidate itemsets of size k by calling
function incr_gen_pattern(). The working of this function is similar to
incr_gen_pattern() described earlier. Each candidate itemsets (of size k) is tested. If it
is frequent then it is appended in Lk∆d. At the end, Lk∆d is appended in L∆d. This process
is repeated till | Lk∆d | ≥ 2. At the end L∆d will contain all frequent itemsets due to
incremental attributes.

4.3.4.1 Performance Evaluation for IAP

For testing the performance of Incremental Association Pattern mining algorithm


(IAP) seven frequent items of dataset T12.I11.D70K had been selected to make the
original data (D). Remaining frequent items (four) of this data were used to make the
incremental data (∆d). Now execution time required due to incremental attributes was
recorded as shown in Table 4.15.

Table 4.15: Execution times in seconds at minimum support of 0.01%

Number of Incremental
Dataset MBA CBA Attributes
1 2 3 4

T12.I11.D70K 3091 519 35 66 109 181

Table indicates that time required for mining frequent itemsets increased with increase
in number of frequent items in the incremental data. It is also noted that time required
for four incremental attribute was only 181 seconds while MBA and CBA took 3091
and 511 seconds respectively when executed on entire data of T12.I11.D70K. Similar

109
experiment was also performed on dataset T12.I17.D70K. The result of this
experiment is summarized in Table 4.16.

Table 4.16 Execution times in seconds at minimum support of 0.01%

Number of Incremental
Attributes
Dataset MBA CBA
1 2 3 4
T12.I17.D70K 4085 713 158 375 637 782

It was also observed that performance gap increased with increasing in number of
attributes in the original data.

110

You might also like