Professional Documents
Culture Documents
†
{ yxtong, leichen } @cse.ust.hk, ‡ cyrneu@gmail.com, psyu@cs.uic.edu
§
10 10
103 30 0 50 0 8 00
U A pr ior i U Apr ior i
U Apr ior i U Apr ior i 4 00
U H M ine 40 0
U H Mine U H Mine
U H Mi ne 25 0
U FP gr ow th U FP gr owt h
U FP gr ow th 30 0 U FP gr owt h
1 00
20 0
2 20 0
10
15 0
10 0
10
10 0 0
0 .9 0 .8 0.7 0 .6 0 .5 0.4 0.5 0 .4 0 .3 0. 2 0.1 0 .1 0.0 5 0 .0 1 0.0 05 0. 00 25 0 .0 01 0. 1 0 .01 0 .0 01 1 .0E 4
m in_ esu p m in _e sup m in_ es up min _e su p
55 0 9 00
U Apr ior i 35 0 12 20
U H Mi ne U Apr i ori
3 00
10 0 U FP gr ow th U H Mi ne U A pri ori
10 0
U FP gr ow th 5 00
1 00 U H M ine
U A pr ior i U FP gr ow th
10
U H M ine
U FP gr ow th
10
10
1 00
20 40 80 10 0 1 60 32 0 20 4 0 8 0 10 0 16 0 3 20 0 .8 1.2 1. 6 2 0 .8 1. 2 1 .6 2
Num b er o f T ra ns act ion s ( K) N um be r of Tr an sa ctio ns (K) Sk ew Ske w
UH-Mine increases smoothly. Moreover, similar to the dis- of dense datasets and higher , UApriori spends the
cussion of Running Time, UFP-growth is the most memory least time and memory. Otherwise, UH-Mine is the winner.
consuming one among three algorithms. Moreover, UFP-growth is often the slowest algorithm and
Scalability. We further analyze the scalability of three spends the largest memory cost since UFP-growth has only
expected support-based algorithms. In Figure 4(i), varying limited shared paths so that it has to spend too much time
the number of transactions in the dataset from 20k to 320k, and memory on redundant recursive computation.
we observe that the running time is linear. With the in- Finally, the in uence of the Zipf distribution is similar to
crease of the size of dataset, the time of UApriori is close to that of a very sparse dataset. Under the Zipf distribution,
that of UH-mine. This is reasonable because all the items UH-Mine algorithm usually performs very well.
in T25I15D30k have similar distributions. Therefore, with
the increase of transactions, the running time of algorithms
increase linearly. Figure 4(j) reports the memory usages of In this section, we compare four probabilistic frequent al-
three algorithms which demonstrate the linearity in terms gorithms: DPNB, DCNB, DPB and DCB. Firstly, we show
of the number of transactions. Moreover, we can find that the running time and the memory cost in terms of changing
the memory usage increase of UApriori is more steady than . Then, we present the in uence of on the run-
that of two other algorithms. This is because UApriori need- ning time and the memory cost. Moreover, the scalability
s not to build a special data structure to store the uncer- of the three algorithms is studied. Finally, we report the
tain database. However, the two other algorithms have to in uence of the skew in the Zipf distribution as well.
spend the extra memory cost for storing their data struc- E ect of min sup. Figures 5(a) and 5(c) show the run-
tures. Therefore, the curve of UApriori is more steady. ning time of four competitive algorithms w.r.t.
E ect of the Zipf distribution. For verifying the in- in Accident and Kosarak datasets, respectively. With the
uence of uncertainty under di erent distributions, Figures Cherno -bound-based pruning, we can see that DCB is al-
4(k) and 4(l) show the running time and the memory cost ways faster than DPB. However, without the Cherno -bound-
of three algorithms in terms of the skew parameter of Zipf based pruning, we can find that DCNB is always faster
distribution.
memory
ter.
assigned
parameter,
ically,
that
ed
among
Conclusions.
support-based
Due
UH-Mine
when
current
to
cost
thethe
which
the
We
zero
decrease
outperforms
property
proposed
skew
To
can
frequent
probability
results
sum
observe
parameter
with
ofup,
mining
inZipf
itemset,
fewer
UApriori
the
under
that
with
distribution,
increase
algorithms.
increases,
the
frequent
the
there
running
gradually.
increase
definition
of
is itemsets.
no
the
wemore
Intime
clear
can
skew
of
the
ofthe
items
and
observe
condition
expect-
winner
Specif-
parame-
skew
the
are 1657
based
algorithms,
than
ty
is
more
infrequent
faster
of
DPB
DPNB.
pruning
computing
divide-and-conquer-based
egorithms,
is
cient
than
faster
we
itemsets
This
quickly.
DCNB,
than
can
the
than
(isthat
find
reasonable
frequent
can
Moreover,
DPNB.
this
of
that
be
dynamic
is
×
2 ).
DCB
filtered
because
probability
These
Comparing
because
we
algorithms
isprogramming-based
by
faster
also
results
there
the
the
of
observe
the
than
are
Cherno
time
show
each
issame
only
(DCNB
),complexi-
itemset
that
which
abound-
type
smal-
DPB
most
and
al-
of
in
is
18 00 23 0 1,8 00 25 0
10 00 DPNB
DPNB
DPB
20 0 D PN B DPB
DCNB 20 0
1 00 1 00 DCNB
D PB DCB
DCNB DCB
D PN B
D PB DCB 15 0
10 10
D CN B
D CB
14 0 10 0
0. 9 0. 8 0. 7 0. 6 0. 5 0 .4 0 .9 0.8 0.7 0.6 0.5 0.4 0 .9 0 .8 0. 7 0. 6 0.5 0.4 0 .3 0 .2 0 .1 0. 9 0.8 0.7 0.
0 .3
.612
.5
.4
m in _su p min _s up m in _su p m in_ sup
1 ,9 50 22 0 8 50 25 0
1 ,0 00
DPNB DPNB
DPNB 20 0
19 0 DPB DPB DPB
1 00 DCNB
1 00 DCNB DCNB
D PN B DCB DCB DCB
D PB 17 0
D CN B
15 0
D CB
10
10
15 0
0. 9 0. 8 0.7 0.6 0 .5 0 .4 0 .3 0 .2 0 .1 0 .9 0 .8 0 .7 0 .6 0 .5 0. 4 0. 3 0.2 0.1 0 .9 0 .8 0. 7 0. 6 0.5 0.4 0 .3 0 .2 0 .1 0. 9 0.8 0.7 0.
0 .3
.612
.5
.4
p ft pft p ft pf t
70 0 1 70 20 0 2 20
DPNB
D PN B
1 00 DPB
D PB 15 0 DP N B 2 00 D P NB
10 0 DCNB
DCNB DP B DPB
50 DCB
DCB DC N B DCNB
10 0
DC B DCB
1 70
10
50
10 0 1 40
20 40 80 10 0 1 60 32 0 20 4 0 8 0 10 0 16 0 3 20 0.8 1 .2 1 .6 2 0 .8 1. 2 1.6 2
Num b er o f T ra ns act ion s ( K) N um be r of Tr an sa ctio ns (K) Skew Sk ew
l number of frequent itemsets that need to compute their algorithm. In Figures 5(i), we can find that the trends of
frequent probabilities when is high, most of the in- running time of all algorithms are linear with the increase
frequent itemsets are already pruned by the Cherno bound. of the number of transactions. In particular, the trends of
In addition, according to Figures 5(b) and 5(d), this is both DC and DCNB are more smooth than those of DP and
very clear that DPB and DPNB require less memory than DPNB because the time complexities of computing frequent
DCB and DCNB. It is reasonable because both DCB and D- probability for DC and DCNB are both ( ) and bet-
CNB trade o the memory for the e ciency based on their ter than the time complexities of DP and DPNB. In Figures
the divide-and-conquer strategy. In addition, we can observe 5(j), we can observe that the memory cost of four algorithms
that the memory usage trend of DCNB changes sharply with linearly varies w.r.t. the number of transactions.
decreasing because there are a few frequent itemset- E ect of the Zipf distribution. Figures 5(k) and 5(l)
s when is high and most of infrequent itemsets are show the running time and the memory cost of four exact
filtered out by the Cherno bound-based pruning. In partic- probabilistic frequent mining algorithms in terms of the skew
ular, we can find that similar observations w.r.t are parameter of Zipf distribution. We can observe that the run-
shown in both the dense and the sparse datasets, which in- ning time and the memory cost decrease with the increase of
dicate that the density of the databases is not the key factor the skew parameter. We can find that, through varying the
a ecting the running time and the memory usage of exact skew parameter, the changing trends of the runing time and
probabilistic frequent algorithms. the memory cost are quite stable. Therefore, the skew pa-
E ect of pft. Figures 5(e) and 5(g) report the running rameter of Zipf distribution does not have significant impact
time w.r.t. . We can find that DCB is still the fastest to the running time and the memory cost.
algorithm and DPNB is the slowest one. Di erent from the Conclusions. First of all, among exact probabilistic fre-
results w.r.t. , DCNB is always faster than DPB quent itemsets mining algorithms, DCB algorithm is the
when varies. Additionally, Figures 5(f) and 5(h) show fastest algorithm in most cases. However, compared to DP-
the memory cost w.r.t. . The memory usages of both B, it has to spend more memory for the divide-and-conquer
DPB and DPNB are always significantly smaller than those processing.
of both
varying
significant
the
frequent
also
4.2,
ability
Scalability.
memory
four
we
further
of
DCB
still
,mining
probabilities
the
four
impact
explained
use
cost
changing
and
exact
Similar
algorithms.
the
are
DCNB.
toT25I15D320k
probabilistic
quite
the
of
in
to
trends
frequent
running
the
Furthermore,
stable.
This
next
scalability
of the
is
time
frequent
subsection.
Thus,
itemsets
dataset
reasonable
runing
and
we
analysis
does
to
itemset
are
the
time
find
test
not
one
memory
because
that,
and
in
have
the
mining
and
Section
scal-
byitof
most
is 1658
set
important
bility
that
In
mining
addition,
for
Cherno
computing
However,
and
time
each
tool
algorithms.
(ifto
the
an
bound-based
it speed
can
itemset,
Cherno
DC
Cherno
×2 )filter
and
Based
toupcalculate
respectively.
exact
DP
out
bound
bound-based
pruning
on
algorithms
some
probabilistic
computational
ofthe
infrequent
each
can
exact
Therefore,
pruning
reduce
itemset
have
frequent
frequent
to
itemsets.
analysis,
the
isis
spend
itthe
only
running
isproba-
item-
clear
most
(the
()).
3 ,60 0 4 00 1 70
2 40 0
D CB 3 80 D CB 1 50
1 ,00 0 PD U Ap r io ri PD U A pr ior i
3 40 DCB
N DU Ap r io r i 1 80 0
3 00 PD U A pr ior i N DU A pr ior i
N DU H Min e
10 0 N D UA pr ior i N DU H M ine 1 00
2 60
N D UH Mine 1 20 0
2 20 DCB
10 1 80 50 P D UA pr ior i
60 0 N D U Apr ior i
1 40
N D U H Mi ne
1 00
0.5 0 .4 0.3 0 .2 0.1 0. 01 0 .5 0 .4 0 .3 0. 2 0. 1 0.0 1 0 0
m in_ su p 0 .0 1 0.0 05 0.0 02 5 0. 00 15 0.0 01 0.0 1 000.00
.0
.0005
25
15
1
min _s up m in_ su p
m in_ su p
2 50 21 0 1 80
2 40
DCB DCB
1 50
2 00 DCB P D U Apr ior i 2 10
18 0 PDUAp rio ri
P D U Apr ior i N D U A pri ori 1 20 NDUAp rio ri
1 50 N D U A pri ori 1 80 NDUH M ine
N D U H Mi ne
N D U H M ine 15 0 90
D CB
1 00 1 50
PD UAp rio ri
60
N DUAp rio ri
50 12 0 1 20
30 N DUH M ine
0 90
0. 9 0. 8 0.7 0.6 0 .5 0 .4 0 .3 0 .2 0 .1 90 0 0. 9 0. 8 0.7 00.6
0.4
0.5
.3
.2
.1
0.9 0. 8 0 .7 0 .6 0.5 0.4 0. 3 0 .2 0 .1 0 .9 0. 8 0. 7 0.6 0 .5 0 .4 0 .3 0 .2 0. 1
p ft pf t p ft
p ft
25 0 1 25
65 0 1 20
60 0 P DU A pr ior i PD U A pri ori
20 0 PDU Ap rio r i 1 20
N D U Apr ior i N DU A pr ior i
50 0 90 ND UAp ri or i
N D U H Mine N DU H M ine
40 0 15 0 1 15
ND UH Min e PD U A pr ior i
60
30 0 N D U Apr ior i
10 0 1 10
10 0 N D U H Mine
30
50 1 05
10 0
0 0 0 1 00
0 2 0 4 0 8 01 00 1 60 32 0 0 20 40 80 10 0 16 0 3 20 0 .8 1.2 1. 6 2 0 .8 1 .2 1 .6 2
Num b er of T ra ns act ion s ( K) N um be r of Tr an sa ctio ns (K) Sk ew Ske w
cause the in uence of is far less than the . Table • As observed in Table 10, under the definition of expect-
8 and Table 9 are shown the precisions and the recalls of two ed support-based frequent itemset, UApriori is usually
approximation probabilistic frequent algorithms in Accident the fastest algorithm with lower memory cost when the
and Kosarak, respectively. We can find that the precision database is dense and is high. On the con-
and the recall are almost 1 in Accident dataset which mean- trary, when the database is sparse or is low,
s there is almost no false positive and false negative. In UH-Mine often outperforms other algorithms in the
Kosarak, we also observe that there are a few false posi- running time and only spends limited memory cost.
tives with decreasing of . In addition, the Normal However, UFP-growth is almost the slowest algorithm
distribution-based approximation algorithms can get better with high memory cost.
approximation e ect than the Poisson distribution-based ap- • From Table 10, among exact probabilistic frequent item-
proximation algorithms. This is because the expectation and sets mining algorithms, DC algorithm is the fastest
the variance in the Poisson distribution is the same, which algorithm in most cases. However, it trades o the
is , but, in fact, the expected support and the variance of memory cost for the e ciency because it has to store
an itemset are usually unequal. recursive results for the processing of the divide-and-
Scalability. We further analyze the scalability of three conquer. In addition, when the condition is satisfied,
approximate probabilistic frequent mining algorithms. In DP algorithm is faster than DC algorithm.
Figure 6(i), varying the number of transactions in the dataset • Again from Table 10, both PDUApriori and NDU-
from 20k to 320k, we find that the running time is linear. Apriori is the winner in the running time and the
Figure 6(j) reports the memory cost of three algorithms memory cost when the database is dense and
which show the linearity in terms of the number of trans-
is high, otherwise, NDUH-Mine is the winner. The
actions. Therefore, NDUH-Mine performs best. main di erence between PDUApriori and NDUAprior-
E ect of the Zipf distribution. Figures 6(k) and 6(l)
i is that NDUApriori has better approximation when
show the running time and the memory cost of three ap- the database is large enough.
proximate algorithms in terms of the skew parameter of Zipf
Other than the result described in Table 10, we also find:
distribution. We can observe that the running time and the
memory cost decrease with the increase of the skew param- • Approximation probabilistic frequent itemset mining
eter. In particular, when the skew parameter increases, we algorithms usually get a high-quality approximation ef-
can observe that PDUApriori outperforms NDUApriori and fect in most cases. To our surprise, the frequent proba-
NDUH-Mine gradually. bilities of most probabilistic frequent itemsets are often
Conclusions. First of all, approximation probabilistic 1 when the uncertain databases are large enough such
frequent itemset mining algorithms can get high-quality ap- as the number of transaction is more than 10,000. It is
proximation when the uncertain database is large enough a reasonable result. On the one hand, Lyapunov Cen-
due to the requirement of CLT. In our experiments, the tral Limit Theory guarantees the high-quality approx-
datasets usually include more than 50,000 transactions. These imation. On the other hand, according to the cumu-
approximation algorithms almost have no false positive or lative distribution function (CDF) of the Poisson dis-
false negative. These results are reasonable because the Lya- tribution, we know that the frequent probability of an
punov CLT guarantees the approximation quality. itemset can be approximated as 1 ×
i
= 0!
In addition, in terms of the e ciency, approximation prob- where is the expected support of this itemset. When
abilistic frequent itemset mining algorithms are much better an uncertain database is large enough, the expected
any existing exact probabilistic frequent itemset mining al- support of this itemset is usually large if it is a proba-
gorithms. Moreover, Normal distribution-based algorithms bilistic frequent itemset. Thus, as a consequence, the
usually are faster than the Poisson distribution-based algo- frequent probability of this itemset equals 1.
rithm.
• Approximation probabilistic frequent itemset mining
Finally, similar to the case of expected support-based fre-
algorithms usually far outperform any existing exact
quent algorithms, NDUApriori is always the fastest algorith-
m Table
the
in
over,
set,
data
arein
We
similar.
best
and
set.
dense
‘time(D)’
summarize
‘time(S)’
algorithm
10
Theuncertain
where
meanings
means
means
experimental
‘in’ sparse
means
databases,
that
ofthat
‘memory(D)’
the
uncertain
the
the
time
winner
results
while
time
cost
cost
databases.
NDUH-Mine
under
in
and
inthat
in
the
‘memory(S)’
the
dicase.
dense
erent
sparse
More-
usually
data
cases probabilistic
1660
the
eder
cause
itemset
ciency
itemset
algorithm
• result
Cherno efrequent
theitdefinition
of
can
under
itemsets
if exact
we
bound
filter
be
ciency
compute
the itemset
obtained
probabilistic
as
of
out
is
definition
well.
and
expected
an mining
theimportant
the
the
by
infrequent
variance
the
memory
of algorithms
frequent
support-based
existing
probabilistic
tool
itemsets
ofcost.
to
algorithms
the in un-
solutions
improve the
Therefore,
support
frequent
quickly.
be-
the
of
[14] L. Chen, M. Ozsu,
T. ¨ and V. Oria. Robust and fast
In this paper, we conduct a comprehensive experimen- similarity search for moving object tra jectories. In
tal study of all the frequent itemset mining algorithms over SIGMOD, pages 491–502, 2005.
uncertain databases. Since there are two definitions of fre- [15] R. Cheng, D. V. Kalashnikov, and S. Prabhakar.
quent itemsets over uncertain data, most existing researches Querying imprecise data in moving object
are categorized into two directions. However, through our environments. IEEE Trans. Knowl. Data Eng.,
exploration, we firstly clarify that there is a close relation- 16(9):1112–1127, 2004.
ship between two di erent definitions of frequent itemsets [16] H. Cherno . A measure of asymptotic e ciency for
over uncertain data. Therefore, we need not use the current tests of a hypothesis based on the sum of observations.
solution for the second definition and replace them with ef- Ann. Math. Statist., 23(4):493–507, 1952.
ficient existing solution of first definition. Secondly, we pro- [17] C. K. Chui and B. Kao. A decremental approach for
vide baseline implementations of eight existing representa- mining frequent itemsets from uncertain data. In
tive algorithms and test their performances under a uniform PAKDD, pages 64–75, 2008.
measurement fairly. Finally, based on extensive experiments [18] C. K. Chui, B. Kao, and E. Hung. Mining frequent
over many di erent benchmarks, we verify several existing itemsets from uncertain data. In PAKDD, pages
inconsistent conclusions and find some new rules in this area. 47–58, 2007.
[19] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
without candidate generation. In SIGMOD, pages
This work is supported in part by the Hong Kong RGC 1–12, 2000.
GRF Project No.611411, National Grand Fundamental Re- [20] B. Jiang and J. Pei. Outlier detection on uncertain
search 973 Program of China under Grant 2012-CB316200 data: Objects, instances, and inferences. In ICDE,
and 2011-CB302200-G, HP IRP Project 2011, Microsoft Re- pages 422–433, 2011.
search Asia Grant, MRA11EG05, the National Natural Sci- [21] B. Kao, S. D. Lee, F. K. F. Lee, D. W.-L. Cheung,
ence Foundation of China (Grant No.61025007, 60933001, and W.-S. Ho. Clustering uncertain data using voronoi
61100024), US NSF grants DBI-0960443, CNS-1115234, and diagrams and r-tree index. IEEE Trans. Knowl. Data
IIS-0914934, and Google Mobile 2014 Program. Eng., 22(9), 2010.
[22] C. K.-S. Leung, M. A. F. Mateo, and D. A. Brajczuk.
A tree-based approach for frequent pattern mining
from uncertain data. In PAKDD, pages 653–661, 2008.
[1] Frequent itemset mining implementations repository.
[23] M. Li and Y. Liu. Underground coal mine monitoring
http://fimi.us.ac.be.
with wireless sensor networks. TOSN, 5(2):10, 2009.
[2] Wikipedia of poisson binomial distribution.
[24] Y. Liu, K. Liu, and M. Li. Passive diagnosis for
[3] C. C. Aggarwal. Managing and Mining Uncertain
wireless sensor networks. IEEE/ACM Trans. Netw.,
Data. Kluwer Press, 2009.
18(4):1132–1144, 2010.
[4] C. C. Aggarwal, Y. Li, J. Wang, and J. Wang.
[25] M. Mitzenmacher and E. Upfal. Probability and
Frequent pattern mining with uncertain data. In
Computing: Randomized algorithm and probabilistic
KDD, pages 29–38, 2009. analysis. Cambridge University Press, 2005.
[5] C. C. Aggarwal and P. S. Yu. Outlier detection with
[26] L. Mo, Y. He, Y. Liu, J. Zhao, S. Tang, X.-Y. Li, and
uncertain data. In SDM, pages 483–493, 2008.
G. Dai. Canopy closure estimates with greenorbs:
[6] C. C. Aggarwal and P. S. Yu. A survey of uncertain
sustainable sensing in the forest. In SenSys, pages
data algorithms and applications. IEEE Trans. Knowl. 99–112, 2009.
Data Eng., 21(5):609–623, 2009.
[27] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang.
[7] R. Agrawal, T. Imielinski, and A. N. Swami. Mining H-mine: Hyper-structure mining of frequent patterns
association rules between sets of items in large
in large databases. In ICDM, pages 441–448, 2001.
databases. In SIGMOD, pages 207–216, 1993.
[28] L. Sun, R. Cheng, D. W. Cheung, and J. Cheng.
[8] R. Agrawal and R. Srikant. Fast algorithms for mining Mining uncertain data with probabilistic guarantees.
association rules in large databases. In VLDB, pages
In KDD, pages 273–282, 2010.
487–499, 1994.
[29] S. Suthram, T. Shlomi, E. Ruppin, R. Sharan, and
[9] T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and T. Ideker. A direct comparison of protein interaction
A. Z¨ u e. Probabilistic frequent itemset mining in confidence assignment schemes. BMC Bioinformatics,
uncertain databases. In KDD, pages 119–128, 2009.
7:360, 2006.
[10] T. Calders, C. Garboni, and B. Goethals. [30] Y. Tong, L. Chen, and B. Ding. Discovering
L.Approximation
[11] T.
[12]
[13] inCalders,
pattern
PAKDD,
binomial
10(4):1181–1197,
and
L.uncertain
Chen
Cam.
editand
mining
distance. of frequentness
distribution.
pages
C.
AnR.
data.
Garboni,
approximation
T.of
480–487,
Ng.
In
uncertain
1960.
ICDM,
VLDB,
On
Pacific
andthe
2010.
B.dataprobability
pages
marriage
pages
theorem
Goethals.
Journal
with
749–754,
792–803,
for of2010.
sampling.
of itemsets
ofEMathematics,
lp-norms
the
cient
poisson
2004.
In 1661
[32]
[31]
[33]
model-based
M.
probabilistic
IEEE
L.
Accelerating
Q.J.Wang,
probabilistic
Zhang,
Zaki.
Trans.
threshold-based
R.
Scalable
approach.
F.data.
Knowl.
Cheng,
Li,
probabilistic
data.
and
Inalgorithms
SIGMOD,
Data
S.
In
K.
InD.
CIKM,
frequent
ICDE,
Yi.
Eng.,
Lee,
frequent
Finding
for
and
pages
12(3):372–390,
closed
association
D.
itemset
frequent
429–438,
819–832,
270–281,
W.-L.
itemsets
mining:
Cheung.
items
mining.
2010.
2008.
2012.
2000.
over
ina