Professional Documents
Culture Documents
Abstract— With the explosive growth of data, we have Clustering of Applications with Noise (DBSCAN)[2] is one of
entered the era of big data. In order to sift through masses of the most widely used. Its popularity stems from the benefits
information, many data mining algorithms using parallelization that DBSCAN provides over other clustering algorithms: the
are being implemented. Cluster analysis occupies a pivotal ability to discover arbitrarily shaped clusters, robustness in the
position in data mining, and the DBSCAN algorithm is one of the presence of noise and simplicity in its input parameters (unlike
most widely used algorithms for clustering. However, when the other clustering algorithms, DBSCAN does not require the
existing parallel DBSCAN algorithms create data partitions, the user to enter the number of clusters to be found prior to
original database is usually divided into several disjoint execution). DBSCAN is also insensitive to the order of the
partitions; with the increase in data dimension, the splitting and
points in the dataset. As a result, DBSCAN has achieved great
consolidation of high-dimensional space will consume a lot of
success and has become the most cited clustering method in
time. To solve the problem, this paper proposes a parallel
DBSCAN algorithm (S_DBSCAN) based on Spark, which can relevant scientific literatures.
quickly realize the partition of the original data and the In order to deal with massive data, the literatures have
combination of the clustering results. It is divided into the proposed multiple approaches for the parallel execution of
following steps: 1) partitioning the raw data based on a DBSCAN. Xiaowei.Xu proposed PDBSCAN by distributed
random sample, 2) computing local DBSCAN algorithms R* tree to improve the efficiency of the algorithm[3].Ali
in parallel, 3) merging the data partitions based on the Patwary proposed an algorithm named PDSDBSCAN using
centroid. Compared with the traditional DBSCAN algorithm, Disjoint-Set-based data structure, it has broken the ordering of
the experimental result shows the proposed S_DBSCAN data access, resulting in a better load balancing[4].Benjamin
algorithm provides better operating efficiency and scalability. Welton proposed a GPU based parallelization DBSCAN with
significantly improving efficiency through effective data
Keywords—Spark;Data Mining;Parallel Algorithms;DBSCAN partitioning[5]. MapReduce was introduced in 2004 in a
Algorithm;Data Partition. seminal paper published by J. Dean and S. Ghemawat[6]. B. R.
Dai, and I. C. Lin in [7] and Y. He, et al. in [8] both applied
I. INTRODUCTION their parallel DBSCAN algorithm based on MapReduce.
With the current innovative information technology, data While the existing parallel DBSCAN algorithms can
is showing explosive growth, many massive data-based overcome some of the problems by traditional DBSCAN
applications have been quite popular. Internationally, algorithms, it also has the following disadvantages:
Facebook users number about one third of the global
population. In China, 2015 Tmall double eleven carnival 1)When the existing parallel DBSCAN algorithms make
activities achieved the turnover of nearly 100 billion yuan, and data partitions, the original database is usually divided into
created a user's transaction volume of up to 85,900 strokes/sec. several disjoint partitions, with the increase in data dimension,
It seems that the internet industry is facing a huge shift from the splitting of high-dimensional space will consume a lot of
IT to DT (Data Technology). How to improve the ability to time.
mine knowledge from massive amounts of data has become a 2)When doing partition merging, we need to do boundary
problem. In order to find out the differences and connections determination for 2m times for each partition, where m is the
between data, data mining[1] appears in the horizon as a new dimension of data, which will definitely consume a lot of time
discipline; and is applicable in varied industries. and result in low efficiency.
Cluster analysis occupies a pivotal position in data mining, In order to solve the problems of existing parallel
and has received wide attention. Clustering is generally in DBSCAN algorithms and their low efficiency when doing
accordance with a certain similarity measure, so that highly data partitions and partition merges, this paper proposes a
similar data can be gathered together. From the point view of parallel DBSCAN algorithm(S_DBSCAN) based on
machine learning, clustering belongs to the unsupervised Spark[9][10]; which could quickly realize these merges and
learning process, and automatically finds relations between divisions of clustering results of the original data.
data. Cluster analysis plays an important role in many fields
including statistics, image processing, pattern recognition, and The work of this paper is as follows:
psychology. Describing the S_DBSCAN algorithm that takes full
Among clustering algorithms, Density-based Spatial advantage of Apache Spark’s parallel capabilities and
This work is supported by the Science and Technology Department of
Sichuan Province (No.2014FZ0087).
549
core point: a point p X is a core point Data partition base
computing local DBSCAN
algorithm in parallel
if Ne ( p) MinPts Raw Data
on random sample
In this section, the paper focuses on the solution of finding B. Computing local DBSCAN algorithms in parallel
density-based clusters from raw data on the Spark platform. In
order to solve the problems by the existing parallel DBSCAN The input of this stage is the output of last stage. The
algorithms; e.g., low efficiency when doing data partition and purpose of this stage is to compute a local DBSCAN
combination; this paper proposes a parallel DBSCAN algorithm in parallel thus creating partial clustering results.
algorithm(S_DBSCAN) based on Spark, which could quickly This paper requires Local_DBSCAN_Map and
realize mergers and divisions of clustering results from the Local_DBSCAN_ReduceByKey to generate partial cluster.
original data. The S_DBSCAN algorithm is divided into the The Schematic diagram of Computing local DBSCAN
following steps: partitioning the raw data based on random algorithms in parallel is shown in Figure 4.
samples, computing local DBSCAN algorithms in parallel and
merging the data partitions based on the centroid. The
S_DBSCAN algorithm’s framework is shown in Figure 3.
550
C. merging the data partitions based on the centroid.
RDD RDD RDD
ĂĂ After the above steps, each partition has produced
independent clustering results. At this stage, we will merge the
partial cluster to generate global clustering results. Since the
distribution of each partition has similarities, so do the partial
clusters. We design a merging strategy based on the centroid,
Local_DBSCAN_Map Local_DBSCAN_Map Local_DBSCAN_Map its general idea is: Firstly, calculating the distance between
ĂĂ
( Create Cluster ) ( Create Cluster ) ( Create Cluster ) every two partial clusters in the same partition; then use quick-
sort or heap-sort to find the minimum distance d min . Secondly,
sort every d min to find the minimum value Dmin . Thirdly, setting
the threshold to merge partial clusters, and Dmmin .
Local_DBSCAN_ Local_DBSCAN_ Local_DBSCAN_
ReduceByKey ReduceByKey ReduceByKey Fourthly, creating a centroid_distance matrix and traversing
every element in the matrix. If the distance is less than , then
RDD RDD RDD RDD RDD RDD add them to the merge queue until every element is visited.
... ... ...
We require Combine_ReduceByKey, Combine_Reduce and
Partial Partial Partial Partial Partial Partial ReLabel_Map to generate the partial cluster. The Schematic
cluster cluster cluster cluster
cluster cluster diagram of merging the data partitions based on the centroid is
shown in Figure 5.
Figure 4. The Schematic diagram of Computing local DBSCAN
algorithms in parallel RDD RDD RDD RDD Ș ... RDD RDD
551
d Application data comes from annual outpatient data of a
2)Finding out the minimum distance min with quick-sort or heap-sort in the
same partition. certain disease, and has 225833 records, and 89 attributes. In
3)The output is <S, (list (centroid, CID), d min )>, and S is a unified identifier.
this study, selecting 5 attributes about cost structure from the
Step2(Combine_Reduce):
annual outpatient data as test data, to enable S_DBSCAN to
solve practical problems. The detail of the test data is shown
Input: <S, (list(centroid,CID), d min )>
in TABLE V.
Output: <G_CIDˈlist(combine-list)>.
TABLE V. Detail of the test data
Process:
d )LHOG 7\SH 5DQJH
1) Reading raw data, and save list (centroid, CID) and min for next stage.
d min Dmin
2) sorting every to find the minimum value ,and setting the Accounting for inspection costs Float 0~1
threshold
Accounting for drug costs Float 0~1
3)Creating centroid_matrix through list (centroid, CID).
4)Searching the centroid_matrix to find combine-list, the way of searching is Accounting for service costs Float 0~1
shown in TABLE IV. Accounting for Medical costs Float 0~1
5)Getting the combine-list, the output is <G_CIDˈlist(combine-list)>., and
G_ID is global cluster identifier.
Accounting for other costs Float 0~1
Step3(ReLabel_Map):
Using the text data to create four datasets for testing.D1
Input: <G_CIDˈlist(combine-list)>ˈ<CID, (Flag,Raw_data)>
has 50 thousand records, D2 has 100 thousand records, D3 has
Output: <G_CID, (Flag,Raw_data)>
150 thousand records and D4 has 100 thousand records.
Process:
1) Reading the raw data, and get the value.
A. Accuracy
2) Searching CID from combine-list, if succeed, update CID to G_CID, until
every CID is updated. In order to find out whether the S_DBSCAN algorithm
3)The output is<G_CID, (Flag,Raw_data)>
could obtain the same results as the original DBSCAN
algorithm use D1 as test data, and run DBSCAN algorithm
and S_DBSCAN algorithm. The DBSCAN algorithm is
The process of creating combine-list is shown in TABLE performed by Weka[13], which is a common data mining tool.
IV. S_DBSCAN is performed on five computing nodes. Setting
TABLE IV. The process of creating combine-list three groups of different parameters. The Comparative Result
of the two algorithms is shown in TABLE VI.
Input: centroid_matrix,
TABLE VI. Comparative Result
Output: list1 , list2 ....listn
Process: Rate1 Rate2
Candidate {CID, Dis} 3.15% 0.92%
Make every element in centroid_matrix as unvisited
2.91% 1.12%
while(centroid_matrix!=null){
list=null 2.32% 0.98%
list.add( CIDi )
Rate1 is defined as follows:
centroid_matrix.Remove( CIDi )
while(centroid_matrix!=null){
Noise(S _ DBSCAN )
Noisesame (S _ DBSCAN , DBSCAN )
Candidate=find(list, centroid_matrix) Rate1 (2)
Noise( DBSCAN )
if(Candidate.Dis<= ){
list.add(Candidate.CID) where Noise(S _ DBSCAN ) denotes the number of noise computed by
centroid_matrix.Remove(Candidate.CID)} S_DBSCAN, Noise(DBSCAN ) indicates the number of noise
else break} computed by DBSCAN, Noisesame (S _ DBSCAN , DBSCAN ) is the same
end while noise number computed by S_DBSCAN and DBSCAN algorithms
output(list) respectively.
end while}
Rate2 is defined as follows:
Noise( DBSCAN )
Noisesame (S _ DBSCAN , DBSCAN )
Rate2 (3)
Noise( DBSCAN )
IV. SIMULATION
In this work, we set up the Spark cluster with similar It can be seen from TABLE VI, S_DBSCAN algorithm
computer machines. The cluster contains a Master node and has a good clustering result, and a lower error rate.
several Slave nodes. Each compute node has 4G memory, S_DBSCAN algorithm can obtain almost the same results as
Gigabit Ethernet, Inter Pentuim®Dual-Core ES5300 processor traditional DBSCAN algorithm.
clocked at 2.6GHz, 500GB hard drive, Ubuntu14.04 64 bits
operating system.
552
B. Running time then the greater the speedup. For the same number of nodes,
Running time is used to determine the speed of execution the larger the data set, the greater the speedup. This fully
of the algorithm. Running time can be used to evaluate reflects the superior efficiency of the S_DBSCAN algorithm
whether S_DBSCAN algorithm improves the operating when dealing with massive data, as compared to existing
efficiency. The experiment is based on a conventional stand- parallel DBSCAN algorithms.
alone platform and a distributed Spark cluster. The test data
are D1, D2, D3 and D4. For each set of tests run multiple V. CONCLUSION
times, we take the average running time as the final result. The To solve the low efficiency problem in data partitioning
running time results are shown in Figure 6. and partition merging for the existing parallel DBSCAN
algorithms, this paper proposes a parallel DBSCAN algorithm
based on Spark named S_DBSCAN; which could quickly
realize the merges and divisions of clustering results from the
original data. This paper evaluates the S_DBSCAN algorithm
by dealing with annual outpatient data. The experimental
result shows the proposed S_DBSCAN algorithm can
effectively; and efficiently; generate clusters and identify
noise data. In short, the S_DBSCAN algorithm has superior
performance when dealing with massive data, as compared to
existing parallel DBSCAN algorithms.
REFERENCES
[1] Fayyad U M, Piatetsky-Shapiro G, Smyth P, et al. Advances in
knowledge discovery and data mining[J]. 1996.
Figure 6. Running time of S_DBSCAN algorithm
[2] Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for
Seen from Figure 6, there is a significant difference discovering clusters in large spatial databases with noise[C]//Kdd. 1996,
between running in a single-node or multiple-node utilization. 96(34): 226-231.
[3] Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for
With the increasing size of the dataset, the difference between large spatial databases[M]//High Performance Data Mining. Springer
the number of utilized nodes is growing. For the same data set, US, 1999: 263-290.
the running time decreases with the increase of computing [4] Patwary M, Palsetia D, Agrawal A, et al. A new scalable parallel
nodes, thus reflecting the parallel ability of S_DBSCAN DBSCAN algorithm using the disjoint-set data structure[C]//High
algorithm. Performance Computing, Networking, Storage and Analysis (SC), 2012
International Conference for. IEEE, 2012: 1-11.
[5] Welton B, Samanas E, Miller B P. Mr. scan: Extreme scale density-
C. Speedup based clustering using a tree-based network of gpgpu
Speedup refers to the ratio of single-task processing time nodes[C]//Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis. ACM,
as compared to multiple-task processing time for the same task. 2013: 84.
Speedup is usually used to measure a parallel system, a [6] Dean J, Ghemawat S. MapReduce: simplified data processing on large
program parallelization performance, or results. In order to clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
test the parallel ability of the S_DBSCAN algorithm, we use [7] B.-R. Dai and I.-C. Lin, “Efficient map/reduce-based dbscan algorithm
D1, D2, D3 and D4 to test the S_DBSCAN algorithm. The with optimized data partition,” in Cloud Computing (CLOUD),
2012IEEE 5th International Conference on. IEEE, 2012, pp. 59–66.
speedup results of S_DBSCAN algorithm is shown in Figure 7. [8] Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan, “Mr-
dbscan: an efficient parallel density-based clustering algorithm using
mapreduce,” in Parallel and Distributed Systems (ICPADS), 2011 IEEE
17th International Conference on. IEEE, 2011, pp. 473–480.
[9] Apache Spark[OL].http://spark.apache.org/,2014
[10] Shoro A G, Soomro T R. Big Data Analysis: Apache Spark
Perspective[J]. Global Journal of Computer Science and Technology,
2015, 15(1).
[11] Qi Y U, Jie L. Research of cloud storage security technology based on
HDFS [J][J]. Computer Engineering and Design, 2013, 8: 013.
[12] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets:
A fault-tolerant abstraction for in-memory cluster computing[C]
//Proceedings of the 9th USENIX conference on Networked Systems
Design and Implementation. USENIX Association, 2012: 2-2.
[13] Hall M, Frank E, Holmes G, et al. The WEKA data mining software: an
update[J]. ACM SIGKDD explorations newsletter, 2009, 11(1): 10-18.
553