Luo 2016

2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom)
A Parallel DBSCAN Algorithm Based On Spark

Guangchun Luo1; Xiaoyu Luo1; Thomas Fairley Gooch2; Ling Tian1; Ke Qin1
1
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
2
Department of Computer Science, Georgia State University, Atlanta, USA
gcluo@uestc.edu.cn; 398814969@qq.com; thomasfgooch@gmail.com; ruan052@126.com;qinke@uestc.edu.cn
Abstract— With the explosive growth of data, we have Clustering of Applications with Noise (DBSCAN)[2] is one of
entered the era of big data. In order to sift through masses of the most widely used. Its popularity stems from the benefits
information, many data mining algorithms using parallelization that DBSCAN provides over other clustering algorithms: the
are being implemented. Cluster analysis occupies a pivotal ability to discover arbitrarily shaped clusters, robustness in the
position in data mining, and the DBSCAN algorithm is one of the presence of noise and simplicity in its input parameters (unlike
most widely used algorithms for clustering. However, when the other clustering algorithms, DBSCAN does not require the
existing parallel DBSCAN algorithms create data partitions, the user to enter the number of clusters to be found prior to
original database is usually divided into several disjoint execution). DBSCAN is also insensitive to the order of the
partitions; with the increase in data dimension, the splitting and
points in the dataset. As a result, DBSCAN has achieved great
consolidation of high-dimensional space will consume a lot of
success and has become the most cited clustering method in
time. To solve the problem, this paper proposes a parallel
DBSCAN algorithm (S_DBSCAN) based on Spark, which can relevant scientific literatures.
quickly realize the partition of the original data and the In order to deal with massive data, the literatures have
combination of the clustering results. It is divided into the proposed multiple approaches for the parallel execution of
following steps: 1) partitioning the raw data based on a DBSCAN. Xiaowei.Xu proposed PDBSCAN by distributed
random sample, 2) computing local DBSCAN algorithms R* tree to improve the efficiency of the algorithm[3].Ali
in parallel, 3) merging the data partitions based on the Patwary proposed an algorithm named PDSDBSCAN using
centroid. Compared with the traditional DBSCAN algorithm, Disjoint-Set-based data structure, it has broken the ordering of
the experimental result shows the proposed S_DBSCAN data access, resulting in a better load balancing[4].Benjamin
algorithm provides better operating efficiency and scalability. Welton proposed a GPU based parallelization DBSCAN with
significantly improving efficiency through effective data
Keywords—Spark;Data Mining;Parallel Algorithms;DBSCAN partitioning[5]. MapReduce was introduced in 2004 in a
Algorithm;Data Partition. seminal paper published by J. Dean and S. Ghemawat[6]. B. R.
Dai, and I. C. Lin in [7] and Y. He, et al. in [8] both applied
I. INTRODUCTION their parallel DBSCAN algorithm based on MapReduce.
With the current innovative information technology, data While the existing parallel DBSCAN algorithms can
is showing explosive growth, many massive data-based overcome some of the problems by traditional DBSCAN
applications have been quite popular. Internationally, algorithms, it also has the following disadvantages:
Facebook users number about one third of the global
population. In China, 2015 Tmall double eleven carnival 1)When the existing parallel DBSCAN algorithms make
activities achieved the turnover of nearly 100 billion yuan, and data partitions, the original database is usually divided into
created a user's transaction volume of up to 85,900 strokes/sec. several disjoint partitions, with the increase in data dimension,
It seems that the internet industry is facing a huge shift from the splitting of high-dimensional space will consume a lot of
IT to DT (Data Technology). How to improve the ability to time.
mine knowledge from massive amounts of data has become a 2)When doing partition merging, we need to do boundary
problem. In order to find out the differences and connections determination for 2m times for each partition, where m is the
between data, data mining[1] appears in the horizon as a new dimension of data, which will definitely consume a lot of time
discipline; and is applicable in varied industries. and result in low efficiency.
Cluster analysis occupies a pivotal position in data mining, In order to solve the problems of existing parallel
and has received wide attention. Clustering is generally in DBSCAN algorithms and their low efficiency when doing
accordance with a certain similarity measure, so that highly data partitions and partition merges, this paper proposes a
similar data can be gathered together. From the point view of parallel DBSCAN algorithm(S_DBSCAN) based on
machine learning, clustering belongs to the unsupervised Spark[9][10]; which could quickly realize these merges and
learning process, and automatically finds relations between divisions of clustering results of the original data.
data. Cluster analysis plays an important role in many fields
including statistics, image processing, pattern recognition, and The work of this paper is as follows:
psychology. Describing the S_DBSCAN algorithm that takes full
Among clustering algorithms, Density-based Spatial advantage of Apache Spark’s parallel capabilities and
This work is supported by the Science and Technology Department of
Sichuan Province (No.2014FZ0087).
978-1-5090-3936-4/16 $31.00 © 2016 IEEE 548

DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.85
obtains the same results as the original DBSCAN of RDD. RDD is read-only, and can be serialized by a persist
algorithm. or a cache function of the cache in memory, thus reducing the
amount of disk IO, which greatly improves the efficiency of
Implementing S_DBSCAN algorithm on Spark, while machine learning algorithms. RDD could be created in two
showing the details of S_DBSCAN. ways, one is created from memory by a parallelize method, the
Evaluating the performance of S_DBSCAN using other is from a file created by textFile from HDFS, HBase,
outpatient medical data to solve practical problems and Amazon S3 and Cassandra. Spark in most computing
to verify the capability of parallelization. interfaces are RDD-based, therefore RDD input format
determines the output format of the Spark programming
interface. Spark in the RDD provides two types of operators:
II. SPARK AND DBSCAN ALGORITHM transformation and action. Transformation is to generate a new
RDD from an existing RDD. Action is to write results to
A. Spark Overview
HDFS through a user-defined function. Each transformation
Spark is a new framework of distributed parallel operator does not do real calculations until it meets an action
processing launched by the Apache Foundation, which is very operator. When the transformation conversion path is long,
suitable for parallel distributed computing with massive data. you can use the checkpoint operation to write data to the
Spark overcomes some of MapReduce’s shortcomings and has storage system. The Spark programming model is shown in
all its advantages at the same time. To ensure the reliability of Figure 2.
the results, MapReduce needs to write back every intermediate
calculation to HDFS [11], which results in low efficiency, Transformation
while Spark saves intermediate calculations in memory task to A: B: F:
improve processing speed. Due to the memory-based features, flatMap join
Action
Spark is suitable to data mining and machine learning map
algorithms which require multiple iterations. Resillient HDFS

HDFS
Distributed Dataset (RDD) [12] is a distributed-stored dataset, saveAsSequenceFile

C: D: E:
and it is an abstract application of distributed memory. Spark textFile
map
is an implementation of the RDD’s abstraction, providing a
programming framework that allows the programmer to reduceByKey
process large amounts of data, and also to take advantage of
volatile memory in order to speed-up computation. Spark has Figure 2. Spark programming model
two important attributes worth mentioning; a Narrow sense
and a Broad sense. Narrow Spark refers to a calculation The Spark system architecture is typical of the Master-
framework of data processed layers. Broad Spark refers to Slave structure, consisting of a Master node and several
Spark ecosystems. Spark uses memory-based computing, Worker nodes. The user-written Spark program is called
which is called resilient distributed datasets RDD; as Driver, users can achieve RDD conversion operation by
mentioned previously. Spark has also developed a high- Driver and interact with the Master node.
throughput memory system called Tachyon; an additional
upper layer with similarities to Hadoop in the Mahout machine B. The DBSCAN Algorithm
learning library MLlib, a diagram calculation called GraphX, The purpose of a clustering algorithm is to divide a mass
Shark SQL, and so on. Spark architecture is shown in Figure 1 of raw data into separate groups (clusters) which are
meaningful, useful, and efficiently accessible. DBSCAN and
Spark GraphX MLlib Shark K-means are two common techniques to deal with clustering
Streaming SQL API problems.
Spark DBSCAN is a density-based clustering algorithm. Density-

based clustering algorithms define a cluster as an area that has
MapReduce RDD FP a higher data density than its surrounding area. In DBSCAN,
density is measured by analyzing whether a point has at least a
Tachyon minimum number of points ( MinPts ) inside a given radius ( ).
That is, if the neighborhood (points within )of a given
HDFS/Hadoop Storage
point has exceeded a particular density threshold ( MinPts ) at
that given point; then the point and its neighbors, form a
Mesos YARN
cluster. There are a few key concepts that stem from the
definition of DBSCAN that bear further explanation:
Figure 1. Spark architecture
Ne ( p) : for a point p X ,its neighborhood is
RDD is an abstract application of distributed memory. If a defined as Ne ( p) {x X | dist ( x, p) } where
RDD data fragment is lost, Spark could reconstruct the
dist is the distance function.
missing fragment according to a coarse-grained log data called
Lineage, which will undoubtedly improve the fault tolerance
549
core point: a point p X is a core point Data partition base
computing local DBSCAN
algorithm in parallel
if Ne ( p) MinPts Raw Data
on random sample
directly density reachable: a point p X is directly

(TABLE II)
(TABLE I)
density reachable from x X if p Ne ( x) and x is

a core point.
density reachable: a point y X is density reachable merging the data partitions
form x X if there is a chain of points p1 , p2 ,... pn based on the centroid.
Global clustering results
where y p1 and x pn and each pn y is (TABLE III)
directly density reachable from x.

cluster: C is a cluster in X if C X and for any Figure 3. S_DBSCAN algorithm’s framework
p, q C ,p and q are density connected.
border point: p is a border point, if p cluster C A. Data partition based on random sample
and p is not a core point. To be able to effectively divide the original data, this paper
noise point: p is a border point if it does not belong to proposes a data partition method based on a random sample.
The idea is as follows: Firstly, determine the number of
any cluster. partitions based on the actual computing nodes; and on this
DBSCAN algorithm process is as follows: basis, through a custom random function, the original random
1)Determine the radius , minimum number of data would export to each slice. Each slice of data points has
neighborhood objects MinPts . roughly the same amount of data. Each slice is equivalent to a
simple random sampling. When the number of samples
2)Selecting an arbitrary point p from D to start extracted by each slice is large enough, it has a similar
neighborhood searching, if Ne ( p) MinPts then mark p as distribution with the raw data. This paper requires
core point, otherwise mark p as noise point. Data_PartitionMap and Data_PartitionReduce to complete the
data partition. The procedure of data partition is shown in
3)if p is a core point, then make a cluster c by p, adding all TABLE I
of point belong to cluster c to a list for recursive calls.
TABLE I. Procedure of data partition
4) Repeat 2), 3) until all of the objects in the collection of
data objects are marked, and return to the cluster as a class, or Step1(Data_PartitionMap):
to identify those objects do not belong to any kind of noise Input: <ID, Raw_data>
cluster. Output: <key, Raw_data>
Process:
DBSCAN algorithm uses Euler distance to measure 1)Reading every raw data, and get the value.
distance. For two data xi and x j with m dimension, the 2)Using random function to generates a random number between 1 to k.
3)Setting the random number number as a key, the output <key, Raw Data>.
formula to measure distance is as follows: Step2(Data_PartitionReduce):
m 1 Input: <key,Raw_data>
d ( xi , x j ) || xi
x j || [ | xik
x jk |2 ]2 (1) Output: <key, list(Raw_data)>
k 1 Process:
1)The Raw_data will be distributed to the same Reducer in order to complete
III. THE PROPOSED S_DBSCAN ALGORITHM the division of the original data.
In this section, the paper focuses on the solution of finding B. Computing local DBSCAN algorithms in parallel
density-based clusters from raw data on the Spark platform. In
order to solve the problems by the existing parallel DBSCAN The input of this stage is the output of last stage. The
algorithms; e.g., low efficiency when doing data partition and purpose of this stage is to compute a local DBSCAN
combination; this paper proposes a parallel DBSCAN algorithm in parallel thus creating partial clustering results.
algorithm(S_DBSCAN) based on Spark, which could quickly This paper requires Local_DBSCAN_Map and
realize mergers and divisions of clustering results from the Local_DBSCAN_ReduceByKey to generate partial cluster.
original data. The S_DBSCAN algorithm is divided into the The Schematic diagram of Computing local DBSCAN
following steps: partitioning the raw data based on random algorithms in parallel is shown in Figure 4.
samples, computing local DBSCAN algorithms in parallel and
merging the data partitions based on the centroid. The
S_DBSCAN algorithm’s framework is shown in Figure 3.
550
C. merging the data partitions based on the centroid.
RDD RDD RDD
ĂĂ After the above steps, each partition has produced
independent clustering results. At this stage, we will merge the
partial cluster to generate global clustering results. Since the
distribution of each partition has similarities, so do the partial
clusters. We design a merging strategy based on the centroid,
Local_DBSCAN_Map Local_DBSCAN_Map Local_DBSCAN_Map its general idea is: Firstly, calculating the distance between
ĂĂ
( Create Cluster ) ( Create Cluster ) ( Create Cluster ) every two partial clusters in the same partition; then use quick-
sort or heap-sort to find the minimum distance d min . Secondly,
sort every d min to find the minimum value Dmin . Thirdly, setting
the threshold to merge partial clusters, and Dmmin .
Local_DBSCAN_ Local_DBSCAN_ Local_DBSCAN_
ReduceByKey ReduceByKey ReduceByKey Fourthly, creating a centroid_distance matrix and traversing
every element in the matrix. If the distance is less than , then
RDD RDD RDD RDD RDD RDD add them to the merge queue until every element is visited.
... ... ...
We require Combine_ReduceByKey, Combine_Reduce and
Partial Partial Partial Partial Partial Partial ReLabel_Map to generate the partial cluster. The Schematic
cluster cluster cluster cluster
cluster cluster diagram of merging the data partitions based on the centroid is
shown in Figure 5.
Figure 4. The Schematic diagram of Computing local DBSCAN
algorithms in parallel RDD RDD RDD RDD Ș ... RDD RDD
For the map task, we will finish generating a partial cluster,

and for the reducebykey task, a partial cluster will be saved as
Combine_ReduceByKey Combine_ReduceByKey Combine_ReduceByKey
a new RDD to HDFS and every centroid for the partial cluster Ș ...
get d min get d min get d m in
will be calculated. The procedure of computing the local
DBSCAN algorithm is shown in TABLE II
TABLE II The procedure of computing local DBSCAN algorithm Combine_Reduce
Step1(Local_DBSCAN_Map): 1)Setting Dmin and

2)creating centroid matrix
Input: <key, list(Raw_data)>, , MinPts / k
3)Determine the combine list
Output: <CID, (Flag,Raw_data)>or<Flag,Raw_data>
Process:
1)Reading every raw data, and get the value list(Raw_data).
2) Selecting an arbitrary point p from Raw_data to start neighborhood
RDD RDD Ș ... RDD
Partial Cluster
searching, if Ne ( p) MinPts then mark p as a core point, otherwise mark p
as noise point.
3) if p is a core point, then make a cluster c by p, adding all of the points ReLabel_Map ReLabel_Map ReLabel_Map
belonging to cluster c to a list for recursive calls. Update CID Update CID Update CID
4) Repeat 2), 3) until all of the objects in the collection of data objects are
marked, and return to the cluster as a class, or to identify those objects that do
saveAsTextFile
not belong to any kind of noise cluster.
5) If the data flag is noise, the output is<Flag, Raw Data>, otherwise the
output is <CID, (Flag, Raw_data)>. RDD RDD Ș ... RDD Global Cluster
Step2(Local_DBSCAN_ReduceByKey):
Input: <CID,list(Flag,Raw_data)>RU<Flag,list(Raw_data)> Figure 5. The Schematic diagram of merging the data
Output: <key, (centroid, ID)> partitions based on the centroid
Process:
1) Reading raw data, and save partial cluster as new RDD through the method
The procedure of merging the data partitions is shown in
of saveAsTextFile.
TABLE III.
2)Counting the number n of list(Flag,Raw_data) or list(Raw_data). TABLE III. The procedure of merging the data partitions
3) Calculating the average value of each field.
Step1(Combine_ReduceByKey):
4)Building the centroid for each partial cluster.
5)The output is <key, (centroid, ID)> Input: <key, (centroid,CID)>
Output: <S, (list (centroid, CID), d min )>
Process:
In TABLE II, Flag means the attribute of each data; its
1) Reading raw data, get the list of centroid, and calculate the distance
values are ‘core point’, ’noise’ and ‘border’. CID is the cluster
between every two partial cluster in the same partition.
ID for each data, its initial value is key_1, and the updated
CID when a new cluster is discovered.
551
d Application data comes from annual outpatient data of a
2)Finding out the minimum distance min with quick-sort or heap-sort in the
same partition. certain disease, and has 225833 records, and 89 attributes. In
3)The output is <S, (list (centroid, CID), d min )>, and S is a unified identifier.
this study, selecting 5 attributes about cost structure from the
Step2(Combine_Reduce):
annual outpatient data as test data, to enable S_DBSCAN to
solve practical problems. The detail of the test data is shown
Input: <S, (list(centroid,CID), d min )>
in TABLE V.
Output: <G_CIDˈlist(combine-list)>.
TABLE V. Detail of the test data
Process:
d )LHOG 7\SH 5DQJH
1) Reading raw data, and save list (centroid, CID) and min for next stage.
d min Dmin
2) sorting every to find the minimum value ,and setting the Accounting for inspection costs Float 0~1
threshold
Accounting for drug costs Float 0~1
3)Creating centroid_matrix through list (centroid, CID).
4)Searching the centroid_matrix to find combine-list, the way of searching is Accounting for service costs Float 0~1
shown in TABLE IV. Accounting for Medical costs Float 0~1
5)Getting the combine-list, the output is <G_CIDˈlist(combine-list)>., and
G_ID is global cluster identifier.
Accounting for other costs Float 0~1
Step3(ReLabel_Map):
Using the text data to create four datasets for testing.D1
Input: <G_CIDˈlist(combine-list)>ˈ<CID, (Flag,Raw_data)>
has 50 thousand records, D2 has 100 thousand records, D3 has
Output: <G_CID, (Flag,Raw_data)>
150 thousand records and D4 has 100 thousand records.
Process:
1) Reading the raw data, and get the value.
A. Accuracy
2) Searching CID from combine-list, if succeed, update CID to G_CID, until
every CID is updated. In order to find out whether the S_DBSCAN algorithm
3)The output is<G_CID, (Flag,Raw_data)>
could obtain the same results as the original DBSCAN
algorithm use D1 as test data, and run DBSCAN algorithm
and S_DBSCAN algorithm. The DBSCAN algorithm is
The process of creating combine-list is shown in TABLE performed by Weka[13], which is a common data mining tool.
IV. S_DBSCAN is performed on five computing nodes. Setting
TABLE IV. The process of creating combine-list three groups of different parameters. The Comparative Result
of the two algorithms is shown in TABLE VI.
Input: centroid_matrix,
TABLE VI. Comparative Result
Output: list1 , list2 ....listn
Process: Rate1 Rate2
Candidate {CID, Dis} 3.15% 0.92%
Make every element in centroid_matrix as unvisited
2.91% 1.12%
while(centroid_matrix!=null){
list=null 2.32% 0.98%
list.add( CIDi )
Rate1 is defined as follows:
centroid_matrix.Remove( CIDi )
while(centroid_matrix!=null){
Noise(S _ DBSCAN )
Noisesame (S _ DBSCAN , DBSCAN )
Candidate=find(list, centroid_matrix) Rate1 (2)
Noise( DBSCAN )
if(Candidate.Dis<= ){
list.add(Candidate.CID) where Noise(S _ DBSCAN ) denotes the number of noise computed by
centroid_matrix.Remove(Candidate.CID)} S_DBSCAN, Noise(DBSCAN ) indicates the number of noise
else break} computed by DBSCAN, Noisesame (S _ DBSCAN , DBSCAN ) is the same
end while noise number computed by S_DBSCAN and DBSCAN algorithms
output(list) respectively.
end while}
Rate2 is defined as follows:
Noise( DBSCAN )
Noisesame (S _ DBSCAN , DBSCAN )
Rate2 (3)
Noise( DBSCAN )
IV. SIMULATION
In this work, we set up the Spark cluster with similar It can be seen from TABLE VI, S_DBSCAN algorithm
computer machines. The cluster contains a Master node and has a good clustering result, and a lower error rate.
several Slave nodes. Each compute node has 4G memory, S_DBSCAN algorithm can obtain almost the same results as
Gigabit Ethernet, Inter Pentuim®Dual-Core ES5300 processor traditional DBSCAN algorithm.
clocked at 2.6GHz, 500GB hard drive, Ubuntu14.04 64 bits
operating system.
552
B. Running time then the greater the speedup. For the same number of nodes,
Running time is used to determine the speed of execution the larger the data set, the greater the speedup. This fully
of the algorithm. Running time can be used to evaluate reflects the superior efficiency of the S_DBSCAN algorithm
whether S_DBSCAN algorithm improves the operating when dealing with massive data, as compared to existing
efficiency. The experiment is based on a conventional stand- parallel DBSCAN algorithms.
alone platform and a distributed Spark cluster. The test data
are D1, D2, D3 and D4. For each set of tests run multiple V. CONCLUSION
times, we take the average running time as the final result. The To solve the low efficiency problem in data partitioning
running time results are shown in Figure 6. and partition merging for the existing parallel DBSCAN
algorithms, this paper proposes a parallel DBSCAN algorithm
based on Spark named S_DBSCAN; which could quickly
realize the merges and divisions of clustering results from the
original data. This paper evaluates the S_DBSCAN algorithm
by dealing with annual outpatient data. The experimental
result shows the proposed S_DBSCAN algorithm can
effectively; and efficiently; generate clusters and identify
noise data. In short, the S_DBSCAN algorithm has superior
performance when dealing with massive data, as compared to
existing parallel DBSCAN algorithms.
REFERENCES
[1] Fayyad U M, Piatetsky-Shapiro G, Smyth P, et al. Advances in
knowledge discovery and data mining[J]. 1996.
Figure 6. Running time of S_DBSCAN algorithm
[2] Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for
Seen from Figure 6, there is a significant difference discovering clusters in large spatial databases with noise[C]//Kdd. 1996,
between running in a single-node or multiple-node utilization. 96(34): 226-231.
[3] Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for
With the increasing size of the dataset, the difference between large spatial databases[M]//High Performance Data Mining. Springer
the number of utilized nodes is growing. For the same data set, US, 1999: 263-290.
the running time decreases with the increase of computing [4] Patwary M, Palsetia D, Agrawal A, et al. A new scalable parallel
nodes, thus reflecting the parallel ability of S_DBSCAN DBSCAN algorithm using the disjoint-set data structure[C]//High
algorithm. Performance Computing, Networking, Storage and Analysis (SC), 2012
International Conference for. IEEE, 2012: 1-11.
[5] Welton B, Samanas E, Miller B P. Mr. scan: Extreme scale density-
C. Speedup based clustering using a tree-based network of gpgpu
Speedup refers to the ratio of single-task processing time nodes[C]//Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis. ACM,
as compared to multiple-task processing time for the same task. 2013: 84.
Speedup is usually used to measure a parallel system, a [6] Dean J, Ghemawat S. MapReduce: simplified data processing on large
program parallelization performance, or results. In order to clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
test the parallel ability of the S_DBSCAN algorithm, we use [7] B.-R. Dai and I.-C. Lin, “Efficient map/reduce-based dbscan algorithm
D1, D2, D3 and D4 to test the S_DBSCAN algorithm. The with optimized data partition,” in Cloud Computing (CLOUD),
2012IEEE 5th International Conference on. IEEE, 2012, pp. 59–66.
speedup results of S_DBSCAN algorithm is shown in Figure 7. [8] Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan, “Mr-
dbscan: an efficient parallel density-based clustering algorithm using
mapreduce,” in Parallel and Distributed Systems (ICPADS), 2011 IEEE
17th International Conference on. IEEE, 2011, pp. 473–480.
[9] Apache Spark[OL].http://spark.apache.org/,2014
[10] Shoro A G, Soomro T R. Big Data Analysis: Apache Spark
Perspective[J]. Global Journal of Computer Science and Technology,
2015, 15(1).
[11] Qi Y U, Jie L. Research of cloud storage security technology based on
HDFS [J][J]. Computer Engineering and Design, 2013, 8: 013.
[12] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets:
A fault-tolerant abstraction for in-memory cluster computing[C]
//Proceedings of the 9th USENIX conference on Networked Systems
Design and Implementation. USENIX Association, 2012: 2-2.
[13] Hall M, Frank E, Holmes G, et al. The WEKA data mining software: an
update[J]. ACM SIGKDD explorations newsletter, 2009, 11(1): 10-18.
Figure 7. The speedup results of S_DBSCAN algorithm

Seen from Figure 7, the S_DBSCAN has significant
speedup. For the same set of data, the more nodes utilized,
553

Luo 2016

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Luo 2016

Uploaded by

Copyright:

Available Formats

2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking

(SocialCom), Sustainable Computing and Communications (SustainCom)

A Parallel DBSCAN Algorithm Based On Spark

978-1-5090-3936-4/16 $31.00 © 2016 IEEE 548

algorithms which require multiple iterations. Resillient HDFS

Distributed Dataset (RDD) [12] is a distributed-stored dataset, saveAsSequenceFile

Spark DBSCAN is a density-based clustering algorithm. Density-

directly density reachable: a point p X is directly

density reachable from x X if p Ne ( x) and x is

directly density reachable from x.

For the map task, we will finish generating a partial cluster,

TABLE II The procedure of computing local DBSCAN algorithm Combine_Reduce

Step1(Local_DBSCAN_Map): 1)Setting Dmin and

Figure 7. The speedup results of S_DBSCAN algorithm

You might also like