Professional Documents
Culture Documents
Tathagata TD Das
Tathagata Das ( TD )
Whoami
Core committer on Apache Spark
Lead developer on Spark Streaming
On leave from PhD program in UC Berkeley
Big Data
Processed Data
mutable state
Pipeline of nodes
input
Each node maintains mutable state
records
Each input record updates the state
node
1
and new records are sent out
input
records
node
3
node
2
>Storm
- Replays record if not processed by a node
- Processes each record at least once
- May update mutable state twice!
- Mutable state can be lost due to failure!
treaming
Kafka
Flume
HDFS
HDFS
Databases
Kinesis
Twitter
treaming
Dashboards
data streams
Receivers
batches as
RDDs
results as
RDDs
Input DStream
batch @ t+1
batch @ t+2
tweets DStream
stored
in
memory
as
RDDs
transformed
DStream
batch @ t+1
batch @ t+2
tweets
DStream
atMap
hashTags
Dstream
[#cat,
#dog,
]
atMap
atMap
tweets DStream
batch @ t
batch @ t+1
batch @ t+2
flatMap
flatMap
flatMap
save
save
save
hashTags DStream
every
batch
saved
to
HDFS
tweets DStream
batch @ t
batch @ t+1
batch @ t+2
flatMap
flatMap
flatMap
foreach
foreach
foreach
hashTags DStream
Languages
Scala API
Java API
FuncOon object
Python API
...soon
Window-based Transformations
val
tweets
=
TwitterUtils.createStream(ssc,
auth)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
val
tagCounts
=
hashTags.window(Minutes(1),
Seconds(5)).countByValue()
sliding
window
operaOon
window length
sliding interval
window length
DStream
of
data
sliding
interval
tweets.transform(tweetsRDD
=>
{
tweetsRDD.join(spamFile).filter(...)
})
Spark
SQL
Spark
Streaming
MLlib
machine
learning
Apache Spark
GraphX
graph
processing
$
./spark-shell
scala>
val
file
=
sc.hadoopFile(smallLogs)
...
scala>
val
filtered
=
file.filter(_.contains(ERROR))
...
scala>
val
mapped
=
filtered.map(...)
...
object
ProcessProductionData
{
def
main(args:
Array[String])
{
val
sc
=
new
SparkContext(...)
val
file
=
sc.hadoopFile(productionLogs)
val
filtered
=
file.filter(_.contains(ERROR))
val
mapped
=
filtered.map(...)
...
}
}
object
ProcessLiveStream
{
def
main(args:
Array[String])
{
val
sc
=
new
StreamingContext(...)
val
stream
=
KafkaUtil.createStream(...)
val
filtered
=
stream.filter(_.contains(ERROR))
val
mapped
=
filtered.map(...)
...
}
}
Performance
Grep
6
5
4
3
2
1 sec
2 sec
0
0
50
#
Nodes
in
Cluster
100
WordCount
3
2.5
2
1.5
1
1 sec
0.5
2 sec
0
0
50
#
Nodes
in
Cluster
100
Fault-tolerance
>Batches of input data are
replicated in memory for
fault-tolerance
>Data lost due to worker
failure, can be recomputed
from replicated input data
tweets
RDD
input
data
replicated
in
memory
atMap
hashTags
RDD
lost
parOOons
recomputed
on
other
workers
Input Sources
Out of the box, we provide
- Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.
Output Sinks
HDFS, S3, etc (Hadoop API compatible filesystems)
Cassandra (using Spark-Cassandra connector)
Hbase (integrated support coming to Spark soon)
Directly push the data anywhere
Conclusion