You are on page 1of 10

Hadoop vs Spark

ADEEL MUSTAFA

Big Data - Apache


Spark and Hadoop are big data processing projects of
Apache. They are both open source and to some
extent have overlapping features causing confusion
for choosing right technology.

Hadoop
Distributed architecture for parrallel processing of
large sets of Data using HDFS and MapReduce

Hadoop- Limitations
Hadoop Mapreduce fundamentally works with disk
operations and will do lot of IO operations for data
processing.
This results in throughput reduction, which can be
improved by adding more nodes and by using solid
state drives. But When data grows, adding more
nodes can increase cost making whole idea of opensource-free-to-use paradigm questionable.

Spark
Spark project was started keeping this problem in
mind. The technology is platform independent and
doesn't have it's own file system like Hadoop do. This
allows Spark to run on any file system and integrate.
Spark provides high performance doing data
processing in memory as compared to Hadoop which
relies on disk operations. Spark not only utilizes
physical memory but also have optimized virtual
memory in case processing batch grows in size.

Benefits of Sparks
- Fast data processing
- In memory architecture for multi stage jobs
- Caching, streaming and graph support
- Built in APIs for analytical and machine learning
functions
- Faster development due to smaller debug/execution
cycles
- Optimized memory utilization
- Virtual memory

Use-Cases of Sparks

iterative processing
machine learning
event stream processing.
real time data processing
Scenarios where time constraints are critical, Mapreduce
fail to deliver required performance. Spark on the other
hand is perfect tool for allowing developers to design
complex applications. Imagine driver-less-car scenario
where decision making time is critical, Spark provides
required level of performance. There are many cases where
clients switched back to conventional data-warehousing
due to Hadoop's longer loading cycle. Spark now provides
solution to this problem.

Limitations of Spark
To get optimal performance the memory in a Spark cluster
needs to be as large as the amount of data the job needs to
process.
Spark is under maturity and adoptability phase
Hadoop MapReduce is best technology for batch processing
application while Spark is best technology for analytic
application
Choose the Right tool for the problem in Hand

References

http://www.quora.com/What-are-use-cases-for-Spark-vs-Hadoop

http://www.quora.com/When-is-Hadoop-MapReduce-better-than-Spark

http://stackoverflow.com/questions/27794469/spark-vs-hadoop

Thank You

You might also like