You are on page 1of 40

Big Data, Map

Reduce & Hadoop

By:
Surbhi Vyas(7)
Varsha(8)
Big Data
Big data is a term that describes the large volume
of data both structured and unstructured that
inundates(floods) a business on a day-to-day basis.
But its not the amount of data thats important.
Its what organizations do with the data that
matters. It can be analyzed for insights that lead
to better decisions and strategic business moves.
An example of big data might bepetabytes(1,024
terabytes) orexabytes(1,024 petabytes) of data
consisting of billions to trillions of records of
millions of peopleall from different sources (e.g.
Web, sales, customer contact center, social media,
mobile data and so on).
Big Data History & Current
Considerations
While the term big data is relatively new, the
act of gathering and storing large amounts of
information for eventual analysis is ages old.
Characteristics:
Volume:Organizations collect data from a variety
of sources, including business transactions, social
media and information from sensor or machine-to-
machine data. In the past, storing it wouldve
been a problem but new technologies (such as
Hadoop) have eased the burden.
Velocity:Data streams in at an unprecedented
speed and must be dealt with in a timely manner.
Variety: Data comes in all types of formats from
structured, numeric data in traditional databases
to unstructured text documents, email, video,
audio, stock ticker data and financial transactions.
Big data has increased the demand of
information management specialists in
thatSoftware AG,Oracle
Corporation,IBM,Microsoft,SAP,EMC,HP
andDellhave spent more than $15billion on
software firms specializing in data management
and analytics.
In 2010, this industry was worth more than
$100billion and was growing at almost
Who uses big data?

Big data affects organizations across


practically every industry:
Banking
Education
Healthcare
Government
Manufacturing
Retail
Before MapReduce

Large scale data processing was difficult!


Managing hundreds or thousands of processors
Managing parallelization and distribution
I/O Scheduling
Status and monitoring
Fault/crash tolerance
MapReduce provides all of these, easily!
MapReduce Overview

What is it?
Programming model used by Google
A combination of the Map and Reduce models with
an associated implementation
Used for processing and generating large data sets
MapReduce Overview

How does it solve our previously mentioned problems?


MapReduce is highly scalable and can be used across many
computers.
Many small machines can be used to process jobs that normally
could not be processed by a large machine.
Map Abstraction

Inputs a key/value pair


Key is a reference to the input value
Value is the data set on which to operate
Evaluation
Function defined by user
Applies to every value in value input
Might need to parse input

Produces a new list of key/value pairs


Can be different type from input pair
Map Example
Reduce Abstraction

Starts with intermediate Key / Value pairs


Ends with finalized Key / Value pairs

Starting pairs are sorted by key


Iterator supplies the values for a given key to
the Reduce function.
Reduce Abstraction

Typically a function that:


Starts with a large number of key/value pairs
One key/value for each word in all files being greped
(including multiple entries for the same word)
Ends with very few key/value pairs
One key/value for each unique word across all the
files with the number of instances summed into this
entry

Broken up so a given worker works with input


of the same key.
Reduce Example
How Map and Reduce Work
Together

Reduce applies a
user defined
Map returns Reduces accepts
function to
information information
reduce the
amount of data
MapReduce workflow
Input Data Output Data

Worker Output
write
local Worker File 0
Split 0 read write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
Map Reduce
extract something you aggregate,
care about from each summarize,
record filter, or
transform
Example: Word Count
Applications

MapReduce is built on top of GFS, the Google File System.


Input and output files are stored on GFS.
While MapReduce is heavily used within Google, it also
found use in companies such as Yahoo, Facebook, and
Amazon.
The original implementation was done by Google. It is used
internally for a large number of Google services.
The Apache Hadoop project built a clone to specs defined
by Google. Amazon, in turn, uses Hadoop MapReduce
running on their EC2 (elastic cloud) computing-on-demand
service to offer the Amazon Elastic MapReduce service.
Why is this approach better?

Creates an abstraction for dealing with complex


overhead
The computations are simple, the overhead is messy
Removing the overhead makes programs much smaller
and thus easier to use
Less testing is required as well. The MapReduce libraries can
be assumed to work properly, so only user code needs to be
tested
Division of labor also handled by the MapReduce
libraries, so programmers only need to focus on the
actual computation
Hadoop
Hadoop
Hadoop is an open-source software framework for
storing data and running applications on clusters
of commodity hardware.
It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or
jobs.
Hadoops strength lies in its ability to scale across
thousands of commodity servers that dont share
memory or disk space.
Hadoop delegates tasks across these servers (called
worker nodes or slave nodes), essentially
harnessing the power of each device and running
This is what allows massive amounts of data to
be analyzed: splitting the tasks across different
locations in this manner allows bigger jobs to be
completed faster.
Hadoop can be thought of as an ecosystemits
comprised of many different components that all
work together to create a single platform.
There are two key functional components within
this ecosystem:
The storage of data (Hadoop Distributed File
System, or HDFS)
The framework for running parallel
Imagine, for example, that an entire
MapReduce job is the equivalent of building a
house.
Each job is broken down into individual tasks
(e.g. lay the foundation, put up drywall) and
assigned to various workers, or mappers
and reducers.
Completing each task results in a single,
combined output: the house is complete.
History of Hadoop

Hadoop was created byDoug CuttingandMike


Cafarellain 2006. Cutting, who was working atYahoo! at
the time,named it after his son's toy elephant.
It was originally developed to support distribution for
theNutch search engine project.
Apache Nutchis a highly extensible and scalableopen
source web crawler software project.
AWeb crawlersystematically browses theWorld Wide
Web, typically for the purpose ofWeb indexing.
As the World Wide Web grew in the late 1900s and
early 2000s, search engines and indexes were
created to help locate relevant information amid
the text-based content.
In the early years, search results were returned
by humans. But as the web grew from dozens to
millions of pages, automation was needed.
Web crawlers were created, many as university-
led research projects, and search engine start-ups
took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search
engine called Nutch the brainchild of Doug
They wanted to return web search results faster
by distributing data and calculations across
different computers so multiple tasks could be
accomplished simultaneously.
During this time, another search engine project
called Google was in progress. It was based on the
same concept storing and processing data in a
distributed, automated way so that relevant web
search results could be returned faster.
The Nutch project was divided the web crawler
portion remained as Nutch and the distributed
computing and processing portion became Hadoop .
In 2008, Yahoo released Hadoop as an open-source
project.
Today, Hadoops framework and ecosystem of
technologies are managed and maintained by the non-
profit Apache Software Foundation (ASF), a global
community of software developers and contributors.
Search engines in 1990s

1996

1996

1996

1997
Google search engines

1998

2017
Framework of Hadoop

The base Apache Hadoop framework is composed of the


following modules:
Hadoop Common contains libraries and utilities
needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the
cluster;
Hadoop YARN a resource-management platform
responsible for managing computing resources in
clusters and using them for scheduling of users'
applications
Hadoop MapReduce an implementation of
theMapReduceprogramming model for large scale data
processing.
The Hadoop framework itself is mostly written in the
Java programming language, with some native code
inCandcommand lineutilities written asshell scripts.
Hadoop Architecture

At its core, Hadoop has two major layers namely:


(a) Processing/Computation layer (Map Reduce), and
(b) Storage layer (Hadoop Distributed File System).
Hadoop framework also includes the following two modules:

Hadoop Common: These are Java libraries and utilities


required by other Hadoop modules.

Hadoop YARN: This is a framework for job scheduling and


cluster resource management.
HOW DOES HADOOP WORK?

It is quite expensive to build bigger servers with heavy configurations


that handle large scale processing, but as an alternative, you can tie
together many commodity computers with single-CPU, as a single
functional distributed system and practically, the clustered machines can
read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server. So this is the first
motivational factor behind using Hadoop that it runs across clustered
and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs:
Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further
processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Cost
Scalable
Effective

Flexible Fast

Resilient
to Failure
Disadvantages of Hadoop
Security Concerns

Not fit for Small Data

Vulnerable by Nature

Potential Stability Issue

General Limitation
Companies Using Hadoop
Used as a source for reporting and machine learning.

Used to store & process tweets, log files.

Its data flows through clusters.

Used for scaling tests.

Client Projects in finance, telecom & retail.

Used for search optimization & research.

Client Projects

Designing & building web applications.

Log analysis & machine learning.

Social services to structured data storage.

You might also like