Hadoop Overview Training Material

HADOOP & Ecosystem -
An Overview
Srinivasa D.K. (srinivasa.kodandarama@aricent.com)
Big Data Architect
PSS Cisco CMS
TABLE OF CONTENT
1. Big Data
2. Hadoop
3. HDFS Architecture
4. MapReduce
5. YARN
6. Flume
7. Sqoop
8. Hive
9. Pig
10.Oozie
11. Zookeeper
12.No-SQL HBase
13.Future Readings
2
What is Big Data?
Extremely large data sets that may be analysed computationally to reveal patterns,
trends, and associations, especially relating to human behaviour and interactions.
What Happens on the Internet in a minute?
4
What is Big Data?
Lots of Data (Terabytes or Petabytes)
Big Data is the term for a collection of data sets so large and
complex, that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
The challenges include capture, curation, storage, search, sharing,
transfer, analysis and visualization.
5
Hadoop
What is Hadoop?
Hadoop is a software framework environment that provides a
parallel processing environment on a distributed file system using
commodity hardware.
Hadoop is all about storage (HDFS) and processing (MapReduce).
Why Hadoop?
We are generating data faster than ever (like Automation,
Digitalization, IOT, Smartphones, Social media, etc )
This data has many valuable applications ( like Market Research,
Product recom., Demand forecast, Fraud dete., etc)
We must process it to visualize and extract that value for making a
better decisions.
7
Hadoop Pillar
Storage
Provides via HDFS
Distributed (think striping)
Redundant (think mirroring)
Processing
Provides via MapReduce
Distributed / Parallel
Both pillars require long-running
daemons (YARN)
Coordination and delegation on
masters (NameNode)
Raw storage & processing on slaves
(DataNode)
8
Hadoop History
9
HDFS Architecture
HDFS Architecture
Base on GFS (Googles File System)
HDFS is a distributed filesystem
Master and slave daemons
Files split into chunks (blocks) and distributed on data nodes (slaves)
Each block is replicated (default 3x) for durability & concurrent access
Write Once Read Many (WORM)
No "appends" or edits
Self-healing - can sustain loss of :
Whole slave node
Block(s) of data
HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
HDFS provides high throughput access to application data and is suitable for
applications that have large data sets.
11
HDFS Daemons
There are 2 daemons in classical HDFS
NameNode (master)
DataNode (slave)
Name Node
Master Node which clients must initiate
read/write
Has all meta-data information about a file
File name, permissions, directory
Which nodes contain which blocks
Meta-data stored both in memory & on disk
Very memory hungry
Disk backups of meta-data Very
important!
If you lose the NN, you lose HDFS
No file block data ever flows through NN
DataNode (s)
Slave Nodes which perform actual block
storage
Client does direct write & read from slaves
12
MapReduce
What is MapReduce?
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed algorithm on
a cluster.
It assumes Moving Computation is Cheaper than Moving Data
MapReduce is more of a framework than a tool. You have to fit your solution into
the framework of map and reduce, which in some situations might be challenging.
MapReduce framework has the ability to process your data with distributed
computing.
14
MapReduce Flow
15
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Can't share resources with non-MR applications running on Hadoop
cluster (e.g. Impala, Giraph, Spark)
16
Hadoop Next Generation
17
YARN (Hadoop 2)
18
Hadoop 2
YARN (Yet Another Resource Negotiator)
19
YARN Architecture
20
Hadoop Eco-System Tools
Hadoop Ecosystem
22
Flume
Apache Flume is a tool/service/data ingestion mechanism
for collecting aggregating and transporting large amounts
of streaming data such as log data, events (etc...) from
various webserves to a centralized data store.
It is a highly reliable, distributed, and configurable tool
that is principally designed to transfer streaming data from
various sources to HDFS.
Enable quick iteration on new collection strategies
Component Definition/Function
Event A singular unit of data that is transported by Flume (typically a single log entry)
Source The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be
delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of
destinations. One example is the HDFS sink that writes events to HDFS. (ex. HBase, Solr, ElasticSearch)
Channel The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the
channel.
Agent Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
Client The entity that produces and transmits the Event to the Source operating within the Agent.
23
Flume (cont.)
24
Sqoop
Apache Sqoop is a tool designed for efficiently
transferring bulk data between Apache Hadoop
and structured datastores such as relational
databases (RDBMS)
Sqoop imports data from external structured
datastores into HDFS or related systems like
Hive and HBase.
Sqoop can also be used to export data from
Hadoop and export it to external structured
datastores such as relational databases and
enterprise data warehouses.
Example:-
$ sqoop export
--connect jdbc:postgresql://hdp-master/sqoop_db -
--username sqoop_user
--password postgres
--table cities
--export-dir cities
$ sqoop import
--connect jdbc:postgresql://hdp-master/sqoop_db
--username sqoop_user
--password postgres
--table cities
25
Hive
It is a Data Warehousing (SQL) layer on Hadoop
Facilitates data summarization, ad-hoc queries and analysis of large datasets
Query using an SQL like language called HiveQL. Can also use custom Mappers and reducers
Hive tables can be defined directly on HDFS files via SerDe, customized formats
(SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. It allows Hive to read in data from a table, and write it back out to HDFS in any
custom format.)
Tables can be partitioned and data loaded separately in each partition for scale.
Tables can be clustered based on certain columns for query performance.
The schema is stored in an RDBMS. Has complex column types like map, array, struct in addition to
atomic types.
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web UI,
Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
Engine the replacements of traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a query for MapReduce job and
process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.
26
Hive (cont.)
Storage Format Example
hive> CREATE DATABASE [IF NOT EXISTS] userdb;

OR
hive> CREATE SCHEMA userdb;
hive> SHOW DATABASES;

default
userdb
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT Employee details
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t
LINES TERMINATED BY \n
STORED AS TEXTFILE;
HiveQL JOIN
hive> INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid, u,age_bkt
FROM pageview pv
JOIN user u
ON (pv.uhash = u.uhash)
HiveQL GROUP BY
hive> SELECT pageid, age_bkt, count(1)

FROM pv_users
GROUP BY pageid, age_bkt
27
Pig
High level scripting platform for processing
and analyzing large data sets.
It allows Hadoop users to write complex
MapReduce transformations using a
simple scripting language called Pig Latin
Pig compiler produces sequences of Map-
Reduce programs
Developed at Yahoo
Used as a data flow language, particularly
well suited to ETL
Requires nothing besides the Pig
interceptor
Makes it easy to develop parallel
processing programs much easier than
developing MR code directly
User Defined Functions (UDFs) specify
custom processing in Java, Python, Ruby
Implement EVAL, AGGREGATE, FILTER
functions (Grunt is Pigs interactive shell)
28
Pig (cont.)
Creating Table:-
$ vi Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING

PigStorage(',')
as (id:int,name:chararray,city:chararray);
Dump student;
As directed in the comand, it loads the student.txt file into Pig and gives
you the result of the Dump operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
29
Apache Pig Vs Hive
Both Apache Pig and Hive are used to create MapReduce jobs. And in
some cases, Hive operates on HDFS in a similar way Apache Pig does. In
the following table, we have listed a few significant points that set
Apache Pig apart from Hive.
Apache Pig Hive

Apache Pig uses a language Hive uses a language
called Pig Latin. It was originally called HiveQL. It was originally
created atYahoo. created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing

language.
Pig Latin is a procedural HiveQL is a declarative
language and it fits in pipeline language.
paradigm.
Apache Pig can handle Hive is mostly for structured
structured, unstructured, and data.
semi-structured data.
30
Oozie
Workflow scheduler system to manage Hadoop
jobs
Create a Directed Acyclic Graphs of actions
Supports various types of Hadoop jobs like
Java MR, Streaming MR, Pig, Hive, Sqoop as
well as shell scripts and Java programs.
Create a flow-chart of your data analysis
process & allow Oozie to co-ordinate
Time & data dependencies
XML based config
Simple GUI to track jobs and workflows
There are two basic types of Oozie jobs:
Oozie Workflow jobs are Directed
Acyclical Graphs (DAGs), specifying a
sequence of actions to execute. The
Workflow job has to wait
Oozie Coordinator jobs are recurrent
Oozie Workflow jobs that are triggered by
time and data availability.
31
Zookeeper
An open source, high-performance coordination service for distributed
applications.
Exposes common services in simple interface:
naming
configuration management
locks & synchronization
group services
Build your own on it for specific needs.
Zookeeper Use cases
Configuration Management
Cluster member nodes bootstrapping configuration from a centralized source in unattended
way
Easier, simpler deployment/provisioning
Distributed Cluster Management
Node join / leave
Node statuses in real time
Naming service e.g. DNS
Distributed synchronization - locks, barriers, queues
Leader election in a distributed system.
Centralized and highly reliable (simple) data registry
32
Zookeeper (cont.)
ZooKeeper Service is replicated over a set of

machines
All machines store a copy of the data (in
memory)
A leader is elected on service startup
Clients only connect to a single ZooKeeper
server & maintains a TCP connection.
Client can read from any Zookeeper server,
writes go through the leader & needs
majority consensus.
Part Description
Client Clients, one of the nodes in our distributed application cluster, access information from the server.
For a particular time interval, every client sends a message to the server to let the sever know that
the client is alive.
Similarly, the server sends an acknowledgement when a client connects. If there is no response from
the connected server, the client automatically redirects the message to another server.
Server Server, one of the nodes in our ZooKeeper ensemble, provides all the services to clients. Gives
acknowledgement to client to inform that the server is alive.
Ensemble Group of ZooKeeper servers. The minimum number of nodes that is required to form an ensemble is
3.
Leader Server node which performs automatic recovery if any of the connected node failed. Leaders are
elected on service startup.
Follower Server node which follows leader instruction. 33
HBase (No-SQL DB)
No-SQL means : Not only SQL
Schemaless data model
HBase Key Features:
Distributed storage across cluster of machines
Random, online read and write data access
Self-managed data partitions
Apache HBase is the Hadoop database
Modeled after Googles BigTable
A sparse, distributed, persistent multi- dimensional
sorted map
The map is indexed by a row key, column key, and a
timestamp
It is not a Relational database (No joins)
Each value in the map is an un-interpreted array of
bytes
Low latency random data access
34
HBase (cont.)
HDFS HBase
HDFS is a distributed file HBase is a database built
system suitable for on top of the HDFS.
storing large files.
HDFS does not support HBase provides fast

fast individual record lookups for larger tables.
lookups.
It provides high latency It provides low latency

batch processing; no access to single rows from
concept of batch billions of records
processing. (Random access).
It provides only HBase internally uses

sequential access of data. Hash tables and provides
random access, and it
stores the data in indexed
HDFS files for faster
lookups.
35
Further Readings:
36
37
OVERVIEW
Who We Are
Aricent is a leading global design and engineering company, providing a range of

services and technology solutions in strategy, software and hardware development
and product support services
We help the worlds pioneering companies solve their most important business and
technology innovation challenges from customer to chip
With more than 12,000 talented designers and engineers, we work with clients to
anticipate disruption and transform products and services for the digital era
We combine specialized talent, unmatched capabilities and technology solutions to

accelerate R&D outcomes and get products and services to market faster
Copyright 2017 Aricent. All rights reserved.

38
\
25 YEARS OF
INNOVATION
INNOVATING FOR THE DIGITAL ERA
Our Industries
MEDIA CONSUMER
& ENTERTAINMENT INDUSTRIAL ELECTRONICS
Our
SOFTWARE &
TELECOMMUNICATIONS AUTOMOTIVE INTERNET SERVICES
NETWORKING SEMICONDUCTOR

40
OVERVIEW
Industry Disruption in the Digital Era

Sensors
and Chips
Ubiquitous
Connectivity
New
Business
Models
Data
Explosion
Internet of Things
41
INNOVATING FOR THE DIGITAL ERA
Aricent Product Life Cycle Services: Our Service Lines
Design & Hardware Software Software Testing Product Product

Strategy Product Product Frameworks & Services Sustenance & Support
Development Development Solutions Maintenance Services

42
25 years of
innovation
for the
digital era
2017 Pioneers in moving mobile access networks in the cloud
2014 Unveiled new software for self-optimized networks
2012 Launched first framework for software Defined networking
2010 Launched LTE software portfolio to advance 4G
2008 Created worlds first in flight WiFi service
2007 Created worlds first WiMax and small cell base station
2003 Added wireless, wireline, testing and multimedia capabilities
1999 Launched new software to manage telcom networks
1996 Fueled market for telephone signaling and Voice over IP
1994 Created software for worlds six largest satellite networks
1992 Deployed worlds first frame relay network

43
Headquarters
303 Twin Dolphin Drive
Redwood City, CA 94065
USA
Tel: +1 650 632 4310
Thank You.
www.aricent.com

Hadoop Overview Training Material

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Overview Training Material

Uploaded by

Copyright:

Available Formats

HADOOP & Ecosystem -

Unit Name Operation

hive> CREATE DATABASE [IF NOT EXISTS] userdb;

hive> SHOW DATABASES;

hive> INSERT OVERWRITE TABLE pv_users

hive> SELECT pageid, age_bkt, count(1)

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING

Apache Pig Hive

Pig Latin is a data flow language. HiveQL is a query processing

ZooKeeper Service is replicated over a set of

HDFS does not support HBase provides fast

It provides high latency It provides low latency

It provides only HBase internally uses

Aricent is a leading global design and engineering company, providing a range of

We combine specialized talent, unmatched capabilities and technology solutions to

Copyright 2017 Aricent. All rights reserved.

Copyright 2017 Aricent. All rights reserved.

Industry Disruption in the Digital Era

Aricent Product Life Cycle Services: Our Service Lines

Design & Hardware Software Software Testing Product Product

Copyright 2017 Aricent. All rights reserved.

2017 Pioneers in moving mobile access networks in the cloud

2014 Unveiled new software for self-optimized networks

2012 Launched first framework for software Defined networking

2010 Launched LTE software portfolio to advance 4G

2008 Created worlds first in flight WiFi service

2003 Added wireless, wireline, testing and multimedia capabilities

1999 Launched new software to manage telcom networks

1996 Fueled market for telephone signaling and Voice over IP

1994 Created software for worlds six largest satellite networks

1992 Deployed worlds first frame relay network

You might also like