Professional Documents
Culture Documents
An Overview
Srinivasa D.K. (srinivasa.kodandarama@aricent.com)
Big Data Architect
PSS Cisco CMS
TABLE OF CONTENT
1. Big Data
2. Hadoop
3. HDFS Architecture
4. MapReduce
5. YARN
6. Flume
7. Sqoop
8. Hive
9. Pig
10.Oozie
11. Zookeeper
12.No-SQL HBase
13.Future Readings
2
What is Big Data?
Extremely large data sets that may be analysed computationally to reveal patterns,
trends, and associations, especially relating to human behaviour and interactions.
What Happens on the Internet in a minute?
4
What is Big Data?
Lots of Data (Terabytes or Petabytes)
Big Data is the term for a collection of data sets so large and
complex, that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
The challenges include capture, curation, storage, search, sharing,
transfer, analysis and visualization.
5
Hadoop
What is Hadoop?
Hadoop is a software framework environment that provides a
parallel processing environment on a distributed file system using
commodity hardware.
Hadoop is all about storage (HDFS) and processing (MapReduce).
Why Hadoop?
We are generating data faster than ever (like Automation,
Digitalization, IOT, Smartphones, Social media, etc )
This data has many valuable applications ( like Market Research,
Product recom., Demand forecast, Fraud dete., etc)
We must process it to visualize and extract that value for making a
better decisions.
7
Hadoop Pillar
Storage
Provides via HDFS
Distributed (think striping)
Redundant (think mirroring)
Processing
Provides via MapReduce
Distributed / Parallel
Both pillars require long-running
daemons (YARN)
Coordination and delegation on
masters (NameNode)
Raw storage & processing on slaves
(DataNode)
8
Hadoop History
9
HDFS Architecture
HDFS Architecture
Base on GFS (Googles File System)
HDFS is a distributed filesystem
Master and slave daemons
Files split into chunks (blocks) and distributed on data nodes (slaves)
Each block is replicated (default 3x) for durability & concurrent access
Write Once Read Many (WORM)
No "appends" or edits
Self-healing - can sustain loss of :
Whole slave node
Block(s) of data
HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
HDFS provides high throughput access to application data and is suitable for
applications that have large data sets.
11
HDFS Daemons
There are 2 daemons in classical HDFS
NameNode (master)
DataNode (slave)
Name Node
Master Node which clients must initiate
read/write
Has all meta-data information about a file
File name, permissions, directory
Which nodes contain which blocks
Meta-data stored both in memory & on disk
Very memory hungry
Disk backups of meta-data Very
important!
If you lose the NN, you lose HDFS
No file block data ever flows through NN
DataNode (s)
Slave Nodes which perform actual block
storage
Client does direct write & read from slaves
12
MapReduce
What is MapReduce?
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed algorithm on
a cluster.
It assumes Moving Computation is Cheaper than Moving Data
MapReduce is more of a framework than a tool. You have to fit your solution into
the framework of map and reduce, which in some situations might be challenging.
MapReduce framework has the ability to process your data with distributed
computing.
14
MapReduce Flow
15
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Can't share resources with non-MR applications running on Hadoop
cluster (e.g. Impala, Giraph, Spark)
16
Hadoop Next Generation
17
YARN (Hadoop 2)
18
Hadoop 2
YARN (Yet Another Resource Negotiator)
19
YARN Architecture
20
Hadoop Eco-System Tools
Hadoop Ecosystem
22
Flume
Apache Flume is a tool/service/data ingestion mechanism
for collecting aggregating and transporting large amounts
of streaming data such as log data, events (etc...) from
various webserves to a centralized data store.
It is a highly reliable, distributed, and configurable tool
that is principally designed to transfer streaming data from
various sources to HDFS.
Enable quick iteration on new collection strategies
Component Definition/Function
Event A singular unit of data that is transported by Flume (typically a single log entry)
Source The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be
delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of
destinations. One example is the HDFS sink that writes events to HDFS. (ex. HBase, Solr, ElasticSearch)
Channel The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the
channel.
Agent Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
Client The entity that produces and transmits the Event to the Source operating within the Agent.
23
Flume (cont.)
24
Sqoop
Apache Sqoop is a tool designed for efficiently
transferring bulk data between Apache Hadoop
and structured datastores such as relational
databases (RDBMS)
Sqoop imports data from external structured
datastores into HDFS or related systems like
Hive and HBase.
Sqoop can also be used to export data from
Hadoop and export it to external structured
datastores such as relational databases and
enterprise data warehouses.
Example:-
$ sqoop export
--connect jdbc:postgresql://hdp-master/sqoop_db -
--username sqoop_user
--password postgres
--table cities
--export-dir cities
$ sqoop import
--connect jdbc:postgresql://hdp-master/sqoop_db
--username sqoop_user
--password postgres
--table cities
25
Hive
It is a Data Warehousing (SQL) layer on Hadoop
Facilitates data summarization, ad-hoc queries and analysis of large datasets
Query using an SQL like language called HiveQL. Can also use custom Mappers and reducers
Hive tables can be defined directly on HDFS files via SerDe, customized formats
(SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. It allows Hive to read in data from a table, and write it back out to HDFS in any
custom format.)
Tables can be partitioned and data loaded separately in each partition for scale.
Tables can be clustered based on certain columns for query performance.
The schema is stored in an RDBMS. Has complex column types like map, array, struct in addition to
atomic types.
User Interface Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web UI,
Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
Engine the replacements of traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a query for MapReduce job and
process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.
26
Hive (cont.)
Storage Format Example
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT Employee details
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t
LINES TERMINATED BY \n
STORED AS TEXTFILE;
HiveQL JOIN
HiveQL GROUP BY
27
Pig
High level scripting platform for processing
and analyzing large data sets.
It allows Hadoop users to write complex
MapReduce transformations using a
simple scripting language called Pig Latin
Pig compiler produces sequences of Map-
Reduce programs
Developed at Yahoo
Used as a data flow language, particularly
well suited to ETL
Requires nothing besides the Pig
interceptor
Makes it easy to develop parallel
processing programs much easier than
developing MR code directly
User Defined Functions (UDFs) specify
custom processing in Java, Python, Ruby
Implement EVAL, AGGREGATE, FILTER
functions (Grunt is Pigs interactive shell)
28
Pig (cont.)
Creating Table:-
$ vi Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
As directed in the comand, it loads the student.txt file into Pig and gives
you the result of the Dump operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
29
Apache Pig Vs Hive
Both Apache Pig and Hive are used to create MapReduce jobs. And in
some cases, Hive operates on HDFS in a similar way Apache Pig does. In
the following table, we have listed a few significant points that set
Apache Pig apart from Hive.
Configuration Management
Cluster member nodes bootstrapping configuration from a centralized source in unattended
way
Easier, simpler deployment/provisioning
Distributed Cluster Management
Node join / leave
Node statuses in real time
Naming service e.g. DNS
Distributed synchronization - locks, barriers, queues
Leader election in a distributed system.
Centralized and highly reliable (simple) data registry
32
Zookeeper (cont.)
34
HBase (cont.)
HDFS HBase
HDFS is a distributed file HBase is a database built
system suitable for on top of the HDFS.
storing large files.
35
Further Readings:
36
37
OVERVIEW
Who We Are
We help the worlds pioneering companies solve their most important business and
technology innovation challenges from customer to chip
With more than 12,000 talented designers and engineers, we work with clients to
anticipate disruption and transform products and services for the digital era
25 YEARS OF
INNOVATION
INNOVATING FOR THE DIGITAL ERA
Our Industries
MEDIA CONSUMER
& ENTERTAINMENT INDUSTRIAL ELECTRONICS
Our
SOFTWARE &
TELECOMMUNICATIONS AUTOMOTIVE INTERNET SERVICES
NETWORKING SEMICONDUCTOR
Ubiquitous
Connectivity
New
Business
Models
Data
Explosion
Internet of Things
Copyright 2017 Aricent. All rights reserved.
41
INNOVATING FOR THE DIGITAL ERA
2007 Created worlds first WiMax and small cell base station
Thank You.
www.aricent.com