Big Data and Hadoop: by - Ujjwal Kumar Gupta

Big Data and Hadoop
By
Ujjwal Kumar Gupta
Contents
Why Big Data & Hadoop
Drawbacks of Traditional Database
Hadoop History
What is Hadoop & How it Works
Hadoop Cluster
Hadoop Ecosystem
Course Topics
Week 1
Understanding Big Data
Hadoop Architecture
Week 2
Introduction to Hadoop 2.x
Data loading Techniques
Hadoop Project Environment
Week 5
Analytics using Hive
Understanding Hive QL
Sqoop Connectivity
Week 6
NoSQL Databases
Understanding HBASE
Zookeeper
Week 3
Hadoop MapReduce Framework Week 7
Programming MapReduce
Apache Spark Framework
Hadoop Installation Cluster
Programming Spark
Setup
Week 8
Week 4
Real world Datasets and Analysis
Analytics using Pig
Project Discussion
Understanding Pig Latin
Why Big Data & Hadoop ?

Following are the reasons why Big Data is needed:
90% of the data in the world today has been created in the last
two years alone.
80% of the data is unstructured or exists in widely varying
structures, which are difficult to analyze.
Structured formats have some limitations with respect to handling
large quantities of data.
It is difficult to integrate information distributed across multiple
systems.
Most business users do not know what should be analyzed.
Potentially valuable data is dormant or discarded.
It is too expensive to justify the integration of large volumes of
unstructured data.
A lot of information has a short, useful lifespan.
Why Big Data & Hadoop ?
Drawbacks of Traditional
Database
Expensive - Out of Reach for small &
mid-size company
Scalability As Data Grows
Expanding the system is a Challenging
task
Time Consuming It takes lots of
time to store & process data
What is Hadoop
Open source framework designed for storage and
processing of large scale data on clusters of
commodity hardware
Created by Doug Cutting in 2006.
Cutting named the program after his sons toy
elephant.
How Hadoop Works

When data is loaded onto the system it is divided
into blocks
Typically 64MB or 128MB
Tasks are divided into two phases
Map tasks which are done on small portions of
data where the data is stored
Reduce tasks which combine data to produce
the final output
A master program allocates work to individual
nodes
3 Vs of Hadoop
Big Data Sources
Big Data Sources

The sources of Big Data are:
web logs;
sensor networks;
social media;
internet text and documents;
internet pages;
search index data;
atmospheric science, astronomy, biochemical and medical
records;
scientific research;
military surveillance; and
photography archives.
Hadoop Cluster
Hadoop Cluster
Who Uses Hadoop
Use Cases of Hadoop
Hadoop Commands
1. Print the Hadoop version
hadoop version
2. List the contents of the root directory in HDFS
hadoop fs -ls /
3. Report the amount of space used and available on currently mounted
filesystem
hadoop fs -df hdfs:/
4. Count the number of directories, files and bytes
hadoop fs -count hdfs:/
5. Run a DFS filesystem checking utility
hadoop fsck /
Hadoop Commands
6. Create a new directory
hadoop fs -mkdir /user/
7.
Upload File to HDFS from Local dir

hadoop fs -put data/sample.txt /user/
8.
View contents of file

hadoop fs -cat /text.txt
9. Delete a file from HDFS

hadoop fs rm /usr/text.txt
10. Delete a Directory
hadoop fs rmr /user/
Hadoop Commands
11. Download File from HDFS to local system
hadoop fs -get /user/test.txt /home/hadoop/
12. Copy File from one dir to other
hadoop fs cp /usr/text.txt /input/
13. Move File from one dir to other
hadoop fs mv /text.txt /input/
14. Change Replication Factor
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
15. Copy file from one node to other
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
MapReduce Overview
A method for distributing computation across multiple

nodes
Each node processes the data that is stored at that node
Consists of two main phases

Map
Reduce
MapReduce Features
Automatic parallelization and distribution
Fault-Tolerance
Provides a clean abstraction for programmers to use
MapReduce Features
The key reason to perform mapping and reducing is to

speed up the execution of a specific process by splitting
the process into a number of tasks, thus enabling parallel
work.
Individual
Work
Parallel Work
Word Count Example
Count the number of words:

This quick brown fox jumps over the lazy dog. A dog is a
mans best friend.
Map execution consists of following phases:
Map
Phase
Reads assigned
input split from
Parses input
into records
(key/value pairs)
Applies map
function to each
record
Informs master
node of its
completion
Partitio
n Phase
Each mapper
must determine
which reducer will
receive each of
the outputs
For any key, the
destination
partition is same
Number of
partitions =
Number of
reducers
Shuffle
Phase
input
Fetches
data from all
map tasks for
the portion
corresponding
to the reduce
tasks bucket
Sort
Phase
Merge-sorts all
map outputs into
a single
run
Reduce
Phase
Applies userdefined reduce
function to the
merged run
Arguments: key
and corresponding
list of values
Writes output
to a file in HDFS
Running wordcount example

Hadoop setup provides some sample MapReduce examples with its setup , we can directly
run those examples and test MapReduce program
1. Copy the jar file of example from
/<hadoop install path>/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar
2. Put some text file as input file in hadoop
Hadoop fs put input.txt
3. Run WordCount program from that example
Hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount input.txt output
4. Check output in output folder
Hadoop fs cat output/part*
Introduction to Pig
Pig is one of the components of the Hadoop eco-system.
Pig is a high-level data flow scripting language.
Pig runs on the Hadoop clusters.
Pig is an Apache open-source project.
Pig uses HDFS for storing and retrieving data and Hadoop MapReduce for
processing Big Data.
Data Models
As part of its data model, Pig supports four basic types:
Atom
A simple atomic value

Example: Mike
Tuple
A sequence of fields that can be any of the data types

Example: (Mike, 43)
Bag
A collection of tuples of potentially varying structures; can contain

duplicates
Example: {(Mike), (Doug, (43, 45))}
Map
An associative array; the key must be a chararray but the value can
be any type
Example: [name#Mike,phone#5551212]
Pig Execution Modes

Pig Execution Modes
Local mode: the Pig
depends on the OS file
system.
MapReduce mode: the Pig

relies on HDFS.
Installing Pig
1. Download pig tar file from apache website
2. Unzip the tar file
$ tar xvzf pig-0.15.0.tar.gz
3. Move file to install location
$ sudo mv pig-0.15.0 /usr/local/pig
4. Set path in bashrc
$ sudo gedit ~/.bashrc
Add following lines at the end of file
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
5. Save the changes in bashrc
$ Source ~/.bashrc
Pig Commands
Loading and Storing Methods

Loading refers to loading relations from files in the Pig buffer
Store data from Pig engine to HDFS.
Filtering and Transforming

Filtering can be defined as filtering data based on conditional clause.
Transforming refers to making data presentable for extracting logical data.
Grouping and Sorting

Grouping refers to generating a group of meaningful data.
Sorting of data is done to arrange the data in an ascending or descending order.
Combining and Splitting

Combining refers to performing union operation of the data stored in the
variable.
Splitting of data refers to separating data with a logical meaning.
Introduction to Hive
Hive is a data warehouse system for Hadoop that facilitates ad-hoc
queries and the analysis of large data sets stored in Hadoop.
It provides a SQL-like language called HiveQL (HQL). Due to its SQL-like
interface,
Hive is a popular choice for Hadoop analytics.
It provides massive scale-out and fault tolerance capabilities for data
storage and processing of commodity hardware.
Relying on MapReduce for execution, Hive is batch-oriented and has high
latency for query execution.
System Architecture and Components of Hive

The image illustrates the architecture of the Hive system. It also displays the role of
Hive and Hadoop in the development process.
Hive
Command Line Interface
JDBC
Web
Interface
Thrift Server
Driver
(Compiler,Optimizer,Executor)
Hadoop
Node
Manager
ODBC
Name
Node
Data Node
+
Recourse Manager
Metastore
Metastore
Metastore is the component that stores the system catalog and metadata about tables,
columns, partitions, and so on. Metadata is stored in a traditional RDBMS format.
Apache Hive uses Derby database by default. Any JDBC compliant database like MySQL
can be used for Metastore.
Metastore Configuration
The key attributes that should be configured for Hive Metastore are given below:
Parameter
Description
Example
javax.jdo.option.Connectio
nURL
JDBC connect string for a

JDBC metastore
jdbc:derby://localhost:1527
/metastore_db;create=true
nDriverNamej
JDBC driver name
org.apache.derby.jdbc.Clie
ntDriver
nUserName
Username for database
APP
nPassword
Password
mine
Metastore ConfigurationTemplate
Driver
Driver is the component that:
manages the lifecycle of a Hive Query Language (HiveQL) statement as
it moves through Hive; and
maintains a session handle and any session statistics.
Query Compiler
Query compiler compiles HiveQL into a Directed Acyclic Graph (DAG) of
MapReduce tasks.
Query optimizer:
consists of a chain of transformations, so that the operator DAG resulting from
one transformation is passed as an input to the next transformation.
performs taskscolumn pruning, partition pruning, and repartitioning of data.
The execution engine:

executes the tasks produced by the compiler in proper dependency order.
interacts with the underlying Hadoop instance to ensure perfect
synchronization with Hadoop services.
Hive Server
Hive Server provides a thrift interface and a Java Database Connectivity/Open
Database Connectivity
(JDBC/ODBC) server. It enables the integration of Hive with other applications.
Client Components
A developer uses the client component to perform development in Hive. The
client component
includes the Command Line Interface (CLI), the web user interface (UI), and the
JDBC/ODBC driver.
Hive
JDBC
Command Line Interface
Web
Interface
ODBC
Thrift Server
Hive Tables
Tables in Hive are analogous to tables in relational databases. A Hive table logically
comprises the data being stored and the associated meta data. Each table has a
corresponding directory in HDFS.
Two types of tables in Hive
Managed Tables - Tables Managed by Hive
External Tables Tables Managed by user
CREATE TABLE t1(ds string, ctry float, li list<map<string, struct<p1:int,
p2:int>>);
CREATE EXTERNAL TABLE test_extern(c1 string, c2 int) LOCATION
'/user/mytables/mydata';
Hive Data Types

Data Types in Hive
Primitive
Types
Integers: TINYINT,
SMALLINT, INT, and
BIGINT
Boolean: BOOLEAN
Floating point
numbers: FLOAT and
DOUBLE
String: STRING
Complex
Types
Structs: {a INT; b
INT}
Maps: M['group']
Arrays: ['a', 'b',
'c'], A[1]
returns 'b'
Userdefined
Types
Structures with
attributes
Attributes can be
of any type
Installing Hive
1. Download hive tar file from apache website
2. Unzip the tar file
$ tar xvzf apache-hive-2.0.0-bin.tar.gz
3. Move file to install location
$ sudo mv apache-hive-2.0.0-bin /usr/local/hive
4. Set path in bashrc
$ sudo gedit ~/.bashrc
Add following lines at the end of file
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
5. Save the changes in bashrc
$ Source ~/.bashrc
6. Initialize Metastore database
$ schematool initSchema dbType derby
Hive Query Language

1. Create Database
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>;
2. List All Databases
SHOW DATABASES;
3. Enabling a Database
Use dbname;
4. Deleting a Database
DROP DATABASE [IF EXISTS] userdb; // used to delete empty database
DROP DATABASE [IF EXISTS] userdb CASCADE; // used to delete database as well as all tables
of that database
5. Displaying Current database in hive shell
$ hive --hiveconf hive.cli.print.current.db=true
Hive Query Language

1. Create Table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note:
TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO
num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)
-- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available
in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and
later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external
tables)
Hive Query Language

1. Create Table Sample
CREATE TABLE IF NOT EXISTS employee
(
eid int, name String,
salary String, destination String
)
COMMENT Employee details
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t ;
2. List All tables
Show tables;
3. Checking schema of a table
Describe tablename;
Describe extended tablename;
show create table emp;
Hive Query Language

Loading Data into table
LOAD DATA [LOCAL] INPATH '/home/user/sample.txt'
[OVERWRITE] INTO TABLE employee;
INSERT
INSERT
02-24';
INSERT
24';
INSERT
OVERWRITE TABLE t1 SELECT * FROM t2;

OVERWRITE TABLE sample1 '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample
//Multiple table Insert using single table

FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)
Hive Query Language

1. Altering a table
ALTER TABLE employee RENAME TO emp;
ALTER TABLE employee ADD COLUMNS ( dept STRING COMMENT 'Department name');
ALTER TABLE name DROP column_name
ALTER TABLE employee CHANGE name ename String;
ALTER TABLE employee CHANGE salary salary Double;
ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int, ename STRING name String);
2. Deleting a table
DROP TABLE [IF EXISTS] employee;
Hive Query Language

1. Querying records from a table
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT
* FROM employee WHERE Id=1205;

* FROM employee WHERE Salary>=40000;
* FROM <tablename> LIMIT 10;
* FROM <tablename> WHERE freq > 100 SORT BY freq ASC LIMIT 10;
20+30 ADD FROM temp;
* FROM employee WHERE Salary>40000 && Dept=TP;
round(2.6) from temp;
floor(2.6) from temp;
ceil(2.6) from temp;
SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Hive Query Language

1. JOINS in a table
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID =
o.CUSTOMER_ID);
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON
(c.ID = o.CUSTOMER_ID);
Partitions in Hive Table

Partitions are analogous to dense indexes on columns. Following
are the features of partitions:
They contain nested sub-directories in HDFS for each combination of
partition column values.
They allow users to retrieve rows efficiently.
Partitions Types
Static Partition (Default)
Dynamic Partition
Static Partition
Static Partition is the default type of partition available in hive , in static
partition we have create all the partition of the table manually and have to load
data for each partition separately
1. Creating Partition Table
CREATE TABLE foo (id INT, msg STRING)
PARTITIONED BY (dt STRING);
2. Loading Data in Partition table
LOAD DATA LOCAL INPATH '/home/user/sample.txt'
INTO TABLE employee
PARTITION (country = 'US', state = 'CA');
3. Listing partitions of a table
Show partitions emp;
Static Partition
1. Altering Partition Table
ALTER TABLE employee
ADD PARTITION (year=2013)
location '/2012/part2012';
ALTER TABLE employee PARTITION (year=1203)
RENAME TO PARTITION (Yoj=1203);
ALTER TABLE employee DROP [IF EXISTS]
PARTITION (year=1203);
INSERT OVERWRITE TABLE test_part PARTITION(ds='2009-01-01',
hr=12)
SELECT * FROM t ;
2. Querying a Partition table
Dynamic Partition
Dynamic partition is used to create automatic partition in a table , we dont have to
provide separate file for each partition . By default dynamic partition is disabled for using
dynamic partition we have to enable it first .
Enabling Dynamic Partition
SET
SET
SET
SET
hive.exec.dynamic.partition=true;
hive.exec.max.dynamic.partition=2048;
hive.exec.max.dynamic.partitions.pernode=256; // In case of cluster
hive.exec.dynamic.partition.mode=non-strict;
Loading Data in Dynamic Partition

We cant directly load data in dynamic partition using LOAD command we have to firstly
load the data in a normal table and then load data using that table into partitioned table
Rules for Dynamic Partition
values for partition col must noot be specified
partition col must be specified at the end of the select clause in the same order as they
are specified in the partitioned by clause
Dynamic Partition
Loading Data in Dynamic Partition
CREATE TABLE part_u (
id int, name string)
PARTITIONED BY (
year INT, month INT, day INT);
CREATE TABLE users (
id int, name string , dt DATE)
ROW FORMAT DELIMTED
FIELDS TERMINATED BY , ;
LOAD DATA LOCAL INPATH '/home/user/sample.txt'
INTO TABLE user;
INSERT INTO TABLE part_u PARTITION(year,month,day)
SELECT id,name,year(dt),month(dt),day(dt)
from users;
Bucketing in Hive Tables

Similar to partition we can divide table data in the form of buckets . Buckets divide
the table data into equal size of buckets provided at the time of creation of table .
Buckets is used in place partitions because partitions are not equally distributed
which cause wastage of memory .
Creating Bucketed Table
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id) INTO 4 BUCKETS;
CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 96 BUCKETS;
Listing Records from a particular bucket
SELECT * FROM bucketed_users
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);

Big Data and Hadoop: by - Ujjwal Kumar Gupta

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data and Hadoop: by - Ujjwal Kumar Gupta

Uploaded by

Copyright:

Available Formats

Big Data and Hadoop

Why Big Data & Hadoop ?

Why Big Data & Hadoop ?

How Hadoop Works

Big Data Sources

Big Data Sources

Who Uses Hadoop

Use Cases of Hadoop

Upload File to HDFS from Local dir

View contents of file

9. Delete a file from HDFS

A method for distributing computation across multiple

Each node processes the data that is stored at that node

Consists of two main phases

Automatic parallelization and distribution

Provides a clean abstraction for programmers to use

The key reason to perform mapping and reducing is to

Word Count Example

Count the number of words:

Map execution consists of following phases:

Running wordcount example

A simple atomic value

A sequence of fields that can be any of the data types

A collection of tuples of potentially varying structures; can contain

Pig Execution Modes

MapReduce mode: the Pig

Loading and Storing Methods

Filtering and Transforming

Grouping and Sorting

Combining and Splitting

System Architecture and Components of Hive

JDBC connect string for a

JDBC driver name

Username for database

The execution engine:

Hive Data Types

Hive Query Language

Hive Query Language

Hive Query Language

Hive Query Language

OVERWRITE TABLE t1 SELECT * FROM t2;

//Multiple table Insert using single table

Hive Query Language

Hive Query Language

* FROM employee WHERE Id=1205;

SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

Hive Query Language

Partitions in Hive Table

Loading Data in Dynamic Partition

Bucketing in Hive Tables

You might also like