You are on page 1of 485

HDP Operations: Install and Manage

With Apache Ambari


A Hortonworks University
Hadoop Training Course

Copyright 2014, Hortonworks, Inc. All rights reserved.

Title: HDP Operations: Install and Manage with Apache Ambari


Version: GA
Revision: 3
Date: Jul 30, 2014
Copyright 2013-2014 Hortonworks Inc. All rights reserved.

Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
The contents of this course and all its related materials, including lab exercises and files, are Copyright Hortonworks
Inc. 2014.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of
Hortonworks Inc. All rights reserved.

Copyright 2014, Hortonworks, Inc. All rights reserved.

Table of Contents
Table of Contents ...................................................................................................................... 4
Course Introduction.............................................................................................................. 10
Unit 1: Introduction to HDP and Hadoop 2.0 ............................................................... 11
Enterprise Data Trends @ Scale ................................................................................................. 12
What is Big Data? ............................................................................................................................. 13
A Market for Big Data ..................................................................................................................... 14
Most Common New Types of Data.............................................................................................. 15
Moving from Causation to Correlation..................................................................................... 17
What is Hadoop? .............................................................................................................................. 19
What is Hadoop 2.0? ....................................................................................................................... 20
Traditional Systems vs. Hadoop ................................................................................................. 21
Overview of a Hadoop Cluster ..................................................................................................... 22
Who is Hortonworks?..................................................................................................................... 23
The Hortonworks Data Platform ............................................................................................... 24
Use Case: EDW before Hadoop .................................................................................................... 26
Banking Use Case: EDW with HDP ............................................................................................. 27

Unit 2: HDFS Architecture................................................................................................... 29


What is a File System? .................................................................................................................... 30
OS Architecture ................................................................................................................................ 31
HDFS Architecture ........................................................................................................................... 32
Understanding Block Storage...................................................................................................... 34
Demonstration: Understanding Block Storage ..................................................................... 36
The NameNode ................................................................................................................................. 39
The DataNodes.................................................................................................................................. 41
DataNode Failure ............................................................................................................................. 43
HDFS Clients ...................................................................................................................................... 45

Unit 3: Installation Prerequisites and Planning ......................................................... 47


Minimum Hardware Requirements .......................................................................................... 48
Minimum Software Requirements ............................................................................................ 49
A Formidable Starter Cluster ...................................................................................................... 50
Lab 3.1: Setting up the Environment ........................................................................................ 52
Lab 3.2: Install HDP 2.0 Cluster using Ambari ...................................................................... 54

Unit 4: Configuring Hadoop ................................................................................................ 65


Configuration Considerations ..................................................................................................... 66
Deployment Layout ......................................................................................................................... 67
Configuring HDFS............................................................................................................................. 69
What is Ambari ................................................................................................................................. 72
Configuration via Ambari ............................................................................................................. 73
Management ...................................................................................................................................... 74
Monitoring ......................................................................................................................................... 76
REST API.............................................................................................................................................. 77
Lab 4.1: Add a New Node to the Cluster ................................................................................... 79
Lab 4.2: Stopping and Starting HDP Services......................................................................... 81

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 4.3: Using HDFS Commands ................................................................................................. 87

Unit 5: Ensuring Data Integrity ......................................................................................... 95


Ensuring Data Integrity ................................................................................................................. 96
Replication Placement ................................................................................................................... 98
Data Integrity Writing Data .................................................................................................... 100
Data Integrity Reading Data ................................................................................................... 102
Data Integrity - Block Scanning ................................................................................................ 103
Running a File System Check ..................................................................................................... 105
What Does the File System Check Look For? ....................................................................... 106
hdfs fsck Syntax .............................................................................................................................. 108
Data Integrity File System Check: Commands & Output .............................................. 110
The dfs Command .......................................................................................................................... 112
NameNode Information ............................................................................................................... 114
Changing the Replication Factor .............................................................................................. 115
Lab 5.1: Verify Data with Block Scanner and fsck .............................................................. 117

Unit 6: HDFS NFS Gateway ................................................................................................ 123

HDFS NFS Gateway Introduction .............................................................................................. 124


NFS Gateway Node ......................................................................................................................... 126
Configuring the HDFS NFS Gateway ........................................................................................ 127
Starting the NFS Gateway Service ............................................................................................ 129
User Authentication ...................................................................................................................... 131
Lab 6.1: Mounting HDFS to a Local File System ................................................................... 133

Unit 7: YARN Architecture and MapReduce ............................................................... 136

What is YARN? ................................................................................................................................. 137


Hadoop as Next-Gen Platform ................................................................................................... 138
Beyond MapReduce ...................................................................................................................... 140
YARN Use-case ................................................................................................................................ 141
YARN Birds Eye View ................................................................................................................... 142
Lifecycle of a YARN Application ................................................................................................ 144
ResourceManager .......................................................................................................................... 146
NodeManager .................................................................................................................................. 147
MapReduce....................................................................................................................................... 148
Demonstration: Understanding MapReduce ....................................................................... 150
Configuring YARN .......................................................................................................................... 152
Configuring MapReduce .............................................................................................................. 154
Tools ................................................................................................................................................... 156
Lab 7.1: Troubleshooting a MapReduce Job......................................................................... 159

Unit 8: Job Schedulers ........................................................................................................ 165

Overview of Job Scheduling ....................................................................................................... 166


The Built-in Schedulers ............................................................................................................... 167
Overview of the Capacity Scheduler ....................................................................................... 168
Configuring the Capacity Scheduler ........................................................................................ 170
Defining Queues ............................................................................................................................. 171
Configuring Capacity Limits ....................................................................................................... 173
Configuring User Limits............................................................................................................... 175
Configuring Permissions ............................................................................................................. 176
Overview of the Fair Scheduler ................................................................................................ 177
Configuration of the Fair Scheduler ........................................................................................ 178

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 8.1: Configuring the Capacity Scheduler ....................................................................... 180

Unit 9: Enterprise Data Movement ................................................................................ 185


Enterprise Data Movement ........................................................................................................ 186
Challenges with a Traditional ETL Platform ........................................................................ 188
Hadoop Based ETL Platform ...................................................................................................... 189
Data Ingestion ................................................................................................................................. 190
Hadoop: Data Movement ............................................................................................................. 191
Defining Data Layers .................................................................................................................... 193
Distributed Copy (distcp) Command ...................................................................................... 194
Distcp Options................................................................................................................................. 196
Using distcp...................................................................................................................................... 198
Using distcp for Backups ............................................................................................................. 200
Lab 9.1: Use distcp to Copy Data from a Remote Cluster ................................................. 202

Unit 10: HDFS Web Services ............................................................................................. 204

What is WebHDFS? ........................................................................................................................ 205


Setting up WebHDFS ..................................................................................................................... 207
Using WebHDFS .............................................................................................................................. 209
WebHDFS Authentication ........................................................................................................... 211
Copying Files to HDFS................................................................................................................... 213
Hadoop HDFS over HTTP ............................................................................................................ 218
Who Uses WebHCat REST API?.................................................................................................. 220
Running WebHCat ......................................................................................................................... 223
Using WebHCat ............................................................................................................................... 224
Lab 10.1: Using WebHDFS........................................................................................................... 226

Unit 11: Hive Administration .......................................................................................... 230

Introduction to Hive ..................................................................................................................... 231


Comparing Hive with RDBMS .................................................................................................... 232
Hive MetaStore ............................................................................................................................... 233
HiveServer2 ..................................................................................................................................... 235
Hive Command Line Interface ................................................................................................... 236
Processing Hive SQL Statements .............................................................................................. 238
Hive Data Hierarchical Structures........................................................................................... 239
Hive Tables....................................................................................................................................... 242
Defining a Hive-Managed Table................................................................................................ 243
Defining an External Table ......................................................................................................... 244
Defining a Table LOCATION ....................................................................................................... 244
Loading Data into Hive ................................................................................................................ 245
Performing Queries ...................................................................................................................... 247
Guidelines for Architecting Hive Data.................................................................................... 248
Hive Query Optimizations .......................................................................................................... 249
Hive/MR versus Hive/Tez .......................................................................................................... 250
ORCFile Example ............................................................................................................................ 251
Compression .................................................................................................................................... 252
Hive Security ................................................................................................................................... 254
Lab 11.1: Understanding Hive Tables .................................................................................... 256

Unit 12: Transferring Data with Sqoop ........................................................................ 262


Overview of Sqoop ......................................................................................................................... 263
The Sqoop Import Tool ................................................................................................................ 265

Copyright 2014, Hortonworks, Inc. All rights reserved.

Importing a Table .......................................................................................................................... 267


Importing Specific Columns ....................................................................................................... 269
Importing from a Query .............................................................................................................. 270
The Sqoop Export Tool ................................................................................................................ 272
Exporting to a Table...................................................................................................................... 274
Lab 12.1: Using Sqoop .................................................................................................................. 276

Unit 13: Flume....................................................................................................................... 284

Flume Introduction ....................................................................................................................... 285


Installing Flume ............................................................................................................................. 287
Flume Events ................................................................................................................................... 289
Flume Sources ................................................................................................................................. 290
Flume Channels .............................................................................................................................. 292
Flume Channel Selectors ............................................................................................................. 294
Flume Channel Selector ............................................................................................................... 295
Flume Sinks ...................................................................................................................................... 296
Multiple Sinks ................................................................................................................................. 298
Flume Interceptors ....................................................................................................................... 300
Design Patterns .............................................................................................................................. 302
Configuring Individual Components....................................................................................... 303
Flume Netcat Source Example ................................................................................................... 305
Flume Exec Source Example ...................................................................................................... 307
Flume Configuration ..................................................................................................................... 308
Monitoring Flume .......................................................................................................................... 310
Lab 13.1: Install and Test Flume .............................................................................................. 312

Unit 14: Oozie ........................................................................................................................ 315

Oozie Overview............................................................................................................................... 316


Oozie Components......................................................................................................................... 317
Jobs, Workflows, Coordinators, Bundles............................................................................... 318
Workflow Actions and Decisions ............................................................................................. 320
Oozie Actions ................................................................................................................................... 321
Oozie Job Submission ................................................................................................................... 323
Oozie Server Workflow Coordinator ...................................................................................... 324
Oozie Console .................................................................................................................................. 325
Interfaces to Oozie......................................................................................................................... 326
Oozie Server Configuration ........................................................................................................ 327
Oozie Scripts .................................................................................................................................... 330
The Oozie CLI................................................................................................................................... 332
Using the Oozie CLI........................................................................................................................ 334
Submit Jobs through HTTP ......................................................................................................... 336
Lab 14.1: Running an Oozie Workflow................................................................................... 338

Unit 15: Monitoring HDP2 Services ............................................................................... 344

Ambari ............................................................................................................................................... 345


Monitoring Architecture ............................................................................................................. 347
Monitoring HDP2 Clusters .......................................................................................................... 348
Ambari Web Interface .................................................................................................................. 350
Ambari Web Interface (cont.) ................................................................................................... 352
Ganglia ............................................................................................................................................... 354
Ganglia Monitoring a Hadoop Cluster .................................................................................... 355

Copyright 2014, Hortonworks, Inc. All rights reserved.

Nagios................................................................................................................................................. 357
Nagios UI ........................................................................................................................................... 359
Monitoring JVM Processes .......................................................................................................... 360
Understanding JVM Memory ..................................................................................................... 362
Eclipse Memory Analyzer ........................................................................................................... 364
JVM Memory Heap Dump ............................................................................................................ 366
Java Management Extensions (JMX) ....................................................................................... 368

Unit 16: Commissioning and Decommissioning Nodes .......................................... 370


Architectural Review.................................................................................................................... 371
Decommissioning and Commissioning Nodes .................................................................... 373
Decommissioning Nodes ............................................................................................................. 374
Steps for Decommissioning a DataNode................................................................................ 376
Decommissioning Node States .................................................................................................. 378
Steps for Commissioning a Node .............................................................................................. 379
Balancer ............................................................................................................................................ 381
Balancer Threshold Setting ....................................................................................................... 382
Configuring Balancer Bandwidth............................................................................................. 384
Lab 16.1: Commissioning & Decommissioning DataNodes ............................................ 386

Unit 17: Backup and Recovery ........................................................................................ 392

What should you backup? ........................................................................................................... 393


HDFS Snapshots .............................................................................................................................. 394
HDFS Data - Backups .................................................................................................................... 395
HDFS Data Automate & Restore ............................................................................................ 396
Hive & Ambari Backup ................................................................................................................. 397
Lab 17.1: Using HDFS Snapshots .............................................................................................. 399

Unit 18: Rack Awareness and Topology ...................................................................... 402

Rack Awareness ............................................................................................................................. 403


YARN Rack Awareness ................................................................................................................. 404
Replica Placement ......................................................................................................................... 405
Rack Topology ................................................................................................................................ 406
Rack Topology Script.................................................................................................................... 408
Configuring the Rack Topology Script.................................................................................... 409
Lab 18.1: Configuring Rack Awareness ................................................................................. 411

Unit 19: NameNode HA ...................................................................................................... 414

NameNode Architecture HDP1 ................................................................................................. 415


NameNode High Availability ...................................................................................................... 416
HDFS HA Components .................................................................................................................. 417
Understanding NameNode HA .................................................................................................. 419
NameNodes in HA .......................................................................................................................... 421
Failover Modes ............................................................................................................................... 423
hdfs haadmin Command ............................................................................................................. 426
Red Hat HA ....................................................................................................................................... 427
VMware HA ...................................................................................................................................... 428
Lab 19.1: Implementing NameNode HA................................................................................. 430

Unit 20: Securing HDP ........................................................................................................ 437

Security Concepts .......................................................................................................................... 438


Kerberos Synopsis ......................................................................................................................... 440

Copyright 2014, Hortonworks, Inc. All rights reserved.

HDP Security Overview................................................................................................................ 442


Securing HDP Authentication................................................................................................. 444
Securing HDP - Authorization ................................................................................................... 445
Lab 20.1: Securing a HDP Cluster ............................................................................................. 446

Appendix A: Unit Review Answers ................................................................................ 451


Appendix B: Other Hadoop Tools................................................................................... 456

Data Lifecycle Management ....................................................................................................... 457


Data Lifecycle Management on Hadoop ................................................................................ 458
Falcon Use Cases and Capabilities ........................................................................................... 459
Falcon ................................................................................................................................................. 460
Future: Knox .................................................................................................................................... 461
ZooKeeper Synopsis ..................................................................................................................... 462
Configuring ZooKeeper................................................................................................................ 465
HBase Synopsis ............................................................................................................................... 468
Configuring HBase ......................................................................................................................... 470
HCatalog ............................................................................................................................................ 471
NameNode Architecture HDP1 ................................................................................................. 472
Federating NameNodes ............................................................................................................... 474
NameNode Architecture HDP2 ................................................................................................. 476
Namespace Volume ....................................................................................................................... 477
Benefits of Independent Block Pools ...................................................................................... 478
Namespaces Increase Scalability ............................................................................................. 479
Configuring NameServices for DataNodes ............................................................................ 480
Block Management with Federation ....................................................................................... 482
Federation Configuration Parameters ................................................................................... 484

Copyright 2014, Hortonworks, Inc. All rights reserved.

Course Introduction

10

Welcome to Hortonworks University

Course Agenda

Introductions

Overview of Hortonworks Certification

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 1: Introduction to HDP and


Hadoop 2.0
Topics covered:

Enterprise Data Trends @ Scale

What is Big Data?

A Market for Big Data

Most Common New Types of Data

Moving from Causation to Correlation

What is Hadoop?

What is Hadoop 2.0?

Traditional Systems vs. Hadoop

Overview of a Hadoop Cluster

Who is Hortonworks?

The Hortonworks Data Platform

Hadoop Use Case

Lab 1.1: Login to Your Cluster

Copyright 2014, Hortonworks, Inc. All rights reserved.

11

Enterprise Data Trends @ Scale


Organizations are redefining data strategies due to the requirements of the
evolving Enterprise Data Warehouse (EDW).

Machine
Data
Social Media
VoIP
Enterprise
Data

Enterprise Data Trends @ Scale


The volume of data that is available for analysis is transforming organizations, as well as
the entire IT industry. Everyone is seeing data external to an organization as becoming
just as strategic as internal data. Semi-structured and unstructured data volume is
beginning to dwarf the traditional data in relational databases and data warehouses.

12

Facebook has around 50 PB warehouse and its constantly growing.

Twitter messages are 140 bytes each generating 8TB data per day.

Data is more than doubling every year.

Almost 80% of data will be unstructured data.

Netflix: 75% of streaming video results from recommendations.

Amazon: 35% of product sales come from product recommendations.

Copyright 2014, Hortonworks, Inc. All rights reserved.

What is Big Data?


Big data is high-volume, -velocity and -variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight
and decision-making.
Gartners Big Data definition is broken into three parts:

Part One: 3Vs: Gartner analyst Doug Laney came up with famous three Vs
(Volume, Velocity and Variety) in 2001.

Part Two: Cost-Effective, Innovative Forms of Information Processing:


Organizations are looking to access unstructured and semi-structured data and
process that data with traditional structured data to perform comprehensive
analysis.

Part Three: Enhanced Insight and Decision Making - The goal of working with
big data is to increase business value and to respond quicker and with more
accuracy to meet well-defined business objectives.

Source: FORBES - http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-bigdata-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/


Copyright 2014, Hortonworks, Inc. All rights reserved.

13

A Market for Big Data


IDC is predicting a big data market that will grow revenue at 31.7 percent a year until it
hits the $23.8 billion mark in 2016. Thats a big number for a relatively new market, but
it only tells part of the story of where big data technology will make money. Defining
big data isnt always an easy task, and breaking it out into a group of separate
technologies might not be either. While this report appears to subsume a May 2012
report from IDC predicting an $813 million Hadoop market, it certainly doesnt include
the market for analytics software. In July, IDC predicted that market which is a critical
piece of the overall big data picture would hit $51 billion by 2016. (IBMs Steve Mills
said he expects IBM to do $15 billion in analytics revenue itself by 2015.)
Despite challenges, such as the lack of clear big data strategies, security concerns, and
the need for workforce re-skilling, the growth potential of Big Data is unprecedented.
Mind Commerce estimates that global spending on Big Data will grow at a CAGR of 48%
between 2014 and 2019. Big Data revenues will reach $135 Billion by the end of 2019.
This report provides an in-depth assessment of the global Big Data market, including a
study of the business case, application use cases, vendor landscape, value chain analysis,
case studies and a quantitative assessment of the industry from 2013 to 2019.

Source: http://www.researchmoz.us/big-data-market-business-case-market-analysisand-forecasts-2014-2019-report.html
14

Copyright 2014, Hortonworks, Inc. All rights reserved.

Most Common New Types of Data


1. Sentiment
Understand how your customers feel about your brand and
products right now

2. Clickstream
Capture and analyze website visitors data trails and
optimize your website

3. Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines

4. Geographic

Value

Analyze location-based data to manage operations where


they occur

5. Server Logs
Research logs to diagnose process failures and prevent
security breaches

6. Unstructured (txt, video, pictures, etc..)


Understand patterns in files across millions of web pages,
emails, and documents

+ Keep existing
data longer!

Most Common New Types of Data

Sentiment: The most commonly sighted source, analyzing language usage, text
and computational iinguistics in an attempt to better analyze subjective
information. Many companies are trying to leverage this data to provide
sentiment trackers, identify influencers etc.

Clickstream: The trail a user leaves behind as he navigates your website.


Analyze the trail to optimize website design.

Sensor/Machine: These are everywhere: cars, health equipment, smartphones,


etc. Nike put one in shoes. Someone also put one in baby diapers! They call it
proactive maintenance.

Copyright 2014, Hortonworks, Inc. All rights reserved.

15

16

Geographic: Location based data a common use being location based


targeting. This data has much more wider application in supply chain
optimization across the manufacturing industry allowing organizations to
optimize routes, predict inventory levels, etc.

Server logs: This one is not new to the IT world. You often lose precious trails
and information when you simply roll over log files. Today, you should not have
to lose this data; you just save the data in Hadoop!

Text: Text is everywhere. We all love to express ourselves - every blog, article,
news site, ecommerce site you go these days, you will find people putting out
their thoughts. And this is on top of the already existing text sources like surveys
and the Web content itself. How do you store, search and analyze all this text
data to glean for key insights? Hadoop!

Copyright 2014, Hortonworks, Inc. All rights reserved.

Moving from Causation to Correlation


Big data allows your organization to generate results that are more accurate and that
you can have more confidence in, because there is more detailed data that has been
correlated with other sources to provide more accuracy. In addition, the ability to
reduce the business latency of the time between when the data hits the disk to the time
the data can be used to make business decisions can be a critical success factor for an
organizations success.
Microsoft was looking to improve the grammar accuracy for Microsoft Word. The
researchers Michele Banko and Eric Brill found the more data they fed into existing
algorithms, the more accurate the algorithms got. They took four common algorithms
and fed them 10 million, 100 million and then 1 billion records. One algorithm that had
an accuracy of 75% went up to 95% when fed more data. The more data they looked at,
the smarter the algorithms got.
Amazon does translations in over 60 languages and its translations are considered the
best. Amazon has massive amounts of data that they have access to and this data gives
them a strategic advantage if they use it properly. Companies are seeing that the more
data they analyze, the more accurate the results become.

Copyright 2014, Hortonworks, Inc. All rights reserved.

17

Organizations are also looking at extra data that comes from social media and machine
data and correlating it with their existing traditional data. Correlating data from multiple
sources is generating much higher data analysis results.
Businesses that can use big data to generate more detailed results with a higher degree
of accuracy will be at a competitive advantage. Its about being able to out Hadoop
your competition.

Data driven decisions are better decisions its as simple as that. Using big
data enables managers to decide on the basis of evidence rather than
intuition. For that reason it has the potential to revolutionize management.
Harvard Business Review, October 2012

By 2015, organizations that build a modern information management


system will outperform their peers financially by 20 percent.
Gartner, Mark Beyer, Information Management in the 21st Century

18

Copyright 2014, Hortonworks, Inc. All rights reserved.

What is Hadoop?
Hadoop is all about processing and storage. Hadoop is a software framework
environment that provides a parallel processing environment on a distributed file
system using commodity hardware. A Hadoop cluster is made up of master processes
and slave processes spread out across different x86 servers. This framework allows
someone to build a Hadoop cluster that offers high performance super computer
capability.

Wikipedia states: Apache Hadoop is an open-source software framework


that supports distributed applications, licensed under the Apache v2 license
(public domain)". It enables applications to work with thousands of
computational independent computers and petabytes of data.

Copyright 2014, Hortonworks, Inc. All rights reserved.

19

What is Hadoop 2.0?


Hadoop 2.0 refers to the next generation of Hadoop. As expected, the Hadoop
framework has grown to meet the demands of its own popularity and usage, and 2.0
reflects the natural maturing of the open-source project.
The Apache Hadoop project consists of the following modules:

20

Hadoop Common: the utilities that provide support for the other Hadoop
modules.

HDFS: the Hadoop Distributed File System

YARN: a framework for job scheduling and cluster resource management.

MapReduce: for processing large data sets in a scalable and parallel fashion.

Copyright 2014, Hortonworks, Inc. All rights reserved.

Traditional Systems vs. Hadoop


SCALE (storage & processing)
Traditional
Database

EDW

Required on write

MPP
Analytics

schema
speed

Reads are fast

NoSQL

Hadoop
Distribution

Required on read
Writes are fast

Standards and structured

governance

Loosely structured

Limited, no data processing

processing

Processing coupled with data

Structured

data types

Multi and unstructured

best fit use

Data Discovery
Processing unstructured data
Massive Storage/Processing

Interactive OLAP Analytics


Complex ACID Transactions
Operational Data Store

Traditional Systems vs. Hadoop


Hadoop is not designed to replace existing relational databases or data warehouses.
Relational databases are designed to manage transactions. They contain a lot of
feature/functionality designed around managing transactions. They are based upon
schema-on-write. Organizations have spent years building Enterprise Data Warehouses
(EDW) and reporting systems for their traditional data. The traditional EDWs are not
going anywhere either. EDWs are also based on schema-on-write.
Hadoop is not:

Relational

NoSQL

Real-time

A database

Hadoop is a data platform that compliments existing data systems. Hadoop is designed
for schema-on-read and can handle the large data volumes coming from semistructured and unstructured data. With the low cost of storage on Hadoop,
organizations are looking at using Hadoop more for archiving.
Copyright 2014, Hortonworks, Inc. All rights reserved.

21

Overview of a Hadoop Cluster


A Hadoop cluster is made up of master and slave servers
Master servers manage the infrastructure
Slave servers contain the distributed data and perform processing
Master Servers NameNode, ResourceManager, Standby Name Node, HBase Master
Master Node 1
NameNode
Oozie Server
ZooKeeper

Master Node 2
ResourceManager
Standby NameNode
HBase Master
HiveServer2
ZooKeeper

Management Node
Ambari Server
Ganglia/Nagios
WebHCat Server
JobHistoryServer
ZooKeeper

Slave Servers NodeManager, DataNode, HBase RegionServer


DataNode 1
DataNode
NodeManager
H RegionServer

DataNode 2
DataNode
NodeManager
H RegionServer

DataNode 3
DataNode
NodeManager
H RegionServer

DataNode n
DataNode
NodeManager
H RegionServer

Overview of a Hadoop Cluster


A Hadoop cluster consists of the following components:

NameNode: a master server that manages the namespace of HDFS.

DataNodes: slave servers that store blocks of data.

ResourceManager: the master server of the YARN processing framework.

NodeManagers: slave servers of the YARN processing framework.

HBase components: HBase also has a master server and slave servers called
RegionServers.

Some of the components working in the background of a cluster include ZooKeeper,


Ambari, Ganglia, Nagios, JobHistory, HiveServer2, and WebHCat.

22

Copyright 2014, Hortonworks, Inc. All rights reserved.

Who is Hortonworks?
OPERATIONAL
SERVICES
AMBARI

FLUME
HBASE

FALCON*
OOZIE

Hortonworks
Data Platform (HDP)

DATA
SERVICES
PIG

SQOOP

HIVE &
HCATALOG

Focus on enterprise
distribution of Hadoop

LOAD &
EXTRACT

HADOOP
CORE
PLATFORM
SERVICES

NFS
WebHDFS

KNOX*

MAP
REDUCE

TEZ

YARN

True open source model, no


vendor lock-in

HDFS
Enterprise Readiness

Defining Hadoop roadmap

High Availability, Disaster


Recovery, Rolling Upgrades,
Security and Snapshots

HORTONWORKS
DATA PLATFORM (HDP)
OS/VM

Cloud

Appliance

Who is Hortonworks?
Hortonworks develops, distributes and supports Enterprise Apache Hadoop:

Develop: Hortonworks was formed by the key architects, builders, and operators
from Yahoo! Hortonworks software engineering team has led the effort to
design and build every major release of Apache Hadoop from 0.1 to the most
current stable release, contributing more than 80% of the code along the way.

Distribute: We provide a 100% Open Source Distribution of Apache Hadoop,


adding the required Operational, Data and Platform services from the open
source community in the Hortonworks Data Platform (HDP).

Support: We provide a range of support options for customers of the


Hortonworks Data Platform and are the leading provider of expert Hadoop
training available today.

Copyright 2014, Hortonworks, Inc. All rights reserved.

23

HDP: Reliable, Consistent & Current


HDP demonstrates most recent community innovation

0.96.0

0.12.0
2.2.0

HDP 1.3

0.5.0

1.4.1

1.2.0

1.0.3

HMC1.1

3.1.3

Ambari

Mahout

HMC1

Zookeeper

HDP 1.0

3.2.0

Oozie

2012

1.2.3

3.4.5

3.3.4
0.92.1

0.4.0

Sqoop

JUNE

HDP 1.1

3.3.2

0.9.0
0.9.2

0.8.0

4.0.0

0.7.0

0.94.2

HBase

2012

1.4.3

1.4.2

HCatalog

SEPT

0.10.0

0.10.1

HDP 1.2

Hadoop

2013

0.94.6

0.11

1.1.2

FEB

1.4.4

0.11.0

Pig

May

2013

1.4.1

0.12.0

HDP 2.0

Hive

OCT

2013

Hortonworks Data Platform

The Hortonworks Data Platform


Individuals can download a free release of Hadoop from the Apache Software
Foundation (100% free open source). This gives someone the opportunity to test
different versions of the different frameworks that make up Hadoop. A company looking
to run a production version of Hadoop will want an enterprise version of Hadoop. An
enterprise version of Hadoop has gone through rigorous system, function and
regression testing of the distribution. Its important for an enterprise version of Hadoop
to determine what is the best combination of all the frameworks. Finding the best stable
release of a framework that works the best with the other frameworks is critical for
stability. Most organizations will not work with the Apache Hadoop distribution for a
production release for the following two reasons:

24

It takes a tremendous amount of skill and testing to find the right combination
for all the frameworks.

Other software runs along side of Hadoop. It is hard for a software vendor to
work with customers that have their own unique distribution of the Hadoop
frameworks.

Copyright 2014, Hortonworks, Inc. All rights reserved.

The Hortonworks Data Platform (HDP) is the Hadoop distribution provided by


Hortonworks. HDP is perceived in the industry as the enterprise distribution of
Hadoop. Hortonworks is recognized for their platform expertise around Hadoop as well
as being the organization that is defining the roadmap for Hadoop.
Hortonworks:

Utilizes HDP, a 100% free, open source distribution of Hadoop. Every line of code
generated by Hortonworks is put back into the Apache Software Foundation.

Is considered to be the distribution defining the Hadoop road map due to


Hortonworks creating significantly more lines of open source code for Hadoop
than any other source.

Has developed over 614,041 lines of code compared to the next nearest
distribution vendor with 147,933 lines of code (based on a recent comparison).

Has 21 committers compared to the nearest distribution vendor having 7


committers.

Tests HDP at a much larger scale than any other distribution. HDP is certified and
tested at scale.

Copyright 2014, Hortonworks, Inc. All rights reserved.

25

Use Case: EDW before HDP


Unstructured
Data

DM

Log files

Exhaust Data

DM

Social Media

Sensors,
devices

ETL

EDW
DM

DB data

EDW was occupied with ETL, SLAs Suffer


Schema required for ingest. New source? Adjust ETL+ EDW
Data Discarded due to Scale and Cost (Volume, Speed)

Lots of time to understand process, high latency, low value

Use Case: EDW before Hadoop


Before Hadoop, a banks Enterprise Data Warehouse was ingesting large amounts of
data with the following characteristics:

26

A schema was required for ingestion. When a new source of data was
introduced, new schemas had to be created (which took up to a month!).

SLAs were suffering because the EDW was busy performing ETL.

Data had to be thrown out after 2 to 5 days because it was not cost-effective to
maintain it.

The bank was missing out on new data sources, and also historical data was lost.

Copyright 2014, Hortonworks, Inc. All rights reserved.

Use Case: EDW with HDP


Unstructured
Data

Explore
DM

Log files

Exhaust Data

Big Data
Platform

Social Media

EDW

DM

Sensors,
devices

DB data

Consolidate data types: structured/polystructured


Data available with minimal delay (explore), On demand Schema

Active Archive: store unprecendented amounts, enable more


complete analysis

DM

Banking Use Case: EDW with HDP


By adding Hadoop, the bank is now benefitting from a list of new capabilities in its EDW:

Data is now available for use with minimal delay, which enables real-time
capture of source data.

They have a new philosophy about data: capture all data first, and then structure
the data as business needs evolve. This makes their systems much more
dynamic.

The bank now stores years worth of raw transactional data. The data is no
longer archived, it has become ACTIVE!

Data Lineage: The bank stores intermediate stages of their data, enabling a more
powerful analytics platform.

The EDW can focus less on storage and transformation and more on analytics.

Hadoop opens up an opportunity for exploration of data that was never there
before!

Copyright 2014, Hortonworks, Inc. All rights reserved.

27

Unit 1 Review
1. The core Hadoop frameworks are __________________ and _______________.
2. True or False: Hadoop is equivalent to a NoSQL platform.
3. What is the name of the management interface used for provisioning, managing,
and monitoring Hadoop clusters? _________________
4. What processes might you find running on a Master node of a Hadoop cluster?
_________________________________________________________________

28

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 2: HDFS Architecture


Topics covered:

What is a File System?

OS Architecture

HDFS Architecture

Understanding Block Storage

Demonstration: Understanding Block Storage

The NameNode

The DataNodes

DataNode Failure

HDFS Clients

Copyright 2014, Hortonworks, Inc. All rights reserved.

29

What is a File System?


Data in Hadoop is stored on a file system referred to as HDFS - the Hadoop Distributed
File System. Within HDFS, data is broken down into chunks and distributed across a
cluster of machines.
Before discussing HDFS, lets take a look at the features of common file systems:

30

Namespace: Multi-level directory trees and file names.

Metadata: All nodes in a directory tree can have various levels of ownership
(user, group, anonymous), permissions (read, write, execute), last accessed time,
create time, modified time, is-hidden, etc.

Journaling: Reliable file systems will maintain a journal of edits in case of


failures, such as power or disk failures. Journals will contain metadata related to
the edit and in some implementations, the actual data to be flushed to disk.

Storage: Storage in a file system is on a physical or network attached storage


device. These devices will persist data, which is chunked by blocks.

Tools: All file systems have tools to perform file operations as well as
administrative operations such as troubleshooting and fixing problems.
Copyright 2014, Hortonworks, Inc. All rights reserved.

OS Architecture
A familiar file system architecture:

Operating System (OS)


Virtual File System

Namespace(s)

Tools

Metadata

File System
(ext4, ext3, xfs, etc.)

Note: File systems are components of an OS

Journaling

Storage

Disk

OS Architecture
Most common file systems are POSIX based. HDFS is also a POSIX based file system.

Copyright 2014, Hortonworks, Inc. All rights reserved.

31

HDFS Architecture

NameNode

Namespace

Block Map

Metadata

Journaling

NameNode and
DataNodes are
daemon jvms

Disk
Tools

DataNode

DataNode

DataNode

Storage

Storage

Disk

Disk

Storage

Disk

HDFS Architecture
A Hadoop instance consists of a cluster of HDFS machines; often referred to as the
Hadoop cluster or HDFS cluster. There are two main components of a HDFS cluster:
1. NameNode: The master node of HDFS that manages the data (without
actually storing it) by determining and maintaining how the chunks of data
are distributed across the DataNodes. The NameNode will contain and
manage the namespace, metadata, journaling, and a BlockMap. The
BlockMap is an in-memory map of all the blocks that make up a file and
DataNode locations of those blocks in the HDFS cluster.
2. DataNode: Stores the chunks of data, and is responsible for replicating the
chunks across other DataNodes.

32

Copyright 2014, Hortonworks, Inc. All rights reserved.

The NameNode and DataNode are daemon processes running in the cluster. Some
important concepts involving the NameNode and DataNodes are:

By default only one NameNode is used in a cluster, which creates a single point
of failure. We will later discuss how to enable HA in Hadoop to mitigate this risk.

Data never resides on or passes through the NameNode. Your big data only
resides on DataNodes.

DataNodes are referred to as slave daemons to the NameNode and are


constantly communicating their state with the NameNode.

The NameNode keeps track of how the data is broken down into chunks on the
DataNodes.

The default chunk size is 128MB (but is configurable).

The default replication factor is 3 (and is also configurable), which means each
chunk of data is replicated across 3 DataNodes.

DataNodes communicate with other DataNodes (through commands from the


NameNode) to achieve data replication.

NOTE: HDFS supports a traditional hierarchical file organization. A user or an


application can create directories and store files inside these directories. It is
a distributed POSIX implementation.

Copyright 2014, Hortonworks, Inc. All rights reserved.

33

1. Client sends a request to


the NameNode to add a
file to HDFS.
NameNode
2. NameNode gives client a
lease to the file path.
3. For every block, the client will request the NameNode to provide
a new blockid and a list of destination DataNodes.
4. The client will write the block directly to the first DataNode in the
list.

DataNode 1

DataNode 2

DataNode 3

5. The first DataNode pipelines the replication to the next DataNode in the list.

Understanding Block Storage


Putting a file into HDFS involves the following steps:
1. A client application sends a request to the NameNode that specifies where they
want to put the file in HDFS.
2. The NameNode gives the client a lease to the file path. This lease will be released
if there is a failure, timeout, or will be made permanent if the write is successful
and file handle is closed.
3. For every block that the client needs to write (128MB by default), the client will
make a request to the NameNode for a new blockid and a destination list of
DataNodes for where to write the new block to.
4. Once the client gets the blockid and destination DataNodes it will start flushing
its buffer to the first DataNode in the list.
5. To replicate the block, that first DataNode will open a stream to the next
DataNode in the list and flush its persisted chunk to that DataNode. This block
replication pipeline is established to all the nodes in the list. Thus replication is
very efficient and occurs in parallel.

34

Copyright 2014, Hortonworks, Inc. All rights reserved.

You can specify the block size for each file using the dfs.blocksize property. If you do not
specify a block size at the file level, the global value of dfs.blocksize defined in hdfssite.xml will be used.

IMPORTANT: The data never passes through the NameNode. The client
program that is uploading the data into HDFS performs I/O directly with the
DataNodes. The NameNode only stores the metadata of the file system; it is
not responsible for storing or transferring the data.

Copyright 2014, Hortonworks, Inc. All rights reserved.

35

Demonstration: Understanding Block Storage

Objective: To understand how data is partitioned into blocks and stored


in HDFS.
During this Watch as your instructor performs the following steps.
demonstration:

Step 1: Put the File into HDFS


1.1. Change directories to /usr/lib/hadoop (or any folder containing a file larger
than 2MB):
# cd /usr/lib/hadoop

1.2. Try putting the hadoop-common JAR file into HDFS with a block size of 30
bytes:
# hadoop fs -D dfs.blocksize=30 -put hadoop-common-x.jar hadoop-common.jar

1.3. Notice 30 bytes is not a valid blocksize. The blocksize needs to be at least
1048576 according to the dfs.namenode.fs-limits.min-block-size property:
put: Specified block size is less than configured minimum
value (dfs.namenode.fs-limits.min-block-size): 30 < 1048576

1.4. Try the put again, but use a block size of 2,000,000:
# hadoop fs -D dfs.blocksize=2000000 -put hadoop-commonx.jar hadoop-common.jar

1.5. Notice 2,000,000 is not a valid block size because it is not a multiple of 512
(the checksum size).

36

Copyright 2014, Hortonworks, Inc. All rights reserved.

1.6. Try the put again, but this time use 1,048,576 for the block size:
# hadoop fs -D dfs.blocksize=1048576 -put hadoop-commonx.jar hadoop-common.jar

1.7. This time the put command should have worked. Use ls to verify the file is in
HDFS:
# hadoop fs -ls
...
-rw-r--r-3 root root

2679929

hadoop-common.jar

Step 2: View the Number of Blocks


2.1. Run the following command to view the number of blocks that were created
for hadoop-common.jar:
# hdfs fsck /user/root/hadoop-common.jar

2.2. Notice there are three blocks. Look for the following line in the output:
Total blocks (validated):

3 (avg. block size 893309 B)

2.3. What is the average block replication for this file? ________________
Step 3: Specify a Replication Factor
3.1. Add another file from /usr/lib/hadoop into HDFS, except this time specify a
different replication factor:
# hadoop fs -D dfs.replication=2
hadoop-nfs.jar

-put hadoop-nfs-x.jar

3.2. Run the hdfs fsck command on hadoop-nfs.jar:


# hdfs fsck /user/root/hadoop-nfs.jar

3.3. Verify the average block replication for this file is 2.


Step 4: Find the Actual Blocks
4.1. Run the same fsck command on hadoop-common.jar as before, but add the
files and blocks options:
Copyright 2014, Hortonworks, Inc. All rights reserved.

37

# hdfs fsck /user/root/hadoop-common.jar -files -blocks

Notice the output contains the block IDs, which coincidentally are the names of
the files on the DataNodes.
4.2. Change directories to the following:
# cd /hadoop/hdfs/data/current/BP-xxx/current/finalized/

replacing BP-xxx with the actual folder name.


4.3. Notice the actual blocks appear in this folder. List the contents of the folder
and look for files with a recent timestamp. You may not see any new files. Why
not? ____________________________________________________________
4.4. Try another node in your cluster:
# ssh root@node3
# cd /hadoop/hdfs/data/current/BP-xxx/current/finalized/

You are looking for a subfolder with a recent timestamp. Once you find it, cd into
that folder.
4.5. See if you can find the various blocks for hadoop-common.jar and hadoopnfs.jar. They will look similar to the following:
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.

1
1
1
1
1
1

hdfs
hdfs
hdfs
hdfs
hdfs
hdfs

hadoop 1048576 blk_1073742331


hadoop
8199 blk_1073742331_1507.meta
hadoop 1048576 blk_1073742332
hadoop
8199 blk_1073742332_1508.meta
hadoop 582777 blk_1073742333
hadoop
4563 blk_1073742333_1509.meta

4.6. How come some of the blocks are exactly 1048576 bytes? ______________
_________________________________________________________________
4.7. What is in the .meta files? _______________________________________

38

Copyright 2014, Hortonworks, Inc. All rights reserved.

The NameNode
1. When the NameNode starts, it reads
the fsimage_N and edits_N files.
2. The transactions in edits_N are
merged with fsimage_N.
3. A newly-created fsimage_N+1 is
written to disk, and a new, empty
edits_N+1 is created.
fsimage

edits

Namespace

Journaling

Metadata

The NameNode will be in safemode,


a read-only mode.

4. Now a client application can


create a new file in HDFS.
5. The NameNode journals that
create transaction in the
edits_N+1 file.

NameNode

The NameNode
HDFS has a master/slave architecture. A HDFS cluster consists of a single NameNode,
which is a master server that manages the file system namespace and regulates access
to files by clients.
The NameNode has the following characteristics:

It is the master of the DataNodes.

It executes file system namespace operations such as opening, closing, and


renaming files and directories.

It determines the mapping of blocks to DataNodes.

It maintains the file system namespace.

Copyright 2014, Hortonworks, Inc. All rights reserved.

39

The NameNode performs these tasks by maintaining two files:

fsimage_N: Contains the entire file system namespace, including the mapping of
blocks to files and file system properties.

edits_N: A transaction log that persistently records every change that occurs to
file system metadata.

When the NameNode starts up, it enters safemode (a read-only mode). It loads the
fsimage_N and edits_N from disk, applies all the transactions from the edits_N to the
in-memory representation of the fsimage_N, and flushes out this new version into a
new fsimage_N on disk.

NOTE: The edits_N file naming actually contains a range of numbers for the
historical events. For example, edits_0008-0012. There is an additional file
named edits_inprogress_<start-of-range> for the current edits.

For example, initially you will have an fsimage_0 file and an edits_inprogress_0 file.
When the merging occurs, the transactions in edits_inprogress_0 are merged with
fsimage_0, and a new fsimage_1 file is created. In addition, a new, empty
edits_inprogress file is created for all future transactions that occur after the creation of
fsimage_1.
This process is called a checkpoint. Once the NameNode has successfully checkpointed,
it will leave safemode, thus enabling writes.

40

Copyright 2014, Hortonworks, Inc. All rights reserved.

The DataNodes
NameNode

Im still here! This


is my latest
Blockreport.
DataNode 1

Replicate block 123


to DataNode 1.

DataNode 2

DataNode 3

Im here too! And


here is my latest
Blockreport.

DataNode 4

123

Storage

The DataNodes
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
The NameNode determines the mapping of blocks to DataNodes. The DataNodes are
responsible for:

Handling read and write requests from application clients.

Performing block creation, deletion, and replication upon instruction from the
NameNode. (The NameNode makes all decisions regarding replication of
blocks.)

Sending heartbeats to the NameNode.

Sending the blcoks stored on the DataNode in a Blockreport.

The NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is
functioning properly. A Blockreport contains a list of all blocks on a DataNode.

Copyright 2014, Hortonworks, Inc. All rights reserved.

41

DataNodes have the following characteristics:

The DataNode has no knowledge about HDFS files.

It stores each block of HDFS data in a separate file on its local file system.

The DataNode does not create all files in the same local directory. It uses a
discovery technique to determine the optimal number of files per directory and
creates subdirectories appropriately.

When a DataNode starts up, it scans through its local file system, generates a list
of all HDFS data blocks that correspond to each of these local files, and sends this
information to the NameNode (as a Blockreport).

REFERENCE: For tips on configuring a network for a Hadoop cluster, visit


http://hortonworks.com/kb/best-practices-for-cluster-network-configuration/ .
42

Copyright 2014, Hortonworks, Inc. All rights reserved.

DataNode Failure
Sorry, DataNode 3,
but Im going to
assume you are
dead.

NameNode

Heartbeat &
Blockreport

DataNode 1

Heartbeat &
Blockreport

DataNode 2

Heartbeat &
Blockreport

DataNode 3

DataNode 4

DataNode Failure
The primary objective of HDFS is to store data reliably even in the presence of failures.
Hadoop is designed to recover gracefully from a disk failure or network failure of a
DataNode using the following guidelines:

If a DataNode fails to send a Heartbeat to the NameNode, that DataNode is


labeled as dead.

Any data that was registered to a dead DataNode is no longer available to HDFS.

The NameNode does not send new I/O requests to a dead DataNode, and its
blocks are replicated to live DataNodes.

DataNode death typically causes the replication factor of some blocks to fall below their
specified value. The NameNode constantly tracks which blocks need to be replicated
and initiates replication whenever necessary.

Copyright 2014, Hortonworks, Inc. All rights reserved.

43

NOTE: It is possible that a block of data fetched from a DataNode arrives


corrupted, either from a disk failure or network error. HDFS implements
checksum checking on the contents of HDFS files.
When a client creates an HDFS file, it computes a checksum of each block of
the file and stores these checksums in a separate hidden file in the same
HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from
each DataNode matches the checksum stored in the associated checksum
file. If the checksum verification fails, the client can opt to retrieve that block
from another DataNode that has a replica of that block.

44

Copyright 2014, Hortonworks, Inc. All rights reserved.

HDFS Clients
Commandline

Tools

User

HDFS

fs, archive, distcp, fsck, fetchdt

Admin
dfsadmin, namenode, datanode, balancer, daemonlog,
secondarynamenode

WebHDFS
The NameNode and DataNodes both expose RESTful apis to
perform user operations

NameNode

DataNode 1

HttpFS
A REST gateway that supports user operations and is
interoperable with WebHDFS

Hue

DataNode 2

A feature rich GUI that includes a HDFS file browser, job browser
for MR & YARN, HBase, Hive, Pig, and Sqoop support

HDFS Java API


Can write your own java based client applications
YARN applications; MapReduce!

HDFS Clients
HDFS provides many out of the box methods for clients to interact with the file system.
These include command line, RESTful, and a Java HDFS API. Additionally, HDP provides
Hue, a GUI interface to not only HDFS but also other components in HDP.
We will explore the various types of clients in an upcoming lab.

Copyright 2014, Hortonworks, Inc. All rights reserved.

45

Unit 2 Review
1. Which component of HDFS is responsible for maintaining the namespace of the
distributed file system? _________________________
2. What is the default file replication factor in HDFS? _________________________
3. True or False: To input a file into HDFS, the client application passes the data to
the NameNode, which then divides the data into blocks and passes the blocks to
the DataNodes. _____________
4. Which property is used to specify the block size of a file stored in HDFS?
__________________________
5. The NameNode maintains the namespace of the file system using which two sets
of files? _______________________________________________________

46

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 3: Installation Prerequisites


and Planning
Topics covered:

Minimum Hardware Requirements

Minimum Software Requirements

A Formidable Starter Cluster

Lab 3.1: Setting up the Environment

Lab 3.2: Install HDP 2.0 Cluster using Ambari

Copyright 2014, Hortonworks, Inc. All rights reserved.

47

Minimum Hardware Requirements


The great benefit of Hadoop is that hardware requirements are very flexible.
Infrastructure choices are in your hands. There are, however, guidelines to help you
make better choices.

48

Master Nodes: RAID10 is recommended because master nodes need to be more


resilient to failures. RAID10 will allow for multiple disk failures at the same time.

Slave Nodes: JBOD, or Just a Bunch of Disks, is simple array of disks with no
striping or mirroring.

Copyright 2014, Hortonworks, Inc. All rights reserved.

Minimum Software Requirements


All hosts, both masters and slaves, should have the above installed. Ambari can push the
JDK to all the hosts if you are using Ambari to manage your cluster. At least one of the
master nodes should have Ambari bits downloaded. We will discuss this later in the
course.

Copyright 2014, Hortonworks, Inc. All rights reserved.

49

A Formidable Starter Cluster


Configuration that will yield 26TB storage:
42U single rack
10 nodes: 2 masters, 8 slaves
Masters (2x):
- 6 - 1 TB drives
- Dual quad core
- 64 GB RAM
- RAID 10
- 2x Gigabit Ethernet
- Redundant power supply
Slaves (8 x 2U):
- Single quad core
- 32 GB RAM
- 2x internal SATA drives for OS
- 6 x 2TB SATA drives as JBOD

50

Copyright 2014, Hortonworks, Inc. All rights reserved.

We will cover cluster tuning later and answer questions such as:
Is the cluster for small group? Multi-tenant?
How much storage do you anticipate in the short-term?
How quickly will the data grow?
Do you anticipate compute heavy processing of the data? Compute/memory
heavy algorithms?

Copyright 2014, Hortonworks, Inc. All rights reserved.

51

Lab 3.1: Setting up the Environment

Objective: To become familiar with preparing an HDP 2.0 installation.


Successful Outcome: You have setup passwordless SSH and configured the
repositories.
Before You Begin: SSH into node1.

Step 1: View the Setup Script


1.1. SSH into node1.
1.2. Change directories to /root/scripts:
# cd /root/scripts

1.3. View the contents of env_setup.sh:


# more env_setup.sh

1.4. This script does a lot. Look over the steps and see if you can follow what is
happening. Some of the highlights of the script include:
- Sets up passwordless SSH amongst your four nodes.
- Installs ntp on each node.
- Configures the repositories for installing HDP locally.
- Disables security and turns off iptables.
Step 2: Run the Setup Script
2.1. Run the setup script using the following command:
52

Copyright 2014, Hortonworks, Inc. All rights reserved.

# ./env_setup.sh

2.2. The script will take a while to execute. Watch the output and keep an eye out
for any errors. The end of the output will look like:
Installed:
yum-plugin-priorities.noarch 0:1.1.30-14.el6
Complete!

NOTE: If you dont see your command prompt, simply press Enter when the
script is finished.

IMPORTANT: If you find an error, try to determine at which step in the script
it occurred. You may need to manually copy-and-paste the remainder of the
script based on where your error occurred.

RESULT: Your cluster is now ready for HDP 2.0 to be installed using Ambari!

Copyright 2014, Hortonworks, Inc. All rights reserved.

53

Lab 3.2: Install HDP 2.0 Cluster using Ambari

Objective: To install a Hadoop cluster using Ambari UI.


Successful Outcome: You can see the various HDP 2.0 services running from within
the Ambari UI.
Before You Begin: SSH into node1.

Step 1: Install ambari-server


1.1. From the command line of node1, enter the following command:
# yum -y install ambari-server

1.2. Open the following file using vi:


# vi /var/lib/ambariserver/resources/stacks/HDPLocal/2.0.6/repos/repoinfo.xml

1.3. Change the centos6 configuration to point to the local repositories by


changing the baseurl property as shown here:
<os type="centos6">
<repo>
<baseurl>http://node1/hdp/HDP-2.0.6.0-76</baseurl>
<repoid>HDP-2.0.6</repoid>
<reponame>HDP</reponame>
</repo>
</os>

Step 2: Start the ambari-server


2.1. The Ambari Server manages the install process. Run the Ambari Server setup
using the following command:
54

Copyright 2014, Hortonworks, Inc. All rights reserved.

# ambari-server setup -s -i /usr/jdk-6u31-linux-x64.bin

NOTE: The -s option runs the setup in silent mode, meaning all default
values are accepted at any prompts.

2.2. Now start the Ambari Server:


# ambari-server start

Step 3: Login to Ambari


3.1. Start the Firefox browser from left side-bar and go to the following URL:

http://node1:8080

3.2. Log in to the Ambari server using the default credentials admin/admin:

Step 4: Run the Install Wizard


4.1. At the Welcome page, enter the name horton for your cluster and click the
Next button.
Copyright 2014, Hortonworks, Inc. All rights reserved.

55

4.2. Select the service stack HDP2.0.6:

Step 5: Enter the Host and SSH Key Details


5.1. Enter node1, node2 and node3 in the list of Target Hosts. (Do not enter
node4; you will add that node to the cluster in a later lab.)

56

Copyright 2014, Hortonworks, Inc. All rights reserved.

5.2. In the Host Registration Information section, click the Choose File button,
then browse to and select the training-keypair.pem file at Desktop:

5.3. Under Advanced Options, check Use a local software repository:

5.4. Click the Register and Confirm button. Click OK if you are warned about not
using fully qualified domain names.
Step 6: Confirm Hosts
6.1. Wait for some initial verification to occur on your cluster. Once the process is
done, click the Next button to proceed:

Copyright 2014, Hortonworks, Inc. All rights reserved.

57

NOTE: You may see a confirmation message with a warning. Verify your
nodes are configured correctly before continuing. If it is related to firewall,
you can ingnore the error.

Step 7: Choose the Services to Install


7.1. Hortonworks Data Platform is made up of a number of components. You are
going to install all the services except HBase on your cluster, so uncheck HBase
and make sure all other serves are checked, then click the Next button:

58

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 8: Assign Master Nodes


8.1. The Ambari wizard attempts to assign the various master services on
appropriate hosts in your cluster. Carefully choose the following assignments of
the master services!

CAUTION: Make sure to choose the right node for each master service as
specified below. Once the installation starts, you cannot change the
selection!

NameNode: node1
SNameNode: node2
History Server: node2
ResourceManager: node2
Nagios Server: node3
Ganglia Server: node3
HiveServer2: node2

Copyright 2014, Hortonworks, Inc. All rights reserved.

59

Oozie Server: node2


ZooKeeper: node1
ZooKeeper: node2
ZooKeeper: node3
8.2. Verify your assignments match the following:

8.3. Click the Next button to continue.


Step 9: Assign Slaves and Clients
9.1. Assign all slave and client components to all nodes in the list:

9.2. Click the Next button to continue.

60

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 10: Customize Services


10.1. Notice three services require additional configuration: Hive, Oozie and
Nagios. Click on the Hive tab, then enter hive for the Database Password:

10.2. Click on the Oozie tab and enter oozie for its Database Password:

10.3. Click on the Nagios tab. Enter admin for the Nagios Admin password, and
enter your email address in the Hadoop Admin email field:

10.4. Click the Next button to continue.


Step 11: Review the Configuration
Copyright 2014, Hortonworks, Inc. All rights reserved.

61

11.1. Notice the Review page allows you to review your complete install
configuration. If youre satisfied that everything is correct, click Deploy to start
the installation process. (If you need to go back and make changes, you can use
the Back button.)

Step 12: Wait for HDP to Install


12.1. The installation will begin now. It will take 20-30 minutes to complete,
depending on network speed. You will see progress updates under the Status
column as components are installed, tested, and started:

62

Copyright 2014, Hortonworks, Inc. All rights reserved.

12.2. You should see the following screen if the installation completes
successfully:

12.3. When the process completes, click Next to get a summary of the installation
process. Check all configured services are on the expected nodes, then click
Complete:

Copyright 2014, Hortonworks, Inc. All rights reserved.

63

Step 13: View the Ambari Dashboard


13.1. After the install wizard completes, you will be directed to your clusters
Ambari Dashboard page. Verify the DataNodes Live status shows 3/3:

RESULT: You now have a running 3-node cluster of the Hortonworks Data Platform!

64

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 4: Configuring Hadoop


Topics covered:

Configuration Considerations

Deployment Layout

Configuring HDFS

What is Ambari

Configuration via Ambari

Management

Monitoring

REST API

Lab 4.1: Add a New Node to the Cluster

Lab 4.2: Stopping and Starting HDP Services

Lab 4.3: Using HDFS Commands

Copyright 2014, Hortonworks, Inc. All rights reserved.

65

Configuration Considerations
There are two ways to configure HDP:
-

Manual configuration

Ambari UI configuration

Currently, it is necessary to know key configurations at the configuration file level.


Ambari is highly capable of managing configurations, however, as an HDP admin if
Ambari fails, it will be important to know key properties to run a cluster. This will also
provide insight into what exact configuration files need to be looked at when
configuring the various services. In this Unit, we focus on HDFS, YARN, MapReduce,
ZooKeeper, & HBase.

66

Copyright 2014, Hortonworks, Inc. All rights reserved.

Data

Binaries

Configuration

Runtime

Install Bits

Deployment Layout
The HDP deployment layout per machine may vary slightly because not all machines will
have the same components. For example, theres only one Ambari Server per cluster.
However, by using the deployment layout above as a guide, you can quickly find the
configuration, binaries, and repos needed for Ambari to run.
The Deployment Layout can be broken down into five key categories:
1. Install Bits: It is a best practice to setup a local repository of the install bits or
rpm repos. When setting up a local repo, a yum repo is added to
/etc/yum.repos.d/. The rpms are installed on a simple webserver.
2. Binaries: Hadoop executables, libraries, dependencies, template configs, etc. are
located at /usr/lib/ in the appropriate project folder. Files in these directories
should not be modified, especially configuration files. A best practice for
customization of shell scripts is that modifications should be done via wrapper
scripts, such as passing parameters or piping stdout to a log file.
3. Configuration: By convention, Hadoop configurations are under /etc/ under the
appropriate project. This is where configuration changes should be made rather
than in install (binaries) directories.

Copyright 2014, Hortonworks, Inc. All rights reserved.

67

4. Data: Various Hadoop services require data directories. For example, HDFS
requires space for the NameNode to write its edits log files. And the DataNodes
will write the actual data blocks to the local file system. Throughout the
configuration files, you will find services requiring a directory path to use as
temporary or permanent storage.
5. Runtime: As Hadoop services are running, starting, and stopping, they will be
writing to self-maintenance files such as pid (process id) files, typically to
/var/run/. For example, Hadoop HDFS services will publish pid files to
/var/run/hadoop/hdfs/.

68

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuring HDFS
There are two configurations involved when configuring HDFS. In addition to Hadoop
configuration properties necessary to bring up an HDFS cluster, there are some prerequisites, which we will discuss:
Ports Firewall considerations:
Service
NameNode WebUI
NameNode metadata service

DataNode

Servers
Master
Nodes

All Slave
Nodes

Secondary (Checkpoint)
NameNode

Copyright 2014, Hortonworks, Inc. All rights reserved.

Ports

Protocol Description

50070
http
50470
https
8020/9000 IPC
50075
50475
50010
50020
50090

http
https
IPC
http

NameNode WebUI
Secure http
File system metadata
operations
DataNode WebUI
Secure http
Data transfer
Metadata operations
Secondary NameNode
WebUI

69

DNS
Ensure that HDFS hosts are resolvable via DNS. If this is not possible, all the hosts will
need their /etc/hosts file to contain all the hosts in the cluster. The hosts file is a local
domain to ip mapping file.
core-site.xml: Cluster-wide settings, including NameNode host and port, proxy
user/groups. This file will get distributed to all nodes, but is always changed
uniformly.
hdfs-site.xml: Some settings are cluster-wide, while others are DataNode
specific. For example, dfs.datanode.data.dir can be different between
DataNodes.
NameNode
fs.defaultFS: hdfs://namenodehost:8020
dfs.namenode.name.dir: /hadoop/hdfs/namenode
dfs.replication: Hadoop default is 3 and should be kept at 3. This property can be
overridden by the client per-operation if you want to change the replication for a
file. For example; if a file is referenced multiple times in many jobs, it is often a
performance gain to have more replicas of that same file, i.e. joining with lookup
files).
dfs.replication.max: Maximum replication.
dfs.blocksize: Default block size is 128MB (this property is expressed in bytes). If
your cluster generally has larger datasets and the datasets are not process
intensive, you can set this to a higher size. However, 128MB is a good default to
keep. If you like, just as the replication factor, you can change the blocksize for
each file that you upload into HDFS.
dfs.namenode.stale.datanode.interval: Default, 30000ms. Threshold for
amount of time in milliseconds before the NameNode considers a DataNode to
be stale, at which point the DataNode is moved to the end of the list of available
replica locations.
SecondaryNameNode
dfs.namenode.checkpoint.dir: Directory where SecondaryNameNode
temporarily stores the images it needs to merge from the NameNode
dfs.namenode.checkpoint.period: Default 3600s. Number of seconds between
two periodic checkpoints in seconds.

70

Copyright 2014, Hortonworks, Inc. All rights reserved.

dfs.namenode.checkpoint.txns: Default 1,000,000 transactions. After this many


transactions on the NameNode, the SecondaryNameNode will create a
checkpoint. This property has precedence over the checkpoint.period
DataNodes
dfs.datanode.address: 0.0.0.0:50010. The DataNode host address.
dfs.datanode.data.dir: DataNode data block directory. If a node has multiple
drives, you can specify a comma-separated list of data directories for each drive.

Copyright 2014, Hortonworks, Inc. All rights reserved.

71

What is Ambari
Ambari is a 100% Apache open source operations framework for provisioning,
managing, and monitoring Hadoop clusters. It provides these features through a web
frontend and an extensive REST API.
With Ambari, clusters can be built from ground up on clean operating system instances.
It will do the job of propagating binaries, configuring services, launching them, and
monitoring them to all the hosts in a cluster.

72

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuration via Ambari


Using Ambari is very convenient. However, there are some cases where performing
manual changes is necessary, such as a differing property values between DataNodes.
When the same property is different within the cluster, we refer to this as a
heterogeneous configuration.
We will be using Ambari throughout this course to monitor, provision, and manage your
HDP cluster.

Copyright 2014, Hortonworks, Inc. All rights reserved.

73

Management
Once a cluster is provisioned, services can be managed either an entire service or
management can be granular to services sub-component. For example, an
administrator can choose to start/stop the entire YARN service (ResourceManager +
NodeManagers), or just stop a particular NodeManager on a host.
Configuration of services can also be managed. Properties, credentials, paths are some
examples of common configurations. Ambari allow for custom or advanced properties
to be managed for most services. Once a configuration change has been made, Ambari
will persist changes to its own internal database, a PostgreSQL database by default.

74

Copyright 2014, Hortonworks, Inc. All rights reserved.

Management Flow
1. Stop service(s): Services are required to be stopped.
2. Edit and save: Once saved, Ambari will validate and persist the new settings in
its database, write the settings to appropriate configuration files on the cluster.
3. Start service(s): Services can now be started.

Ambari uses Puppet, an open source automation system for orchestrating


the starting/stopping of services.

Advanced Configurations
Ambari supports configuring NameNode HA and security. These are advanced features
available under the Admin page. These topics will be covered in later units.

Copyright 2014, Hortonworks, Inc. All rights reserved.

75

Monitoring
Ambari provides monitoring with the combination of two powerful open source
frameworks: Ganglia and Nagios.
Ganglia
All cluster metrics are gathered by Ganglia agents running on each host and aggregated.
Nagios
Nagios is used to provide alerts, escalation schemes to implement enterprise SLAs, and
reports. With Nagios, alerts via email, SMS, or script execution can be triggered by
events such as a threshold limit being crossed. For example, an administrator may want
to receive an SMS alert if a Hadoop master nodes CPU is pegged at 100% for more than
5 minutes. All such thresholds are configurable in Nagios.
Dashboard
Ambari provides a dashboard that gives an administrator a quick view of the overall
health of the entire cluster. There are 20+ widgets that provide quick stats on services.
Widgets can be added or you could write your own widget using the Ambari APIs.

76

Copyright 2014, Hortonworks, Inc. All rights reserved.

REST API
Ambari uses a REST API. You can write your own automation scripts to perform
extensive operations. The REST API allows you to monitor as well as manage a cluster.

Copyright 2014, Hortonworks, Inc. All rights reserved.

77

All components of a cluster are Resources that can be added, updated/configured, or


removed. Core resources include:

Clusters: Top most level resource; a Hadoop cluster.

Services: Hadoop services such as HDFS, YARN, etc.

Components: Individual components of a service such as NameNode or


ResourceManager.

Hosts: The host machines that participate in a cluster.

Host_Components: Individual resources on a host; often times this resource


is used to get all services running on a host.

Configurations: Sets of key/value pairs that configure services.

The Ambari REST API is an evolving feature. While most operations will work
as expected, be sure to thoroughly test an operation and validate expected
results

78

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 4.1: Add a New Node to the Cluster

Objective: Add an additional node to a cluster.


Successful Outcome: node4 will be added to your HDP cluster as a DataNode.
Before You Begin: Your 3-node cluster should have HDP successfully installed.

Step 1: Login to Ambari


1.1. If you are not logged in already, login to the Ambari Dashboard of your
cluster.
Step 2: Run the Add Hosts Wizard
2.1. Click on the Hosts tab of Ambari.
2.2. Click the Add New Hosts button to start the Add Host Wizard:

Step 3: Complete the Add Host Wizard


3.1. You have seen this wizard before when you installed HDP, so we will provide
only a few hints in this lab. Read through the following hints before running the
wizard.
3.2. The hostname of the node you are adding is node4.
3.3. Make sure you check the box for using a local software repository.
3.4. On the Assign Slaves and Clients step, choose only the Client option.

Copyright 2014, Hortonworks, Inc. All rights reserved.

79

Step 4: Verify the New Host


4.1. Go to the Hosts page of Ambari. You should see all four of your nodes now
listed.

RESULT: You have added a new node to the cluster and Hadoop is installed on it. In a
later lab, you will commission this node as a DataNode.

80

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 4.2: Stopping and Starting HDP Services

Objective: To learn how to start and stop the various HDP services using
either the command line or Ambari.
Successful Outcome: You will have stopped HDP from the command line and
started it again using Ambari.
Before You Begin: Your cluster should be up and running.

Step 1: Stop the HDP Services from the Command Line


1.1. The following table lists all the processes that need to be stopped, in the
proper order for shutting down all HDP services. Do not type these in yet - they
are provided for you in a script:
Nagios
Ganglia

Oozie
WebHCat
Hive
Zookeeper

Yarn

Node
Manager

service nagios stop


service hdp-gmetad stop
service hdp-gmond stop
service hdp-gmond stop
service hdp-gmond stop
service hdp-gmond stop
sudo su -l oozie -c "/usr/lib/oozie/bin/oozied.sh
stop"
su -l hcat -c
"/usr/lib/hcatalog/sbin/webhcat_server.sh stop"
ps aux | awk '{print $1,$2}' | grep hive | awk '{print
$2}' | xargs kill >/dev/null 2>&1
/usr/lib/zookeeper/bin/zkServer.sh stop
/usr/lib/zookeeper/bin/zkServer.sh stop
/usr/lib/zookeeper/bin/zkServer.sh stop
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop nodemanager'

Copyright 2014, Hortonworks, Inc. All rights reserved.

node3
node3
node1
node2
node3
node4
node2
node2
node2
node1
node2
node3

node1

81

MapReduce

History
Server

Yarn

Resource
Manager

HDFS

DataNode

Secondary
NameNode
NameNode

su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop nodemanager'
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop nodemanager'
su - mapred -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-mapreduce/sbin/mrjobhistory-daemon.sh --config /etc/hadoop/conf
stop historyserver'
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop resourcemanager'
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop secondarynamenode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop namenode"

node2

node3

node2

node2
node1
node2
node3
node2
node1

1.2. SSH into node1. (Make sure you run this script from node1.)
1.3. Run the following script to shutdown all HDP services on your cluster:
# ~/scripts/shutdown_all_services.sh

1.4. Wait for the script to execute and all the services to stop.
Step 2: View Ambari
2.1. Go to your Ambari Dashboard. Notice the Cluster Status and Metrics on the
Dashboard are mostly n/a:

82

Copyright 2014, Hortonworks, Inc. All rights reserved.

2.2. Notice that all the Services are down - as shown by the red icon next to each
service name:

2.3. From the Services page, click on each service individually. They should all be
stopped.
Step 3: Stop ambari services
3.1. Run the following script to shutdown all Ambari services on your cluster:
# ~/scripts/stop_ambari.sh

Copyright 2014, Hortonworks, Inc. All rights reserved.

83

Step 4: Take a backup of your VM.


4.1. Go to VMWare Player/Fusion Menu and shutdown the VM.
4.2. Copy the existing VM folder from current location to a new location.
4.3. Once you are done with backup, go to VMWare Player/Fusion Menu and start
the VM again.

NOTE: It is important to take a backup of the VM now so that you do not


need to start from beginning in case of a fatal error.

Step 5: Start Ambari services


5.1. Login to node1 again
5.2. Run the following script to start all Ambari services on your cluster:
# ~/scripts/start_ambari.sh

Step 6: Start the HDP Services from Ambari UI


6.1. Go To Ambari UI using Firefox browser
6.2. From the Services page, click the Start All button. Click the OK button to
confirm.
6.3. Wait for the services to start, which can take 10-15 minutes. If you click the
small arrow to the right of Start All Services, you can view the progress on each
node:

84

Copyright 2014, Hortonworks, Inc. All rights reserved.

6.4. Once all the services are started, click the OK button to close the progress
dialog.
6.5. Verify on the Services page of Ambari that all the HDP services in your cluster
are up and running.

NOTE: The table below shows the proper order for starting HDP services.
These can be executed using the /root/scripts/startup_all_services.sh script
provided in your class cluster.

HDFS

NameNode
Secondary
NameNode
DataNodes

YARN

Resource Manager

History Server

Node Managers

su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start namenode"


su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start secondarynamenode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start datanode"
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh
--config /etc/hadoop/conf start
resourcemanager'
su - mapred -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-mapreduce/sbin/mrjobhistory-daemon.sh --config
/etc/hadoop/conf start historyserver'
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh
--config /etc/hadoop/conf start nodemanager'
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh
--config /etc/hadoop/conf start nodemanager'

Copyright 2014, Hortonworks, Inc. All rights reserved.

85

node1
node2
node1
node2
node3

node2

node2

node1

node2

Zookeeper

Hive

Metastore

HiveServer2

WebHCat
Oozie
Ganglia

Nagios

86

su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh
--config /etc/hadoop/conf start nodemanager'
/usr/lib/zookeeper/bin/zkServer.sh start
/usr/lib/zookeeper/bin/zkServer.sh start
/usr/lib/zookeeper/bin/zkServer.sh start
su - hive -c 'env HADOOP_HOME=/usr
JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/tmp/startMetastore.sh /var/log/hive/hive.out
/var/log/hive/hive.log /var/run/hive/hive.pid
/etc/hive/conf.server '
su - hive -c 'env
JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/tmp/startHiveserver2.sh /var/log/hive/hiveserver2.out /var/log/hive/hive-server2.log
/var/run/hive/hive-server.pid
/etc/hive/conf.server '
su -l hcat -c
"/usr/lib/hcatalog/sbin/webhcat_server.sh
start"
sudo su -l oozie -c "/usr/lib/oozie/bin/oozied.sh
start"
/etc/init.d/hdp-gmetad start
/etc/init.d/hdp-gmond start
/etc/init.d/hdp-gmond start
/etc/init.d/hdp-gmond start
/etc/init.d/hdp-gmond start
service nagios start

node3
node1
node2
node3

node2

node2

node2
node2
node3
node1
node2
node3
node4
node3

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 4.3: Using HDFS Commands

Objective: To become familiar running HDFS commands and how to


view the HDFS file system.
Successful Outcome: You will have added and deleted several files and folders in
HDFS.
Before You Begin: SSH into node1.

Step 1: View the hadoop fs Command


1.1. From the command line, enter the following command to view the usage of
hadoop fs:
# hadoop fs

Notice the usage contains options for performing file system tasks in HDFS, like
copying files from a local folder into HDFS, retrieving a file from HDFS, copying
and moving files around, and making and removing directories.
1.2. Enter the following command:
# hdfs dfs

Notice you get the same usage list as the hadoop fs command.

NOTE: The hadoop command is a more generic command that has fewer
options than the hdfs command. However, notice hdfs dfs is just an alias for
hadoop fs.

Copyright 2014, Hortonworks, Inc. All rights reserved.

87

Step 2: Understanding the Default Folders in HDFS


2.1. Enter the following -ls command to view the contents of the users root
directory in HDFS, which is /user/root:
# hadoop fs -ls

You do have not a /user/root directory yet, so no output is displayed:


ls: `.': No such file or directory

2.2. View the contents of the /user directory in HDFS:


# hadoop fs -ls /user
Found 4 items
drwxrwx--- ambari-qa
drwxr-xr-x
- hcat
drwx------ hive
drwxrwxr-x
- oozie

hdfs
hdfs
hdfs
hdfs

0
0
0
0

/user/ambari-qa
/user/hcat
/user/hive
/user/oozie

Notice HDFS has four user folders by default: ambari-qa, hcat, hive and oozie.
2.3. Run the -ls command again, but this time specify the root HDFS folder:
# hadoop fs -ls /

The output should look like:


Found 6 items
drwxrwxrwt
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxrwxrwx
drwxr-xr-x
-

yarn
hdfs
mapred
hdfs
hdfs
hdfs

hdfs
hdfs
hdfs
hdfs
hdfs
hdfs

0
0
0
0
0
0

2013-08-20
2013-08-20
2013-08-20
2013-08-20
2013-08-28
2013-08-28

13:59
13:53
13:57
13:58
22:03
22:03

/app-logs
/apps
/mapred
/mr-history
/tmp
/user

2.4. Which user is the owner of the /user folder? ___________________

IMPORTANT: Notice how adding the / in the -ls command caused the
contents of the root folder to display, but leaving off the / attempted to list
the contents of /user/root. If you do not specify an absolute path, then all
hadoop commands are relative to the users default home folder.

88

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 3: Create a User Directory in HDFS


3.1. Enter the following mkdir command:
# hadoop fs -mkdir /user/root

Notice the root user does not have permission to create this folder.
3.2. Switch to the hdfs user:
# su - hdfs

3.3. Make a new directory in HDFS named /user/root:


$ hadoop fs -mkdir /user/root

3.4. Change the permissions to make root the owner of the directory:
$ hadoop fs -chown root:root /user/root

3.5. Verify the folder was created successfully and root is the owner:
$ hadoop fs -ls /user
...
drwxr-xr-x
- root

root

/user/root

3.6. Switch back to the root user:


$ exit
logout
[root@node1 ~]#

3.7. Now view the contents of /user/root using the following command again:
# hadoop fs -ls

The directory is empty, but notice this time the command worked.
Step 4: Create Directories in HDFS
4.1. Enter the following command to create a directory named test in HDFS:
Copyright 2014, Hortonworks, Inc. All rights reserved.

89

# hadoop fs -mkdir test

4.2. Verify the folder was created successfully:


# hadoop fs -ls
Found 1 items
drwxr-xr-x
- root root

test

4.3. Create a couple of subdirectories of test:


# hadoop fs -mkdir test/test1
# hadoop fs -mkdir test/test2
# hadoop fs -mkdir test/test2/test3

4.4. Use the -ls command to view the contents of /user/root:


# hadoop fs -ls

Notice you only see the test directory. To recursively view the contents of a
folder, use: -ls -R:
# hadoop fs -ls -R

The output should look like:


drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x

root
root
root
root

root
root
root
root

0
0
0
0

test
test/test1
test/test2
test/test2/test3

Step 5: Delete a Directory


5.1. Delete the test2 folder (and recursively its subcontents) using the -rm -R
command:
# hadoop fs -rm -R test/test2

5.2. Now run the -ls -R command:


# hadoop fs -ls -R

The directory structure of the output should look like:


90

Copyright 2014, Hortonworks, Inc. All rights reserved.

.Trash
.Trash/Current
.Trash/Current/user
.Trash/Current/user/root
.Trash/Current/user/root/test
.Trash/Current/user/root/test/test2
.Trash/Current/user/root/test/test2/test3
test
test/test1

NOTE: Notice Hadoop created a .Trash folder for the root user and moved
the deleted content there. The .Trash folder empties automatically after a
configured amount of time.

Step 6: Upload a File to HDFS


6.1. Now lets put a file into the test folder. Change directories to
/var/log/hadoop/hdfs:
# cd /var/log/hadoop/hdfs

6.2. Notice this folder contains a file named hdfs-audit.log:


# tail hdfs-audit.log

6.3. Run the following -put command to copy hdfs-audit.log into the test folder in
HDFS:
# hadoop fs -put hdfs-audit.log test/

6.4. Verify the file is in HDFS by listing the contents of test:


# hadoop fs -ls test
Found 2 items
-rw-r--r-3 root root
drwxr-xr-x
- root root

3744098
0

test/hdfs-audit.log
test/test1

Step 7: Copy a File in HDFS


7.1. Now copy the hdfs-audit.log file in test to another folder in HDFS:
# hadoop fs -cp test/hdfs-audit.log test/test1/copy.log
Copyright 2014, Hortonworks, Inc. All rights reserved.

91

7.2. Verify the file is in both places by using the -ls -R command on test. The
output should look like the following:
# hadoop fs -ls -R test
-rw-r--r-3 root root
drwxr-xr-x
- root root
-rw-r--r-3 root root

3744098 test/hdfs-audit.log
0 test/test1
3744098 test/test1/copy.log

7.3. Now delete the copy.log file using the -rm command:
# hadoop fs -rm test/test1/copy.log

7.4. Verify the copy.log file is in the .Trash folder.


Step 8: View the Contents of a File in HDFS
8.1. You can use the -cat command to view text files in HDFS. Enter the following
command to view the contents of data.txt:
# hadoop fs -cat test/hdfs-audit.log

8.2. You can also use the -tail command to view the end of a file:
# hadoop fs -tail test/hdfs-audit.log

Notice the output this time is only the last 20 rows of hdfs-audit.log.
Step 9: Getting a File from HDFS
9.1. See if you can figure out how to use the get command to copy test/hdfsaudit.log into your local /tmp folder.
Step 10: The getmerge Command
10.1. Put the file /var/log/hadoop/hdfs/hadoop-hdfs-namenode-node1.log into
the test folder in HDFS. You should now have two files in test: hdfs-audit.log and
hadoop-hdfs-namenode-node1.log:
# hadoop fs -ls test
Found 3 items
-rw-r--r-3 root root
namenode-node1.log
-rw-r--r-3 root root
drwxr-xr-x
- root root

92

1033038 test/hadoop-hdfs3744098 test/hdfs-audit.log


0 test/test1

Copyright 2014, Hortonworks, Inc. All rights reserved.

10.2. Run the following getmerge command:


# hadoop fs -getmerge test /tmp/merged.txt

10.3. What did the previous command do? Compare the file size of merged.txt
with the two log files from the test folder.
Step 11: Specify the Block Size of a File
11.1. Change directories to /root/labs:
# cd /root/labs

Notice this folder contains an HBase JAR file that is about 4.7MB.
11.2. Put the HBase JAR file into /user/root in HDFS with the name hbase.jar, and
assign it a blocksize of 1048576 bytes. HINT: The blocksize is defined using the
dfs.blocksize property on the command line.
11.3. Run the following fsck command on hbase.jar:
# hdfs fsck /user/root/hbase.jar

11.4. How many blocks did this file get broken down in to? ________________

RESULT: You should now be comfortable with executing the various HDFS commands,
including creating directories, putting files into HDFS, copying files out of HDFS, and
deleting files and folders.

Copyright 2014, Hortonworks, Inc. All rights reserved.

93

ANSWERS:
Step 2.4: hdfs
Step 9.1:
# hadoop fs -get test/hdfs-audit.log /tmp

Step 10.3: The two files that were in the test folder in HDFS were merged into a single
file and stored on the local file system.
Step 11.2:
hadoop fs -D dfs.blocksize=1048576 -put hbase-0.94.3bimota-1.2.0.21+HBASE-7644.jar hbase.jar

Step 11.4: The file should be broken down into 5 blocks.

94

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 5: Ensuring Data Integrity


Topics covered:

Ensuring Data Integrity

Replication Placement

Data Integrity Writing Data

Data Integrity Reading Data

Data Integrity Block Scanning

Running a File System Check

What Does the File System Check Look For?

hdfs fsck syntax

Data Integrity File System Check: Commands & Output

Hadoop dfs Command

NameNode Information

Changing the Replication Factor

Lab 5.1: Verify Data with Block Scanner and fsck

Copyright 2014, Hortonworks, Inc. All rights reserved.

95

Ensuring Data Integrity


HDFS has a simple yet robust architecture that was designed for data reliability in the
face of faults and failures in disks, nodes and networks. HDFS is a successful file system
because it is simple. The simplicity allows HDFS to be fast, relatively easy to administer,
and it avoids the complex locking issues in databases and standard file systems.
DataNodes and disk drives can fail. Along with hardware failures data can become
corrupted or lost. Disk rot can occur. Memory, disk and network issues can all
contribute to the corruption of a block. We will discuss how HDFS file system checks
and block scanning will help make sure any corrupt blocks are replaced.

96

Copyright 2014, Hortonworks, Inc. All rights reserved.

These features not only maintain the reliability and durability of the data blocks but also
allow for easy administration.

The HDFS client will calculate a checksum for each block and send it to the
DataNode along with the block.

The DataNode stores checksums in a metadata file separate from the blocks
data file.

The block as well as the checksum is sent to the client when reading. The client
will validate the checksum and if there is an inconsistency it will inform the
NameNode that the block is corrupt.

Copyright 2014, Hortonworks, Inc. All rights reserved.

97

Replication Placement
Every file has a block size and replication factor associated with it. All blocks that make
up a file will be the same size except for the last file. The NameNode will make all
decisions regarding block replication for a file in HDFS. DataNodes send block reports to
the NameNode containing a list of all the blocks for a specific DataNode. The
DataNodes are responsible for the creation, deletion and replication of blocks based
upon instructions from NameNode.
Be aware that HDFS block placement does not take into account disk space utilization on
the DataNodes. This ensures that blocks are placed for availability and not just on the
DataNodes with the most free space.

98

Copyright 2014, Hortonworks, Inc. All rights reserved.

Below is an example of how blocks and metadata are laid out in a DataNode directory.
The data blocks are stored in HDFS directories beginning with the blk_ prefix and
contain the raw bytes. The metadata file has the .meta suffix and contains header,
version and type information. It also contains the checksum data for the blocks.
${dfs.data.dir}/current/VERSION
/blk_<id_1>
/blk_<id_1>.meta
/blk_<id_2>
/blk_<id_2>.meta
/...
/blk_<id_64>
/blk_<id_64>.meta
/subdirectory0/
/subdirectory1/
/...

Copyright 2014, Hortonworks, Inc. All rights reserved.

99

Data Integrity Writing Data


2. OK,
please use
DataNodes
1, 4, 12.

Client
1. I want to
write a block
of data.

NameNode

ta +
3. Daksum
che c

6. Success!

DataNode 1

DataNode 4
4. Data and
checksum
5. Success!

DataNode 12
4. Verify
Checksum
5.Success!

Data Pipeline

Data Integrity Writing Data


High performing applications stream data to files. HDFS does this as well; the HDFS
client caches packets of data in memory. Once that data reaches the HDFS block size,
the client will notify the NameNode. The NameNode will provide the DataNode
information about, and the locations, for the block replicas. The client will then stream
the packet of data to the first targeted DataNode. Replication is performed in a pipeline
fashion; the first DataNode will start writing the block and will then transfer that data to
the second DataNode. The second DataNode will start sending the data to the third
DataNode and so on.
When the blocks in a directory reach a defined limit, which is controlled via
dfs.datanode.numblocks, the DataNode will define a new subdirectory. After defining
the subdirectory it will start placing new data blocks and the corresponding metadata in
that subdirectory. This is performed using a fan-out structure ensuring no single
directory is overloaded with files or becomes too deep.

100

Copyright 2014, Hortonworks, Inc. All rights reserved.

Checksums
Checksums are generated for each data block and are used to validate the block during
reads. A checksum is created for a set number of bytes of data as defined by
io.bytes.per.checksum. The size of the checksum data is minimal, for instance; a CRC32 checksum is 4 bytes long.

Note: The default number of bytes for the checksum is 512.

Investigating Corrupt Blocks


In the case of corrupt blocks, you may want to verify or recover as much of the data as
possible. Verification of checksums can be turned off if you want to look at the corrupt
blocks. To disable the verification:

When using the HDFS API, call FileSystem.setVerifyChecksum(false) when


opening a file to read.

Use the -ignoreCrc option when using the get or -copyToLocal command to
read data.

Copyright 2014, Hortonworks, Inc. All rights reserved.

101

Data Integrity Reading Data

Client
1. I need
to read a
portion of
file.txt

NameNode

2. OK, youll
find it on
DataNode
12, block 5.

3. Verify checksum and read the block.


DataNode 12

Data Integrity Reading Data


When a Hadoop client reads a file it will get a list of the blocks and locations of the block
replicas from the NameNode. The blocks will be sorted based on their distance from
the client. The client will attempt to find the closest block replica. If the DataNode is
down or the first block replica is corrupt, the client will then get the next replica of the
block from a different DataNode.
The client will validate checksums to make sure the block is valid. A computed
checksum is compared against the stored checksum. If the client detects a problem
with the checksum it will get the block from the next block replica from a different
DataNode in the list.

102

Copyright 2014, Hortonworks, Inc. All rights reserved.

Data Integrity - Block Scanning


Deep inspection every 3 weeks

Web UI
display of bad
blocks

NameNode

Periodic block Scanning


and checksum checking.

DataNode

Reporting of bad
blocks with block report.

DataNode

New blocks are created


for bad blocks.

DataNode

Data Integrity - Block Scanning


The block scanner is a software process that runs over a defined time frame to validate
the integrity of the blocks on a DataNode. The block scanners role is to check all the
blocks in a data node and report the results to the NameNode.
The block scanner will:

Ensure the block matches the stored checksums.

Notify the DataNode with the results for each validated block.

Inform the NameNode if it detects a corrupt block.

Will NOT fix any corrupt blocks.

The block scanner will adjust its read rate to ensure it completes the block scanning
within the defined time frame. The time frame is defined by the parameter
dfs.datanode.scan.period.hours (the default is 504 hours or 3 weeks). The DataNode
keeps an in-memory list of the blocks verification times, which are also stored in a log
file.

Copyright 2014, Hortonworks, Inc. All rights reserved.

103

The Block Scanning Report can be accessed from the DataNode GUI:
http://datanode:50075/blockScannerReport
The period of time between running block scanner is set with the
dfs.datanode.scan.period.hour property.

104

Copyright 2014, Hortonworks, Inc. All rights reserved.

Running a File System Check


Block information is gathered from the DataNode and heartbeats are used to send block
information to the NameNode. A HDFS file system check reviews the blocks metadata
information looking for any issues. This file system check reports on the health of HDFS
files from the status of the blocks.
A Linux file system check can repair files on local Linux file systems. The HDFS file
system check checks the blocks in HDFS files and reports the results. The HDFS file
system check information is then used by the NameNode to perform repairs on the
HDFS blocks.
The fsck command is run from the command line.
# hdfs fsck [path] [options]

If the fsck command is run with no arguments, it will print usage information. If run
with a path, the command will check all blocks for files within the path. If / is given as
the path, the entire HDFS file system will be checked.
The fsck command will not run on files open for write by a Hadoop client. The
openforwrite option will override that default.

Copyright 2014, Hortonworks, Inc. All rights reserved.

105

What Does the File System Check Look For?


The fsck command will recursively walk through the file system namespace, starting at
the given path and check the blocks for all files it finds. It prints a dot for every file it
checks. To check a file, fsck retrieves the metadata for the files blocks and looks for
problems or inconsistencies.

NOTE: fsck retrieves all of its information from the NameNode; it does not
communicate with any DataNodes to retrieve any block data.

A few of the conditions the fsck looks for are:

Over-replicated blocks: Blocks that exceed their target replication for the file
they belong to. Normally, over-replication is not a problem, and HDFS will
automatically delete excess replicas.

Under-replicated blocks: Blocks that do not meet their target replication for the
file they belong to. The NameNode will automatically create new replicas of
under-replicated blocks until they meet the target replication. You can get
information about the blocks being replicated (or waiting to be replicated) using
hdfs dfs -metasave.

106

Copyright 2014, Hortonworks, Inc. All rights reserved.

Mis-replicated blocks: Blocks that do not satisfy the block replica placement
policy. For example, for a replication level of three in a multirack cluster, if all
three replicas of a block are on the same rack, then the block is mis-replicated
because the replicas should be spread across at least two racks for resilience.
The NameNode will automatically re-replicate mis-replicated blocks so that they
satisfy the rack placement policy.

Corrupt blocks: Blocks whose replicas are corrupt.

Missing replicas: Blocks with no replicas anywhere in the cluster.

Copyright 2014, Hortonworks, Inc. All rights reserved.

107

hdfs fsck Syntax


hdfs fsck [OPTIONS] <path> [-move | -delete | openforwrite] [-files [-blocks [-locations | -racks]]]
Options

Description

path

Start checking from this path.

-move

Move corrupted files to /lost+found.

-delete

Delete corrupted files.

-openforwrite

Print out files opened for write.

-files

Print out information on files being checked.

-blocks

Print out block information.

-locations

Print out DataNode address of blocks.

-racks

Print out rack information for data-nodes.

hdfs fsck Syntax


The following options may be used to request additional block information:
Option

Description

-files

File name and path.


File size in bytes and blocks.
File health status.

-blocks

Block name & length.


Number of replicas of the block.

-locations

DataNode address of blocks (including


replicas).

-racks

Rack name is added to the location


information.

-delete

Remove corrupt files.

-move

Move corrupt files.

108

Copyright 2014, Hortonworks, Inc. All rights reserved.

Code example:
$

hdfs fsck /user/user-name -files -blocks locations racks

Copyright 2014, Hortonworks, Inc. All rights reserved.

109

Data Integrity Filesystem Check


fsck
NameNode

results

fsck checks reported


condition in
NameNode metadata

$ hdfs fsck /
....................................................................
Status: HEALTHY
Total size:
128847681 B
Total dirs:
144
Total files: 200 (Files currently being written: 3)
Total blocks (validated):
198 (avg. block size 650745 B) (Total open file
blocks (not validated): 3)
Minimally replicated blocks: 198 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks:
1 (0.5050505 %)
Mis-replicated blocks:
0 (0.0 %)
Default replication factor: 3
Average block replication:
2.989899
Corrupt blocks:
0
Missing replicas:
7 (1.1824324 %)
Number of data-nodes:
3
Number of racks:
1
FSCK ended at Wed Oct 03 12:05:58 EDT 2012 in 44 milliseconds

Data Integrity File System Check: Commands & Output


Command

Output

$ hdfs fsck /

All files in the HDFS file system are checked.

$ hadoop fs du s h /

HDFS storage used not including replication factor.

$ hadoop fs count q /

Quota information can also provide storage space


used.

$ hdfs fsck / -openforwrite

A list of all files currently open for writing

$ hdfs fsck - files - blocks


locations

List all files that are rechecked and list all the blocks
for each file. Includes the addresses of the
DataNodes containing the blocks.

110

Copyright 2014, Hortonworks, Inc. All rights reserved.

Make sure and redirect fsck output to a file if working on a large cluster. Writing to
STDOUT on a large cluster can be time consuming.
$ hdfs fsck / -files -blocks -locations > myfsck001.log

Look for key patterns in output of fsck. Search for these strings:

Target Replicas is x but found y replica(s).

CORRUPT block

CORRUPT

MISSING

Minimally replicated blocks

Copyright 2014, Hortonworks, Inc. All rights reserved.

111

The dfs Command


The hdfs dfs command supports some additional HDFS administration
operations
The report option can provide some detailed information on the HDFS
environment
The refreshNodes option is used to refresh the list of nodes to decommission,
defined in a file specified by the dfs.hosts.exclude property.

The dfs command can be used to get a detailed listing of the HDFS
namespace
$ hdfs dfs -ls / > mydfslsr001.log

The dfs Command


The hdfs dfs command syntax is:
hdfs dfs [-report] [-safemode enter | leave | get | wait]
[-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status
| details | force] [-metasave filename] [-setQuota <quota>
<dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [help [cmd]]

dfs command Option

Description

-report

Provides statistics on HDFS.

-safemode

Used to enter or leave Safemode.

-finalizeUpgrade

Removes cluster backup made during the last


upgrade.

-refreshNodes

Updates the DataNodes allowed to connect to


the NameNode.

112

Re-reads the config file to update the

Copyright 2014, Hortonworks, Inc. All rights reserved.

DataNodes defined in the files specified


by the dfs.hosts and dfs.host.exclude
properties.

Each entry that exists in


dfs.hosts.exclude and NOT in dfs.hosts
is decommissioned.

Each entry that exists in dfs.hosts and


NOT in dfs.hosts.exclude is stopped
from decommissioning if it had been
previously defined for
decommissioning.

DataNodes not in either dfs.hosts or


dfs.hosts.exclude will be
decommissioned.

$ hdfs dfs -report > datanodereport001.log Returns a list of all the DataNodes in a cluster.

$ hdfs dfs -report >


datanodelist102113.log
$ hdfs dfs -lsr / > namespace102113.log

Used together, the dfsadmin and dfs


commands can be used to compare old and
current lists of DataNodes and the HDFS
Namespace.

$ hdfs dfs -safemode enter

Puts HDFS in Safemode.

$ hdfs dfs refreshNodes

Decommissions or Recommissions
DataNodes(s).

$ hdfs dfs -upgradeProgress status

Determines if there is currently an upgrade in


progress.

$ hdfs dfs finalizeUpgrade

Makes an upgrade permanent after completing


all the work.

Copyright 2014, Hortonworks, Inc. All rights reserved.

113

NameNode Information
The NameNode information can also be saved to a
file using the metasave option:
347 files and directories, 201 blocks = 548 total
Live Datanodes: 3
Dead Datanodes: 0
Metasave: Blocks waiting for replication: 1
/user/root/.staging/job_201210012351_0005/job.jar: blk_2319384830921372914_1198
(replicas: l: 3 d: 0 c: 0 e: 0) 10.202.29.145:
50010 : 10.77.22.74:50010 : 10.34.49.188:50010 :
Metasave: Blocks being replicated: 0
Metasave: Blocks 0 waiting deletion from 0 datanodes.
Metasave: Number of datanodes: 3
10.202.29.145:50010 IN 885570207744(824.75 GB) 124239872(118.48 MB) 0.01%
842086092800(784.25 GB) Wed Oct 03 12:18:41 EDT 2012
10.77.22.74:50010 IN 885570207744(824.75 GB) 132116480(126 MB) 0.01%
842075832320(784.24 GB) Wed Oct 03 12:18:42 EDT 2012
10.34.49.188:50010 IN 885570207744(824.75 GB) 122011648(116.36 MB) 0.01%
842083241984(784.25 GB) Wed Oct 03 12:18:41 EDT 2012

NameNode Information
The metasave option creates a file (named filename) that is written to
HADOOP_LOG_DIR/hadoop/hdfs on the NameNodess local file system and contains:

Blocks waiting for replication.

Blocks being replicated.

Blocks waiting for deletion.

Each block will list the DataNodes to which it is replicated.

Summary statistics.

114

Copyright 2014, Hortonworks, Inc. All rights reserved.

Changing the Replication Factor


The replication factor can be increased or decreased. When the replication factor of a
file is decreased, the NameNode will eliminate block replicas to get to the modified
replication factor. The NameNode will have a DataNode eliminate a block replica during
a heartbeat. The DataNode will remove the block replica, freeing up the space,
although, there may be a time delay before the block is deleted.
To set the replication of file to 2:
$ hadoop dfs -setrep w 2 /directorypath/file

To set the replication of all of HDFS recursively to 4:


./bin/hadoop dfs -setrep w 4

Copyright 2014, Hortonworks, Inc. All rights reserved.

115

Unit 5 Review
1. What is the priority of placement of the second block replica during block
replication? ____________________
2. What is the purpose of setting the io.bytes.per.checksum parameter?
_____________________
3. What process uses the dfs.datanode.scan.period.hours parameter?
_____________________
4. List three things an hdfs fsck command will look for?
5. Which option of the hdfs fsck command would you use to list DataNode
addresses for the blocks? _______________________
6. What output value(s) of the hdfs fsck command would you use to determine the
total amount of disk storage including replication is a file taking up?
7. Why would you run the command below?
$ hdfs dfs -report > myreport001.log

116

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 5.1: Verify Data with Block Scanner and


fsck

Objective: View the various tools for performing block verification and
the health of files in HDFS.
Successful Outcome: You will see the result of the Block Scanner Report on node1,
and the output of the fsck command.
Before You Begin: SSH into node1.

Step 1: Stop HDFS service


Step 2: Configure the Scan Period
1.1. Go to HDFS -> Configs, then expand the Custom hdfs-site.xml section and
click the Add Property
1.2. Add the following property to hdfs-site.xml:
dfs.datanode.scan.period.hours=1

Step 2: Restart the HDFS Service


Step 3: View the Block Scanner Report
3.1. Point your web browser to http://node1:50075/blockScannerReport. The
report will look similar to the following:

Copyright 2014, Hortonworks, Inc. All rights reserved.

117

3.2. How many blocks are on your node1 DataNode? ______________


Step 4: View the Block Details
4.1. Add the listblocks parameter to the blockScannerReport URL:
http://node1:50075/blockScannerReport?listblocks

You should see a list of all blocks on that DataNode and their status:

NOTE: If a block is corrupt, the NameNode is notified and attempts to fix the
issue. The default time period for scanning blocks is every three weeks, so in
a production environment you would not set this interval to 30 minutes like
you did in this lab. Use the block scanner report as a quick way to verify the
integrity of the blocks in your cluster.

Step 5: Run the fsck Command on a File


5.1. Put the file /root/data/test_data into the /user/root folder of HDFS:

118

Copyright 2014, Hortonworks, Inc. All rights reserved.

# hadoop fs -put ~/data/test_data test_data

5.2. Run the fsck command for the file /user/root/test_data:


# hdfs fsck /user/root/test_data -files

5.3. How many blocks did test_data get split into? ____________
5.4. What is the average block replication of test_data? ___________
Step 6: Using fsck Options
6.1. Run the fsck command again, but this time add the -blocks option:
# hdfs fsck /user/root/test_data -files -blocks

6.2. What did the blocks option add to the output? _________________________
6.3. Add the -locations option as well:
# hdfs fsck /user/root/test_data -files -blocks -locations

6.4. What did the locations option add to the output? _______________________
Step 7: Run a File system Check
7.1. You can run fsck on the entire file system. Enter the following command:
# hdfs fsck /

Notice this command fails, because root does not have permission to view all the
files in HDFS.
7.2. Switch to the hdfs user:
# su - hdfs

7.3. Run fsck on the entire file system:


$ hdfs fsck /

7.4. What is the total size of your HDFS? _______________________

Copyright 2014, Hortonworks, Inc. All rights reserved.

119

7.5. How many directories does your cluster have? ____________


7.6. How many files are on your cluster? _____________
7.7. How many total blocks are on your cluster? ___________
7.8. What is the average block replication of your cluster? ________________
7.9. Switch back to the root user:
[hdfs@node1 ~]$ exit

Step 8: View the Health of your Cluster


8.1. From the Hosts page for node1 in Ambari, stop the DataNode service on
node1.
8.2. Click on the Services page, then HDFS. You should see that one of your
DataNodes is not live.
8.3. Using the Quick Links menu on the HDFS Services page (shown in the
screenshot below), open the NameNode UI:

The NameNode UI opens the dfshealth.jsp page by default.


8.4. Notice you still have 3 Live Nodes and 0 Dead Nodes:

NOTE: It takes 10.5 minutes for a DataNode to be marked as dead in a


cluster.

120

Copyright 2014, Hortonworks, Inc. All rights reserved.

8.5. Click on the Live Nodes link to view the Live DataNodes in your cluster:

Step 9: Take a Break


9.1. ...and wait for your DataNode to be marked as Dead in your cluster!
9.2. Refresh the Live DataNodes page, and you should only see two live
DataNodes:

9.3. Go back to the dfshealth.jsp page and refresh it. Notice you now have 1 Dead
Node and a large number of under-replicated blocks:

9.4. Why does your cluster have so many under-replicated blocks? ___________
_________________________________________________________________
Step 10: Run fsck Again
10.1. Switch to the hdfs user and run fsck on the entire file system:
# su - hdfs
[hdfs@node1 ~]$ hdfs fsck /

Notice you get a long list of every file that contains under-replicated blocks.

Copyright 2014, Hortonworks, Inc. All rights reserved.

121

10.2. What is the average block replication now on your cluster? _____________
10.3. Compare the value of Missing replicas in the output of fsck with the value of
Number of Under-Replication Blocks in the NameNode UI.
Step 11: Start the DataNode Again
11.1. Using Ambari, start the DataNode process on node1.
11.2. Refresh the dfshealth.jsp page in the NameNode UI frequently, and you can
watch as the number of under-replicated blocks gradually decreases to 0:

11.3. Run fsck again on your entire file system, and notice everything is back to
normal again.

RESULT: The Block Scanner Report is a quick way to view the status of the blocks on the
DataNodes of your cluster. The fsck tool is a great way to view the health of your file
system and block replication, as is using the NameNode UI.
122

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 6: HDFS NFS Gateway


Topics covered:

HDFS NFS Gateway Introduction

HDFS NFS Gateway

Configuring the HDFS NFS Gateway

Starting the NFS Gateway Service

User Authentication

Lab 6.1: Mounting HDFS to a Local File System

Copyright 2014, Hortonworks, Inc. All rights reserved.

123

The HDFS NFS Gateway


HDFS access is usually done using the HDFS API or the web API
The HDFS NFS Gateway (NFS) allows HDFS to be mounted as part of a
client local file system

DFSClient

The NFS Gateway is a stateless daemon, that translates NFS protocol to


HDFS access protocols

ferP
r an s
a t aT

c
roto

ol

ClientProtocol
NameNode

NFSv3
Client

NFS
Gateway

DataNode

Data

Tran
s

ferP

rotoc
ol

DataNode

HDFS NFS Gateway Introduction


Network File System (NFS) is a distributed file system protocol that allows access to files
on a remote computer in a manner similar to how local file system is accessed.
The DFSClient is inside the NFS Gateway daemon (nfs3), therefore, the DFSClient is part
of the NFS Gateway.
HDFS NFS Gateway allows HDFS to be accessed using the NFS protocol supporting more
applications and use cases. All HDFS commands are supported from listing files, copying
files, moving files, creating and removing directories.

NFS Client: The number of application users doing the writing and the number
of files being loaded concurrently define the workload.

DFS Client: Multiple threads are used to process multiple files. DFSClient
averages 30 MB/S writes.

NFS Gateway: Multiple NFS Gateways can be created for scalability.

124

Copyright 2014, Hortonworks, Inc. All rights reserved.

HDFS NFS Gateway simplifies data ingest of large-scale analytical workloads. Random
writes are not supported. Different ways the NFS interface to HDFS can be used include:

Browsing files on HDFS.

Downloading files from HDFS.

Uploading files on HDFS.

Streaming data directly to HDFS.

A few reminders:

HDFS is a read-only file system with append capabilities.

NFSv3 is a stateless environment.

After an idle period, the file will be closed.

Following write will reopen the file.

NOTE: NFSv4 support, HA, Kerberos are on the roadmap for the HDFS NFS
Gateway.

Copyright 2014, Hortonworks, Inc. All rights reserved.

125

NFS Gateway Node


For scalable loading of data to HDFS, you should use webHDFS or the Java RPC. NFS
supports loading of smaller (< 1 GB at a time) files into HDFS through the NFS Gateway.
NFS Gateway Node is a very easy way to load data not necessarily the fastest. As
mentioned earlier, you should create multiple HDFS NFS Gateways to increase
scalability.
When an application user writes to an HDFS mount point, the Gateway server will
generate one DFSClient. If the application user writes a number of files to each HDFS
mount point, the Gateway Server will generate one DFSClient with multiple threads.
When multiple application users perform loads with multiple files to each mount point,
the Gateway server will generate multiple DFSClients with multiple threads each.
HDFS NFS Gateway was introduced in HDP 1.3. There are some operational limits to be
aware of:

Additional Gateway servers have to be setup, started and monitored manually.


Ambari management is on the roadmap.

HA is not build into the Gateway Servers. IF a gateway sever goes down then the
corresponding HDFS client mounts will fail.

126

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuring the HDFS NFS Gateway


Edit the hdfs-site.xml file on your NFS gateway machine and modify the following
property. If the export is mounted with access time update allowed, make sure this
property is not disabled in the configuration file.
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>3600000</value>
<description>The access time for HDFS file is precise
upto this value.
The default value is 1 hour. Setting a value
of 0 disables access times for HDFS.
</description>
</property>

Copyright 2014, Hortonworks, Inc. All rights reserved.

127

Update the following property to hdfs-site.xml. This sets the maximum number of files
being uploaded in parallel.
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>1024</value>
</property>

Add the following property to hdfs-site.xml. NFS client often reorders writes.
Sequential writes can arrive at the NFS gateway at random order. This directory is used
to temporarily save out-of-order writes before writing to HDFS. You need to make sure
the directory has enough space.
<property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>

Change the log level in the log4j.property file to debug to collect more details:
log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG

To get more details of RPC requests, add the following:


log4j.logger.org.apache.hadoop.oncrpc=DEBUG

128

Copyright 2014, Hortonworks, Inc. All rights reserved.

Starting the NFS Gateway Service


NFS3 can be thought of as a combination of the nfsd and mountd daemon together.
HDP2 release enables file streaming to HDFS through NFS.
Start the portmap in the NFS gateway package as root.
# hdfs portmap

Start mountd and nfsd making sure the user starting the Hadoop cluster and the user
starting the NFS gateway are the same.
$ hdfs nfs3

Stop the NFS gateway service.


$ hadoop-daemon.sh stop nfs3
$ hadoop-daemon.sh stop portmap

Make sure NFS gateway services have started properly. Verify mountd, portmapper and
NFS are up and running.

Copyright 2014, Hortonworks, Inc. All rights reserved.

129

Execute the following command to verify if all the services are up and running:
rpcinfo -p $nfs_server_ip

Make sure the HDFS namespace is exported and can be mounted by any client.
#

showmount -e $nfs_server_ip

Make HDFS available through NFS.


mount -t nfs -o vers=3,proto=tcp,nolock $server:/
$mount_point

130

Copyright 2014, Hortonworks, Inc. All rights reserved.

User Authentication
The OS login user id on the NFS client must match the user id
accessing HDFS
LDAP/NIS should be used to make sure the same user ids are deployed
on the NFS client and HDFS

NFS
Client

Gage

Login user: Gage


UID: 350,
GID: 300

NFS
Gateway

Gage

NameNode

Look up UID/GID
for user

User Authentication
The user authentication method needs to make sure the UID/GID match between the
user access, the NFS client, and the user running the HDFS operations.
The manual creation of users is not recommended for production environments.

Copyright 2014, Hortonworks, Inc. All rights reserved.

131

Unit 6 Review
1. What nodes in a Hadoop cluster can the HDFS NFS gateway run on?
2. A _________ needs to be running on the NFS gateway node.
3. What configuration file is modified to configure the HDFS NFS Gateway server?
4. The _____________ must match between the NFS client and HDFS for user
authentication.

132

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 6.1: Mounting HDFS to a Local File System

Objective: To learn how to mount HDFS to a local file system.


Successful Outcome: HDSF will be mounted to /home/hdfs/hdfs-mount.
Before You Begin: SSH into node1.

Step 1: Install NFS


1.1. Execute following yum command to install NFS on node1:
# yum -y install nfs*

Step 2: Configure NFS


2.1. Within Ambari, stop the HDFS service.
2.2. Go to Services -> HDFS -> Configs, scroll down to the Advanced section and
click on the Advanced heading to expand the section.
2.3. Locate the dfs.namenode.accesstime.precision property and set it to
3600000.
2.4. Scroll down to the "Custom hdfs-site.xml" section.
2.5. Using the "Add Property..." link, add the following new property:
dfs.nfs3.dump.dir=/tmp/.hdfs-nfs

2.6. Click the Save button to save your changes.


2.7. Start the HDFS service back up.
Step 3: Start NFS
Copyright 2014, Hortonworks, Inc. All rights reserved.

133

3.1. Run the following commands to stop the nfs and rpcbind services. (If they are
not running, the following commands will fail, which is no problem):
# service nfs stop
# service rpcbind stop

3.2. Now start the NFS services using the hadoop-daemon.sh script:
# hdfs portmap &
# hdfs nfs3 &

3.3. Verify that the required services are up and running.


# rpcinfo -p node1

You should se output similar to the following:

3.4. Verify that the HDFS namespace is exported and can be mounted by any
client.
# showmount -e node1

You should se output similar to the following:

134

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 4: Mount HDFS to the Local File System


4.1. As the root user, create a new directory on your local file system named
/home/hdfs/hdfs-mount.
4.2. Change its ownership to the hdfs user.
4.3. Execute the following command on a single line to mount HDFS to the hdfsmount directory:
# mount -t nfs -o vers=3,proto=tcp,nolock node1:/
/home/hdfs/hdfs-mount

4.4. Browse HDFS now on your local file system.


# ls -l /home/hdfs/hdfs-mount

4.5. Switch to the hdfs user.


4.6. Copy a local file using the cp command into the hdfs-mount directory, and
then check whether you can see this file using the hadoop fs -ls / command.
4.7. Delete the file from the local file system using the rm command, and verify
the file is no longer in HDFS.
4.8. Try other Linux commands and see whether they work successfully.

RESULT: You have mounted HDFS to a local file system, which can be a convenient skill
to know how to do when working frequently with files in HDFS.

Copyright 2014, Hortonworks, Inc. All rights reserved.

135

Unit 7: YARN Architecture and


MapReduce
Topics covered:

136

What is YARN?

Hadoop as Next-Gen Platform

Beyond MapReduce

YARN Use Case

YARN Birds Eye View

Lifecycle of a YARN Application

ResourceManager

NodeManager

MapReduce

Demonstration: Understanding MapReduce

Configuring YARN

Configuring MapReduce

Tools

Lab 7.1: Troubleshooting a MapReduce Job

Copyright 2014, Hortonworks, Inc. All rights reserved.

What is YARN?
The goal of an operating system is to facilitate applications to achieve 100% utilization
of all resources on the physical system while letting every application execute at its
maximum potential. This is what YARN achieves in a Hadoop cluster. The compute
resources managed by YARN in a Hadoop cluster are memory and cpu. A YARN
application can request these resources and YARN will make them available according to
its scheduler policy.
What distinguishes YARN from other distributed compute frameworks is that the
applications that can run on YARN can be rapidly developed. Many standalone
applications have already been adapted and they range in types from batch applications
such as MapReduce to realtime always-on database applications such as HOYA (HBase
on YARN).

Copyright 2014, Hortonworks, Inc. All rights reserved.

137

Hadoop as Next-Gen Platform


Single Use System

Multi Purpose Platform

Batch Apps

Batch, Interactive, Online, Streaming,

HADOOP 1.0

HADOOP 2.0

MapReduce
(cluster resource management
& data processing)

HDFS
(redundant, reliable storage)

MapReduce

Others

(data processing)

(data processing)

YARN
(cluster resource management)

HDFS2
(redundant, reliable storage)

Hadoop as Next-Gen Platform


What are key areas that needed improvement in Hadoop 1 that were addressed in
YARN?

Scalability: It was difficult for a Hadoop 1 cluster to go beyond 4000 nodes.

Availability: The JobTracker was a single point of failure. If it failed, then ALL
jobs failed.

Hard partition of resources into map and reduce slots: This limitation is a major
factor that causes a clusters compute resources to be underutilized.

Lacks support for alternate paradigms and services: Legacy Hadoop was meant
to solve batch-processing scenarios, and MapReduce was the only programming
paradigm available.

Hadoop 2 presents five key benefits:

Scale: YARN has been successfully deployed on 35,000+ nodes.

New programming models & services: Resource management and task


management are now separated concerns. This opens the doors to having

138

Copyright 2014, Hortonworks, Inc. All rights reserved.

different types of applications. Applications dont necessarily have to be batch


jobs. Resources are now generic that any kind of application can ask for them.

Improved cluster utilization: In Hadoop 1 MapReduce, each worker node or


TaskTracker had a hardcoded allocation of slots that were a combination of Map
and Reduce slots. Because resources are dynamically allocated per application in
YARN, there are no longer any wasted compute units.

Agility: Applications written on YARN can adapt to a changing cluster in terms of


resource availability, data locality, state persistence for handling failure, etc.

Beyond Java: The types of applications that run on YARN are not limited to Java.
Applications written in any language, as long as the binaries are installed on the
cluster, can run natively, all while requesting resources from YARN and utilizing
HDFS.

Copyright 2014, Hortonworks, Inc. All rights reserved.

139

Beyond MapReduce
Tez
MapReduce as a workflow

Hoya Hbase on YARN


HBase resource management now centralized

Storm
Stream processing; always-running application

Applica ons Run Na vely IN Hadoop


BATCH
INTERACTIVE
(MapReduce)
(Tez)

ONLINE
(HBase)

STREAMING
(Storm, S4,)

GRAPH
(Giraph)

IN-MEMORY
(Spark)

HPC MPI
(OpenMPI)

OTHER
(Search)
(Weave)

YARN (Cluster Resource Management)


HDFS2 (Redundant, Reliable Storage)

Beyond MapReduce
Remember that MapReduce is just a type of application paradigm that can run on YARN.
Applications are continuously being ported to run on YARN to utilize HDFS storage and
in some cases to utilize YARNs distributed compute framework itself.

140

Copyright 2014, Hortonworks, Inc. All rights reserved.

YARN Use-case
Two key factors led to dropping an entire datacenter of 10k nodes:
1. YARN is a pure resource manager; it does not care about application specifics
such as what type of application is running on the cluster. Resource management
is lightweight once these types of details are offloaded to other processes. YARN
simply knows about resource availability for each node in the cluster and will
lease these resources based on its scheduler policy. The responsibility of
applications using these resources is left to another type of per-application
process called an ApplicationMaster.
2. MapReduce in Hadoop 2 (MRv2) itself has taken advantage of this type of
architecture, where each job has its own ApplicationMaster. We will discuss
ApplicationMaster details later on. Each MRv2 jobs resource requests are
dynamically sized for its Map and Reduce processes.

Copyright 2014, Hortonworks, Inc. All rights reserved.

141

YARN Birds Eye View


ResourceManager

ResourceManager (master)
Application management
Scheduling
Security

ApplicationsManager

Scheduler

NodeManager (worker)

NodeManager 1

Provides resources to lease


Memory, cpu

Container as unit of processing


ApplicationMaster to manage
individual jobs
Job History Server
Job history preserved in HDFS

Client & Admin Utilities


CLI
REST
JMX

Security

Container

Container

Container

Container

Job1
Task1

Job2
Map1

Job2
Reducer

Job2
Map2

Free
Capacity

NodeManager 2
Container

Container

AppMaster
YARN Job1

Job1
Task2

Free Capacity

NodeManager 3
Container

Container

Container

Container

Container

Job2
Map3

Job2
Map4

Job2
Map5

Job2
Map6

AppMaster
MR Job2

Free
Capacity

YARN Birds Eye View


The following core components make up the YARN framework:
ResourceManager
The ResourceManager is the heart of YARN. This the single entry point for clients to
submit YARN applications. Its responsible for:

Application management: Launching, tracking application status, and leasing


cluster resources to applications

Scheduling: When an application requests resources, resources are made


available (or rejected) to the application based on the ResourceManagers
scheduler policy. The default scheduler is the Capacity Scheduler where an
application is submitted to a pre-configured queue. A queue will have a certain
portion of the clusters resources allocated. Another type of scheduler is the Fair
Scheduler. These are discussed in more detail later.

Security: Intracommunication between YARN components can be secured with


authorization tokens. The ResourceManager contains token managers that are
used by various components:

142

Copyright 2014, Hortonworks, Inc. All rights reserved.

AMRMTokenSecretManager: Per application; ApplicationMasters will


use a token when submitting ResourceManager requests. The manager
stores the tokens by ApplicationID.

RMContainerTokenSecretManager: Per container; after a container is


leased by the ResourceManager to the ApplicationMaster, the
ApplicationMaster submits a container specific token as part of the
container launch request to the NodeManager.

RMDelegationTokenSecretManager: Per RM client; an unauthenticated


process can be passed a delegation token by a trusted client and can then
communicate with the ResourceManager.

NodeManagers: These are the worker nodes in a YARN cluster. They publish
resource pools (memory & CPU) to the ResourceManager. The ResourceManager
will have an aggregate view of these resources.

Containers: A Container is the atomic resource unit in YARN. Containers are


sized by memory and cpu (virtual cores) and are dynamic. Thus the number of
containers that can be running in parallel on a single NodeManager will vary
depending on what the ApplicationMaster has requested.

ApplicationMasters: The controllers in a YARN application. An


ApplicationMasters responsibility is to ensure that the application gets the
necessary resources, and hence, requests/releases them from the
ResourceManager. It is also responsible for handling container failures,
application-level cleanup, and application-level failure recovery.
ApplicationMasters themselves run in a Container.

Client & Admin Utilities: YARN provides both client and admin command-line
tools. For monitoring YARN components, there is a REST API as well as MBeans
for daemon processes.

Copyright 2014, Hortonworks, Inc. All rights reserved.

143

Lifecycle of a YARN Application


1

Client submits application request

ResourceManager
Response with ApplicationID
Scheduler
ApplicationsManager

Application Submission Context


(user, queue, dependencies, resource requests)
+
3
Containter Launch Context
(resource reqs, launch commands, delegation tokens)

ApplicationMasterService

5
4

Get Capabilities

Start ApplicationMaster

6
NodeManager 1
Container

Container

Job
Map1

Job
Reducer1

Free
Capacity

NodeManager 2

Container
r
Job
Map2

NodeManager 3
Free Capacity

Req/Rec Containers

Container

Container

Container

Container

Container

Job
Map3

Job
Map4

Job
Map5

Free
Job
Capacity
Map6

ApplicationMaster
MR Job

Container Launch Requests

NodeManager 4
Container

Container

Job
Map7

Job
Map8

Container

Container

Job Free Capacity


Job
Map10
Map9

Container
Job
Reducer2

Lifecycle of a YARN Application


1. Client submits application request: This is an initial lightweight request for an
ApplicationID.
2. Response with ApplicationID: If the request is successful, the
ApplicationManager will respond with an ApplicationID, which will be used for
the actual application submission to the cluster.
3. Application Submission Context (ASC) + Container Launch Context (CLC): The
client submits the application to the ApplicationManager. Any queue,
dependencies, container launch commands, etc. are sent in the request as well.
4. The ApplicationManager is responsible for finding a Container on a
NodeManager to start the ApplicationMaster.
5. Once the ApplicationMaster has started, it will establish connection with the
ResourceManager, specifically, a component called the
ApplicationMasterService. It will then retrieve cluster capabilities (memory &
cpu) availability.

144

Copyright 2014, Hortonworks, Inc. All rights reserved.

6. The ApplicationMaster has the option to continue. If it does, it will send a


request for Containers to run the application. In the case of MapReduce, it will
ask for specific NodeManagers because it wants to co-locate processing with
data blocks in HDFS.
7. The ApplicationMaster will receive leases on Containers per the YarnScheduler
policy. An ApplicationMaster can request resources as often as necessary. It will
also have to take care of graceful shutdown of Containers by releasing its lease(s)
to the ResourceManager.
8. Once the ApplicationMaster has determined that the application should end, it
can optionally persist application logs to HDFS, and any other post-process
activity before it self-terminates.

Copyright 2014, Hortonworks, Inc. All rights reserved.

145

ResourceManager

146

Copyright 2014, Hortonworks, Inc. All rights reserved.

NodeManager

Copyright 2014, Hortonworks, Inc. All rights reserved.

147

MapReduce
Map Phase

Shuffle/Sort

Reduce Phase

NM + DN
NM + DN
Mapper

Reducer

NM + DN

Mapper

Data is shuffled
across the network
and sorted

NM + DN

NM + DN

Reducer

Mapper

MapReduce
The original use-case for Hadoop was distributed batch processing. MapReduce is a
power application paradigm for processing massive amounts of data.
Core features of MapReduce are:

Co-locating processing with data blocks: Take the computing to where the data
lives, rather than querying or reading data into a remote application. Would you
rather move hundreds of GB/TB of data around your network, or would you
rather move an application that processes the same data to where the data
actually lives?

Map Phase: This is the initial phase of all MapReduce jobs. This is where raw
data can be read, extracted, transformed, and results written out to HDFS or
moved on to Reducers for aggregate processing, such as a final count, sum, min,
max, etc. The Map phase can also be thought of as the ETL or projection step for
MapReduce.

Reduce Phase: This is the final phase where data is sorted on a user-defined key
and grouped by that same key. The Reducer has the option to perform an

148

Copyright 2014, Hortonworks, Inc. All rights reserved.

aggregate function on that data. The Reduce phase can be thought of as the
aggregation step.

Data is always moved along the pipeline in MapReduce in the form of key/value
pairs.

A MapReduce job scales to the size of the data For example, if a dataset in
HDFS is 1 terabyte broken into 256MB blocks, it is possible for 4096 mappers to
run in parallel to read each block (if the cluster has the capacity).

Copyright 2014, Hortonworks, Inc. All rights reserved.

149

Demonstration: Understanding MapReduce

Objective: To understand how MapReduce works.


During this Watch as your instructor performs the following steps.
demonstration:

Step 1: Put the File into HDFS


1.1. Change directories to the labs folder:
# cd /root/labs

1.2. Notice a file named constitution.txt:


# more constitution.txt

1.3. Put the file into HDFS:


# hadoop fs -put constitution.txt constitution.txt

Step 2: Run the WordCount Job


2.1. The following command runs the WordCount job on the constitution.txt and
writes the output to wordcount_output:
# hadoop jar wordcount.jar wordcount.WordCountJob
constitution.txt wordcount_output

2.2. Notice a MapReduce job gets submitted to the cluster. Wait for the job to
complete.

150

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 3: View the Results


3.1. View the contents of the wordcount_output folder:
# hadoop fs -ls wordcount_output

You should see a single output file named part-r-00000:


Found 1 items
-rw-r--r-- 1 root hdfs 17031 wordcount_output/part-r-00000

3.2. Why is there one file in this directory? ______________________________


3.3. What does the r in the filename stand for? _________________________
3.4. View the contents of part-r-00000:
# hadoop fs -cat wordcount_output/part-r-00000

3.5. Why are the words sorted alphabetically? _____________________________


3.6. What was the key output by the WordCount reducer? ___________
3.7. What was the value output by the WordCount reducer? _____________
3.8. Based on the output of the reducer, what do you think the mapper output
contained as key/value pairs? _________________________________________
__________________________________________________________________

Copyright 2014, Hortonworks, Inc. All rights reserved.

151

Configuring YARN
For configuring YARN, there is one core configuration file:
/etc/hadoop/conf/yarn-site.xml

The most important aspect to configuring YARN is how resource allocation works. There
are two types of resources:

Physical: The total physical resources (memory) that a container will claim.

Virtual: The total virtual resources (memory) that a container will claim. It is
usually much larger than physical memory. You want to keep this higher
because once the containers are running, a process can often times take
advantage of virtual memory addressing in order to give the application an
impression that it has more memory than physically allocated.

Why does this work? Because the underlying operating system will page out
memory thats not being used to a partition on its local disk known as a
swap partition.

152

Copyright 2014, Hortonworks, Inc. All rights reserved.

Ports firewall considerations:


Service

Servers

Resource Manager WebUI


ResourceManager
NodeManager WebUI

ResourceManager 8088
ResourceManager 8032
NodeManagers
50060

Copyright 2014, Hortonworks, Inc. All rights reserved.

Ports

Protocol Description
http
IPC
http

WebUI for RM
Application submissions
WebUI for NMs

153

Configuring MapReduce
There are a few additional considerations for properly configuring MapReduce. In the
mapred-site.xml, there are two additional properties that should be kept in tune:
MapReduce container size (physical):

mapreduce.map.memory.mb: 1GB + 512MB for non-heap such as permgen.

mapreduce.reduce.memory.mb: 2GB + 512MB for non-heap such as


permgen.

Now you need to ensure that the jvm heap is lower than the physical allotted to the
container:
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2048m</value>
</property>

154

Copyright 2014, Hortonworks, Inc. All rights reserved.

Notice the Xmx (Java Heap max) is less than the container allocation of 2560MB. This
will give the container some breathing room to continue running without expressing any
out of memory issues.

Copyright 2014, Hortonworks, Inc. All rights reserved.

155

Tools
Starting daemons from the command line:
# yarn resourcemanager
# yarn nodemanager
# yarn proxyserver

Web Application Proxy


It is the responsibility of an ApplicationMaster to provide a web UI and send the link to
the UI to the ResourceManager. The ResourceManager runs as a trusted user, however,
ApplicationMasters are not. In order to provide a warning to users that they are visiting
an untrusted link and also stripping out any cookies sent by the user, a proxy server can
be optionally started. Note that a web application proxy runs embedded within the
ResourceManager if yarn.web-proxy.address is not set.

156

Copyright 2014, Hortonworks, Inc. All rights reserved.

Admin operations
Operation

Description

$ yarn rmadmin

Primarily to refresh properties such as


queues, nodes, and acls.

$ yarn application

List/kill applications.

$ yarn node

Print node reports such as status (rack


info, containers, memory, cpu) and view
nodes by states (i.e. RUNNING, INITING,
etc.).

$ yarn logs

Container level logs dumped to stdout.

$ yarn daemonlog

Get/set log level for live daemons.

REST
YARN MR applications and cluster can be monitored via the REST API. The REST API is
available at:
http://<resourcemanager:port>/ws/v1/cluster
http://<node:port>/ws/v1/node
http://<webapplicationproxy:port>/proxy/<appid>/ws/v1/mapreduce

Currently, GET requests are supported for monitoring and gathering metrics. Full REST
API usage with examples can be found at:
http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarnsite/WebServicesIntro.html

Ambari
The primary management and provisioning of YARN components should be done via
Ambari (if possible). Ambari has extensive UIs to manage your YARN cluster.

Copyright 2014, Hortonworks, Inc. All rights reserved.

157

Unit 7 Review
1. What are the three main phases of a MapReduce job? _____________
_________________________________________________________
2. What determines the number of Mappers of a MapReduce job? ___________
________________________________________________________________
3. What determines the number of Reducers of a MapReduce job? ___________
________________________________________________________________

4. What are the 3 main components of YARN? ____________________________


________________________________________________________________
5. What happens if a Container fails to complete its task in a YARN application?
________________________________________________________________

158

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 7.1: Troubleshooting a MapReduce Job


Objective: To learn tips on how to troubleshoot issues with a
MapReduce job.
Successful Outcome: You will have run the IndexInverter MapReduce job several
times and viewed its log files.
Before You Begin: SSH into node1.

Step 1: Put the File into HDFS


1.1. Change directories to the labs folder:
# cd /root/labs

1.2. Notice a file named hortonworks.txt. View its contents:


# more hortonworks.txt

This file contains URLS, along with keywords found on the webpages of each URL.

NOTE: The MapReduce job in this lab computes an inverted index, one of
the earliest use cases of Hadoop and MapReduce. A Web crawler scans the
Internet and retrieves URLs along with keywords on each page. The index
inverter job flips this information around, outputting the keywords along
with each web page that contains the keyword.

Step 2: Run the IndexInverterJob


2.1. Enter the following command to run the IndexInverterJob:
# hadoop jar invertedindex.jar inverted.IndexInverterJob
hortonworks.txt index_output
Copyright 2014, Hortonworks, Inc. All rights reserved.

159

2.2. The job fails. What is the issue? ___________________________________


________________________________________________________________
Step 3: Run the Job Again
3.1. Put the file /root/labs/hortonworks.txt into HDFS:
# hadoop fs -put /root/labs/hortonworks.txt hortonworks.txt

3.2. Run the job again, using the same command as the previous step.
3.3. The job should run successfully this time. How many map tasks were needed
for this job? ________ How many reduce tasks? ___________
3.4. How long (in ms) did it take for all the mappers to run? _________________
3.5. How long (in ms) did it take for all the reducers to run? _________________
3.6. How many bytes did the mappers of this job process? ____________
3.7. How many bytes did the reducers output? ____________
Step 4: View the Output
4.1. Verify the index_output folder was created in HDFS:
# hadoop fs -ls index_output

You should see a single output file named part-r-00000:


4.2. View the contents of part-r-00000:
# hadoop fs -cat index_output/part-r-00000

4.3. What did the reducer use as the key for its output? ________________
4.4. What did the reducer use as the values for its output? _________________
Step 5: Run the Job Again
5.1. Run the IndexInverterJob again with the exact same command.
5.2. The job failed. Why? ____________________________________________
5.3. Delete the index_output folder in HDFS.
160

Copyright 2014, Hortonworks, Inc. All rights reserved.

5.4. Run the job again, and it should run successfully this time.
Step 6: View the Resource Manager UI
6.1. Point your web browser to Ambari at http://node1:8080.
6.2. From the Dashboard page, select YARN from the left-hand menu.
6.3. Select the Configs tab on the YARN page.
6.4. Which node in your cluster is the Resource Manager running on? __________
6.5. Point your web browser to the Resource Manager UI, which is
http://node2:8088. You should see the All Applications page:

6.6. In the Cluster menu on the left side of the page, click on the various links like
About, Nodes, Applications, and Scheduler. Notice there is a lot of useful
information provided in this UI.

Step 7: Troubleshoot a Problem


7.1. From a previous lab, you should have a file named hbase.jar in /user/root in
HDFS:
# hadoop fs -ls hbase.jar
Found 1 items
-rw-r--r-3 root root

4708852 hbase.jar

If you do not have this file in HDFS, put it there. The file is found in your
/root/labs folder.

Copyright 2014, Hortonworks, Inc. All rights reserved.

161

7.2. Run the IndexInverterJob using the following command (entered on a single
line):
# hadoop jar invertedindex.jar inverted.IndexInverterJob
hbase.jar index_output2

7.3. Notice exceptions are thrown, and eventually the job will fail. From the
output of the job, how many map tasks were launched? _________ How many
map tasks failed? ________ How many were killed? ________
7.4. The input file, hbase.jar, is split into 5 blocks in HDFS. Why did this
MapReduce job launch 10 map tasks? ___________________________________
Step 8: View the Log Files
8.1. Lets figure out what happened to the IndexInverterJob. View the Job History
page of node2 by pointing your browser to http://node2:19888:

Notice the most recent job at the top of the list has a status of FAILED.
8.2. Click on Job ID of the failed IndexInverterJob. You should see the details page
for this job:

162

Copyright 2014, Hortonworks, Inc. All rights reserved.

8.3. Notice this page contains useful details about the job, including the average
map and reduce time, and how long it took to execute the entire job.
8.4. Also notice that 5 mappers and 1 reducer were started for this job. In the
screen shot above, 8 map tasks failed, 2 were killed, and 0 were successful. Notice
these numbers are links - click on your number of failed map tasks:

8.5. In the Logs column, click on the logs link of one of the failed map tasks to
view the corresponding log file:

8.6. What happened in this job? Why did the mapper fail? ___________________
RESULT: You have executed a MapReduce job that failed for several different reasons.
Being able to troubleshoot these types of issues is an important and handy skill for any
Hadoop administrator.
ANSWERS:
2.2: The job fails because the input file for the job does not exist in HDFS.
3.3: 1 mapper and 1 reducer, as found in the Job Counters section of the output.
3.4: Look for the counter Total time spent by all maps in occupied slots (ms)

Copyright 2014, Hortonworks, Inc. All rights reserved.

163

3.5: Similarly, look for Total time spent by all reduces in occupied slots (ms)
3.6: Bytes Read=1126, as found in the File Input Format Counters section.
3.7: Bytes Written=2997, as found in the File Output Format Counters section.
5.2: The output folder of a MapReduce job cannot exist. You should have gotten the
following error message: FileAlreadyExistsException: Output directory
hdfs://node1:8020/user/root/index_output already exists
7.3: You should see 10 launched map tasks. The number of failed and killed tasks will
vary, but expect about 8 failed and 2 killed.
7.4: When a map task fails, the MapReduce framework launches the map task again. A
map task has to fail 2 times (by default) before the entire job fails. The input file was
split into 5 blocks, and each block generated a map task that failed 2 times, so 5x2=10.
8.6: A NullPointerException was thrown on line 32 of IndexInverterJob.java. Useful
information for the Java developer!

164

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 8: Job Schedulers


Topics covered:

Overview of Job Scheduling

The Built-in Schedulers

Overview of the Capacity Scheduler

Configuring the Capacity Scheduler

Defining Queues

Configuring Capacity Limits

Configuring User Limits

Configuring Permissions

Overview of the Fair Scheduler

Configuration of the Fair Scheduler

Multi-Tenancy Limits

Lab 8.1: Configuring the Capacity Scheduler

Copyright 2014, Hortonworks, Inc. All rights reserved.

165

Overview of Job Scheduling

Job Scheduler

1 Client applications submit


jobs to the cluster.

2 A pluggable job scheduler


assigns resources for each
job.

3 The jobs execute on the


cluster within their assigned
resources.

Overview of Job Scheduling


YARN jobs are submitted to the ResourceManager, which is responsible for scheduling
jobs. The ResourceManager has a pluggable Scheduler that allows you to configure
scheduling to meet the needs of your organization and your cluster.
The Scheduler is responsible for allocating cluster resources to a job. This unit discusses
how to configure this important aspect of your Hadoop cluster.

NOTE: The Scheduler of the ResourceManager is purely a scheduler; it only


schedules jobs. It is not responsible for monitoring the status or ensuring the
completion of jobs.

166

Copyright 2014, Hortonworks, Inc. All rights reserved.

The Built-in Schedulers


HDP has two commonly-used built-in Schedulers:

Capacity Scheduler: Schedules jobs based on memory usage using hierarchical


queues that you configure to meet the requirements of your organizations
needs.

Fair Scheduler: Schedules jobs so that all jobs get, on average, an equal share of
cluster resources.

We recommend using a Capacity Scheduler, and HDP uses the Capacity Scheduler by
default. We will discuss both of these schedulers in this Unit, but our focus will be on
configuring and managing the Capacity Scheduler.

Copyright 2014, Hortonworks, Inc. All rights reserved.

167

Overview of the Capacity Scheduler

Distributes jobs to queues based on capacity


Capacity determined by percentage of available memory
Total of all of the task percentage assigned has to add up to 100%

Can set individual user limits per queue

Extra capacity available is given to other queues evenly

Actual:
Configured for:

Queue1

Queue2

Queue3

40%

35%

25%

50%

30%

20%

Overview of the Capacity Scheduler


The Capacity Scheduler uses a hierarchical collection of queues. The parent of all queues
is named root, and by default the root queue is allocated 100% of the resources.
You can define as many queues as you like, and distribute the percentage of resources
however you like, as long as the percentages add up to 100. This design of the Capacity
Scheduler allows you to share the resources of your Hadoop cluster according to your
organizations needs and SLAs.
For example:

Queue 1 might represent the Marketing department, which gets 40% of the
resources because it paid for 40% of the cluster from its budget.

Queue 2 might represent the Sales department, and they get 35% of the
resources because of a company SLA.

Queue 3 might represent the Engineering department, and the remaining 25% is
allocated to them until another department comes along and needs to use the
cluster.

168

Copyright 2014, Hortonworks, Inc. All rights reserved.

The Capacity Scheduler provides elastic resource scheduling, which means that if some
of the resources in the cluster are idle, then one queue can take up more of the cluster
capacity than was minimally allocated to them in the above configuration.
Lets now take a look at how to define and configure queues.

Copyright 2014, Hortonworks, Inc. All rights reserved.

169

Configuring the Capacity Scheduler

Configuring the Capacity Scheduler


The queues of the Capacity Scheduler are defined and configured in the capacityscheduler.xml file found in the HADOOP_CONF folder. You can edit the XML file
directly, or use Ambari on the Services -> YARN -> Configs page (as shown in the
screenshot above).

Manual editing: If you edit the XML file directly, run the following command to
have the changes take effect:
# yarn rmadmin -refreshQueues

170

Ambari: If you configure the Capacity Scheduler using Ambari, you will need to
stop the YARN service, make your changes, and then start YARN again.

Copyright 2014, Hortonworks, Inc. All rights reserved.

Defining Queues
yarn.scheduler.capacity.root.queues=
"Marketing,Sales,Engineering"

yarn.scheduler.capacity.root.Marketing.capacity=50
yarn.scheduler.capacity.root.Sales.capacity=30
yarn.scheduler.capacity.root.Engineering.capacity=
20

Defining Queues
To define a child queue of root, use a comma-separated list of queue names for the
yarn.scheduler.capacity.root.queues property. For example:
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>Marketing,Sales,Engineering</value>
</property>

Copyright 2014, Hortonworks, Inc. All rights reserved.

171

A queues properties are configured by adding the queue name to the specific property.
For example, the following allocates 50% of the total capacity to the Marketing queue
and 30% to the Sales queue:
<property>
<name>
yarn.scheduler.capacity.root.Marketing.capacity
</name>
<value>50</value>
</property>
<property>
<name>
yarn.scheduler.capacity.root.Sales.capacity
</name>
<value>30</value>
</property>

In Ambari, you set properties using an equals sign. For example:


yarn.scheduler.capacity.root.Engineering.capacity=20

172

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuring Capacity Limits


yarn.scheduler.capacity.root.queues =
"Marketing,Marketing-longrunning "
yarn.scheduler.capacity.root.Marketing.capacity=50
yarn.scheduler.capacity.root.Marketing.maximumcapacity=80
yarn.scheduler.capacity.root.Marketing-longrunning
.capacity=35
yarn.scheduler.capacity.root.Marketing-longrunning
.maximum-capacity=35

Configuring Capacity Limits


With elastic resource scheduling, a queue can use up more resources than it is initially
allocated if some of the resources in the cluster are idle. You can control the behavior of
this elasticity by configuring a queues maximum capacity using the
yarn.scheduler.capacity.<queue-name>.maximum-capacity property.
For example, the following Marketing queue can use up to 80% of a clusters resources:
yarn.scheduler.capacity.root.Marketing.maximum-capacity=80

Copyright 2014, Hortonworks, Inc. All rights reserved.

173

A good use case for maximum-capacity is for applications that take a long time to run.
You may not want a long-running app to not consume a lot of resources, while providing
a large maximum for applications that you want to run quickly. You could setup
different queues for this behavior:
yarn.scheduler.capacity.root.queues="Marketing,
Marketing-longrunning"
yarn.scheduler.capacity.root.Marketing.capacity=50
yarn.scheduler.capacity.root.Marketing.maximum-capacity=80
yarn.scheduler.capacity.root.Marketing-longrunning
.capacity=35
yarn.scheduler.capacity.root.Marketing-longrunning
.maximum-capacity=35

Given the configuration above, answer the following questions:


1. What is the highest percentage of resources that an application submitted to the
Marketing queue use? ______________
2. What is the highest percentage of resources that an application submitted to the
Marketing-longrunning queue use? ______________

174

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuring User Limits


You can configure a queue so that users of the queue are guaranteed a certain
minimum percentage of resources. For example, the following property assigns each
user a minimum of 20 of the resources available to the Sales queue:
yarn.scheduler.capacity.Sales.minimum-user-limit-percent=20

The maximum user limit is based on the number of users that have actually submitted
jobs at any given time. For example, two users each get a maximum of 50%, three users
would each get a maximum of 33%, and so on.
Suppose the Sales queue is configured with a user minimum of 20%. Answer the
following questions:
1. If one user submits two jobs to the Sales queue, then each job will get between
__________ and _________ percent of resource.
2. If 3 different users have submitted one job each to the Sales queue, then each
user will get between _______ and _________ percent of resources.

Copyright 2014, Hortonworks, Inc. All rights reserved.

175

Configuring Permissions
yarn.scheduler.capacity.root.
Engineering.acl_submit_applications=
"developer,admin,George,Tom
yarn.scheduler.capacity.root.
Engineering.acl_administer_queue=
"admin,Tom"

Configuring Permissions
Each queue can define an Access Control List that authorizes which users and groups
can submit jobs to the queue. For example:
yarn.scheduler.capacity.root.Engineering.acl_submit_applica
tions="developer,admin,George,Tom"

There is also a property for configuring users and/or groups you can administer a queue:
yarn.scheduler.capacity.root.Engineering.acl_administer_que
ue="admin,Tom"

NOTE: The acl_submit_applications property includes users and groups


from any parent queue.

176

Copyright 2014, Hortonworks, Inc. All rights reserved.

Overview of the Fair Scheduler


Fair scheduling is a method of assigning resources to applications such that all apps get,
on average, an equal share of resources over time. The Fair Scheduler is enabled using
the following property in yarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.
scheduler.fair.FairScheduler</value>
</property>

Some of the features of the Fair Scheduler include:

All applications get, on average, an equal share of resources (memory) based on


time.

Queues can be defined so that resources are shared fairly between these
queues.

Different fairness algorithms can be used, including FIFO and Dominant Resource
Fairness (which uses an algorithm combining memory usage with CPU usage).

Copyright 2014, Hortonworks, Inc. All rights reserved.

177

Configuration of the Fair Scheduler


Queues are defined in an allocation file specified by the
yarn.scheduler.fair.allocation.file property.
The file looks like:
<?xml version="1.0"?>
<allocations>
<queue name="myqueue">
<minResources>10000 mb,0vcores</minResources>
<maxResources>80000 mb,0vcores</maxResources>
<maxRunningApps>25</maxRunningApps>
<weight>2.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
<queue name="my_subqueue">
<minResources>5000 mb,0vcores</minResources>
</queue>
</queue>
<user name="Tom">
<maxRunningApps>10</maxRunningApps>
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
</allocations>

For details of all configuration options of the Fair Scheduler, view the documentation at
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarnsite/FairScheduler.html.

178

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 8 Review
1. What are the two built-in schedulers in Hadoop?
2. Which scheduler does Hortonworks recommend you use in HDP 2.0?
3. When using a Capacity Scheduler, all queues are children of the ___________
queue.
Suppose you have the following properties configured:
yarn.scheduler.capacity.root.queues=A,B
yarn.scheduler.capacity.root.A.capacity=80
yarn.scheduler.capacity.root.B.capacity=20
yarn.scheduler.capacity.root.B.maximum-capacity=100

4. How many queues does this cluster have?


5. What does setting As capacity to 80 mean?
6. Is it possible that a job submitted to queue B can use 100% of the clusters
resources?
7. Is it possible that a job submitted to queue A can use 100% of the clusters
resources?

Copyright 2014, Hortonworks, Inc. All rights reserved.

179

Lab 8.1: Configuring the Capacity Scheduler


Objective: Learn how to define and configure queues for the Capacity
Scheduler.
Successful Outcome: You will have two new queues configured.
Before You Begin: Your cluster should be up and running.

Step 1: View the Status of the Capacity Scheduler


1.1. Point your web browser to http://node2:8088 to view the UI of the Resource
Manager.
1.2. In the menu on the left side, click on the Scheduler link to view the current
status of the Capacity Scheduler:

1.3. Notice there is one child queue of root defined. What is the name of the
queue? _________________
1.4. Click on the arrow to the left of default to expand and view the settings of
the default queue:

180

Copyright 2014, Hortonworks, Inc. All rights reserved.

1.5. Notice since there are no jobs running on your cluster, the status page simply
shows 0.0% of default is being used right now.
Step 2: View the Settings of the Capacity Scheduler
2.1. Go to the Ambari Dashboard page.
2.2. Click on the YARN link in the list of services, then click on the Configs tab and
scroll down to the Scheduler section:

2.3. Notice this is where you configure the settings for the scheduler of the
Resource Manager. Which type of scheduler is currently being used?
________________________________________
Step 3: Stop YARN
3.1. You cannot configure the Capacity Scheduler using Ambari while the Resource
Manager and Node Manager services are running. While on the YARN Services
page, click the Stop button in the upper-right corner of the page, then OK to
confirm. Wait for the YARN service to stop:

Copyright 2014, Hortonworks, Inc. All rights reserved.

181

Step 4: Define Custom Queues


4.1. From the Configs tab of the YARN Services page, add two new queues named
A and B to the Capacity Scheduler (and leave the default queue defined also):
yarn.scheduler.capacity.root.queues=default,A,B

4.2. Assign the following capacities to the three queues:


default capacity = 20%
A capacity = 50%
B capacity = 30%
4.3. Assign the following maximum capacities to the three queues:
default maximum capacity = 100%
A maximum capacity = 70%
B maximum capacity = 50%
4.4. Save your changes by clicking the Save button at the bottom of the page. You
should see the following confirmation dialog:

4.5. Click OK to close the dialog window.


Step 5: Start YARN
5.1. Go back to the Summary page of YARN and click the Start button to start the
YARN service again.
182

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 6: Verify the Changes


6.1. Go back to the Scheduler web page at http://node2:8088/cluster/scheduler.
6.2. You should see the A and B queues now:

6.3. Expand the A queue and verify its capacity is 50% and its maximum capacity is
70%:

6.4. Similarly, verify B and default are configured correctly.


6.5. Leave this web page open - you are going to view it in the next step.
Step 7: Submit a Job to a Specific Queue
7.1. Lets submit a MapReduce job to each new queue. SSH into node1, and
change directories to the /root/labs folder:
[root@node1 ~]# cd ~/labs

7.2. Make sure environment variable JAVA_HOME is defined, so enter the


following command:
# export JAVA_HOME=/usr/jdk64/jdk1.6.0_31/

7.3. Make sure you have a file in /user/root in HDFS named hbase.jar. If not, put
the HBase jar from the labs folder into HDFS, giving it the name hbase.jar.
7.4. In the first window, submit the test1.pig script to queue A by running the
following command (all on a single line):
# pig -Dmapreduce.job.queuename=A test1.pig 1>pig1.out
&>pig1.err &

Copyright 2014, Hortonworks, Inc. All rights reserved.

183

7.5. While test1.pig is running, submit the test2.pig script to queue B in the other
terminal window:
# pig -Dmapreduce.job.queuename=B test2.pig 1>pig2.out
&>pig2.err &

7.6. While both jobs are running, refresh the Scheduler status page. (It may take a
minute for both jobs to run long enough to show up in the queues, so refresh the
page often until they do):

7.7. You should see resources being used in both the A and B queues.

RESULT: You just defined two queues for the Capacity Scheduler, configured specific
capacities for each queue, and submitted a job to each queue.

ANSWERS:
1.3: default
2.3: The Capacity Scheduler

184

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 9: Enterprise Data Movement


Topics covered:

Enterprise Data Movement

Challenges with a Traditional ETL Platform

Hadoop Based ETL Platform

Data Ingestion

Hadoop: Reducing Business Latency

Defining Data Layers

Distributed Copy (distcp) Command

distcp Options

Considerations for distcp

Using distcp for Backups

Lab 9.1: Use distcp to Copy Data from a Remote Cluster

Copyright 2014, Hortonworks, Inc. All rights reserved.

185

Enterprise Data Movement


Organizations are continually looking for ways to reduce and manage costs. At the same
time, data centers are storing more and more data and for longer periods of time than
they ever have in the past. Compared to SAN, HDFS is a very cost effective way of
managing mass amounts of data. HDFS is a file system; therefore any type of data can
be stored in Hadoop.
Similar to any Data Warehouse or Enterprise Data Store (EDS), data architecture is a
critical component within the Hadoop cluster. As organizations move both their current
and historical data from the traditional data warehouses into Hadoop, many factors
must be considered and evaluated and a data strategy must be put into place.
The first step in that process is for organizations to evaluate their current system and
decide their goals for working with the data. Some common scenarios include:

Keeping the current data in the EDW and archiving historical data to Hadoop.

Loading the new semi-structured and unstructured data into Hadoop.

Building data layers and performing data analysis within Hadoop.

186

Copyright 2014, Hortonworks, Inc. All rights reserved.

Using a hybrid approach incorporating some data layers in Hadoop and the
speed layer in HBase (more discussion on this later in this unit).

Using Hadoop to aggregate and filter the data then loading the results and into
an EDW and/or datamarts.

Storing the data in Hadoop and using high-speed connecters from Teradata,
Oracle, SQL Server, etc. and using the analytics in the EDW to read the data from
the Hadoop cluster.

As the data strategies evolve they create more data movement between the enterprise
data systems. Lifecycle data management has always been a central part of enterprise
data platforms, and the Hadoop and HBase clusters now become a part of that lifecycle.
Falcon is the framework used in Hadoop 2 for lifecycle data management.

Copyright 2014, Hortonworks, Inc. All rights reserved.

187

Challenges with a Traditional ETL Platform


Data discarded
due to cost and/or
performance

Incapable/high
complexity when
dealing with loosely
structured data

No visibility into
transactional data

EDW used as an ETL


tool with 100s of
transient staging tables

-Lot of time spent understanding


source and defining destination data
structures

-Doesnt scale linearly.


-License Costs High

-High latency between data generation


and availability

Challenges with a Traditional ETL Platform


Traditional ETL platforms are not designed to handle big data. The traditional ETL
platforms have the following challenges when trying to work with big data:

They do not work well with semi-structured or unstructured data.

Scalability is very expensive with vendor proprietary solutions and SAN storage.

High cost of storage leads to eliminating an enormous amount of detailed data.

Traditional systems have high licensing costs.

There is very high business latency between data hitting the disks and being able
to make business decisions using the data.

188

Copyright 2014, Hortonworks, Inc. All rights reserved.

Hadoop Based ETL Platform


-Provides data for use with
minimum delay and latency
-Enables real time capture
of source data

-Support for any type


of data: structured/
unstructured

-Store raw transactional data


-Store 7+ years of data with no archiving
-Data Lineage: Store intermediate stages of data
-Becomes a powerful analytics platform

-Linearly scalable on
commodity hardware

-Data warehouse can


focus less on storage
& transformation and
more on analytics

-Massively parallel
storage and compute

Hadoop Based ETL Platform


The weaknesses of the traditional platforms are the strengths of the Hadoop platforms.
Traditional ETL platforms are not designed to handle big data.
Hadoop platforms:

Easily work with semi-structured or unstructured data.

Offer relatively inexpensive storage.

Use free open source software and commodity hardware.

Hadoop greatly reduces the business latency between data hitting the disks and being
able to make business decisions using the data.

Copyright 2014, Hortonworks, Inc. All rights reserved.

189

Data Ingestion
Data Sources/Transports

Web Logs,
Clicks

Social,
Graph, Feeds

Sensors,
Devices, RFID

Spatial, GPS

Extract &
Load

Big Data
Refinery

WebHDFS

Docs, Text,
XML

3rd Party

Audio, Video,
Images

Sqoop

Flume

DB Data

Events, Other

Data Ingestion
Data ingestion is one of the key components of any data warehouse, enterprise data
store or Hadoop cluster. It is a major effort to design data ingestion strategies for any
enterprise data store. Hadoop Data Refineries and Data Lakes take data ingestion to an
entirely new level of volume, speed and types of data.
Extraction, Transformation and Loading (ETL) has been a standard method for moving
data into enterprise data stores. The reason for the transformation before loading is
that the cost of SAN storage has required data to be transformed by aggregating and
filtering data to reduce the amount of data that will be loaded into an enterprise data
store.
With Extraction, Loading and Transformation (ELT) the data is loaded into Hadoop to a
layer known as the source of truth. This is the raw data. Since Hadoop can store data
much more cost effectively, all of the detailed data gets loaded into Hadoop. The data is
then transformed into different data layers.

190

Copyright 2014, Hortonworks, Inc. All rights reserved.

Hadoop: Reducing Business Latency


Retain runtime models and historical data
for ongoing refinement & analysis

DB
Data

Business
Transactions
& Interactions

Audio,
Video,
Images
Docs,
Text,
XML

Social,
Graph,
Feeds
Sensors,
Devices,
RFID
Spatial,
GPS

Events,
Other

Web, Mobile, CRM,


ERP, SCM,

Big Data
Refinery

Web
Logs,
Clicks

Share refined data and


runtime models

Classic
ETL
processing

2
Store, aggregate,
and transform
multi-structured
data to unlock
value

Business
Intelligence
& Analytics
Retain historical
data to unlock
additional value

5
Dashboards, Reports,
Visualization,

Hadoop: Data Movement


There are three broad areas of data processing:
1. Business Transactions & Interactions Relational databases.
2. Business Intelligence & Analytics Data Warehouses and Enterprise Data Stores.
3. Big Data Refinery Hadoop Cluster.
Enterprise IT has been connecting systems via classic ETL processing, as illustrated in
Step 1 above, for many years in order to deliver structured and repeatable analysis. In
this step, the business determines the questions to ask and IT collects and structures the
data needed to answer those questions.
The Big Data Refinery, is a new system capable of storing, aggregating, and
transforming a wide range of multi-structured raw data sources into usable formats that
help fuel new insights for the business. The Big Data Refinery provides a cost-effective
platform for unlocking the potential value within data and discovering the business
questions worth answering with this data. A popular example of big data refining is
processing Web logs, clickstreams, social interactions, social feeds, and other user
generated data sources into more accurate assessments of customer churn or more
effective creation of personalized offers.
Copyright 2014, Hortonworks, Inc. All rights reserved.

191

More interestingly, there are businesses deriving value from processing large video,
audio, and image files. Retail stores, for example, are leveraging in-store video feeds to
help them better understand how customers navigate the aisles as they find and
purchase products. Retailers that provide optimized shopping paths and intelligent
product placement within their stores are able to drive more revenue for the business.
In this case, while the video files may be big in size, the refined output of the analysis is
typically small in size but potentially big in value.
The Big Data Refinery platform provides fertile ground for new types of tools and data
processing workloads to emerge in support of rich multi-level data refinement solutions.
With that as backdrop, Step 3 takes the model further by showing how the Big Data
Refinery interacts with the systems powering Business Transactions & Interactions and
Business Intelligence & Analytics. Interacting in this way opens up the ability for
businesses to get a richer and more informed 360 view of customers, for example.
By directly integrating the Big Data Refinery with existing Business Intelligence &
Analytics solutions that contain much of the transactional information for the business,
companies can enhance their ability to more accurately understand the customer
behaviors that lead to the transactions.
Moreover, systems focused on Business Transactions & Interactions can also benefit
from connecting with the Big Data Refinery. Complex analytics and calculations of key
parameters can be performed in the refinery and flow downstream to fuel runtime
models powering business applications with the goal of more accurately targeting
customers with the best and most relevant offers, for example.
Since the Big Data Refinery is great at retaining large volumes of data for long periods of
time, the model is completed with the feedback loops illustrated in Steps 4 and 5.
Retaining the past 10 years of historical Black Friday retail data, for example, can
benefit the business, especially if its blended with other data sources such as 10 years
of weather data accessed from a third party data provider. The opportunities for
creating value from multi-structured data sources available inside and outside the
enterprise are virtually endless if you have a platform that can do it cost effectively and
at scale.

192

Copyright 2014, Hortonworks, Inc. All rights reserved.

Defining Data Layers


There are lots of different ways of organizing data in an enterprise data
platform that includes Hadoop.

Organize data
based on
source/derived
relationships
Allows for fault
and rebuild
process

Speed
Layer
Conform, Summarize, Access

Serving
Layer
Standardize, Cleanse, Integrate, Filter,
Transform

Batch
Layer
Extract & Load

Defining Data Layers


There are multiple ways of organizing data in an Enterprise Data Warehouse and the
same goes for Hadoop.
One way is the Lambda Architecture, which defines different data layers. A Hadoop
cluster can work by itself or be integrated with HBase and other EDWs and ODSs to build
different data layers that meet the data needs of an organization.
The process of building different data layers is a familiar concept within data
warehousing and analytics. The data layers are built in a Hadoop cluster for the same
reasons they have been built in data warehouses for the last 30 years, the facilitate
speed. There are 3 data layers:

Batch Layer: Immutable master data set (source of truth). Used to create views
for the batch layer.

Serving Layer: Contains pre-computed views.

Speed Layer: Contains additional levels of pre-computed views, structures and


indexes to reduce the latency that exists in the serving layer.

Copyright 2014, Hortonworks, Inc. All rights reserved.

193

Distributed Copy (distcp) Command


The distcp command makes it easy to copy large volumes of HDFS data in parallel.
Distcp uses the MapReduce framework to support copying files or directories
recursively.
Usage: hadoop distcp [OPTIONS]
<SourceURLn> <DestinationURL>

<SourceURL1>

For example, perform a copy between two Hadoop clusters running the same version of
Hadoop:
$ hadoop distcp hdfs://<SourceURL>:8020/input/data1
hdfs://<DestinationURL>:8020/input/data1

To perform a copy between two Hadoop clusters running a different version of Hadoop,
the older cluster uses the hftp protocol, and the 2.x cluster uses the hdfs protocol:
$ hadoop distcp hftp://<SourceURL>:50070/input/data2
hdfs://<DestinationURL>:8020/input/data2

Perform a copy within the same cluster:


194

Copyright 2014, Hortonworks, Inc. All rights reserved.

$ hadoop distcp input/data1


$ hadoop distcp -f
/srcfile

/input/data2 /input/data3
/input/data3

If a map fails and -i is NOT specified, all the files in the split, not only those that failed,
will be recopied. It also changes the semantics for generating destination paths, so users
should use this carefully.
Flag -i means Ignore failures. This option will keep more accurate statistics about the
copy than the default case. It also preserves logs from failed copies, which can be
valuable for debugging. A failing map will not cause the job to fail before all splits are
attempted.

Copyright 2014, Hortonworks, Inc. All rights reserved.

195

Options for the distcp command


To display the help text for distcp, type the following command:
$ hadoop distcp

Controllable distcp options:


Defining log directory location
The maximum amount of files or data to copy
SSL configuration for mapper

Distcp Options
The following list are things to take into consideration when using the distcp command:

Source paths need to be absolute.

If the destination directory does not exist it will be created.

The update option is used to make sure only files that have changed are
updated. Using a checksum (CRC32) it verifies if the destination file sizes are
different. The skipcrccheck option can be used to disable the checksum.

Distcp will skip files that already exist in the destination path. Use the
overwrite to make sure existing files are overwritten. File sizes are not checked.

The delete option can be used to delete any files in destination that are not in
the source.

Use the hftp file system on the source if there are different versions between the
source and destination HDFS clusters.

196

Copyright 2014, Hortonworks, Inc. All rights reserved.

# hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-async
Should distcp execution be blocking
-atomic
Commit all changes or none
-bandwidth <arg>
Specify bandwidth per map in MB
-delete
Delete from target, files missing in
source
-f <arg>
List of files that need to be copied
-filelimit <arg>
(Deprecated!) Limit number of files
copied to <= n
-i
Ignore failures during copy
-log <arg>
Folder on DFS where distcp execution
logs are saved
-m <arg>
Max number of concurrent maps to use
for copy
-mapredSslConf <arg>
Configuration for ssl config file,
to use with hftps://
-overwrite
Choose to overwrite target files
unconditionally, even if they exist.
-p <arg>
preserve status (rbugp)(replication,
block-size, user, group, permission)
-sizelimit <arg>
(Deprecated!) Limit number of files
copied to <= n bytes
-skipcrccheck
Whether to skip CRC checks between
source and target paths.
-strategy <arg>
Copy strategy to use. Default is
dividing work based on file sizes
-tmp <arg>
Intermediate work path to be used for
atomic commit
-update
Update target, copying only
missingfiles or directories

There are two strategy options: static (the default) and dynamic. When static is used,
mappers are balanced based on the total size of files copied by each map. The dynamic
approach splits files into chunks and map tasks process a chunk at a time, allowing
faster mappers to consume more file paths than slower ones and thereby speeding up
the overall distcp job.

Copyright 2014, Hortonworks, Inc. All rights reserved.

197

Considerations for distcp


The m option can be used to select the number of Mappers. If there is a large
amount of data, there may be a need to limit the number of mappers.
Minimizing the number of mappers will have each mapper copy more data.
Note: This may be more efficient if there are a lot of mappers or the system is
constrained for resources.

Using distcp
The distcp command starts up containers for running the mappers as well as generating
I/O based on the volume of data to be copied. Take the resource utilization and the
IOPS generated to schedule large distcp jobs during appropriate times.

Distcp also consumes container resources on the destination cluster, which may
be the same or a different cluster.

Copying data between the two clusters will also generate network traffic
between the data nodes for each cluster. Make sure network resources are not
exceeded between the two clusters.

198

Copyright 2014, Hortonworks, Inc. All rights reserved.

Using the hdfs:// schema for the source and destination requires the clusters be running
the same version of software. Other protocols that can be used include:

webhdfs://

hftp://

Best practice is to validate the copy between the source and the destination.
Use the hadoop fs ls /Path to confirm ownership, permissions and files.

Copyright 2014, Hortonworks, Inc. All rights reserved.

199

Using distcp for Backups


When using HCatalog or Hive Metadata you must also move the SQL Files containing
that info.
Using the Lambda architecture as an example; a Hadoop cluster will have multiple data
layers. It may be resource, time or cost prohibitive to backup the entire Hadoop
cluster. In such cases, because the other data layers can be rebuilt from the source, you
can backup only the raw data layer.

200

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 9 Review Questions


1. _____________ is one of the key components of any data warehouse,
enterprise data store or Hadoop cluster.
2. Data is inherently ___________ and _____________.
3. _____________, is a new system capable of storing, aggregating, and
transforming a wide range of multi-structured raw data sources into usable
formats that help fuel new insights for the business.
4. The Lambda Architecture includes what three layers __________________,
__________________ & __________________.
5. When running distcp for two Hadoop clusters running a different version of
Hadoop the source prefix needs to be ___________.
6. The update option checks ____________________ and __________ to see if a
file has changed.

Copyright 2014, Hortonworks, Inc. All rights reserved.

201

Lab 9.1: Use distcp to Copy Data from a


Remote Cluster

Objective: To become familiar with how to copy data from one cluster
to another.
Successful Outcome: Data from a remote cluster is copied to your own cluster.
Before You Begin: For this exercise use node1 as your Remote-Cluster.

Step 1: Access Remote-Cluster


1.1. Verify you can reach the NameNode of the Remote-Cluster by executing the ls
command.
$ hadoop fs -ls hdfs://node1:8020/

You should see the / folder of the Remote-Cluster.


Step 2: View the Remote Folder
2.1. Use the ls command and view the contents of the remote directory that you
are going to copy:
$ hadoop fs -ls hdfs://node1:8020/user/root

Step 3: Copy the Data from Remote-Cluster


3.1. Execute the following command to copy a remote file into distcp_target:
$ hadoop distcp
hdfs://Remote-Cluster:8020/user/root/test_data
distcp_target

3.2. View the contents of distcp_target and verify test_data file copied over to
your cluster:
202

Copyright 2014, Hortonworks, Inc. All rights reserved.

$ hadoop fs -ls -R distcp_target

3.3. Copy one or many directories/files into distcp_target:


$ hadoop distcp
hdfs://node1:8020/user/root/wordcount
hdfs://node1:8020/user/root/constitution.txt
distcp_target

3.4. View the contents of distcp_target and verify the wordcount &
constitution.txt file copied over to your cluster again.
Step 4: Copy only new/updated files and directories using -update option.
4.1. Check the timestamp for files in /user/root/wordcount directory. Delete partr-00000 file from wordcount directory.
4.2. Now run following command with -update option.
$ hadoop distcp -update
hdfs://node1:8020/user/root/wordcount
distcp_target/wordcount

4.3. View the contents of distcp_target and compare timestamp of all the files.
You can see that the timestamp changed only for part-r-00000 file and wordcount
folder timstamp.
Step 5: Copy data from a Remote-Cluster running different version of Hadoop.
5.1. Execute the following command to copy a remote file into distcp_target.
$ hadoop distcp
hftp://node1:50070/user/root/hbase.jar
distcp_target

5.2. View the contents of distcp_target and verify test_data file copied over to
your cluster.

RESULT: You have learnt the steps to copy data from one cluster to another.

Copyright 2014, Hortonworks, Inc. All rights reserved.

203

Unit 10: HDFS Web Services


Topics covered:

What is WebHDFS ?

Setting up WebHDFS

Using WebHDFS

WebHDFS Authentication

Copying Files to HDFS

Hadoop HDFS over HTTP

Who Uses WebHCat REST API?

Running WebHCat

Using WebHCat

Lab 10.1: Using WebHDFS

204

Copyright 2014, Hortonworks, Inc. All rights reserved.

What is WebHDFS?
Hadoop contains native libraries for accessing HDFS from the Hadoop cluster. WebHDFS
provides a full set of HTTP REST APIs to access Hadoop remotely. HDFS commands can
be run from a platform that does not contain Hadoop software.
REST (Representational State Transfer) uses well known HTTP verb commands GET,
POST, PUT and DELETE to perform operations. REST:

Uses HTTP to execute calls between different machines.

Is platform independent, language independent and can be used with firewalls.

Uses URIs (Uniform Resource Identifier defines a web resource using text).

Is a lightweight alternative to using SOAP (Simple Object Access Protocol - XML


based protocol) and RPC (remote procedure calls) to access web resources.

Copyright 2014, Hortonworks, Inc. All rights reserved.

205

WebHDFS uses REST APIs to perform HDFS user operations including reading files,
writing to files, making directories, changing permissions and renaming. WebHDFS can
be used to copy data between different versions of HDFS.
WebHDFS is built-in to HDFS. It runs inside NameNodes and DataNodes, therefore, it can
use all HDFS functionalities. It is a part of HDFS there are no additional servers to
install. WebHDFS can use a proxy WebHDFS (httpfs). In most cases it uses hdfs:// (port
optional).
WebHDFS supports the following:

HDFS read and write operations as well as HDFS parameters.

Kerberos (SPNEGO) and delegation tokens for authentication.

206

Copyright 2014, Hortonworks, Inc. All rights reserved.

Setting up WebHDFS
WebHDFS should be enabled during the Ambari install by selecting the
enable WebHDFS checkbox.

Properties for WebHDFS

Description

dfs.webhdfs.enabled

Use to enable WebHDFS.

dfs.web.authentication.kerberos.principal

HTTP Kerberos principal.

dfs.web.authentication.kerberos.keytab

Kerberos keytab file.

Setting up WebHDFS
If manually setting the dfs.webhdfs.enabled property in the hdfs-site.xml file, HDFS
(NameNode and DataNodes) must be restarted for the changes to take effect.
hdfs-site.xml:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

Copyright 2014, Hortonworks, Inc. All rights reserved.

207

When using Kerberos to secure cluster, look at the documentation to get all the details
but here is a summary.
1. Create a HTTP service user principal.
2. kadmin: addprinc -randkey
HTTP/$<Fully_Qualified_Domain_Name>@$<Realm_Name>.COM

3. Create keytab files for the HTTP principals.


kadmin: xst -norandkey -k
/etc/security/spnego.service.keytabHTTP/$<Fully_Qualified_D
omain_Name>

4. Verify that the keytab file and the principal are associated with the correct
service:
klist k -t /etc/security/spnego.service.keytab

5. Add the following properties to the hdfs-site.xml file.


<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/$<Fully_Qualified_Domain_Name>@$<Realm_Name>.CO
M</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/security/spnego.service.keytab</value>
</property>

208

Copyright 2014, Hortonworks, Inc. All rights reserved.

Using WebHDFS
The URL syntax to access the REST API of WebHDFS is:
http://hostname:port/webhdfs/v1/<PATH>?op=

To read a file named input/mydata:


$ curl -i -L "http://host:50070/webhdfs/v1/input/mydata?op=OPEN"

To list the contents of a directory named tdata:


$ curl -i "http://host:50070/webhdfs/v1/tdata/?op=LISTSTATUS"

To make a directory named myoutput:


$ curl -i -X PUT "http://host:50070/webhdfs/v1/myoutput?
op=MKDIRS&permission=744"

Using WebHDFS
The REST API uses the prefix "/webhdfs/v1" in the path and appends a query at the end.
HTTP URL format:
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=...

WebHDFS File system URI :


webhdfs://<HOST>:<HTTP_PORT>/<PATH>

HDFS URI:
hdfs://<HOST>:<RPC_PORT>/<PATH>

cURL and wget can be used to execute WebHDFS commands. cURL has been around a
long time in Unix and Linux environments. It is a popular command line tool (and
library) because it can support so many protocols (HTTP, HTTPS, FTP, SCP, LDAP, TENET,
POP2, SMTP, IMAP, .).

Copyright 2014, Hortonworks, Inc. All rights reserved.

209

wget is a free software package for retrieving files using HTTP, HTTPS and
FTP.

Additional examples:

List the status of a file (Use the v option to display output in verbose mode to
get more details.)
$ curl -i
"http://host:port/webhdfs/v1/input/mydata?op=GETFILESTATUS"

To write a file into a /input/myfile2 file:


$ curl v -i -X PUT -L
"http://$<Host_Name>:$<Port>/webhdfs/v1/input/myfile2?op=CR
EATE" -T mycoolfile

210

Copyright 2014, Hortonworks, Inc. All rights reserved.

WebHDFS Authentication
Authentication can be controlled through the following commands:

Security off:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?[user.name=<USER>&
]op=..."

Security on and using Kerberos SPNEGO:


curl -i --negotiate -u :
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=..."

Security on with Hadoop delegation token:


curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?delegation=<TOKEN>&
op=..."

Copyright 2014, Hortonworks, Inc. All rights reserved.

211

Proxy Users
A proxy user can send a request for another. The username of U must be specified in the
doas query parameter unless a delegation token is presented in authentication. In such
case, the information of both users P and U must be encoded in the delegation token.
Below is the syntax to use when;

Security is off:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?[user.name=<USER>&]
doas=<USER>&op=..."

When security is on (Kerberos SPNEGO):


curl -i --negotiate -u :
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?doas=<USER>&op=..."

When security is on (Hadoop delegation token):


curl -i
"http://<HOST>:PORT>/webhdfs/v1/<PATH>?delegation=<TOKEN>&o
p=..."

212

Copyright 2014, Hortonworks, Inc. All rights reserved.

Copying Files to HDFS


Prepare a file for writing to HDFS
Syntax:
curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
[&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>]
[&permission=<OCTAL>][&buffersize=<INT>]"

Write a file to HDFS


$ curl -v -X PUT "http://ec2-50-17-40-3.compute-1.amazonaws.com:50070/webhdfs/v1/
input/mydata?op=CREATE&user.name=jdoe"
$ curl -v -X PUT T mydata "http://ec2-50-17-40-3.compute-1.amazonaws.com:50075/
webhdfs/v1/input/mydata?op=CREATE&user.name=jdoe"

Copying Files to HDFS


Below is an example of copying data into HDFS using WebHDFS. The two-step process is
a temporary workout for a software library bug.
The output will contain the 307 response code with a server address and port number
for where to write the file.
$ curl -v -X PUT
"http://localhost:50070/webhdfs/v1/input/mydata?op=CREATE
Format for 307 response code:
HTTP/1.1 307 TEMPORARY_REDIRECT
Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0
$ curl -v -X PUT T mydata "http://ec2-50-17-40-3.compute1.amazonaws.com:50075/webhdfs/v1/input/mydata?op=CREATE&use
r.name=jdoe"
# Verify the file was created.
$ hadoop fs -ls /input

Copyright 2014, Hortonworks, Inc. All rights reserved.

213

Additional WebHDFS Commands


WEBHDFS operations will return a JSON schema or a zero length response. Only
exception is the OPEN command.

HTTP GET

HTTP PUT

HTTP POST

OPEN

CREATE

APPEND

GETFILESTATUS

MKDIRS

LISTSTATUS

RENAME

GETCONTENTSUMMARY

SETREPLICATION

HTTP POST

GETFILECHECKSUM

SETOWNER

DELETE

GETHOMEDIRECTORY

SETPERMISSION

GETDELEGATIONTOKEN

SETTIMES
RENEWDELEGATONTOKEN
CANCELDELEGATIONTOKEN

Additional WebHDFS Commands


Submit a HTTP GET request with automatically following redirects.
curl -i -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN
[&offset=<LONG>][&length=<LONG>][&buffersize=<INT>]

The request is redirected to a datanode where the file data is to be written:


HTTP/1.1 307 TEMPORARY_REDIRECT

Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0

Create and Write to a File:


curl -i -X PUT -T <LOCAL_FILE>
"http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...

214

Copyright 2014, Hortonworks, Inc. All rights reserved.

Append to a File:
curl -i -X POST
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersi
ze=<INT>]"

The request is redirected to a datanode where the file data is to be appended:


HTTP/1.1 307 TEMPORARY_REDIRECT

Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
Content-Length: 0

Step 2: Submit request to append file:


curl -i -X POST -T <LOCAL_FILE>
"http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...

Make a Directory:
curl -i -X PUT
"http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>]

Rename a File/Directory:
curl -i -X PUT
"<HOST>:<PORT>/webhdfs/v1/<PATH>?op=RENAME&destination=<PAT
H>

Delete a File/Directory:
curl -i -X DELETE
"http://<host>:<port>/webhdfs/v1/<path>?op=DELETE[&recursiv
e=<true|false>]

Get the Status of a File/Directory:


curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILESTATUS

Copyright 2014, Hortonworks, Inc. All rights reserved.

215

List a Directory:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS

Get Directory Content Summary:


curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMAR
Y

Get File Checksum:


curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM

Get Home Directory:


curl -i
"http://<HOST>:<PORT>/webhdfs/v1/?op=GETHOMEDIRECTORY

Set Permission:
curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION
[&permission=<OCTAL>]

Set Owner:
curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER
[&owner=<USER>][&group=<GROUP>]

Set Replication Factor:


curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETREPLICATION
[&replication=<SHORT>]

Set Access or Modification Time:


curl -i -X PUT

216

Copyright 2014, Hortonworks, Inc. All rights reserved.

"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETTIMES
[&modificationtime=<TIME>][&accesstime=<TIME>]

Get Delegation Token:


curl -i
"http://<HOST>:<PORT>/webhdfs/v1/?op=GETDELEGATIONTOKEN&ren
ewer=<USER>

Renew Delegation Token:


curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/?op=RENEWDELEGATIONTOKEN&t
oken=<TOKEN>

Cancel Delegation Token:


curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/?op=CANCELDELEGATIONTOKEN&
token=<TOKEN>

HTTP Response Codes


Exception

HHT Response Code

IllegalArgumentException

400 Bad Request

UnsupportedOperationException

400 Bad Request

SecurityException

401 Unauthorized

IOException

403 Forbidden

FileNotFoundException

404 Not Found

RumtimeException

500 Internal Server Error

Copyright 2014, Hortonworks, Inc. All rights reserved.

217

Hadoop HDFS over HTTP


HttpFS is a separate service from NameNode and must be configured. HttpFS is a Java
application that runs in Tomcat that comes with the HttpFS binary distribution.
Examples:
Create the HDFS /user/tom directory:
$ curl -X POST http://httpfshost:14000/webhdfs/v1/user/tom/bar?op=mkdirs

Display the contents of the HDFS /user/george directory.


$ curl http://<HTTPFSHOST>:14000/webhdfs/v1/user/jdoe?op=list

218

Copyright 2014, Hortonworks, Inc. All rights reserved.

HttpFS is a full rewrite of Hadoop HDFS proxy. A key difference is HttpFS supports all file
system operations while Hadoop HDFS proxy supports only read operations.
HttpFS also supports:

Hadoop pseudo authentication

Kerberos SPNEGOS authentication

Hadoop proxy users

Hadoop HDFS proxy did not support the above authentications.

Copyright 2014, Hortonworks, Inc. All rights reserved.

219

Who Uses WebHCat REST API?


WebHCat is designed to connect services. WebHCat was initially called Templeton (rat
in Charlottes Web). Therefore, you will still see references to Templeton in the
directories, etc.
WebHCat will look for files in the CLASSPATH and then in the TEMPLETON_HOME
environmental variable. Key files with WebHCat:

220

Copyright 2014, Hortonworks, Inc. All rights reserved.

Filename

Description

webhcat_server.sh
webhcat-default.xml

Default configuration variables. Never


change this file it has no effect. The
webhcat-default.xml file Is in the WebHCat
war file.

webhcat-site.xml

Modify and add variables to customize


WebHCat. The WebHCat server needs to
be restarted after configuration changes.

webhcat-log4j.properties

Contains the location of the WebHCat log


files.

The WebHCat configuration variables can be found in the HCatalog documentation.


WebHCat can be configured to use Kerberos. Here are a few examples of WebHCat
variables.
Configuration Variable

Description

templeton.port

The HTTP port for the WebHCat server (50111).

templeton.hadoop.config.dir

Path to the Hadoop configuration files


${env.HADOOP.CONFIG_DIR}

templeton.jar

Path to the WebHCat jar file


${env.TEMPLETON_HOME/share/webhcat/svr/webhcat0.11.0.jar}

templeton.streaming.jar

HDFS path to the Hadoop streaming jar file


hdfs:///user/temleton/hadoop-streaming.jar

templeton.hive.path

Path to Hive executable


hive-0.11.0.tar.gz/hive-0.11.0/bin/hive

templeton.hive.properties

Properties to use when running Hive.

templeton.zookeeper.hosts

Zookeeper servers listed in a comma delimited order


(host:port).

Copyright 2014, Hortonworks, Inc. All rights reserved.

221

Accessing and Securing WebHCat Files


Parameters for making files accessible to WebHCat:
Variables

Definition

Default

Caching and Securing


WebHCat Files

Path to Pig archive.

hdfs:///apps/webhcat/pig.tar.gz

templeton.pig.path

Path to Pig executable.

pig.tar.gz/pig/bin/pig

templeton.hive.archive

Path to Hive archive.

hdfs:///apps/webhcat/hive.tar.gz

templeton.hive.path

Path to Hive executable.

hive.tar.gz/hive/bin/hive

templeton.streaming.jar

Path to Hadoop streaming


jar file.

hdfs:///apps/webhcat/hadoopstreaming.jar

Caching and Securing WebHCat Files


The paths shown above are configured in the /etc/hcatalog/conf/webhcat-site.xml file
on the node where WebHCat is installed.

222

Copyright 2014, Hortonworks, Inc. All rights reserved.

Running WebHCat
Start the server:
$ /usr/lib/hcatalog/sbin/webhcat_server.sh start

Stop the server:


$ /usr/lib/hcatalog/sbin/webhcat_server.sh stop

WebHCat requirements include:


Zookeeper if using the ZooKeeper storage class
A secure cluster will require Kerberos keys and principals

Running WebHCat
Hadoop uses a LocalResource to keep Pig and Hive from having to be installed
everywhere on the cluster. The server will get a copy of the LocalResource when
needed.

Copyright 2014, Hortonworks, Inc. All rights reserved.

223

Using WebHCat
The URL to access the REST API of WebHCat is:
http:/ / hostname/ tem pleton/ v1/
Here is an example of running a MapReduce job:
# curl -s -d user.name=hadoop_user \
-d jar=wordcount.jar \
-d class=com.hortonworks.WordCount \
-d libjars=transform.jar \
-d arg=wordcount/input \
-d arg=wordcount/output \
'http://host:50111/templeton/v1/mapreduce/jar'

Using WebHCat
WebHCat can execute programs through the Knox Gateway. The URL for accessing the
REST API of WebHCat is: http://hostname:port/templeton/v1/.
Below is an example of WebHCat running a Java MapReduce job. This example
assumes the input and output directories have been setup as well as the inode being
created for the file.
$ curl -v -i -k u <USERIDID>:<PASSWORD> -X POST \
-d jar=/dev/my-examples.jar -d class=wordcount \
-d arg=/dev/input -d arg=/dev/output \
'https://127.0.0.1:8443/gateway/sample/templeton/api/v1
/mapreduce/jar'

224

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 10 Review
1. WebHDFS supports HDFS _____________ and _______________ operations.
2. The _________________ parameter needs to be set to true to enable WebHDFS.
3. WebHDFS can use ________________ and _________________ for
authentication.
4. HttpsFS is a _____________________ from NameNode and must be configured.

Copyright 2014, Hortonworks, Inc. All rights reserved.

225

Lab 10.1: Using WebHDFS

Objective: To become familiar with the capabilities of WebHDFS and


how to use it.
Successful Outcome: You will have executed several WebHDFS file commands
successfully, include upload, append, list a directory, and
download.
Before You Begin: SSH into node1.

Step 1: List a Directory


1.1. Using curl, view the contents of the /user/root directory in HDFS:
# curl -i
"http://node1:50070/webhdfs/v1/user/root?op=LISTSTATUS"

1.2. You should see a 200 OK response, along with a JSON object containing the
files and directories in your /user/root folder:
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 14 Nov 2013 14:35:42 GMT
Date: Thu, 14 Nov 2013 14:35:42 GMT
Pragma: no-cache
Expires: Thu, 14 Nov 2013 14:35:42 GMT
Date: Thu, 14 Nov 2013 14:35:42 GMT
Pragma: no-cache
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
{"FileStatuses":{"FileStatus":[
{"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":1732
1,"group":"hadoop","length":0,"modificationTime":1384408800

226

Copyright 2014, Hortonworks, Inc. All rights reserved.

076,"owner":"root","pathSuffix":".Trash","permission":"700"
,"replication":0,"type":"DIRECTORY"},
{"accessTime":1384219125588,"blockSize":134217728,"children
Num":0,"fileId":17331,"group":"hadoop","length":861,"modifi
cationTime":1384219125967,"owner":"root","pathSuffix":"cons
titution.txt","permission":"644","replication":3,"type":"FI
LE"},
...
]}}

Step 2: Make a New Directory


2.1. Use WebHDFS to make a new subdirectory of /user/root in HDFS named
history:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/history?op=MKDIRS"

If you get an AccessControlException, you need to add the user.name property to


the URL:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/history?op=MKDIRS&
user.name=root"

2.2. Verify the history directory was created successfully:


# hadoop fs -ls

Step 3: Upload a File


3.1. The first step to uploading a file is to create a path on the NameNode. As part
of the REST contract, the NameNode will respond with a 307
TEMPORARY_REDIRECT to the actual DataNode that the blocks should go to:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/history/constituti
on.txt?op=CREATE&blocksize=1048576"

3.2. Use the temporary redirect URL that the NameNode provides in the response
above to submit the file to the DataNode. For example, the command shown here
puts the file onto node4, but you should copy-and-paste the URL from the
response of the previous step:

Copyright 2014, Hortonworks, Inc. All rights reserved.

227

# curl -i -PUT -T constitution.txt


"http://node4:50075/webhdfs/v1/user/root/history/constituti
on.txt?op=CREATE&namenoderpcaddress=node1:8020&blocksize=10
48576&overwrite=false&user.name=root"

3.3. Verify the file was uploaded successfully:


# hadoop fs -ls history
Found 1 items
-rwxr-xr-x
3 root hadoop

44841 history/constitution.txt

Step 4: Upload a Large File


4.1. In this step, you will upload a larger file, one that spans multiple blocks. Start
by changing directories to /root/data:
# cd ~/data

You should have a large file in data named test_data.


4.2. Ask the NameNode to create a file named big.txt in /user/root:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/big.txt?op=CREATE&
blocksize=1048576"

4.3. Using the URL provided by the previous command, upload test_data into
HDFS, and then verify the upload worked successfully.
Step 5: Append to an Existing File
5.1. Appending a file is similar to creating a file - it is a two-step process. Using
WebHDFS, append the local file constitution.txt to big.txt in HDFS.

228

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 6: Retrieve a File


6.1. Use WebHDFS to retrieve the file constitution.txt.
6.2. Retrieve big.txt from the 1,000,000th byte offset and get 1048576 bytes
(1MB). Pipe the result to a local file named big_partial.txt.
RESULT: You have seen how to use WebHDFS to execute a variety of HDFS commands
over HTTP using RESTful web Services.

SOLUTION to 6.2:
curl -i -L
"http://node1:50070/webhdfs/v1/user/root/big.txt?op=OPEN&of
fset=1000000&length=1048576" > big_partial.txt

Copyright 2014, Hortonworks, Inc. All rights reserved.

229

Unit 11: Hive Administration


Topics covered:

Introduction to Hive

Comparing Hive with RDBMS

Hive Components

Hive MetaStore

HiveServer2

Hive Command Line Interface

Processing Hive SQL Statements

Defining a Hive-Managed Table

Defining an External Table

Loading Data into Hive

Performing Queries

Guidelines for Architecting Hive Data

ORCFile Example

Hive Tables

Hive Query Optimizations

Hive/MR verses Hive/Tez

ORCFile Example

Compression

Hive Security

Lab 11.1: Understanding Hive Tables

230

Copyright 2014, Hortonworks, Inc. All rights reserved.

Introduction to Hive
Hive queries are capable of data summarization, ad-hoc querying and analytics of large
volumes of data. Hive is scalable to 100PB+. Apache Hive is the gateway for business
intelligence and visualization tools integrated with Apache Hadoop. Hive supports
databases, tables, SQL language and other foundational constructs for analyzing data.
Hive will get the SQL code, process the code and convert the code to a MapReduce
program. The MapReduce program runs in the YARN framework and generates the
results.
Additional Hive capabilities and features:

Allows queries, inserts and appends.

Does not allow updates or deletes.

Data can be separated into partitions and buckets.

Supports cubes, dimensions, and star schemas.

Copyright 2014, Hortonworks, Inc. All rights reserved.

231

Comparing Hive with RDBMS


Remember, Hive is a data warehouse infrastructure on top of Hadoop. HDFS uses
schema-on-read. Data can be stored in different formats such as text, sequential files
and columnar files.
Other comparisons include:

Views in Hive are logical query constructs.

Materialized views are not supported.

Indexes in Hive store their data in separate table constructs.

Bitmap indexes are supported.

User Defined Functions (UDFs) can be used to add additional functionality to


Hive queries.

Hive supports arrays, maps, structs and unions.

SerDes map JSON, XML and other formats natively into Hive.

232

Copyright 2014, Hortonworks, Inc. All rights reserved.

Hive MetaStore
The Hive MetaStore contains all the metadata definitions for Hive tables
and partitions
The metastore can be local or remote

Local Metastore
Driver

Metastore

RDBMS
Local
Datastore

HiveServer2

Remote Metastore
Driver

Metastore

RDBMS
Remote
Datastore

HiveServer2

Hive MetaStore
The Hive metastore stores table definitions and related metadata information. Hive
uses an Object Relational Mapper (ORM) to access relational databases. Valid Hive
metastore database are: MySQL, PostgreSQL, Oracle and Derby. An embedded
metastore is available but it should only be used for unit testing.
Below is an example of setting up a local metastore using MySQL at the metastore
repository
Property

Value

javax.jdo.option.ConnectionURL

jdbc:mysql://<HOSTNAME>/<DBNAME>?cre
ateDatabaseIfNotExist=true

javax.jdo.option.ConnectionDriverName

com.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserName

<MYSQL_USER>

javax.jdo.option.ConnectionPassword

<MYSQL_PASSWORD>

hive.metastore.local

True, this is local store.

Copyright 2014, Hortonworks, Inc. All rights reserved.

233

hive.metastore.warehouse.dir

<DEFINE_PATH_HIVETABLES>

With a remote metastore setup, a Hive client needs to connect to a metastore server
that then communicates to the remote datastore (RDBMS) using the Thrift protocol.
Thrift is an Interface Definition Language (IDL) that defines the specification for the
interface to a software component. Thrift uses Remote Procedure Calls (RPCs) for the
communication between two service endpoints.

234

Copyright 2014, Hortonworks, Inc. All rights reserved.

HiveServer2
HiveServer2 is a server interface that allows JDBC/ODBC remote clients
to run queries and retrieve the results.

Hive SQL
CLI

JDBC / ODBC

Web UI

HiveServer2

RDBMS
Datastore

Hive

Mappers
Reducers

DataNodes

HiveServer2
HiveServer2 (HS2) is a gateway / JDBC / ODBC endpoint Hive clients can talk to. ODBC
allows Excel and just about any BI tool to use Hive to access Hadoop data.
Configuration parameters for the HiveServer2 are set in the hive-site.xml file.
HiveServer2 supports no authentication (Anonymous), Kerberos, LDAP and custom
authentication. Authentication mode is defined with the hive.server2.authentication
parameter (NONE, KERBEROS, LDAP and CUSTOM). NONE is the default value.
HiveServer2 executes a query as the user who started the query by default
(hive.server2.enable.doAs=true). Setting this parameter to false, the query will run as
the same user the HiveServer2 process runs as.
There are multiple ways to start the HiveServer2:
$ $HIVE_HOME/bin/hiveserver2
$ $HIVE_HOME/bin/hive --service hiveserver2

Copyright 2014, Hortonworks, Inc. All rights reserved.

235

Hive Command Line Interface


The $HIVE_HOME/bin path should be added to the PATH environment variable in Linux.
The hive --help option can provide a listing of hive options.
In the Hive CLI, the set and set-v options will display variables in hiveconf, hivevar,
system and env namespaces and all Hadoop properties respectfully.
Example:
hive> set;
hive> set -v;

The e option can be used to execute from the Linux command line. The S option says
run in silent mode.
$ hive -S -e "select * FROM mycooltab" > /tmp/mytabout

Use the set command to find a property value.


$ hive -S -e "set" | grep auth

236

Copyright 2014, Hortonworks, Inc. All rights reserved.

Run a script file containing SQL code.


$ hive -f /hscripts/myrockingquery.hql

Bash shell commands can be run from the Hive CLI.


hive> ! whoami;

Hadoop dfs commands can be run from the Hive CLI. Dfs commands can be run
without typing hadoop first.
hive> dfs -ls /user;

Beeline connects to the Hive Server2 instance. Hive clients connect to the HiveServer
instance.

Copyright 2014, Hortonworks, Inc. All rights reserved.

237

Processing Hive SQL Statements


1

Client executes SQL query

Hive parses and plans query

Hadoop

1
Hive SQL

Query converted to Map/Reduce

Map/Reduce run by Hadoop

CLI

JDBC / ODBC

Hive

HiveServer2

Web UI

Hive
Compiler
Optimizer
Executor
Map / Reduce 3

Mappers
Reducers

DataNodes
4

Processing Hive SQL Statements


Hive is a system that provides a SQL interface and a relational model on top of Hadoop.
When you use Hive, you expose data within Hadoop as relational tables and you issue
SQL queries to query the data. The underlying data can be structured or unstructured.
Under the covers Hive converts these SQL queries to map/reduce jobs and submits
them to the Hadoop cluster. Using Hive lets you program SQL rather than Java map
reduce and lets you work at a much higher level of abstraction.
The driver handles the parsing, optimization, compilation, and execution. Hive does not
generate Java code for MapReduce. Hive uses Map and Reduce modules (like
interpreters) that are driven by the defined job execution. Hive will communicate with
the YARN Resource Manager for starting the MapReduce job.
Once the Hive query is converted into a MapReduce program, Hadoop can use parallel
processing and distributed database to generate a result using a highly scalable and
highly available Hadoop infrastructure.

238

Copyright 2014, Hortonworks, Inc. All rights reserved.

Hive Data Hierarchical Structures


Hive data hierarchical structures.
Databases: In Hive a database is a namespace that separates tables and data structures.
The default database directory is defined by the hive.metastore.warehouse.dir
parameter. A directory is defined for each database. The default location can be
overridden by specifying the path. Comments and additional properties can be added
to the database definition. The USE command defines the current database. All
following commands will execute on the objects in the current database. Database
properties can be modified with the ALTER DATABASE command.
Tables: Schema objects (abstract table definitions mapping relational definition to
underlying data).

Partitions: Can physically separate table data into separate data units.

Buckets (or Clusters): Data in each partition can be sub-partitioned based on a


hash function of some column of the Table.

Copyright 2014, Hortonworks, Inc. All rights reserved.

239

Database commands in Hive:


hive> CREATE DATABASE mydb;
hive> CREATE DATABASE mydb2 LOCATION
/user/george/mydbs;
hive> CREATE DATABASE mydb3
COMMENT My supercool db;
hive> CREATE DATABASE mydb4 WITH DBPROPERTIES creator
= George Trujillo, date = 2013-11-02);
hive> SHOW DATABASES;
hive> DESCRIBE DATABASE mydb3;
hive> DESCRIBE DATABASE EXTENDED mydb4;
hive> USE mydb;
hive> CREATE TABLE mycooltab( id INT, name STRING);
hive> DESCRIBE mycooltab;
hive> DESCRIBE EXTENDED mycooltab;
-- can tell if
table is INTERNAL or EXTERNAL
hive> SHOW TABLES;

A Hive CREATE TABLE command can create a Hive and HBase table as well as create a
Hive table that points to an existing HBase table. Hive tables can also point to other
NoSQL database tables.
CREATE EXTERNAL TABLE myhtab(id INT, name STRING) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

Hive has four built-in file formats:

Delimited Text: Excellent for sharing among Pig, Hive, and Linux (awk, perl,
pythonn etc.) Binary file formats are more efficient.

SequenceFile: Binary key-value pairs. Can be compressed at BLOCK and


RECORD level. Supports splitting on blocks.

ORCFile

RCFile: Stores data in a record columnar format. Allows compression of


individual columns and fast analyzing on columns.

Hive uses SerDes to read and write from tables. The SerDe determines the format in
which the records are serialized and deserialized. You can write your own custom SerDe,
or use one of the built-in ones which include:

240

Avro: Easily converts Avro schema and data types into Hive schema and data
types. Avro understands compression.
Copyright 2014, Hortonworks, Inc. All rights reserved.

Regular Expression

ORC

JSON: JavaScript Object Notation is supported by Hive.

Thrift

NOTE: SerDe stands for serializer/deserializer. A SerDe is used when data


needs to be converted from one format (unstructured) to another format
(record structure).

NOTE: Accumulo is not part of the HDP distribution yet, but it is supported
by Hortonworks.

Copyright 2014, Hortonworks, Inc. All rights reserved.

241

Hive Tables
Data stored in HDFS is schema-on-read, meaning Hive does not control the data
integrity when it is written. For Hive Managed tables, the table name is the name Hive
will assign to the directory in HDFS. For external tables, the files can be in any folder in
HDFS.
If you drop an external table, it will keep the data in its defined directory. With a Hive
Managed table, if you drop the table, then the data is deleted.
Multiple schemas can be connected to a single directory.

242

Copyright 2014, Hortonworks, Inc. All rights reserved.

Defining a Hive-Managed Table


A Hive table allows you to add structure to your otherwise unstructured data in HDFS.
Use the CREATE TABLE command to define a Hive table, similar to creating a table in
SQL.
For example, the following HiveQL creates a new Hive-manged table named customer:
CREATE TABLE customer (
customerID INT,
firstName STRING,
lastName STRING,
birthday TIMESTAMP,
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

The customer table has four columns.

ROW FORMAT is either DELIMITED or SERDE.

Hive supports the following data types: TINYINT, SMALLINT, INT, BIGINT,
BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, VARCHAR and DATE.

Hive also has four complex data types: ARRAY, MAP, STRUCT and UNIONTYPE.

Copyright 2014, Hortonworks, Inc. All rights reserved.

243

Defining an External Table


The following CREATE statement creates an external table named salaries:
CREATE EXTERNAL TABLE salaries (
gender string,
age int,
salary double,
zip int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

An external table is just like a Hive-manged table, except that when the table is
dropped, Hive will not delete the underlying /apps/hive/warehouse/salaries folder.

Defining a Table LOCATION


Hive does not have to store the underlying data in /apps/hive/warehouse. Instead, the
files for a Hive table can be stored in a folder anywhere in HDFS by defining the
LOCATION clause. For example:
CREATE EXTERNAL TABLE salaries (
gender string,
age int,
salary double,
zip int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/train/salaries/';

In the table above, the table data for salaries will be whatever is in the
/user/train/salaries directory.

IMPORTANT: The sole difference in behavior between external tables and


Hive-managed tables is when they are dropped. If you drop a Hive-managed
table, then its underlying data is deleted from HDFS. If you drop an external
table, then its underlaying data remains in HDFS (even if the LOCATION was
in /apps/hive/warehouse/).

244

Copyright 2014, Hortonworks, Inc. All rights reserved.

Loading Data into Hive


LOAD DATA LOCAL INPATH /tmp/customers.csv'
OVERWRITE INTO TABLE customers;
LOAD DATA INPATH '/user/train/customers.csv'
OVERWRITE INTO TABLE customers;
INSERT INTO birthdays
SELECT firstName, lastName, birthday
FROM customers
WHERE birthday IS NOT NULL;

Loading Data into Hive


The data for a Hive table resides in HDFS. To associate data with a table, use the LOAD
DATA command. The data does not actually get loaded into anything, but the data
does get moved:

For Hive-managed tables, the data is moved into a special Hive subfolders of
/apps/hive/warehouse.

For external tables, the data is moved to the folder specified by the LOCATION
clause in the tables definition.

The LOAD DATA command can load files from the local file system (using the LOCAL
qualifier) or files already in HDFS. For example, the following command loads a local file
into a table named customers:
LOAD DATA LOCAL INPATH '/tmp/customers.csv' OVERWRITE INTO
TABLE customers;

The OVERWRITE option deletes any existing data in the table and replaces it with
the new data. If you want to append data to the tables existing contents, simply
leave off the OVERWRITE keyword.

Copyright 2014, Hortonworks, Inc. All rights reserved.

245

If the data is already in HDFS, then leave off the LOCAL keyword:
LOAD DATA INPATH '/user/train/customers.csv' OVERWRITE INTO
TABLE customers;

In either case above, the file customers.csv is moved either into HDFS in a subfolder of
/apps/hive/warehouse or to the tables LOCATION folder, and the contents of
customers.csv are now associated with the customers table.
You can also insert data into a Hive table that is the result of a query, which is a
common technique in Hive. An example of the syntax is below:
INSERT INTO birthdays SELECT firstName, lastName, birthday
FROM customers WHERE birthday IS NOT NULL;

The birthdays table will contain all customers whose birthday column is not null.

246

Copyright 2014, Hortonworks, Inc. All rights reserved.

Performing Queries
Lets take a look at some sample queries to demonstrate what HiveQL looks like. The
following SELECT statement selects all records from the customers table:
SELECT * FROM customers;

You can use the familiar WHERE clause to specify which rows to select from a table:
FROM customers SELECT firstName, lastName, address, zip
WHERE orderID > 0 GROUP BY zip;

NOTE: The FROM clause in Hive can appear before or after the SELECT
clause.

One benefit of Hive is its ability to join data in a simple fashion. The JOIN command in
HiveQL is similar to its SQL counterpart. For example, the following statement performs
an inner join on two tables:
SELECT customers.*, orders.* FROM customers JOIN orders ON
(customers.customerID = orders.customerID);

To perform an outer join, use the OUTER keyword:


SELECT customers.*, orders.* FROM customers LEFT OUTER JOIN
orders ON (customers.customerID = orders.customerID);

In the SELECT above, a row will be returned for every customer, even those without
any orders.

Copyright 2014, Hortonworks, Inc. All rights reserved.

247

Guidelines for Architecting Hive Data


Hadoop is very good at coordinated, sequential scans, but it does not have the concept
of random I/O. Traditional indexes are not very effective in Hive.

Sorting and skipping takes the place of indexing.

A key consideration in Hive is to minimize the amount of data that needs to be


shuffled during the shuffle/sort phase of the MapReduce job.

A best practice is to divide data among different files that can be pruned out,
which is accomplished by using partitions, buckets and skewed tables.

Sort data ahead of time. Sorting data ahead of time simplifies joins and skipping
becomes more effective.

248

Copyright 2014, Hortonworks, Inc. All rights reserved.

Hive Query Optimizations

4 Stages

Stage Details

Hive Query Optimizations


Hive queries can be optimized. There is a Hive explain plan that can be used to evaluate
the execution plan of the query. Use explain extended in front of your query.
Sections:

Abstract syntax tree: You can usually ignore this.

Stage dependencies: Dependencies and # of stages.

Stage plans: Important info on how Hive is running the job.

Copyright 2014, Hortonworks, Inc. All rights reserved.

249

Hive/MR versus Hive/Tez


SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)

Tez avoids
unneeded writes to
HDFS

GROUP BY a.state

Hive MR
M

Hive Tez

SELECT a.state

SELECT b.id
R

SELECT a.state,
c.itemId

M
R

SELECT b.id

M
M

HDFS

JOIN (a, c)
SELECT c.price

HDFS

JOIN (a, c)

HDFS

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

Hive/MR versus Hive/Tez


As you can see in the diagram above, a Hive query without Tez can consist of multiple
MapReduce jobs. Tez performs a Hive query in a single job, avoiding the intermediate
writes to disk that were a result of the multiple MapReduce jobs.

250

Copyright 2014, Hortonworks, Inc. All rights reserved.

ORCFile Example
sale
id

mestamp

productsk

storesk

amount

state

10000

2013-06-13T09:03:05

16775

670

$70.50

CA

10001

2013-06-13T09:03:05

10739

359

$52.99

IL

10002

2013-06-13T09:03:06

4671

606

$67.12

MA

10003

2013-06-13T09:03:08

7224

174

$96.85

CA

10004

2013-06-13T09:03:12

9354

123

$67.76

CA

10005

2013-06-13T09:03:18

1192

497

$25.73

IL

CREATE TABLE sale (


id int, timestamp timestamp,
productsk int, storesk int,
amount decimal, state string
) STORED AS orc;

ORCFile Example
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store
Hive data. File formats in Hive are specified at the table level using the AS keyword. For
example:
CREATE TABLE tablename (
...
) AS ORC;

You can also modify the file format of an existing table:


ALTER TABLE tablename SET FILEFORMAT ORC;

You can also specify ORC as the default file format of new tables:
SET hive.default.fileformat=Orc

The ORC file format is a part of the Stinger Initiative to improve the performance of Hive
queries, and using ORC files can greatly improve the execution time of your Hive
queries.

Copyright 2014, Hortonworks, Inc. All rights reserved.

251

Compression
Hive queries will usually become I/O bound before they become CPU bound. Reducing
the amount of data to be read by using compression can improve performance.
Different compression codecs include: Snappy, LZO, Gzip, BZip2, etc.
Get a listing of the compression codes available in your environment. Compression
options can also be defined in the Hive CLI.
$ hive -e "set io.compression.codecs"
hive> set mapred.output.compression.type=BLOCK;
hive> set mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.GZipCodec;
hive> set hive.exec.compress.intermediate=true;
hive> set hive.exec.compress.output=true;

The intermediate data generated by the Mappers can be compressed. The


hive.exec.compress.intermediate property needs to be set to true.

252

Copyright 2014, Hortonworks, Inc. All rights reserved.

Example:
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
</property>

The default compression codec is set with the mapred.map.output.compression.codec


in the $HADOOP_HOME/conf/mapred-site.xml file.
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Specify that the output of the Reducer(s) should be compressed with the
hive.exec.compress.output parameter.
<property>
<name>hive.exec.compress.output</name>
<value>true/value>
</property>

Set the codec to use for the output.


<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

Copyright 2014, Hortonworks, Inc. All rights reserved.

253

Hive Security
Usernames can be defined when executing commands. You can specify user.name in a
GET :table command:
$ curl -s
'http://localhost:50111/templeton/v1/ddl/database/default/t
able/my_table?user.name=cole'

Or you can specify user.name in a POST :table command:


$ curl -s -d user.name=cole -d rename=myoldtable_2
'http://localhost:50111/templeton/v1/ddl/database/default/t
able/mycool_table'

254

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 11 Review
1. The Hive component for storing schema and metadata information is
___________________.

2. A(n) ______________________ table loosely couples the table schema to the


underlying data storage.

3. ___________________ is a server interface that allows JDBC/ODBC remote


clients to run queries and retrieve the results.

4. True or False: Tez improves the performance of any MapReduce job, not just
Hive queries.

Copyright 2014, Hortonworks, Inc. All rights reserved.

255

Lab 11.1: Understanding Hive Tables

Objective: Understand how Hive table data is stored in HDFS.


Successful Outcome: You will have created a couple of tables in Hive and learned
how data gets associated with a Hive table.
Before you begin: SSH into node1.

Step 1: Review the Data


1.1. As root, change directories to the /root/labs/data folder:
# cd /root/labs/data

1.2. Notice there are 5 part-m-0000x files, which are the result of a MapReduce
job that formatted the data for use with Hive. View the contents of one of these
files:
# more part-m-00000

Notice the data consists of information about visitors to the White House,
including the name, date, person being visited, and a comment section.
Step 2: Define a Hive Table
2.1. In the data folder, there is a text file named wh_visits.hive. View its
contents. Notice it defines a Hive table named wh_visits with a schema that
matches the data in the part-m-0000x files:
# more wh_visits.hive
create table wh_visits (
lname string,
fname string,
time_of_arrival string,

256

Copyright 2014, Hortonworks, Inc. All rights reserved.

appt_scheduled_time string,
meeting_location string,
info_comment string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' ;

NOTE: You cannot use comment or location as column names because


those are reserved Hive keywords.

2.2. Run the script with the following command:


# hive -f wh_visits.hive

2.3. If successful, you should see OK in the output along with the time it took to
run the query.
Step 3: Verify the Table Creation
3.1. Start the Hive Shell:
# hive
hive>

3.2. From the hive> prompt, enter the show tables command:
hive> show tables;

You should see wh_visits in the list of tables.


3.3. Use the describe command to view the details of wh_visits:
hive> describe wh_visits;
OK
lname
string
fname
string
time_of_arrival
string
appt_scheduled_time
string
meeting_location
string
info_comment
string

None
None
None
None
None
None

3.4. Try running a query (even though the table is empty):


select * from wh_visits;
Copyright 2014, Hortonworks, Inc. All rights reserved.

257

The query should execute fine, but no result appears.


Step 4: View the Hive Folder Structure
4.1. Exit the Hive shell:
hive> exit;
[root@node1 data]#

4.2. View the contents of the Hive warehouse folder:


# hadoop fs -ls /apps/hive/warehouse
Found 1 items
drwxr-xr-x
- root hdfs
0 /apps/hive/warehouse/wh_visits

Notice there is a folder named wh_visits. When did this folder get created?
_________________________________________________________________
4.3. List the contents of the wh_visits folder:
# hadoop fs -ls /apps/hive/warehouse/wh_visits

Notice the folder is empty.


Step 5: Populate the Hive Table
5.1. Run the following command to put the local part-m-00000 file into the
wh_visits folder:
# hadoop fs -put part-m-00000
/apps/hive/warehouse/wh_visits

5.2. From the Hive shell, run the following query:


hive> select * from wh_visits;

This time, you should see a couple thousand rows of data. Notice that by simply
putting a file into the wh_visits folder, the table now contains data.
5.3. Notice no MapReduce job was executed to perform the select * query. Why
not? ___________________________________________________________
Step 6: Drop the Table

258

Copyright 2014, Hortonworks, Inc. All rights reserved.

6.1. Run the following query, which drops the wh_visits table:
hive> drop table wh_visits;

6.2. Exit the Hive shell and view the contents of the Hive warehouse folder:
# hadoop fs -ls /apps/hive/warehouse/

Notice that not only has the part-m-00000 file been deleted, but also the
wh_visits folder no longer exists!
Step 7: Create the Table Again
7.1. Run wh_visits.hive again to recreate the wh_visits table:
# hive -f wh_visits.hive

Step 8: Use the Hive LOAD DATA Command


8.1. Create a new directory in HDFS named whitehouse:
# hadoop fs -mkdir whitehouse

8.2. Put all 5 part-m files into whitehouse:


# hadoop fs -put part-m-* whitehouse/

8.3. Verify the files are there:


# hadoop fs -ls whitehouse

8.4. From the Hive shell, run the following query:


hive> LOAD DATA INPATH '/user/root/whitehouse/' OVERWRITE
INTO TABLE wh_visits;

8.5. Verify you have now have data in the table:


hive> select * from wh_visits limit 10;

You should ten rows of visitors, and no MapReduce is needed.

Copyright 2014, Hortonworks, Inc. All rights reserved.

259

8.6. Try the following query. Make sure the output looks like first names:
hive> select fname from wh_visits limit 20;

This time a MapReduce job executed. Why? ____________________________


Step 9: View the Folder Structure
9.1. View the contents of the wh_visits folder:
# hadoop fs -ls /apps/hive/warehouse/wh_visits

Notice the five part-m files are located in wh_visits.


9.2. Try viewing the contents of the whitehouse folder:
# hadoop fs -ls whitehouse

Notice the folder is empty. The LOAD DATA command moved the files from their
original HDFS folder into the Hive warehouse folder; it did not copy them.

IMPORTANT: Be careful when you drop a managed table in Hive. Make sure
you either have the data backed up somewhere else, or that you no longer
want the data.

Step 10: Count the Number of Rows in a Table


10.1. Enter the following Hive query, which outputs the number of rows in
wh_visits:
hive> select count(*) from wh_visits;

10.2. How many rows are currently in wh_visits? _____________


Step 11: Define an External Table
11.1. Drop the wh_visits table again:
hive> drop table wh_visits;

11.2. View the contents of external_table.hive in the /root/labs/data folder:


260

Copyright 2014, Hortonworks, Inc. All rights reserved.

# more external_table.hive
create external table wh_visits (
lname string,
fname string,
time_of_arrival string,
appt_scheduled_time string,
meeting_location string,
info_comment string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/root/whitehouse/' ;

11.3. Create the whitehouse folder in HDFS again, and put the five part-m files
into whitehouse.
11.4. Verify that there is not a subfolder of /apps/hive/warehouse named
wh_visits.
11.5. Run the query in external_table.hive to create the wh_visits table:
# hive f external_table.hive

11.6. Run a query on wh_visits to verify that the table does actually contain
records.
11.7. Drop wh_visits again, but this time notice that the files in the whitehouse
folder are not deleted.

RESULT: As you just verified, the data for external tables is not deleted when the
corresponding table is dropped. Aside from this behavior, managed tables and external
tables in Hive are essentially the same.

Copyright 2014, Hortonworks, Inc. All rights reserved.

261

Unit 12: Transferring Data with


Sqoop
Topics covered:

Overview of Sqoop

The Sqoop Import Tool

Importing a Table

Importing Specific Columns

Importing from a Query

The Sqoop Export Tool

Exporting a Table

Lab 12.1: Using Sqoop

262

Copyright 2014, Hortonworks, Inc. All rights reserved.

Overview of Sqoop

Relational
Database

1. Client executes a
sqoop command

Enterprise
Document-based
Data Warehouse
Systems

3. Plugins provide connectivity to


various data sources

2. Sqoop executes the


command as a
MapReduce job on the
cluster (using Map-only
tasks)

Map
tasks

Hadoop Cluster

Overview of Sqoop
Sqoop is a tool designed to transfer data between Hadoop and external structured
datastores like RDBMS and data warehouses. Using Sqoop, you can provision the data
from an external system into HDFS. Sqoop uses a connector-based architecture that
supports plugins that provide connectivity to additional external systems.
As you can see in the slide, Sqoop uses MapReduce to distribute its work across the
Hadoop cluster:
1. A Sqoop job gets executed using the sqoop command line.
2. Sqoop uses Map tasks (4 by default) to execute the command.
3. Plugins are used to communicate with the outside data source. The schema is
provided by the data source, and Sqoop generates and executes SQL statements
using JDBC or other connectors.

NOTE: Using MapReduce to perform Sqoop commands provides parallel operation


as well as fault tolerance.

Copyright 2014, Hortonworks, Inc. All rights reserved.

263

HDP provides the following connectors for Sqoop:

Teradata

MySQL

Oracle JDBC connector

Netezza

A Sqoop connector for SQL Server is also available from Microsoft:

264

SQL Server R2 connector

Copyright 2014, Hortonworks, Inc. All rights reserved.

The Sqoop Import Tool


With Sqoop, you can import data from a relational database system into HDFS:

The input to the import process is a database table.

Sqoop will read the table row-by-row into HDFS. The output of this import
process is a set of files containing a copy of the imported table.

The import process is performed in parallel. For this reason, the output will be in
multiple files.

These files may be delimited text files (for example, with commas or tabs
separating each field), or binary Avro or SequenceFiles containing serialized
record data.

The import command looks like:


sqoop import (generic-args) (import-args)

Copyright 2014, Hortonworks, Inc. All rights reserved.

265

The import command has the following requirements:

Must specify a connect string using the --connect argument

Credentials can be included in the connect string, so using the --username and -password arguments

Must specify either a table to import using --table, or the result of a SQL query
using --query

266

Copyright 2014, Hortonworks, Inc. All rights reserved.

Importing a Table
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile

Importing a Table
The following Sqoop command imports a database table named StockPrices into a
folder in HDFS named /data/stockprices:
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile

Based on the import command above:

The connect string in this example is for MySQL. The database name is nyse.

The --table argument is the name of the table in the NYSE database.

The --target-dir is where in HDFS the data will be imported.

The default number of map tasks for Sqoop is 4, so the result of this import will
be in 4 files.

The --as-textfile argument imports the data as plain text.

Copyright 2014, Hortonworks, Inc. All rights reserved.

267

NOTE: You can use --as-avrodatafile to import the data to Avro files, and use
--as-sequencefile to import the data to sequence files.

Other useful import arguments include:

--columns: a comma-separated list of the columns in the table to import (as


opposed to importing all columns, which is the default behavior).

--fields-terminated-by: specify the delimiter. Sqoop uses a comma by default.

--append: the data is appended to an existing dataset in HDFS.

--split-by: the column used to determine how the data is split between mappers.
If you do not specify a split-by column, then the primary key column is used.

-m: the number of map tasks to use.

--query: use instead of --table, the imported data is the resulting records from
the given SQL query.

--compress: enables compression.

NOTE: The import command shown here looks like it entered over multiple
lines, but you have to enter this entire Sqoop command on a single
command line.

REFERENCE: Visit http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html for a list


of all arguments available for the import command.
268

Copyright 2014, Hortonworks, Inc. All rights reserved.

Importing Specific Columns


sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--columns StockSymbol,Volume, High,ClosingPrice
--target-dir /data/dailyhighs/
--as-textfile
--split-by StockSymbol
-m 10

Importing Specific Columns


Use the --columns argument to specify which columns from the table to import. For
example:
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--columns StockSymbol,Volume,High,ClosingPrice
--target-dir /data/dailyhighs/
--as-textfile
--split-by StockSymbol
-m 10

Based on the import command above:

How many columns will be in imported? ______________

How many files will be created in /data/dailyhighs/? ______________

Which column will Sqoop use to split the data up between the mappers?
____________________________

Copyright 2014, Hortonworks, Inc. All rights reserved.

269

Importing from a Query


sqoop import
--connect jdbc:mysql://host/nyse
--query "SELECT * FROM StockPrices s
WHERE s.Volume >= 1000000
AND \$CONDITIONS"
--target-dir /data/highvolume/
--as-textfile
--split-by StockSymbol

Importing from a Query


Use the --query argument to specify which rows to select from a table. For example:
sqoop import
--connect jdbc:mysql://host/nyse
--query "SELECT * FROM StockPrices s
WHERE s.Volume >= 1000000
AND \$CONDITIONS"
--target-dir /data/highvolume/
--as-textfile
--split-by StockSymbol

Based on the command above:

Only rows whose Volume column is greater than 1,000,000 will be imported.

The $CONDITIONS token must appear somewhere in the WHERE clause of your
SQL query. Sqoop replaces this token with LIMIT and OFFSET clauses so that the
data can be split between mappers.

If you use --query, then you must also specify a --split-by column or the Sqoop
command will fail to execute.

270

Copyright 2014, Hortonworks, Inc. All rights reserved.

NOTE: Using --query is limited to simple queries where there are no


ambiguous projections and no OR conditions in the WHERE clause. Use of
complex queries (such as queries that have sub-queries, or joins leading to
ambiguous projections) can lead to unexpected results.

IMPORTANT: You either use --query or --table, but attempting to define


both results in an error.

Copyright 2014, Hortonworks, Inc. All rights reserved.

271

The Sqoop Export Tool


Sqoops export process will read a set of delimited text files from HDFS in parallel, parse
them into records, and insert them as new rows in a target database table. The syntax
for the export command is:
sqoop export (generic-args) (export-args)

The Sqoop export tool runs in three modes:


1. Insert Mode: the records being exported are inserted into the table using a SQL
INSERT statement.
2. Update Mode: an UPDATE SQL statement is executed for existing rows, and an
INSERT can be used for new rows.
3. Call Mode: a stored procedure is invoked for each record.
The mode used is determined by the arguments specified:

272

--table: the table to populate in the database. This table must already exist in the
database. If no --update-key is defined, then the command is executed in Insert
Mode.
Copyright 2014, Hortonworks, Inc. All rights reserved.

--update-key: the primary key column for supporting updates. If you define this
argument, the Update Mode is used and existing rows are updated with the
exported data.

--call: invokes a stored procedure for every record, thereby using Call Mode. If
you define --call, then do not define the --table argument or an error will occur.

The following are sqoop export arguments:

--export-dir: the directory in HDFS that contains the data to export.

--input-fields-terminated-by: the input field delimiter. A comma is the default.

--update-mode: Specify how updates are performed when new rows are found
with non-matching keys in database. Values are updateonly (the default) and
allowinsert.

Copyright 2014, Hortonworks, Inc. All rights reserved.

273

Exporting to a Table
sqoop export
--connect jdbc:mysql://host/nyse
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"

Exporting to a Table
The following Sqoop command exports the data in the /data/logfiles/ folder in HDFS to
a table named LogData:
sqoop export
--connect jdbc:mysql://host/nyse
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"

Based on the command above:

The table LogData needs to already exist in the Weblogs database.

The column values are determined by the delimiter, which is a tab in this
example.

All files in the /data/logfiles/ directory will be exported.

Sqoop will perform this job using 4 mappers, but you can specify the number to
use with the -m argument.

274

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 12 Review
1. What is the default number of map tasks for a Sqoop job? _____________
2. How do you specify a different number of mappers in a Sqoop job?
_________________________________________________
3. What is the purpose of the $CONDITIONS value in the WHERE clause of a Sqoop
query?
__________________________________________________________________

Copyright 2014, Hortonworks, Inc. All rights reserved.

275

Lab 12.1: Using Sqoop

Objective: Move data between HDFS and a RDBMS.


Successful Outcome: You will have imported data from MySQL into folders in
HDFS, and exported data from HDFS into a MySQL table.
Before You Begin: SSH into node1.

Perform the following steps:


Step 1: Install MySQL
1.1. Run the following commands to install MySQL on node1:
# yum -y install mysql mysql-server

1.2. Start the server with the following command:


# service mysqld start

Step 2: Create a Table in MySQL


2.1. As the root user, change directories to /root/labs:
# cd ~/labs/

2.2. View the contents of salaries.txt:


# cat salaries.txt

The comma-separated fields represent a gender, age, salary and zip code.
2.3. Notice there is a salaries.sql script that defines a new table in MySQL named
salaries. For this script to work, you need to copy salaries.txt into the publiclyavailable /tmp folder:
276

Copyright 2014, Hortonworks, Inc. All rights reserved.

# cp salaries.txt /tmp

2.4. Now run the salaries.sql script using the following command:
# mysql test < salaries.sql

Step 3: View the Table


3.1. To verify the table is populated in MySQL, open the mysql prompt:
# mysql

3.2. Switch to the test database, which is where the salaries table was created:
mysql> use test;

3.3. Run the show tables command and verify salaries is defined:
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| salaries
|
+----------------+
1 row in set (0.00 sec)

3.4. Select 10 items from the table to verify it is populated:


mysql> select * from salaries limit 10;
+--------+------+--------+---------+----+
| gender | age | salary | zipcode | id |
+--------+------+--------+---------+----+
| F
|
66 | 41000 |
95103 | 1 |
| M
|
40 | 76000 |
95102 | 2 |
| F
|
58 | 95000 |
95103 | 3 |
| F
|
68 | 60000 |
95105 | 4 |
| M
|
85 | 14000 |
95102 | 5 |
| M
|
14 |
0 |
95105 | 6 |
| M
|
52 |
2000 |
94040 | 7 |
| M
|
67 | 99000 |
94040 | 8 |
| F
|
43 | 11000 |
94041 | 9 |
| F
|
37 | 65000 |
94040 | 10 |
+--------+------+--------+---------+----+

Step 4: Grant the Necessary Privileges


Copyright 2014, Hortonworks, Inc. All rights reserved.

277

4.1. Enter the following command at the mysql prompt to grant access to node2
and node3 to connect to the mysql-server running on node1:
grant all privileges on *.* to 'root'@'%' with grant
option;

4.2. Exit the mysql prompt:


mysql> exit

Step 5: Import the Table into HDFS


5.1. Enter the following Sqoop command (all on a single line), which imports the
salaries table in the test database into HDFS:
# sqoop import
--connect jdbc:mysql://node1/test
--table salaries
--username root

5.2. A MapReduce job should start executing, and it may take a couple minutes for
the job to complete.
Step 6: Verify the Import
6.1. View the contents of the salaries folder:
# hadoop fs -ls salaries

6.2. You should see a new folder named salaries. View its contents:
# hadoop fs -ls salaries
Found 4 items
-rw-r--r-1 root hdfs
-rw-r--r-1 root hdfs
-rw-r--r-1 root hdfs
-rw-r--r-1 root hdfs

272
241
238
272

part-m-00000
part-m-00001
part-m-00002
part-m-00003

6.3. Notice there are four new files in the salaries folder named part-m-0000x.
Why are there four of these files?
__________________________________________________________________
6.4. Use the cat command to view the contents of the files. For example:
278

Copyright 2014, Hortonworks, Inc. All rights reserved.

# hadoop fs -cat salaries/part-m-00000

Notice the contents of these files are the rows from the salaries table in MySQL.
You have now successfully imported data from a MySQL database into HDFS.
Notice you imported the entire table with all of its columns. In the next step, you
will import only specific columns of a table.
Step 7: Specify Columns to Import
7.1. Using the --columns argument, write a Sqoop command that imports the
salary and age columns (in that order) of the salaries table into a directory in
HDFS named salaries2. In addition, set the -m argument to 1 so that the result is a
single file.
7.2. After the import, verify you only have one part-m fie in salaries2:
# hadoop fs -ls salaries2
Found 1 items
-rw-r--r-1 root hdfs

482

salaries2/part-m-00000

7.3. Verify the contents of part-m-00000 are only the 2 columns you specified:
# hadoop fs -cat salaries2/part-m-00000

The last few lines should look like the following:


69000.0,97
91000.0,48
0.0,1
48000.0,45
3000.0,39
14000.0,84

Step 8: Importing from a Query


8.1. Write a Sqoop import command that imports the rows from salaries in
MySQL whose salary column is greater than 90,000.00. Use gender as the --splitby value, specify only 2 mappers, and import the data into the salaries3 folder in
HDFS.

TIP: The Sqoop command will look similar to the ones you have been using
throughout this lab, except you will use --query instead of --table. Recall

Copyright 2014, Hortonworks, Inc. All rights reserved.

279

that when you use a --query command you must also define a --split-by
column, or define -m to be 1.
Also, do not forget to add $CONDITIONS to the WHERE clause of your query,
as demonstrated earlier in this Unit.

8.2. To verify the result, view the contents of the files in salaries3. You should
have only two output files.
8.3. View the contents of part-m-00000 and part-m-00001. Notice one file
contains females, and the other file contains males. Why? ______________
______________________________________________________________
8.4. Verify the output files contain only records whose salary is greater than
90,000.00.
Step 9: Put the Export Data into HDFS
9.1. Now lets export data from HDFS to the database. Start by viewing the
contents of the data, which is in a file named salarydata.txt:
# tail salarydata.txt
M,49,29000,95103
M,44,34000,95102
M,99,25000,94041
F,93,96000,95105
F,75,9000,94040
F,14,0,95102
M,68,1000,94040
F,45,78000,94041
M,40,6000,95103
F,82,5000,95050

Notice the records in this file contain 4 values separated by commas, and the
values represent a gender, age, salary and zip code, respectively.
9.2. Create a new directory in HDFS named salarydata.
9.3. Put salarydata.txt into the salarydata directory in HDFS.
Step 10: Create a Table in the Database
10.1. There is a script in the /root/labs folder that creates a table in MySQL that
matches the records in salarydata.txt. View the SQL script:
280

Copyright 2014, Hortonworks, Inc. All rights reserved.

# more salaries2.sql

10.2. Run this script using the following command:


# mysql test < salaries2.sql

10.3. Verify the table was created successfully in MySQL:


# mysql
mysql> use test;
mysql> describe salaries2;
+---------+------------+------+-----+---------+-------+
| Field
| Type
| Null | Key | Default | Extra |
+---------+------------+------+-----+---------+-------+
| gender | varchar(1) | YES |
| NULL
|
|
| age
| int(11)
| YES |
| NULL
|
|
| salary | double
| YES |
| NULL
|
|
| zipcode | int(11)
| YES |
| NULL
|
|
+---------+------------+------+-----+---------+-------+

10.4. Exit the mysql prompt:


mysql> exit

Step 11: Export the Data


11.1. Run a Sqoop command that exports the salarydata folder in HDFS into the
salaries2 table in MySQL. At the end of the MapReduce output, you should see a
log event stating that 10,000 records were exported.
11.2. Verify it worked by viewing the tables contents from the mysql prompt. The
output should look like the following:
mysql> use test;
mysql> select * from salaries2 limit 10;
+--------+------+--------+---------+
| gender | age | salary | zipcode |
+--------+------+--------+---------+
| M
|
57 | 39000 |
95050 |
| F
|
63 | 41000 |
95102 |
| M
|
55 | 99000 |
94040 |
| M
|
51 | 58000 |
95102 |
| M
|
75 | 43000 |
95101 |
| M
|
94 | 11000 |
95051 |
| M
|
28 |
6000 |
94041 |
| M
|
14 |
0 |
95102 |
Copyright 2014, Hortonworks, Inc. All rights reserved.

281

| M
|
3 |
0 |
95101 |
| M
|
25 | 26000 |
94040 |
+--------+------+--------+---------+

RESULT: You have imported the data from MySQL to HDFS using the entire table,
specific columns, and also using the result of a query. You have also exported a folder of
data in HDFS into a table in MySQL.

SOLUTIONS:
Step 7.1 is the following command (entered on a single line):
# sqoop import --connect jdbc:mysql://node1/test
--table salaries
--columns salary,age
-m 1
--target-dir salaries2
--username root

Step 8.1:
sqoop import --connect jdbc:mysql://node1/test
--query "select * from salaries s where s.salary > 90000.00
and \$CONDITIONS"
--split-by gender
-m 2
--target-dir salaries3
--username root

282

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 11
sqoop export
--connect jdbc:mysql://node1/test
--table salaries2
--export-dir salarydata
--input-fields-terminated-by ","
--username root

ANSWERS:
Step 6.3: The MapReduce job that executed the Sqoop command used four mappers, so
there are four output files (one from each mapper).
Step 8.3: You used gender as the split-by column, so all records with the same gender
are sent to the same mapper.

Copyright 2014, Hortonworks, Inc. All rights reserved.

283

Unit 13: Flume


Topics covered:

Flume Introduction

Installing Flume

Flume Events

Flume Sources

Flume Channels

Flume Channel Selectors

Flume Channel Selector

Flume Sinks

Multiple Sinks

Flume Interceptors

Design Patterns

Configuring Individual Components

Flume Netcat Source Example

Flume Exec Source Example

Flume Configuration

Monitoring Flume

Lab 13.1: Install and Test Flume

284

Copyright 2014, Hortonworks, Inc. All rights reserved.

Flume Introduction
A Flume is an artificial channel or stream created
which uses water to transport objects down a
channel.
Apache Flume, a data ingestion tool, collects,
aggregates and directs data streams into Hadoop
using the same concepts. Flume works with
different data sources to process and send data to
defined destinations.

Source

Channel

Sink

Flume Agent

Flume Introduction
A flume is an artificial channel or stream created that uses water to transport objects
down the channel. Flumes were often used by the logging industry to move cut
wooden logs. Apache Flume transfers data from multiple sources into Hadoop via
events instead of wooden logs. It efficiently collects, aggregates, and moves large
amounts of streaming data.
Flume Components

Event: The individual unit of data (such as a log entry) and is made up of
header(s) and a byte-array body.

Source: Defines the type of data stream that is entering Flume. Sources may
either be active; constantly looking for data, or passive; waiting for data to be
passed to them.

Client: Produces and communicates events to the source.

Sink: Delivers the data to its destination. Each sink is defined based on the
destination it will be transferring data into. For example: HDFS, HBase, a local
file.

Copyright 2014, Hortonworks, Inc. All rights reserved.

285

Channel: The conduit between the source and the sink (destination).

Agent: A JVM process that is a collection of sources, sinks and channels. An


agent requires that at least one source, channel and sink be defined however
may also be configured with multiple sources, channels, and sinks.

Flume Workflow
1. Client transmits event to a source.
2. Source receives event and delivers it to one or more channels.
3. The sink or sinks transfer the data from the channel to the final destination.

286

Copyright 2014, Hortonworks, Inc. All rights reserved.

Installing Flume
Following are the system requirement for running Flume:

Java Runtime Environment: Java 1.6 or later (Java 1.7 Recommended).

Memory: The Flume agent requires appropriate amount of memory for all
components of the agent.

Disk Space: Flume agent needs permission to access sources and write to
destinations. Make sure channels have sufficient storage.

Permissions: Read/write access to all directories the agents will be using.

Although not required, it is recommended to set your time to UTC versus local
time.

NOTE: The Flume agent heap size can be set with JAVA_OPTS:
JAVA_OPTS= "-Xms100m -Xmx200m"

Copyright 2014, Hortonworks, Inc. All rights reserved.

287

The key Flume environment and configuration files are:


File Location / Name

Description

/etc/flume/conf/flume-conf.properties

Java properties file.

/etc/flume/conf/flume-env.sh

Contains environment variables.

/etc/flume/conf/log4j.properties

Contains Java logging properties (such as


log directory).

flume.log.dir=/var/log/flume

Flume log directory.

NOTE: The following templates are available:


/etc/flume/conf/flume-conf.properties.template and
/etc/flume/conf/flume-env.sh.template

Flume can be started using either of the commands below:


$ /etc/rc.d/init.d/flume-ng start
-- OR -$ service flume-ng start

Use the help option to view the usage list:


$ ./bin/flume-ng help

288

Copyright 2014, Hortonworks, Inc. All rights reserved.

Flume Events
An event can range from text to images. The key point about events is they need to be
generated from regular streaming data.
An Event is a single unit of data that can be transported by Flume NG (akin to messages
in JMS). Events are generally small (ranging from a few bytes to a few kilobytes) and are
commonly a single record from a larger dataset. Events are made up of headers,
containing the key / value map and a body, storing the arbitrary byte array.
Clients generate data as a stream of events and run in a separate thread. The clients
send data to a source. A log4j appender sends events directly to Flume NG's source or
syslog daemon.

Copyright 2014, Hortonworks, Inc. All rights reserved.

289

Flume Sources
A Flume source is the data stream from which Flume receives the data. The source can
be pollable or event driven. A spool director can be set up to look for new files. A suffix
can be added once all events have been transmitted.
Property

Sample Value

agent.sources

mychannel

agent.sources.channels

mychannel

agent.sources.mychannel. type

spooldir

agent.sources.mychannel. spoolDir

/directorypath

agent.source.mychannel. fileSuffix

.COMPLETE

290

Copyright 2014, Hortonworks, Inc. All rights reserved.

Flume Source

Description

Avro Source

Listens on Avro port and receives events


from an external Avro client stream.

Exec Source

Runs a given Unix command at start-up


and listens to standard out.

Thrift Source

RPC source.

NetCat Source

Listens on a given port and turns each line


of text into an event.

Sequence Generator Source

Continuously generates events with a


counter that starts at 0 and increments by
1. Useful for testing and debugging.

SpoolDir Source

Process rotating log files.

JMS Source

Process stream from Java Messaging


Service.

HTTP Source

Accepts events by HTTP POST and GET


(experimentation only).

Syslog Source

Reads syslog data and generates Flume


events.
Also: Syslog TCP Source and Syslog UDP
Source.

Custom Source

Implementation of Source interface.

Copyright 2014, Hortonworks, Inc. All rights reserved.

291

Flume Channels
The channel is the conduit for events between a source and a sink. The channel dictates
the durability of event delivery between a source and a sink. An event stays in the
channel until the sink successfully sends the data to the defined destination. The source
and the sink run asynchronously in processing events in the channel. Channel
exceptions can be thrown if the ingest rate exceeds the channels ability to handle that
rate.

292

Copyright 2014, Hortonworks, Inc. All rights reserved.

Types of channels include:

Memory Channel: Fast but makes no guarantee against data loss.

File Channel: Backed by WAL implementation; fully durable and reliable.

JDBC Channel: Backed by embedded Database; fully durable and reliable.

Memory Channels: Events are stored in an in-memory queue with configurable


max size. They are the fastest but lack durability.

File Channel: Writes and checkpoints files to disk. Slower but durable.

JDBC Channel: Events are stored in a persistent storage that is backed up with a
database. Slower but durable.

Custom Channel: Implementation of the Channel interface. Examples of


defining different types of channels.
agent.channels.mychannel. type = memory
agent.channels.mychannel. type = file

Copyright 2014, Hortonworks, Inc. All rights reserved.

293

Flume Channel Selectors


A channel selector allows you to go from a single source to multiple channels using a fan
out strategy.
Example:
agent.sources.mychannel.channels = mych1 mych2 mych3
agent.sources.mychannel.selector.type = replicating
agent.sources = mychannel
agent.channels = mych1 mych2 mych3
agent.sources. mychannel.selector.type = multiplexing
agent.sources. mychannel.selector.header = state
agent.sources. mychannel.selector.mapping.src1 = mych1
agent.sources. mychannel.selector.mapping.src2 = mych2
agent.sources. mychannel.selector.mapping.src3 = mych1
mych2

Events can be batched as a transaction and each transaction has a unique id. The
number of events that are processed together as a single transaction determines the
batch size. Each event in a transaction has a unique sequence number.
The durability of transactions is determined by the batch sizes as batch sizes control
throughput.
294

Copyright 2014, Hortonworks, Inc. All rights reserved.

Flume Channel Selector


Channel Selectors support fanning out of events. The events are either
replicated to all channels or sent to a specific channel.

Source

Channel

Sink

Channel

Sink

C
S

Agent

Flume Channel Selector


A channel selector can be set to replicating (default) or multiplexing. Channel selectors
set to replicating will send all events to multiple channels. Channel selectors set to
multiplexing can replicate or selectively route an event to one or more channels. See
below for examples of setting the channel selector.
Replicating:
agent.sources.mychannel. channels = c1 c2 c3
agent.sources.mychannel. selector.type = replicating

Multiplexing:
agent.sources.mychannel. selector.type = multiplexing
agent.sources.mychannel. selector.header = port

Copyright 2014, Hortonworks, Inc. All rights reserved.

295

Flume Sinks
Sinks receive Events from Channels which are then written to the HDFS
or forwarded to another data source. Supported destinations are shown
below:

HDFS

Avro

Flume Sinks
A sink is the destination for the data stream in Flume. The sink receives events from a
channel and runs in a separate thread. Sinks can support text and sequence files when
writing to HDFS and both file types can be compressed. Below is a list of the different
types of sinks.
Type of Sink

Description

HDFS Sink

Write events into HDFS. Data streams can


be compressed.
Creates text or sequence files. Sequence
files are the default and are able to be
split.
Supports Kerberos authentication.

Logger Sink

Logs event at INFO level.

Avro Sink

Use to sent to Avro Source on another


server.

296

Copyright 2014, Hortonworks, Inc. All rights reserved.

Thrift Sink

Flume events are turned into Thrift events


and sent to hostname/port.

IRC Sink

Takes messages from attached channel


and relays those to configured IRC
destinations.

File Roll Sink

Stores events on the local file system.

HBase Sink

Write data to HBase.

Async HBase Sink

Writes data to HBase using asynchronous


model.

Null Sink

Removes all events received from the


channel.

Custom Sink

Create a custom implementation of Sink


interface.

Copyright 2014, Hortonworks, Inc. All rights reserved.

297

Multiple Sinks
Single sink is default behavior.
Multiple sinks can provide:
Failover for Sinks.
Load balancing of Sinks.

Sink
Source

Channel

S
P
Sink

Agent

Multiple Sinks
Sink Processors are a collection of multiple sinks and can be setup for load balancing
over multiple sinks or to achieve failover from one sink to another in case of failure.

298

Copyright 2014, Hortonworks, Inc. All rights reserved.

There are different types of sink processors; each is described in the table below:
Type of Sink Processor

Description

Default

Accepts one sink and does not require user


to create a sink processor. Follows the
source channel sink pattern.

Failover

Maintains a list of prioritized sinks. Moves


out failed sinks, if they continue to fail
they are retired, if they send a successful
event, they are restored.

Load Balancing

Load balances across multiple sinks. The


load distribution can be round-robin or
random selection with a default of roundrobin. Picks next available sink if selected
sink fails to deliver event.

Copyright 2014, Hortonworks, Inc. All rights reserved.

299

Flume Interceptors
Interceptors are set with the interceptors property and have the ability to drop or
modify an event based on how the interceptor is coded. Flume supports chaining
multiple interceptors together and the order of definition sets the order they run in.
agent.sources.mychannel.interceptors = inter1 iinter2 inter3

The type property sets the type of interceptor.

Timestamp Interceptor: Inserts timestamp into header of events. An existing


time stamp can be overwritten.

agent.sources.mychannel.interceptors = inter1
agent.sources.mychannel.interceptors.inter1.type = timestamp
agent.sources.mychannel.interceptors.inter1.preserveExisting
= true

300

Copyright 2014, Hortonworks, Inc. All rights reserved.

Host Interceptor: Inserts the hostname or IP address in the header of event.


Allows user to append a static header to all events.
agent.sources.mychannel.interceptors = inter1
agent.sources.mychannel.interceptors.type = host

Static Interceptor: Allows a key/value to be added.

UUID Interceptor: Sets a universally unique identifier on all events that go


through interceptor.

Regex Filtering Interceptor: Filters events selectively by interpreting the event


body as text and matching the text against a configured regular expression.

Regex Extractor Interceptor: Filters events by interpreting event body as text,


matching the text against a configured regular expression and extract an
element out of the body and place it in the header.

Morphline Interceptor: Filters the events through a morphline configuration file


that defines a chain of transformation commands that pipe records from one
command to another.

Copyright 2014, Hortonworks, Inc. All rights reserved.

301

Design Patterns
Multi-Agent Flow
Source

Channel

Avro Sink

Avro
RPC

Channel

Avro Source

Sink

Fan In (Consolidation)
Source

Channel

Sink

Source

Channel

Sink

Source

Channel

Sink

Source

Channel

Sink

Fan Out

Source

Channel

Sink

Channel

Sink

Channel

Sink

HDFS
Source

Channel

Sink

Design Patterns
Flume has the flexibility to create complex data workflows. Agents are able to have
multiple sources, channels and sinks. You can also connect multiple agents to each
other.
The Flume topology supports multiple design patterns. A few are shown above:

Multi-Agent Flow

Fan In

Fan Out

For any Flume agent, the source ingests data and sends it to the channel. There can be
multiple sources, channels and sinks in a Flume agent but each sink can only receive
data from a single channel.

302

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuring Individual Components


Multiple agents can be configured in a single configuration file. The configuration file
contains properties about each source, channel, and sink for the agent(s). How these
areas are configured determines the data flow through the agent.
A prefix name is used to identify an agent. Each source, channel and sink needs to have
a name associated with it. Following the name are the properties that define the
component.

Copyright 2014, Hortonworks, Inc. All rights reserved.

303

Example formats:
<AgentName>.sources = <SourceName>
<AgentName>.sinks = <SinkName>
<AgentName>.channels = <Channel1> <Channel2>
<AgentName>.sources.<SourceName>.channels = <Channel1>
<Channel2> ... # set channel for source
<AgentName>.sinks.<SinkName>.channel = <Channel1>
# set channel for sink
<AgentName>.sources.<SourceName>.<someProperty> =
<someValue>
<AgentName>.channel.<ChannelName>.<someProperty> =
<someValue>
# properties for channels
<AgentName>.sources.<SinkName>.<someProperty> = <someValue>
# properties for sinks

# Agent name examples


agent.sources.mysource.port=20100 -- # Uses agent called
agent.
a1.sources.mysource.port=20100 -- # Uses agent called a1.

To start a Flume agent, call the flume-ng shell (located in Flume bin directory) script.
The script sets the agent name, the configuration directory and the configuration
properties file.
$ bin/flume-ng agent -n $agent_name -c conf -f conf/flumeconf.properties.

304

Copyright 2014, Hortonworks, Inc. All rights reserved.

Flume Netcat Source Example


The following table outlines options that can be set within the my.conf file. The agent
will receive the text from the telnet command.

Copyright 2014, Hortonworks, Inc. All rights reserved.

305

# my.conf file
#Define source name
agent.sources = snet
#Define sink name
agent.sinks = sink1
#Define channel name
agent.channels = chmem
#Set the source
agent.sources.snet.type = netcat
agent.sources.snet.bind = localhost
agent.sources.snet.port = 44444
# Set the sink destination
agent.sinks.sink1.type = logger
#Set channel to type memory
agent.channels.chmem.type = memory
agent.channels.chmem.capacity = 1000
agent.channels.chmem.transactioncapacity = 100
#Set the source with the channel
agent.sources.snet.channels = chmem
#Set the sink with the channel
agent.sinks.sink1.channel = chmem

306

Copyright 2014, Hortonworks, Inc. All rights reserved.

Flume Exec Source Example


agent.sources = pstream
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.sources.pstream.channels = memoryChannel
agent.sources.pstream.type = exec
agent.sources.pstream.command = tail -f /etc/passwd
agent.sinks = hdfsSink
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.channel = memoryChannel
agent.sinks.hdfsSink.hdfs.path =
hdfs://hdp/user/root/flumetest
agent.sinks.hdfsSink.hdfs.fileType = SequenceFile
agent.sinks.hdfsSink.hdfs.writeFormat = Text

Copyright 2014, Hortonworks, Inc. All rights reserved.

307

Flume Configuration
# A single-node Flume configuration
# Name the components on this agent
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channelA
# Describe/configure source1
agent1.sources.source1.type = netcat
agent1.sources.source1.bind = localhost
agent1.sources.source1.port = 44444
# Describe sink1
agent1.sinks.sink1.type = logger
# Use a channel which buffers events in memory
agent1.channels.channelA.type = memory
agent1.channels.channelA.capacity = 1000
agent1.channels.channelA.transactionCapactiy = 100
# Bind the source and sink to the channel
agent1.sources.source1.channels = channelA
agent1.sinks.sink1.channel = channelA

Flume Configuration
The property "type" needs to be set for each component for Flume to understand what
kind of object it needs to be. Each source, sink and channel type has its own set of
properties required for it to function as intended. All those need to be set as needed. In
the previous example, we have a flow from avro-AppSrv-source to hdfs-Cluster1-sink
through the memory channel mem-channel-1.

308

Copyright 2014, Hortonworks, Inc. All rights reserved.

The example below shows configuration of each of those components:


# set channel for sources, sinks, channels
my_agent.sources = avro-AppSrv-source
my_agent.sinks = hdfs-Cluster1-sink
my_agent.channels = channel1
# properties of avro-AppSrv-source
my_agent.sources.avro-AppSrv-source.type = avro
my_agent.sources.avro-AppSrv-source.bind = localhost
my_agent.sources.avro-AppSrv-source.port = 10000
# properties of channel1
my_agent.channels.channel1.type = memory
my_agent.channels.channel1.capacity = 1000
my_agent.channels.channel1.transactionCapacity = 100
# properties of hdfs-Cluster1-sink
my_agent.sinks.hdfs-Cluster1-sink.type = hdfs
my_agent.sinks.hdfs-Cluster1-sink.hdfs.path =
hdfs://namenode/flume/webdata
#...

Copyright 2014, Hortonworks, Inc. All rights reserved.

309

Monitoring Flume
Flume monitoring options can be set in /etc/flume/conf/flume-env.sh (JAVA_OPTS) for
the following:

JMX monitoring
JAVA_OPTS="-Dcom.sun.management.jmxremoteDcom.sun.management.jmxremote.port=4159
-Dcom.sun.management.jmxremote.authenticate=false Dcom.sun.management.jmxremote.ssl=false

Ganglia: Flume metrics can be sent to Ganglia.


JAVA_OPTS="-Dflume.monitoring.type=ganglia Dflume.monitoring.hosts=<ganglia-server>:8660"

310

Nagios: Nagios can be configured to watch the Flume agents. Monitoring for
cpu, memory and disk resources consumed by Flume should be standard. Look
at the Nagios JMX plugin to monitor performance.

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 13 Review
1. The basic unit of data for Flume is an ____________________ .
2. Sources can be polled or _________________________.
3. A channel selector can be replicating, multiplexing or _____________________.
4. The Flume component ____________________ allows inspection and
transformation of the data as it flows through the stream.

Copyright 2014, Hortonworks, Inc. All rights reserved.

311

Lab 13.1: Install and Test Flume

Objective: Install, configure and run Flume.


Successful Outcome: A running Flume agent that reads data from a network
connection and writes it to a folder in HDFS.
Before You Begin: SSH into node1.

Perform the following steps:


Step 8: Install Flume
8.1. On node1 as root, install Flume using the following command:
# yum y install flume

8.2. Verify Flume is installed by viewing the usage of the flume-ng command:
# flume-ng

Step 9: View the Flume Agent Configuration


9.1. Change directories to the /root/labs/flume directory.
9.2. A Flume agent has been written for you in a file named logagent.conf. View
the file:
# less logagent.conf

9.3. Notice the name of the agent defined in this file is called logagent.
9.4. The source of logagent is source1. Based on the source1 configuration, where
is the data coming from for this Flume agent?
__________________________________________________

312

Copyright 2014, Hortonworks, Inc. All rights reserved.

9.5. What type of channel is being used for logagent?


___________________________________
9.6. Where is the sink for this Flume agent?
___________________________________________
Step 10: Start a Flume agent
10.1. Flume requires JAVA_HOME to be defined, so enter the following command:
# export JAVA_HOME=/usr/jdk64/jdk1.6.0_31/

10.2. Start logagent using the following command (all on a single line):
flume-ng agent -n logagent -f logagent.conf
-Dflume.log.dir=/var/log/flume/
-Dflume.log.file=logagent.log &

10.3. View the output of the command. Make sure sink1 and source1 started:
INFO sink.RollingFileSink: RollingFileSink sink1 started.
INFO instrumentation.MonitoredCounterGroup: Monitoried
counter group for type: SOURCE, name: source1, registered
successfully.
INFO instrumentation.MonitoredCounterGroup: Component type:
SOURCE, name: source1 started
INFO source.AvroSource: Avro source source1 started.

10.4. Verify a flume process is running:


# ps -eaf | grep flume

Step 11: Test the Flume Agent


11.1. The sink for logagent is the /user/root/flumedata folder in HDFS. View the
contents of this folder, which should either be empty or does not exist yet:
# hadoop fs -ls flumedata

11.2. Change directories to ~/labs/flume and view the contents of test.log. This
will be the data that you send to the source of logagent.

Copyright 2014, Hortonworks, Inc. All rights reserved.

313

11.3. From the ~/labs/flume folder, run the following command (all on a single
line) which takes the contents of test.log and writes it in the Avro format to port
8888 on node1:
# flume-ng avro-client -H node1 -p 8888 -C
/usr/lib/flume/lib/flume-ng-core-1.4.0.2.0.6.0-76.jar -F
test.log

11.4. Wait for this task to execute. When complete, view the contents of
flumedata in HDFS, which should now contain a new file:
# hadoop fs -ls flumedata
Found 1 items
-rw-r--r-3 root root
739
flumedata/FlumeData.1384193670669

11.5. View the contents of the file in HDFS. It should match the content from
test.log:
# hadoop fs -cat flumedata/FlumeData.1384193670669

Step 12: Stop the Agent


12.1. Determine the process ID of the Flume agent:
# ps -eaf | grep flume

12.2. To kill a Flume agent, simply issue the kill command on the process:
# kill pid

RESULT: You just ran a Flume agent that reads data from a network connection and
streams it into a folder in HDFS.

ANSWERS:
2.4: The source of logagent is a network connection on port 8888 of node1.
2.5: The channel is an in-memory channel of size 100.
2.6: The sink is the /user/root/flumedata folder in HDFS.

314

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 14: Oozie


Topics covered:

Oozie Overview

Oozie Components

Jobs, Workflows, Coordinators, Bundles

Workflow Actions and Decisions

Oozie Job Submission

Oozie Server Workflow Coordinator

Oozie Console

Interfaces to Oozie

Oozie Server Configuration Files

Oozie Scripts

The Oozie CLI

Using the Oozie CLI

Submit Jobs through http

Oozie Actions

Oozie Metrics

Lab 14.1: Running an Oozie Workflow

Copyright 2014, Hortonworks, Inc. All rights reserved.

315

Oozie Overview
A workflow is a sequence of actions scheduled for execution. Oozie is the workflow
scheduler for Hadoop that runs as a service on the cluster. Clients submit workflow
definitions for immediate or scheduled execution. Oozie is tightly integrated with
Hadoop.
Oozie actions may include:

Streaming

MapReduce

Pig

Hive

Distcp

Sqoop jobs

316

Copyright 2014, Hortonworks, Inc. All rights reserved.

Oozie Components
Workflow Engine
Runs workflows

Coordinating
Engine

Coordinating Engine Scheduler


Runs workflow jobs based on:

Workflow
Engine

Oozie Server

Predefined schedules (fixed or cron intervals)


Data availability

Oozie Server
JVM runs Coordinating Engine and Workflow Engine

Database

Database
Stores workflow definitions and state information

Oozie Console

Oozie Components
Oozie is a Java Web-application that runs in a Java servlet-container in Tomcat. Tomcat
is a Java web-application server that uses a database to store the Oozie workflow
definitions, the state of current workflow instances, instance states and variables.
Two main components are the Oozie server and the Oozie client. The Server is the
engine that runs the workflow, and the Oozie client launches jobs and communicates
with the Oozie server.
Oozies metadata database contains the workflow definitions and the current status of
workflows including state information and workflow instances (such as states and
variables) in a database.

Copyright 2014, Hortonworks, Inc. All rights reserved.

317

Jobs, Workflows, Coordinators, Bundles


Workflows run in a defined order set by a Direct Acyclic Graph (DAG)
Workflow is defined with nodes
Include beginning and end of a workflow
Start
End

OK

Fail/Kill

Start

Action

End

ERROR

Kill

Jobs, Workflows, Coordinators, Bundles


An Oozie job is an operation to be performed. There are Oozie workflow and
coordinator jobs.
An Oozie workflow is a collection of jobs or actions to be performed in a defined order
(DAG). Workflows contain control flow (execution path) and action nodes (processing to
be performed). Oozie triggers workflow actions, but Hadoop MapReduce executes
them. Oozie detects completion of tasks through callback and polling. When Oozie starts
a task, it provides a unique callback HTTP URL to the task, thereby notifying that URL
when its complete. If the task fails to invoke the callback URL, Oozie can poll the task
for completion.
The Oozie coordinator jobs are triggered by time and data availability. The Coordinator
allows you to model workflow execution triggers in the form of the data, time or event
predicates. The workflow job is started after those predicates are satisfied. Oozie
Coordinator can also manage multiple workflows that are dependent on the outcome of
subsequent workflows. The outputs of subsequent workflows become the input to the
next workflow. This chain is called a data application pipeline.
An Oozie bundle provides a way to package multiple coordinator and workflow jobs.
The bundle defines a set of coordinator applications (data pipeline). This allows a user
to start, stop, suspend, resume and resume a bundle.
318

Copyright 2014, Hortonworks, Inc. All rights reserved.

When a HDFS URI is defined as a data set, Oozie will perform availability check. When
data dependencies are met, the coordinators workflow is triggered. Oozie coordinators
also support triggers that run when HCatalog table partitions are available and workflow
actions can read data from the partitions. (HCatalog provides abstract table definitions
for the underlying data storage.)
A Direct Acyclic Graph (DAG) is a collection of vertices (nodes actions) and directed
edges that connect different vertices in an order (directed graph) so there is an end and
the DAG does not circle back to the start.
The Oozie workflows are defined in a XML Process Definition Language called hPDL. The
XML documents contain the workflow made up of start, end and fail nodes as well as
decision control statements contain decision, fork and join nodes.
Workflow actions:

All workflows must have one start and one end node.

Workflow starts with a transition to the start node.

Workflow succeeds when it transitions to the end node.

If the workflow fails, it transitions to a kill node. The workflow reports the error
message specified in the message element in the workflow definition.

Copyright 2014, Hortonworks, Inc. All rights reserved.

319

Workflow Actions and Decisions


Workflow is defined with mechanisms to control the workflow execution:
Decision
Fork
Join

Kill

Action

Start

Action

Fork

Join
Action

Kill

End
Kill

Workflow Actions and Decisions


The workflow will trigger the execution of a computation of a computation/processing
task.
Action types include:

Streaming, MapReduce, HDFS, Pig, and Sqoop.

Java, shell, email, and distcp.

HDFS file system operations.

Oozie sub-workflow.

Most actions have to wait until the previous action completes. Callbacks and polling are
used by Oozie to stay in communication with the defined processing.
Computation/processing tasks triggered by an action node are executed by the
MapReduce framework. Most operations are executed asynchronously however file
system operations are executed synchronously.

320

Copyright 2014, Hortonworks, Inc. All rights reserved.

Oozie Actions
Shell Action: Oozie will wait for the shell command to complete before going to next
action. The standard output of the shell command can be used to make decisions.
Pig, Hive and MapReduce Actions: For executing Pig and Hive scripts and Java
MapReduce jobs.
Sqoop Action: Oozie will wait for the Sqoop command to complete before going to next
action.
Ssh Action: Runs a remote secure shell command on a remote machine. The workflow
will wait for ssh command to complete. Ssh command is executed in the home
directory of the defined user on the remote host.
Custom Action: Custom actions can be set up to run synchronous or asynchronous.

Copyright 2014, Hortonworks, Inc. All rights reserved.

321

Email Actions: Sent synchronously, an email must contain an address, a subject and a
body. Here is an example of setting the properties for an email action. Examples of
other Oozie actions can be found in the documentation.
<workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:0.1">
...
<action name="an-email">
<email xmlns="uri:oozie:email-action:0.1">
<to>bigkahuna@hwxs.com</to>
<subject>Email notifications for
${wf:id()}</subject>
<body>My cool workflow ${wf:id()} successfully
completed.</body>
</email>
<ok to="mycooljob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>

322

Copyright 2014, Hortonworks, Inc. All rights reserved.

Oozie Job Submission


1.
2.
3.
4.
5.

Submit MapReduce Job


Start Map only process (runs Oozie script)
The action is executed
The Launcher task is terminated
If the workflow contains another action, a new Launcher is submitted
Oozie
Server

Resource
Manager

Execute the
Action

MapReduce
Job
(Launcher
task)

Oozie Job Submission


As mentioned in the previous slide, most of the actions are executed asynchronously.
HDFS actions, however, are handled synchronously. Identical workflows can be run
concurrently when properly parameterized.
Oozie can detect completion of computation/processing tasks through callbacks and
polling. When a computation/processing task is started by Oozie, it provides a unique
callback URL to the task. The task should invoke the given URL to notify its completion.
When the task fails to invoke the callback URL for any reason (Transient network failure,
for example), or when the type of task cannot invoke the callback URL upon completion,
Oozie has a mechanism to poll computation/processing tasks for completions. The
default number of retries is three.

NOTE: If the Oozie job consists of multiple actions, then a new Launcher
MapReduce job is executed for each distinct action in the workflow.

Copyright 2014, Hortonworks, Inc. All rights reserved.

323

Oozie Server Workflow Coordinator


Oozie server can receive job completion from either callback or poll.

Coordinating
Engine

Job Submission (with callback URL)

Workflow
Engine

Periodic polling for job status

Callback to URL on job completion

JobTracker

Oozie Server

Databases supported:
Derby (default), MYSQL, Oracle, PostgreSQL, and HSQL.
Database

Oozie Server Workflow Coordinator


Workflows can be submitted individually or as part of a Coordinator workflow.
Oozies coordinating engine:

Allows the user to define workflow execution schedules based on dependencies.

Allows for workflow execution triggers in the form of predicates.

Starts a workflow job after the predicate event is satisfied.

Predicates can reference data, time or a cron-style schedule.

Allows for multiple coordinators to be bundled together.

Many organizations use their enterprise scheduler to call Oozie Workflows. You may
use REST API to call workflows.

Yahoo runs over 700 workflows. They are organized into coordinators and
bundled together.

324

Copyright 2014, Hortonworks, Inc. All rights reserved.

Oozie Console

Use to watch progress of workflows

Cannot submit jobs

Use to see results of workflow execution

Cannot modify job status

Use to get detailed information on job


execution

Oozie Console
The Oozie Web Console provides a UI for viewing and monitoring your Oozie jobs. You
will use the Console in the upcoming lab.

Copyright 2014, Hortonworks, Inc. All rights reserved.

325

Interfaces to Oozie
The Oozie Web Services (WS) API is a HTTP REST JSON API.

326

Copyright 2014, Hortonworks, Inc. All rights reserved.

Oozie Server Configuration


A proxy user should be set up for the Oozie server process. Set the following two
properties in the Hadoop core-site.xml file. You should always restart Hadoop when
changes to the proxy user settings are modified.
<! Setting up Oozie proxy user -->
<property>
<name>hadoop.proxyuser.[PROXY_USERNAME].hosts</name>
<value>[OOZIE_SERVER_HOSTNAME]</value>
</property>
<property>
<name>hadoop.proxyuser.[PROXY_USERNAME].groups</name>
<value>[PROXY_USER_GROUPS]</value>
</property>

The oozie-default.xml contains the initial read-only Oozie parameters.

Copyright 2014, Hortonworks, Inc. All rights reserved.

327

Here are the primary Oozie environmental variables, which are configured in oozieenv.sh:
Variable Name

Description

CATALINA_OPTS

Java properties for the Embedded Tomcat


server that runs Oozie.

OOZIE_CONFIG_FILE

Oozie configuration file (oozie-site.xml).

OOZIE_LOGS

Oozie logs directory (logs/).

OOZIE_LOG4J_FILE

Oozie Log4J configuration file (oozielog4j.properties).

OOZIE_LOG4J_RELOAD

Reload interval for Log4J configuration file


(10) in seconds.

OOZIE_HTTP_PORT

Oozie port number (11000).

OOZIE_ADMIN_PORT

Oozie admin port (11001).

OOZIE_HTTP_HOSTNAME

The Oozie host server name. Find by


running hostname f command.

OOZIE_BASE_URL

The base URL for actions callback URLs to


Oozie

OOZIE_CHECK_OWNER

When set to TRUE , Oozie


setup/start/run/stop scripts will verify the
owner of the Oozie installation directory
matches the user executing the script. The
default value is undefined.

328

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configure when Oozie is using HTTPS (SSL).


Variable Name

Description

OOZIE_HTTPS_PORT

The Oozie server SSL port runs (11443).

OOZIE_HTTPS_KEYSTORE_FILE

The keystore certificate file location


($OOZIE_HOME}/.keystore).

OOZIE_HTTPS_KEYSTORE_PASS

The keystore file password ( password).

Copyright 2014, Hortonworks, Inc. All rights reserved.

329

Oozie Scripts

oozied.sh start: start the Oozie process as a daemon

oozied.sh run: start Oozie as a foreground process

oozied.sh stop: stop the Oozie process

Run the oozie-setup.sh script to manually configure Oozie with all the components
added to the libext/ directory.
$ bin/oozie-setup.sh prepare-war [-d directory] [-secure]
sharelib create -fs <FS_URI> [-locallib <PATH>]
sharelib upgrade -fs <FS_URI> [-locallib <PATH>]
db create|upgrade|postupgrade -run [-sqlfile <FILE>]

Manually create the Oozie metadata DB:


$ bin/ooziedb.sh create -sqlfile oozie.sql -runValidate DB
Connection.

330

Copyright 2014, Hortonworks, Inc. All rights reserved.

Examples:
Command

Description

bin/oozied.sh start

Start Oozie as a daemon process.

bin/oozied.sh run

Start Oozie as a foreground process.

bin/oozie admin -oozie


http://localhost:11000/oozie -status

Check the status of Oozie. The status


should be normal.

Copyright 2014, Hortonworks, Inc. All rights reserved.

331

The Oozie CLI


The Oozie Command Line Interface (CLI) can perform pseudo/simple and Kerberos HTTP
SPNEGO authentication when requested by Oozie server. The doas option allows a
proxy user to run oozie commands if the proxy user is set.

332

Copyright 2014, Hortonworks, Inc. All rights reserved.

Oozie CLI commands:


Command

Description

oozie job <OPTIONS>

Execute job operations such as start, kill,


dryrun, resume, suspend, etc.

oozie jobs <OPTIONS>

Get jobs status.

oozie admin <OPTIONS>

Perform administration operations.

oozie validate <ARGS>

Validate a workflow XML file

oozie pig <OPTIONS> -X <ARGS>

Submit a pig job.

oozie hive <OPTIONS> -X<ARGS>

Submit a hive job.

oozie info <OPTIONS>

Return detailed info about listed options.

oozie mapreduce <OPTIONS>

Submit a mapreduce job.

oozie sla <OPTIONS>

sla operations (Deprecated with Oozie 4.0)

Copyright 2014, Hortonworks, Inc. All rights reserved.

333

Using the Oozie CLI


Included in the distribution are a number of Oozie examples found in the oozieexamples-*.tar.gz file.
Examples:
# Submitting a job and put it in prep status.
$ oozie job -oozie http://localhost:11000/oozie -config
job.properties submit
job: 14-20131025454545-oozie-myjob
#Starts a job in prep status.
$ oozie job -oozie http://localhost:11000/oozie -start 1420131025454545-oozie-myjob
# Check the status of a job
$ oozie job -oozie http://localhost:11000/oozie -info 1420131025454545-oozie-myjob
# Stop workflows
$ oozie job -oozie http://localhost:11000/oozie -suspend
14-20131025454545-oozie-myjob

334

Copyright 2014, Hortonworks, Inc. All rights reserved.

# Resume suspended job


$ oozie job -oozie http://localhost:11000/oozie -resume 1420131025454545-oozie-myjob
#Kill an Oozie job
$ oozie job -oozie http://localhost:11000/oozie -kill 1420131025454545-oozie-myjob
# Perform dryrun of workflow job
$ oozie job -oozie http://localhost:11000/oozie -dryrun config job.properties
# Check status of Oozie system
$ oozie admin -oozie http://localhost:11000/oozie systemmode [NORMAL|NOWEBSERVICE|SAFEMODE]
# Validate workflow XML
$ oozie validate myJob/myworkflow.xml
# Submit a MapReduce job
$ oozie mapreduce -oozie http://localhost:11000/oozie config job.properties

Copyright 2014, Hortonworks, Inc. All rights reserved.

335

Submit Jobs through HTTP


You can submit an Oozie job over HTTP by specifying the -oozie property:
$ oozie pig -oozie http://host:8080/oozie
-file myscript.pig -config job.properties
-PINPUT=/user/me/in -POUTPUT=/user/me/out
-X -Dmapred.job.queue.name=UserQueue

Behind the scenes, a workflow.xml file is generated dynamically that contains a single
action. The action will be script specified at the command line, and the job will be
created and executed right away.

336

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 14 Review
1. There are three types of Oozie jobs. They are _______________________ ,
___________________________and _______________________ jobs.
2. An Oozie __________________ provides a way to package multiple coordinator
and workflow jobs.
3. List three types of Oozie actions: ______________________________________
4. Set Oozie logging information in the ____________________________ file.

Copyright 2014, Hortonworks, Inc. All rights reserved.

337

Lab 14.1: Running an Oozie Workflow


Objective: Deploy and run an Oozie workflow
Successful Outcome: You will run an Oozie job that executes a Pig script and a Hive
script.
Before You Begin: SSH into node2.

Step 1: View the Raw Data


1.1. On node2, change directories to the oozielab folder:
# cd ~/labs/oozielab/

1.2. Unzip the archive in the oozielab folder, which contains a file named
whitehouse_visits.txt that is quite large:
# unzip whitehouse_visits.zip

1.3. View the contents of this file:


# tail whitehouse_visits.txt

This publicly available data contains records of visitors to the White House in
Washington, D.C.
Step 2: Load the Data into HDFS
2.1. Make a new directory in HDFS named whitehouse. (If you already have a
whitehouse folder in HDFS, delete it first):
# hadoop fs -rm -R whitehouse
# hadoop fs -mkdir whitehouse

338

Copyright 2014, Hortonworks, Inc. All rights reserved.

2.2. Use the put command in the Grunt shell to copy the whitehouse_visits.txt
file the whitehouse folder in HDFS, renaming the file visits.txt. (Be sure to enter
this command on a single line):
# hadoop fs -put whitehouse_visits.txt
whitehouse/visits.txt

2.3. Use the ls command to verify the file was uploaded successfully:
# hadoop fs -ls whitehouse
Found 1 items
-rw-r--r-3 root root 183292235 whitehouse/visits.txt

Step 3: Configure Oozie User Permissions


3.1. In Ambari, go to the Services page and Stop HDFS service.
3.2. Go the HDFS page in Ambari, then scroll down and expand the Custom coresite.xml section.
3.3. The Oozie workflow you defined is going to be executed by the root user, so
root needs permission to communicate with the Oozie server. Add root to the
hadoop.proxyuser.oozie.groups property:

3.4. Click the Add Property... link and add two properties. Assign the
hadoop.proxyuser.root.hosts property to * and also the
hadoop.proxyuser.root.groups:

3.5. Click the Save button to save your changes to the HDFS config.
3.6. Start HDFS service.
Step 4: Deploy the Oozie Workflow
4.1. SSH into node2.
Copyright 2014, Hortonworks, Inc. All rights reserved.

339

4.2. View the file workflow.xml in /root/labs/oozielab.


4.3. How many actions are in this workflow? _____________
4.4. Which action will execute first? _________________
4.5. If the first action is successful, which action will execute next? ____________
4.6. To deploy this workflow, we need a directory in HDFS:
# hadoop fs -mkdir congress

4.7. Put congress_visits.hive and whitehouse.pig from the oozielab folder into
the new congress folder in HDFS.
4.8. Also, put workflow.xml into the congress folder.
4.9. If you look at the Hive action in workflow.xml, you will notice that it
references a file named hive-site.xml within the <job-xml> tag. This file
represents the settings Oozie needs to connect to your Hive instance, and the file
needs to be deployed in HDFS (using a relative path to the workflow directory).
Put hive-site.xml into the congress directory:
# hadoop fs -put /etc/hive/conf/hive-site.xml congress

4.10. Verify you have four files now in your congress folder in HDFS:
# hadoop fs -ls congress
Found 4 items
-rw-r--r-3 root root
congress/congress_visits.hive
-rw-r--r-3 root root
-rw-r--r-3 root root
-rw-r--r-3 root root

429
3509 congress/hive-site.xml
580 congress/whitehouse.pig
1623 congress/workflow.xml

Step 5: Define the OOZIE_URL Environment Variable


5.1. Although not required, you can simplify oozie commands by defining the
OOZIE_URL environment variable. From the command line, enter the following
command:
# export OOZIE_URL=http://node2:11000/oozie

Step 6: Define the Job Properties


340

Copyright 2014, Hortonworks, Inc. All rights reserved.

6.1. View the contents of job.properties in oozielab.


6.2. Notice the oozie.wf.application.path property points to the congress folder
in HDFS. This property is how you denote which Oozie job is going to execute.
6.3. Make sure the resourceManager and nameNode properties are defined
properly.
Step 7: Run the Workflow
7.1. From the oozielab folder, run the workflow with the following command:
# oozie job -config job.properties -run

If successful, the job ID should be displayed at the command prompt.


Step 8: Monitor the Workflow
8.1. Point your Web browser to the Oozie Web Console:
http://node2:11000/

You should see your Oozie job in the list of Workflow Jobs:

8.2. Double-click on the Job Id to view the Job Info page:

Copyright 2014, Hortonworks, Inc. All rights reserved.

341

Notice you can view the status of each Action within the workflow.
Step 9: Verify the Results
9.1. Once the Oozie job is completed successfully, start the Hive Shell.
9.2. Run a select statement on congress_visits and verify the table is populated:
hive> select * from congress_visits;
...
WATERS MAXINE
12/8/2010 17:00 POTUS
OEOB
MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WATT
MEL
12/8/2010 17:00 POTUS
OEOB
MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WEGNER DAVID L
12/8/2010 16:46 12/8/2010 17:00 POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WILLOUGHBY JEANNE
P
12/8/2010 17:07 12/8/2010 17:00
POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL
STAFF
WILSON ROLLIE
E
12/8/2010 16:49 12/8/2010 17:00
POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL
STAFF
YOUNG DON
12/8/2010 17:00 POTUS
OEOB
MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
MCCONNELL
MITCH
12/14/2010 9:00 POTUS
WH
MEMBER OF CONGRESS MEETING WITH POTUS.
Time taken: 1.082 seconds, Fetched: 102 row(s)

342

Copyright 2014, Hortonworks, Inc. All rights reserved.

RESULT: You have just executed an Oozie workflow that consists of a Pig script followed
by a Hive script.

ANSWERS:
Step 4.2: Two
Step 4.3: The Pig action named export_congress
Step 4.4: The Hive action named define_congress_table

Copyright 2014, Hortonworks, Inc. All rights reserved.

343

Unit 15: Monitoring HDP2 Services


Topics covered:

Ambari

Monitoring Architecture

Monitoring HDP2 Clusters

Ambari Web Interface

Ambari Services - HDFS

Ganglia

Ganglia Monitoring a Hadoop Cluster

Nagios

Nagios - Ambari Interface

Nagios UI

Monitoring JVM Processes

Understanding JVM memory

Eclipse Memory Analyzer

JVM Memory Heap Dump

Java Management Extensions (JMX)

344

Copyright 2014, Hortonworks, Inc. All rights reserved.

Ambari
The HDP install needs to get software from a YUM repository. A remote yum repository
can be used however; usually a local copy of the HDP repository is set up so your hosts
within the firewall can access it. Reference the Hortonworks documentation on
Deploying HDP In Production Data Centers with Firewalls for more information.
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP2.0.6.0/bk_reference/content/reference_chap4.html
Database Metastores are required for Ambari, Hive and Oozie. MySQL, Oracle or
PostgreSQL are recommended. Derby is the default.
HDP is certified and supported for running on virtual or cloud platforms (VMware
vSphere, Amazon Web Services and Rackspace).
The Hortonworks Sandbox (Pseudo distribution deployment model) has VMs for
VMware Fusion. Ambari is used to manage the Hadoop cluster running in the sandbox.
VirtualBox and Hyper-V. (www.hortonworks.com/sandbox)

Copyright 2014, Hortonworks, Inc. All rights reserved.

345

Ambari was first released in HDP 1.2. Ambari 1.4.1 is released with HDP2 and contains
additional functionality:

Ability to deploy and manage the Hadoop 2.0 stack using Ambari.

Capability to enable NameNode HA.

Support for enabling Kerberos based security for Hadoop 2.0 services.

Support to work with SSL enabled Hadoop daemons.

Support to work with web authentication enabled for Hadoop daemons.

Added support for JDK 7.

346

Copyright 2014, Hortonworks, Inc. All rights reserved.

Node1 Resource Manager


Node2 NameNode
Node3 Mgmt Node

Monitoring Architecture

Ganglia Server
(gmetad)

AmbariServer
Postgres

Nagios Server

RRDtool
gmond
AmbariAgent
IP Address #1
(Gateway)

gmond
AmbariAgent
IP Address #1

gmond

gmond

AmbariAgent

AmbariAgent

IP Address #3

IP Address #4

Monitoring Architecture
Ambari monitors Hadoop services including: HDFS, HBase, Pig, Hive, etc. A service can
have multiple components (i.e. HDFS, NameNode, StandbyNameNode, DataNode,).
The term node and host are used interchangeably.
The Ambari server has:

An Agent Interface for communicating with agents.

Agents installed on each host. It sends heartbeats to the Ambari server and
receives commands in response to heartbeats.

A database repository (usually Postgres or MySQL). The database maintains the


Ambari state in case of an Ambari server crash. Agents stay running during
Ambari server crash and recover when the Ambari server is restored.

Each host will have a Ganglia Monitor (gmond) running that collects information to
Ganglia Connector and then to Ambari Server.
Ambari Web sessions do not timeout. It is important to log out of the Ambari web
interface when you are done.

Copyright 2014, Hortonworks, Inc. All rights reserved.

347

Monitoring HDP2 Clusters


A HDP2 system can run as standalone (in a single JVM), a pseudo distributed system (all
daemons on a single node) or as a distributed system. A distributed HDP2 system can
run on two nodes or 10,000+ nodes. A system like HDP2 that can scale needs a
monitoring infrastructure than can scale as the Hadoop cluster grows.
Ganglia provides the monitoring capability to generate real time graphs.
Nagios excels at monitoring and sending out alerts.
OpenStack is a cloud-computing project focused on providing Infrastructure as a Service
(IaaS). OpenStack is an open source cloud operating system. OpenStack provides the
capability to create and deliver cloud-computing services on standardized hardware.
Ambari is the Hadoop management interface with OpenStack. The OpenStack Savanna
project provides a way to deploy a Hadoop cluster on top of OpenStack.

348

Copyright 2014, Hortonworks, Inc. All rights reserved.

Advantages of Ambari:

Open source management system for Hadoop.

Also selected as Hadoop Management interface for OpenStack.

While Ambari is not the first management system for Hadoop, Ambari is an
excellent example of the innovation and accelerated development open source
delivers. Ambari has grown significantly in HDP 1.2, 1.3 and HDP2.

Start the Ambari Server on the node where it has been configured.
# ambari-server start

Access Ambari from the Ambari Web interface.


http://<AmbariHOSTSERVER>:8080

Copyright 2014, Hortonworks, Inc. All rights reserved.

349

Ambari Web Interface

Add Widget

Gear

1
2

4
Service Status

3
Widgets

Ambari Web Interface


The Ambari Web interface is made up of a number of different components:
1. Navigation Header: DashBoard | Heatmaps | Services | Hosts | Admin
2. The Dashboard View: Made up of two different versions that are both
customizable:
3. Dashboard View Widget Version: Shown above.
4. Dashboard View Service Status: Provides an overall view of status by color. A
colored rectangle will show number of alerts. You can then drill down into more
detail.

350

Solid Green = Up and running

Blinking Green = Starting up

Solid Red = Down

Blinking Red = Stopping

Copyright 2014, Hortonworks, Inc. All rights reserved.

Widgets can be moved around screen (drag and drop). Hovering over a widget will
provide a summary. You can also:

Click on the X in right corner and delete widget.

Click on edit icon to modify an existing Widget.

Click on the gear icon (#5 in slide) and move to Classic Version. The gear allows
you to reset widgets to default and view metrics in Ganglia.

Zoom in for more detailed information.

Click on +Add (#6 in slide) icon to add a widget.

Copyright 2014, Hortonworks, Inc. All rights reserved.

351

Ambari Web Interface (cont.)

Ambari Web Interface (cont.)


Continued explanation of the Ambari Web Interface monitoring tool:
1. Heatmaps: Provide a graphical representation of the overall health with color
indicators. Hovering over a heatmap will display a popup. Select the Metric
dropdown to change the metric type. The default maximums can also be
changed.
2. Services: Provides details on services running in the Hadoop cluster.
2a. Maintenance options: The management header containing
Maintenance | Start | Stop sections is an easy way to start and stop a
service as well as perform smoke tests.
2b. Services Summary: Clicking on summary tab will give an overall
perspective of a specific service.
2c. Services Config: Clicking on the Services Config tab allows updates to
configurations for a specific service. Some services have quick links that drill
into logs, JMX, different UIs.

352

Copyright 2014, Hortonworks, Inc. All rights reserved.

3. Hosts: The Hosts view lets you dill down into a host to get detailed information
on services running on host. Actions are available to start, stop and
decommission. Hosts can be added with the +Add Hosts Wizard.
4. Admin: The Admin View supports user management and provides general
information.

User Management: Users can be added, dropped and assigned privileges.


There are two types of users: User and Admin. A User can view metrics. An
Admin has user privileges and can start and stop services, change
configurations, etc.

High Availability: NameNode HA can be set up. This option will start the
NameNode HA Wizard. The Wizard will walk you through defining the
Standby NameNode, and JournalNodes.

Enabling Kerberos Security: Enabling Kerberos Security will walk you


through creating principals and keytabs, etc.

Checking Stack and Component Versions: This screen allows you to see the
Hadoop software stack and the specific version installed.

Checking Service User Accounts and Groups: Display users and groups and
the services they own.

Copyright 2014, Hortonworks, Inc. All rights reserved.

353

Ganglia
Designed for monitoring and collecting large quantities of metrics of
federations of clusters

Ganglia
Ganglia was developed at Berkeley and is a BSD-licensed open source project. Berkeley
is known as a center of grid and high-performance environments. Ganglia was designed
and developed in an environment where large computing environments were the norm.
Ganglia was assumed to be running in extremely scalable environments where minimal
overhead and performance were a fundamental requirement. Ganglia was designed
from the very beginning to scale to cloud-sized networks. Therefore, Ganglia is an ideal
tool for monitoring Hadoop clusters that can grow to 10,000+ nodes per cluster.
Ganglia ships with a large number of metrics that can be accessed with visual graphs.
Ganglia has a plug-in to receive Hadoop metrics and can provide aggregate statistics for
the cluster as a whole. Ganglia also provides real-time graphing capabilities.

NOTE: HDP2 uses Ganglia 3.5 and Gweb 3.5.7.

354

Copyright 2014, Hortonworks, Inc. All rights reserved.

Ganglia Monitoring a Hadoop Cluster

Ganglia Monitors

Drill down to get detail on Ganglia Server

Ganglia Monitoring a Hadoop Cluster


Ganglia has three primary daemons: gmond, gmead, and gweb.

gmond: The Ganglia Monitoring Daemon runs on each host to be monitored.


Gmond collects run-time statistics. Gmond polls metrics according to its own
local configuration file. Gmond can collect metrics on compute resources such as
memory, CPU, storage networking. Gmond can also collect metrics on active
processes/daemons. Hadoop can publish metrics to Ganglia in formats that
Ganglia can understand.

Gmetad: The Ganglia Meta Daemon polls information from the gmond daemons
then collects and aggregates the statistics. RRDtool is a tool that stores metrics in
round robin databases.

Gweb: Ganglia Web is a PHP program that runs in an Apache web server that
provides visualization. The configuration file is conf.php.

Copyright 2014, Hortonworks, Inc. All rights reserved.

355

The Ganglia configuration file (gmetad.conf) is organized into sections that are defined
in curly braces. Section names and attributes are case insensitive. There are two
categories:

Host and cluster configurations.

Metrics collection and scheduling.

Ganglia monitoring is setup in the Hadoop hadoop-metrics.properties file. Examples of


property definitions include metric time periods and ports.
Metrics (contexts) are broken into:

yarn: Containers, failures, and containers completed.

hdfs: Data block activity (reads, writes, replicates, verifications, removed).

hbase: Number of regions, memstore sizes, read and write requests, StoreFile
Index sizes, block cache hjt and miss ratios, and block cache memory available.

jvm: Memory, thread states, garbage collection, and logging events.

rpc: Number of operations, open connections, processing times, and open


connections.

356

Copyright 2014, Hortonworks, Inc. All rights reserved.

Nagios
The Nagios primary configuration file (nagios.cfg) default location is the /etc/nagios
directory.
Key parameters:
Parameter
log file

Description
Contains the location of the nagios.log file
(/usr/local/nagios/var/nagios.log).

nagios_user

Nagios, the effective user that the Nagios


process runs under.

nagios_group

Nagios, the effective group that the Nagios


process runs under.

status_file

/usr/local/nagios/var/staus.dat is the
current status or downtime information.

temp_path

/tmp is the directory that Nagios uses as a


scratch work area.

Copyright 2014, Hortonworks, Inc. All rights reserved.

357

temp_file

/usr/local/nagios/var/nagios.tmp is used
as a temporary file when updating status
information.

After making any configuration changes, Nagios needs to be restarted.


# service nagios restart

To define a host so it is available to Nagios, define the following values in


/etc/nagios/objects/hadoop-hosts.cfg:
define host {
use
host_name
alias
address
}

linux-server
xxxx.xxxx.xxxx
xxxx.xxxx.xxxx
xxx.xxx.xxx.xxx

NOTE: HDP2 uses Nagios 3.5.0. Nagios is installed as part of the Ambari
install.

358

Copyright 2014, Hortonworks, Inc. All rights reserved.

Nagios UI

Nagios UI
Nagios can be accessed from the Ambari interface or from the server running Nagios.
Launch the Nagios UI on the server it is running via http:/localhost/nagios.

Copyright 2014, Hortonworks, Inc. All rights reserved.

359

Monitoring JVM Processes


JMX is a Java specification for providing self-describing monitoring and management
capabilities within a Java application. Since all the Hadoop daemons are Java based, it
makes sense for Hadoop to use this capability as one of its mechanism for delivering
metrics data. All modern JVMs support JMX natively, so there are no additional
components to set up or install for Hadoop to produce metrics via JMX. JMX provides
metrics through components called MBeans. Each MBean can contain attributes and
operations. MBean attributes are used to expose metrics to the outside world.
There are many operational tools (For example, HP OpenView, etc.) out there which can
connect to JMX compliant applications, without having any specialized knowledge about
those applications. This is an example of the self-describing aspect of JMX. JMX metrics
will automatically be enabled when the GangliaSink is used for metrics (hadoopmetrics2.properties). You can also access JMX metrics via an HTTP interface with
http://<daemon-host-name>:<port>/jmx.
For example: http://namenode:50070/jmx.

360

Copyright 2014, Hortonworks, Inc. All rights reserved.

Viewing JVM heap dumps is only used when a problem needs to be looked at in a very
detailed way. Whats nice about jmap is that is available if necessary.
If you are having problems running out of memory for the JVM you can cause a heap
dump to be generated automatically when an out of memory issue occurs. Set the XX:+HeapDumpOnOutOfMemoryError option to generate a heap dump when an out of
memory issue occurs.

Copyright 2014, Hortonworks, Inc. All rights reserved.

361

Understanding JVM Memory


The memory for a JVM process is divided into the following areas.

Survivor 2

Survivor 1

Eden

New
Tenured

PermGen

Understanding JVM Memory


Hadoop Administrators need to have a fundamental understanding of JVM memory.
A JVM process is divided into three memory segments called generations. The
generations are divided into young, old, and permanent. The young generation is also
sometimes referred to as the new generation and the old generation is sometimes
referred to as the tenured generation.
During garbage collection a Java object will move from Young memory to Old memory.
The new generation memory area is divided into 3 sub-segments. These are called Eden,
Survivor Space I, and Survivor Space II.

Eden: Where an object is first created.

Survivor I: Object moves into Survivor I from Eden.

Survivor II: Object moves into Survivor II from Survivor I.

An object will move from the Survivor II memory area into the Old memory area.

362

Copyright 2014, Hortonworks, Inc. All rights reserved.

When the Old memory area fills up, a major garbage collection will occur which can
impact performance. This can impact YARN which is running mappers and reducers in
Containers.
The Xms option determines the size of the Old and Young generation memory areas
together. This combined memory area is determined by the Xmx option.
Hints can be provided to the Young and Old memory areas. The exact memory size will
be determined by the JVM. Young generation size can be initialized with the
XX:NewSize argument
Old generation with XX:NewRatio. A value of 2 will make the Old generation memory
area twice as big as the Young generation.

Copyright 2014, Hortonworks, Inc. All rights reserved.

363

Eclipse Memory Analyzer

Eclipse Memory Analyzer


The Eclipse Memory Analyzer (MAT) is a nice interface to use for viewing details on Java
heap dumps. The Eclipse MAT makes it easy to go out and view retained sizes of objects
and issues with garbage collection. A set of Eclipse plug-ins are used to access reference
objects on Java heap dumps. A Java heap dump is a point in time view of a Java object
graph.
[root@hdp2 ~]# jmap -F -dump:live,format=b,file=nnheap.bin
2719
Attaching to process ID 2719, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.45-b08
Dumping heap to nnheap.bin ...
Heap dump file created

364

Copyright 2014, Hortonworks, Inc. All rights reserved.

JVM memory heap dumps can also be viewed with commands. Use jps to get process id
(2719) and jmap to display.
# jps -l
# jmap -histo:live 2719 | head
# jmap heap 2719
# jstat gcutil 2719 5000

Note process id 2719 is the NameNode.

# jps
3855 ResourceManager
3096 SecondaryNameNode
3973 NodeManager
2719 NameNode
2645 QuorumPeerMain
2952 DataNode
3292 RunJar
3332 RunJar
4080 JobHistoryServer
4558 AmbariServer
3505 RunJar
4903 RunJar
3238 Jps
3789 Bootstrap

Copyright 2014, Hortonworks, Inc. All rights reserved.

365

JVM Memory Heap Dump

JVM Memory Heap Dump


It is pretty easy to create a JVM memory heap dump. The jmap command needs the
JVM heap dump file to be named and the process id of the JVM to be specified. You can
then click on the JVM heap dump to bring up Eclipse or open the file in Eclipse.
# $JAVA_HOME/bin/jstat -gcutil $(cat
/var/run/hadoop/yarn/yarn-yarn-resourcemanager.pid)

366

Copyright 2014, Hortonworks, Inc. All rights reserved.

S0

S1

YGC

YGCT

FGC

FGCT

GCT

0.00

100.0

40.85

11.19

99.35

0.044

0.000

0.044

Field definitions in the jstat gcutil command output.


Column

Description

S0

Survivor I Memory Heap.

S1

Survivor II Memory Heap.

Eden & Young Memory Heap.

Old Memory Heap.

Permanent Memory Heap.

YGC

The number of Young (Eden) space collections.

YGCT

The total time taken for the Young (Eden) collections.

FCG

The number of Old space collections.

FGCT

The total time taken for the Old collections.

GCT

The total time taken for garbage collection consumed so far.

Copyright 2014, Hortonworks, Inc. All rights reserved.

367

Java Management Extensions (JMX)


DataNodes metrics provide:

Number of bytes written.

Number of blocks replicated.

Number of read requests from clients.

368

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 15 Review
1. The Dashboard View supports two different types of views, they are the
________________ and _________________ views.
2. The Ganglia primary daemons are _____________ , _____________ and
_______________.
3. The main Nagios configuration file is ______________________.
4. Use this Java JDK tool to create a JVM heap dump: ______________________
5. Use this Java tool to access JVM metrics: ___________________________

Copyright 2014, Hortonworks, Inc. All rights reserved.

369

Unit 16: Commissioning and


Decommissioning Nodes
Topics covered:

Decommissioning and Commissioning

Decommissioning Worker Nodes

Steps for Decommissioning a Worker Node

Decommissioning Node States

Steps for Commissioning a DataNode

Balancer

Balancer Threshold Setting

Running Balancer

Configuring Balancer Bandwidth

Lab 16.1: Commissioning DataNodes

370

Copyright 2014, Hortonworks, Inc. All rights reserved.

Architectural Review
Decommissioning/Commissioning nodes need to take the above into consideration.
Daemons and Processes running on a slave server can include additional frameworks.
Usually a slave server in the cluster runs both a DataNode and NodeManager daemon.
If running HBase, the slave server will also run a HBase Region Server. Additional
frameworks such as Accumulo, Storm, etc. will have their own client processes.
The ResourceTrackerServer is responsible for registering new nodes,
decommissioning/commissioning nodes.
NMLiveliness Monitor monitors live and dead nodes.
The NodesListManager manages the collection of valid and excluded nodes. The
NodesListManager reads the following local host configuration files. Lines that begin
with # are comments.

dfs.hosts: Names a file that contains a list of hosts that are permitted to connect
to the NameNode.

dfs.hosts.exclude: Names a file that contains a list of hosts that are not
permitted to connect to the NameNode.

Copyright 2014, Hortonworks, Inc. All rights reserved.

371

yarn.resourcemanager.nodes.include-path: Points to a file that has a list of


nodes accepted by ResourceManager.

yarn.resourcemanager.nodes.exclude-path: Points to a file with a list of nodes


that are not to be accepted by ResourceManager.

Run the refreshNodes option for the ResourceManager daemon to recognized the
changes:
# yarn rmadmin -refreshNodes

372

Copyright 2014, Hortonworks, Inc. All rights reserved.

Decommissioning and Commissioning Nodes


Hadoop clusters grow horizontally. Administrators periodically need to be able to add,
decommission and recommission DataNodes (slave servers). Adding a node to a cluster:

Adds more processing capabilities because the cluster can run more Containers.

Add more IOPS and storage capability.

Make a Hadoop cluster more resilient to failure and heavy workloads.

Copyright 2014, Hortonworks, Inc. All rights reserved.

373

Decommissioning Nodes
Although HDFS is designed to tolerate DataNode failures, this does not mean you can
just terminate DataNodes en masse with no ill effect. With a replication level of three
for example, the chances are very high that you will lose data by simultaneously
shutting down three DataNodes if they are on different racks. The way to decommission
DataNodes is to inform the NameNode infrastructure of the DataNode(s) to be taken
out of circulation, so that it can replicate the blocks to the rest of HDFS before taking the
node down.
With NodeManagers and Containers, Hadoop is more forgiving. If you shut down a
NodeManager that is running tasks, the ResourceManager will notice the failure and
reschedule the tasks on other nodes in the Cluster.
The decommissioning process is controlled by an exclude file. The exclude file lists the
nodes that are not permitted to connect to the cluster (master daemons).

374

Copyright 2014, Hortonworks, Inc. All rights reserved.

The rules for whether a NodeManager may connect to the ResourceManager are
simple: a NodeManager may connect only if it appears in the include file and does not
appear in the exclude file. An unspecified or empty include file is taken to mean that all
nodes are in the include file.
For HDFS, the rules are slightly different. If a node appears in both the include and
exclude file, then it may connect, but only to be decommissioned.

Copyright 2014, Hortonworks, Inc. All rights reserved.

375

Steps for Decommissioning a Node


1.

Add the Node to the the NameNode(s) dfs.exclude file

2.

Add the Node to the ResourceManagers exclude file

3.

Have the NameNode(s) & ResourceManager re-read the exclude file:


$ /usr/lib/hadoop-hdfs/sbin/distribute-exclude.sh
<exclude_file>
$ /usr/lib/hadoop-hdfs/sbin/refresh-namenodes.sh
$ yarn rmadmin refreshNodes

4.

Check the Cluster Web Console NameNode Web UI for Node status

5.

Wait for the Node to be listed as decommissioned

6.

Check the Resource Manager Web UI for the Node status

Steps for Decommissioning a DataNode


Decommissioning follows similar processes in HDP2 as in HDP1. A node is
decommissioned by being placed in the exclude file for the NameNodes and the
Resource Manager.
Optionally, Remove the Node from the NameNode(s) dfs.include file(s), if permanently
removing the node from the cluster. (Also remove this Node from the ResourceManager
include file). Remove the node information from the slave file as well.
The distribute-exclude.sh script will copy the exclude file to all the NameNodes in the
HDFS cluster. The refresh-namenodes.sh script will refresh all the NameNodes to read
the new exclude file.
The individual NameNode(s) Web UI can also be accessed for Node status. View the
Cluster status to verify the DataNode has been decommissioned.
Check the new Node using the Cluster Web Console from any NameNode.
http://<any_nn_host:port>/dfsclusterhealth.jsp.

Check that the new Node appears in the ResourceManager Web UI

376

Copyright 2014, Hortonworks, Inc. All rights reserved.

http://<ResourceManager node>:8088

The NameNode Web UI can also be accessed for the NameNodes.


http://<NameNode node>:8020

Copyright 2014, Hortonworks, Inc. All rights reserved.

377

Decommissioning Node States

Dead Nodes: The NameNode will declare a DataNode dead when a heartbeat is
not received for a period of time. The default is 10 minutes.

Time period is set with: heartbeat.recheck.interval

Node remains as dead until it is removed from the dfs.include list, AND dfsadmin
command is run to refresh the nodes (dfsadmin - refreshNodes).

378

Copyright 2014, Hortonworks, Inc. All rights reserved.

Steps for Commissioning a Node


1.

Add the new Node to the dfs.include file on the NameNode(s)

2.

Add the new Node to the include file on the ResourceManager

3.

Have the NameNode(s) & ResourceManager re-read the include file


with new Node in it:
$ /usr/lib/hadoop-hdfs/sbin/refresh-namenodes.sh
$ yarn rmadmin refreshNodes

4.

Start the DataNode and NodeManager daemons

Steps for Commissioning a Node


Verify the Hadoop software on the Node to be commissioned matches the other nodes.
Make sure the DataNode process and the NodeManager daemons point to their
NameNodes and ResourceManager respectively.
yarn.resourcemanager.nodes.include-path Points to a file that has a list of DataNodes
accepted by ResourceManager.
Start the DataNode daemon.
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoop-daemon.sh"

Start the ResourceManager daemon.


su -l yarn -c "/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh"

Verify the new Worker Node (slave server is running properly).


Check the new Node using the Cluster Web Console from any NameNode.
http://<any_nn_host:port>/dfsclusterhealth.jsp.

Copyright 2014, Hortonworks, Inc. All rights reserved.

379

Check that the new Node appears in the ResourceManager Web UI.
http://<ResourceManager node>:8088

The NameNode Web UI can also be accessed for the NameNodes.


http://<NameNode node>:8020

Run balancer if you want existing blocks to be written to the new DataNode. This
ensures the HDFS cluster is able to leverage the processing and IOPS of the new
DataNode.
The Balancer needs to work with the Namenodes in the cluster to balance the cluster.
Example:
"$HADOOP_PREFIX"/bin/hadoop-daemon.sh --script "$bin"/hdfs
start balancer [-policy <policy>]

380

Copyright 2014, Hortonworks, Inc. All rights reserved.

Balancer

Copyright 2014, Hortonworks, Inc. All rights reserved.

381

Balancer Threshold Setting

382

Copyright 2014, Hortonworks, Inc. All rights reserved.

Running Balancer
Balancer can be run periodically as a batch job
Every 24 hours or weekly for example

Balancer should be run after new nodes have been added to the cluster
Running the balancer is also useful if a client loads files from and to a
computer that is a DataNode
One replica of the blocks will be placed on the local DataNode

To run the balancer:


hdfs balancer [-threshold <threshold>] [-policy <policy>]]

Balancer runs until there are no blocks to move or until it has lost
contact with the NameNode
Can be stopped with a Ctrl+C

Running Balancer

Copyright 2014, Hortonworks, Inc. All rights reserved.

383

Configuring Balancer Bandwidth


Should be run in a way which will limit impact on the cluster
You can limit the amount of bandwidth Balancer can utilize for
replicating blocks
The bandwidth limit is set in bytes/sec
Default value is 1048576 (1 MB/sec)

To set the bandwidth utilization allowed: (hdfs-site.xml)


dfs.datanode.balance.bandwidthPerSec

Configuring Balancer Bandwidth

384

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 16 Review
1. Which property points to the file that contains the list of hosts allowed to
connect to the NameNode? _________________________________
2. Which property points to the file that contains the list of hosts not allowed to
connect to the NameNode? _________________________________
3. The ResourceManager also has include and exclude files. Which two properties
define where these two files are located? ____________________________
_______________________________________________________________
4. The rmadmin option is to __________________________________________.

Copyright 2014, Hortonworks, Inc. All rights reserved.

385

Lab 16.1: Commissioning & Decommissioning


DataNodes

Objective: Commission a new DataNode to an existing cluster, and also


decommission a node.
Successful Outcome: node4 is successfully added to your cluster as a DataNode,
and node1 is successfully decommissioned from the cluster.
Before You Begin: Open the Ambari UI.

Step 1: Commission a New DataNode.


1.1. Go to the Hosts page of Ambari.
1.2. Click on the host for node4.
1.3. Click on the Add Component dropdown menu and select DataNode:

1.4. Click Yes when the confirmation dialog appears.

386

Copyright 2014, Hortonworks, Inc. All rights reserved.

1.5. Wait for the DataNode component to be installed. When the install is
complete, DataNode should appear in the list of Components on node4:

Step 2: Copy a significant big directory to HDFS


2.1. Copy /root/repo directory to HDFS.

Step 3: Start the DataNode


3.1. Click on the Action menu next to DataNode and choose Start.
3.2. Wait for the DataNode service to start.
Step 4: Restart Nagios (Optional, if there are only 3 DataNodes under HDFS service)
Step 5: Verify the Commissioned Node
5.1. Once the DataNode is started successfully on node4, go back to the Ambari
Dashboard. You should see 4 live DataNodes:

Copyright 2014, Hortonworks, Inc. All rights reserved.

387

Step 6: Run the Balancer


6.1. Go to the NameNode UI. Notice it also shows 4 live DataNodes.
6.2. Click on Live Nodes link.
6.3. Notice that the new DataNode does not have any data blocks currently.
6.4. On node1, run following command to run the balancer process on your
cluster:
# su -l hdfs -c "hdfs balancer -threshold 1"

6.5. Wait a couple minutes for the balancer to even out the block storage. You will
see at the command prompt as blocks get moved from one node to another:
INFO balancer.Balancer: 0 over-utilized: []
INFO balancer.Balancer: 1 underutilized:
[BalancerDatanode[10.222.133.205:50010,
utilization=0.40288804420577484]]
INFO balancer.Balancer: Need to move 173.74 MB to make the
cluster balanced.
INFO balancer.Balancer: Decided to move 89.95 MB bytes from
10.170.202.246:50010 to 10.222.133.205:50010
INFO balancer.Balancer: Decided to move 151.79 MB bytes
from 10.174.49.252:50010 to 10.222.133.205:50010
INFO balancer.Balancer: Will move 241.74 MB in this
iterationINFO balancer.Balancer: Moving block 1073742573
from 10.174.49.252:50010 to 10.222.133.205:50010 through
10.174.50.60:50010 is succeeded.
INFO balancer.Balancer: Moving block 1073742572 from
10.174.49.252:50010 to 10.222.133.205:50010 through
10.174.50.60:50010 is succeeded.
...

6.6. Refresh the Live Nodes page of the NameNode UI. Your node4 DataNode
should now have blocks on it, and the number of blocks will gradually increase as
the balancer app continue to even out the block storage on your cluster.

NOTE: The balancer app will run for a long time. Just leave the process open
in your terminal window. If you need to perform any future tasks on node1,
just open a new terminal window.

388

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 7: Decommission a DataNode


7.1. From the Hosts page of Ambari, click on the host for node1.
7.2. In the Components section, click on the Action menu next to DataNode and
select Decommission:

7.3. Click OK in the confirmation dialog, and wait for the decommissioning task to
complete.

NOTE: There is a minimal chance that the decommissioning task may fail
due to a known bug in Hadoop 2.0 where the node contains a block that
belongs to a file with a replication factor larger than the rest of the cluster
size. The work-around is to locate and delete any files that have a replication
factor larger than 3. View https://issues.apache.org/jira/browse/HDFS-5662
for more details.

Step 8: View the NameNode UI


8.1. Go to the Namenode UI. Notice that Decommissioning Nodes is 1 and Live
Nodes is still 4:

Copyright 2014, Hortonworks, Inc. All rights reserved.

389

8.2. Click on Decommissioning Nodes and it will show that node1 is undergoing
the decommission process.
8.3. Go to the Live Nodes page of the NameNode UI. You will see that blocks are
gradually being copied from node1 to the other nodes. The Admin State of node1
is going to be either Decommission in Progress or Decommissioned. Refresh the
page until the status is Decommissioned.
8.4. Go back to the NameNode UI page. Notice you have 4 Live Nodes, and 1 of
them is Decommissioned:

Step 9: Stop the DataNode


9.1. Once the decommissioning process is complete, go back to the Hosts page for
node1 in Ambari.
9.2. In the Action menu next to DataNode, select Stop:

9.3. It will take several minutes to stop the DataNode process on node1.
9.4. From the Ambari Dashboard page, you should see 3/4 live DataNodes:

390

Copyright 2014, Hortonworks, Inc. All rights reserved.

RESULT: You have now seen how to commission a new DataNode, and also how to run
the balancer tool to balance the blocks across a cluster once new DataNodes are
commissioned. You also have decommissioned one of the DataNodes from your cluster.

Copyright 2014, Hortonworks, Inc. All rights reserved.

391

Unit 17: Backup and Recovery


Topics covered:

392

What should you backup?

HDFS Snapshots

HDFS Data Backups

HDFS Data Automation & Restore

Hive & Ambari

Lab 17.1: Using HDFS Snapshots

Copyright 2014, Hortonworks, Inc. All rights reserved.

What should you backup?


HDFS data and configurations in your Hadoop cluster are the most important files to
backup. Here well discuss backing up HDFS data, Hive, and Ambari. It is also important
to backup cluster related configuration files on the OS.

Copyright 2014, Hortonworks, Inc. All rights reserved.

393

HDFS Snapshots

Create HDFS directory snapshots


Fast operation - only metadata affected
Results in .snapshot/ directory in the HDFS directory
Snapshots are named or default to timestamp
Directories must be made snapshottable
Snapshot Steps:
Allow snapshot on directory
hdfs dfsadmin -allowSnapshot foo/bar/
Create snapshot for directory and optionally provide snapshot name
hdfs dfs -createSnapshot foo/bar/ mysnapshot_today
Verify snapshot
hadoop fs -l foo/bar/.snapshot
Snapshot diff
# hdfs snapshotDiff foo/bar firstsnapshot secondsnapshot
Difference between snapshot firstsnapshot and snapshot secondsnapshot under
directory /user/root/foo/bar:
M
.
+
./anotherfile.txt

HDFS Snapshots
Another major highlighted feature of Hadoop 2 is HDFS snapshots. Taking a snapshot is
fast. As long as snapshotting is enabled on a particular directory, users with write
permissions to that directory can create as many snapshots as needed and removed as
needed.

394

Copyright 2014, Hortonworks, Inc. All rights reserved.

HDFS Data - Backups


Backup
Condition(s)
Fulfilled

hdfs dfs -createSnapshot /foo/ snapshot-name


hadoop distcp update prbugp m 16 \
hdfs://original-host/foo/.snapshot/snapshot-name \
hdfs://target-host/foo/

Perform
HDFS
Snapshot

distcp new
Snapshot to
Backup
Cluster

hdfs dfs -createSnapshot /foo/ snapshot-name

Snapshot
new data on
Backup
Cluster

Enterprise
Retention
Policy
Cleanup

On Success
Action

On Failure
Action

Can be orchestrated by Oozie!

HDFS Data - Backups


A typical strategy for backing up HDFS data to a remote Hadoop cluster is highlighted
above. Oozie is the perfect tool to automate the entire process. Remember, with Oozie
you can execute distcp actions and run shell scripts. For example, you can write a
wrapper script to implementing your enterprises data retention policy, i.e. remove
data older than 7 years nightly.

Copyright 2014, Hortonworks, Inc. All rights reserved.

395

HDFS Data Automate & Restore


Whereas Oozie can be used to automate the backup process, often times restoring data
is a manual process. If you chose to perform offsite backups, you can choose an offsite
snapshot, or use your production clusters own snapshots to restore.
Restoring a backup is fairly simple:
1. On your backup cluster, choose which snapshot to restore. We will move it to
the target/production cluster last.
2. Now clean up your production cluster by removing or moving the production
directory using either the -rm or -mv commands.
a. The -rm option with -skipTrash will free up space in case your dataset is
very large.
3. Move your pristine copy of the data to the target location via distcp and without
the -update options. And if your pristine copy is already on the target cluster,
simply run a -mv command.

396

Copyright 2014, Hortonworks, Inc. All rights reserved.

Hive & Ambari Backup


Backup Hive Metastore
Weve already seen how to backup HDFS data. Hive data is in HDFS. We also need to be
to backup the Hive Metastore, the database that contains Hives table definitions.
Since the Metastore is simply a relational database (MySQL, PostgreSQL, or Oracle), just
simply backup the Metastore database.
The Metastore database name by default is hive.
MySQL Backup
mysqldump hive > hive_backup.sql

PostgreSQL Backup
pgadmin hive > hive_backup.sql

Copyright 2014, Hortonworks, Inc. All rights reserved.

397

Oracle Backup
[oracle]$ expdp hive/password schemas=hive directory=backups
dumpfile=hive_backup.dmp

Backup Hive data on same cluster


Often times, its sufficient to backup any HDFS data simply by making a physical copy on
the same HDFS cluster. Here is an example of backing up a Hive table within the same
HDFS cluster:
hadoop fs -cp /apps/hive/warehouse/mytable /backups/hive/mytable

Backup Hive data to different cluster


Using the distcp command we saw earlier, you can also backup HDFS data to different
cluster:
distcp fs -cp hdfs://sourcenode/apps/hive/warehouse/mytable \
hdfs://destnode/backups/hive/mytable

Backup Ambari
Take a backup of the following Ambari cluster configurations:
1. /etc/ambari-server
2. The ambari database in PostgreSQL

398

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 17.1: Using HDFS Snapshots

Objective: Understand how snapshots work in Hadoop.


Successful Outcome: The data folder has a snapshot taken of it.
Before You Begin: SSH into node1.

Step 1: Load a File into HDFS


1.1. Start by creating a new directory in HDFS:
# hadoop fs -mkdir data

1.2. Put a file in the new directory:


# hadoop fs -put ~/labs/constitution.txt data/

1.3. Run the fsck command on the file:


# hdfs fsck /user/root/data/constitution.txt -files -blocks
-locations

Select-and-copy the block ID.


1.4. Use the find command to identify the location of the block on your local file
system. The command will look like the following, and youneed to run it on
node2 or node3:
# find / -name "blk_ 1073743186"

1.5. Which node and folder is the block stored in? ________________________
Step 2: Enable Snapshots
2.1. Now lets enable the /user/root/data directory for taking snapshots:
Copyright 2014, Hortonworks, Inc. All rights reserved.

399

# su -l hdfs -c "hdfs dfsadmin -allowSnapshot


/user/root/data"

You should see the following confirmation:


Allowing snaphot on /user/root/data succeeded

Step 3: Create a Snapshot


3.1. Now create a snapshot of /user/root/data:
# hdfs dfs -createSnapshot /user/root/data ss01

You should see the following confirmation:


Created snapshot /user/root/data/.snapshot/ss01

3.2. Verify the snapshot was created by viewing the contents of the
data/.snapshot folder:
# hadoop fs -ls -R data/.snapshot
drwxr-xr-x
- root hadoop
0 data/.snapshot/ss01
-rw-r--r-3 root hadoop
44841 data/.snapshot/
ss01/constitution.txt

3.3. Try to delete the data folder:


# hadoop fs -rm -R data

You cannot delete the folder. Why not? _________________________________


Step 4: Delete the File
4.1. Delete the constitution.txt file in data:
# hadoop fs -rm data/constitution.txt

4.2. Use the ls command to verify the file is no longer in the data folder in HDFS.
4.3. Check whether the file still exists in /user/root/data/.snapshot/ss01. It
should still be there.

400

Copyright 2014, Hortonworks, Inc. All rights reserved.

4.4. Run the same find command again that you ran in the earlier step. Does the
block file still exist on your local file system? _____________________________
Step 5: Recover the File
5.1. Lets copy this file from data/.snapshot/ss01 to the data directory.
# hadoop fs -cp data/.snapshot/ss01/constitution.txt data/

5.2. Run the fsck command again on data/constitution.txt. Notice that the block
and location information have changed for this file.
5.3. Run the find command for the new blocks. Notice the blocks for the
constitution.txt file appear in two locations on your local file system (before
deleting the file and after copying the file).

RESULT: This lab demonstrates how the snapshot process locks down the blocks from
deleting and editing, and the blocks are always available in case you need to recover
your file in future.

Answers:
Step 1.5: In a subfolder of: /hadoop/hdfs/data/current/
Step 3.3: Once snapshot is enabled for a directory, it can not be deleted until we delete
the snapshot itself.
Step 4.4: Yes

Copyright 2014, Hortonworks, Inc. All rights reserved.

401

Unit 18: Rack Awareness and


Topology
Topics covered:

Rack Awareness

YARN Rack Awareness

HDFS Replication

Rack Topology

Rack Topology Script

Configuring the Rack Topology Script

Lab 18.1: Configuring Rack Awareness

402

Copyright 2014, Hortonworks, Inc. All rights reserved.

Rack Awareness
Rack awareness spreads block replicas across different racks to make sure if a rack
becomes unavailable (power failure, switch failure, etc.) all replicas for a block are not
lost. Rack awareness makes sure that all operations that involve rack placement
understand to spread the blocks across multiple racks. The NameNode makes the
decision where blocks are placed. Examples of block operations that are rack aware
include:

Inserts

Hadoop balancer

Decommissioning a datanode

For rack awareness, each data node is assigned to a rack. Each rack will have a unique
rack id. Rack ids are hierarchical and appear as path names.
If rack awareness is not configured, the entire Hadoop cluster is treated as if it were a
single rack. Every DataNode will have a rack id of /default-rack. With the default
behavior, data is loaded on a DataNode and the then two other data nodes are selected
at random to make sure replicas are spread across multiple data nodes.

Copyright 2014, Hortonworks, Inc. All rights reserved.

403

YARN Rack Awareness


YARN is rack aware similar to HDFS. The NameNode and ResourceManager get rack
information on the DataNodes. An API resolve in an administrator module resolves the
DNS name, IP address and rack id.

404

Copyright 2014, Hortonworks, Inc. All rights reserved.

HDFS Replication
First replica is placed on the same rack as the client, if possible. If that
is not possible, it will be placed randomly.
Second replica is placed on a DataNode on another rack
Third replica is on another DataNode on the second rack

Rack 1

Rack 2

DataNode

DataNode

DataNode
Data and
checksum

Data and
checksum

Ack

Ack

Verify
Checksum

Replica Placement
Rack awareness places different priorities on each replica. The assumption is traffic
within a rack is faster than across racks.
The first replica is put on the DataNode that is closest to the Hadoop client. This is the
rack the client is running on.
The second replica is placed on a different rack for high availability. This makes sure
that if a rack fails a replica of a block still exists.
The third replica is placed on the same rack as the second rack. Once the second replica
is on a different rack, high availability has been taken care of. The goal is to get the third
replica on another DataNode of the second rack.

Copyright 2014, Hortonworks, Inc. All rights reserved.

405

Rack Topology
Aggregation Switches

2xToR Switches

Staging Node
NameNode
HBaseMaster
Oozie Server
DataNodes

2xToR Switches

Staging Node
Standby NameNode
Secondary NameNode

DataNodes
KVM Switch

Resource Manager
Management Node
Ambari Server
Ganglia/Nagios
WebHCat Server
JobHistoryServer
Hive2 Server

KVM Switch

Rack Topology
Rack topologies need to make sure there are no single points of failure.
There are a number of different ways to deploy rack topologies for Hadoop. The
TopofRack (ToR) architecture is popular with their short cable runs and easy replication
of rack configurations. As companies build out data centers, they deploy rack servers as
the core building block, with ToR switches and cabling within the rack. Pod-based
(containerized) modular designs are becoming very popular. A pod is a preconfigured
system with compute, network and storage resources. Pod architectures strength is
integration and standardization.
TopofRack does not mean switches are at the top of the rack. The top of the rack is
more popular because of ease of access and cabling but switches can be anywhere in
the rack.
This example uses the leaf-spine topology. Each TOR switch is a leaf and each
aggregation switch is a spine. Scalability can be increased by designing a dual-tier
aggregation layer. TOR switches in a rack can be connected to aggregation switches that
can provide interconnection to the rest of the data center.
Each rack should have two Top Of Rack (TOR) Ethernet switches that are bonded. Two
switches are used for scalability and availability.
406

Copyright 2014, Hortonworks, Inc. All rights reserved.

Rack topologies have a number of advantages:

Modular design, making it easy to upgrade and add racks.

Support switching from 10GB cards to 40GB or higher.

Copper cabling is all in rack.

Copyright 2014, Hortonworks, Inc. All rights reserved.

407

Rack Topology Script

408

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuring the Rack Topology Script


The rack topology script is set in the core-site.xml file
The topology.script.file.name sets the rack topology script
The topology script can be a shell, python, Java, etc. script
Example:
#!/bin/bash
ipaddr=$1
rack=`echo $ipaddr | cut -f1-3 -d '.' `
if [ -z "$rack" ] ; then
echo -n "/default-rack"
else
echo -n "/$rack"
fi

rack-topology.sh

Configuring the Rack Topology Script


Setting the rack topology script:
<property>
<name>topology.script.file.name</name>
<value>/home/hadoop/scripts/rack-topology.sh</value>
</property>

Copyright 2014, Hortonworks, Inc. All rights reserved.

409

Unit 18 Review
1. Each rack has a _____________________ path name.
2. The priority of the second replica for rack aware is _______________________.
3. Rack topology is configured in the __________________________________ file.

410

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 18.1: Configuring Rack Awareness

Objective: Configure a Hadoop cluster to be rack-aware.


Successful Outcome: Each node in your cluster will be assigned to a rack.
Before You Begin: SSH into node1.

Step 1: View the Current Rack Awareness


1.1. Check out how many racks are being recognized by the cluster by running the
fsck command:
$ su -l hdfs -c "hdfs fsck -racks"

Notice you only have one rack in your cluster:

1.2. Switch back to the root user.


Step 2: View the rack-topology Script

Copyright 2014, Hortonworks, Inc. All rights reserved.

411

2.1. On node1, change directories to ~/labs.


2.2. View the contents of rack-topology.sh.sample, a sample rack topology script
provided for you:
#!/bin/bash
ipaddr=$1
rack=`echo $ipaddr | cut -f1-3 -d '.' `
if [ -z "$rack" ] ; then
echo -n "/default-rack"
else
echo -n "/$rack"
fi

Notice this script calculates the rack name using the IP address of the node. The
first three parts of the IP address become its rack name. For example: if
192.168.1.100 is the IP address, then the rack name would be /192.168.1.
Step 3: Configure the Rack Script
3.1. Copy the script to directory /etc/hadoop/conf as rack-topology.sh:
# cp rack-topology.sh.sample
/etc/hadoop/conf/rack-topology.sh

3.2. Stop HDFS. Edit core-site.xml and add the following properties:
topology.script.file.name=/etc/hadoop/conf/rack-topology.sh
topology.script.number.arg=1

3.3. Start HDFS, so your changes to core-site.xml take effect:


Step 4: Verify the Rack Awareness
4.1. Run the fsck command once again and this time you will see 4 racks:
# su -l hdfs -c "hdfs fsck -racks"

4.2. You can also view current topology by using following command:

412

Copyright 2014, Hortonworks, Inc. All rights reserved.

# su -l hdfs -c "hdfs dfsadmin -printTopology"


Rack: /172.17.0.2
172.17.0.2:50010 (node1)
Rack: /172.17.0.3
172.17.0.3:50010 (node2)
Rack: /172.17.0.4
172.17.0.4:50010 (node3)
Rack: /172.17.0.5
172.17.0.5:50010 (node4)

4.3. Run the fsck command. You should see 4 racks now:

RESULT: The nodes in your cluster are now each assigned to a rack, and the rack
assignment takes place automatically using the rack-topology.sh script. You can write
your own custom script for automatically determining the appropriate rack names for
your cluster nodes.

Copyright 2014, Hortonworks, Inc. All rights reserved.

413

Unit 19: NameNode HA


Topics covered:

NameNode Architecture HDP1

NameNode High Availability

HDFS HA Components

Understanding NameNode HA

NameNodes in HA

Failover Modes

NameNode Architectures

hdfs haadmin Command

Red Hat HA

VMware HA

Lab 19.1: Configure NameNode High Availability using Ambari

414

Copyright 2014, Hortonworks, Inc. All rights reserved.

NameNode Architecture HDP1


Single
Point Of
Failure

NameNode
Namespace
fsimage

Block Management

Edits.log

Heartbeats

DataNode 1

DataNode 2

BLK1

BLK6

BKL4

BLK2

BK5

BK5

DataNode 3

BLK6

BLK1

BLK6

BLK2

BKL4

DataNode n

BLK1

BK5

BLK2

BKL4

HDFS Distributed Storage

NameNode Architecture HDP1


The NameNode in HDP1 is a Single Point of Failure (SPOC). Two supported options for
HDP 1 are:

Red Hat HA for physical servers.

VMware HA for virtual servers running vSphere.

Red Hat and VMware HA are solutions that work well but there are reasons customers
want an HA solution built into HDP.
Both Red Hat and VMware HA:

Are crash recovery solutions.

Are 3rd party solutions that must be purchased.

Require expertise outside of Hadoop.

Copyright 2014, Hortonworks, Inc. All rights reserved.

415

NameNode High Availability


The HDP HA architecture maintains high availability for the NameNode master daemon
service. With HA, if the NameNode master daemon fails, the standby NameNode takes
over as the active NameNode.
When any dependent service (YARN, HBase, ) detects a failure, it will pause, retry and
recover when the HA stack resolves the issue. For example, when the Standby
NameNode takes over on a failure the applications will recover and finish to completion.
Dependent services (like ResourceManager) automatically detect the failure or fail over
of the co-dependent component (NameNode) and these dependent services pause,
retry, and recover the failed service. For example, the ResourceManager does not
launch new jobs or kill jobs that have been waiting for the NameNode.

416

Copyright 2014, Hortonworks, Inc. All rights reserved.

HDFS HA Components
Hadoop HA clusters use nameservice IDs to identify an HDFS instance that may be made
up of multiple NameNodes. A NameNode ID is added. Each NameNode has a unique ID
in a HA cluster to make sure it is uniquely identified.
DataNodes send block map reports and heartbeats to the Primary and Standby
NameNodes to maintain consistency.
A ZooKeeper Quorum is used to coordinate data, perform update notifications and
monitor for failure. Each NameNode maintains a persistent session in ZooKeeper.
The ZooKeeper Quorum will:

Notify the Standby NameNode to become active during a failover.

Perform an Active NameNode election, giving the Standby NameNode an


exclusive lock so it can become the Active NameNode.

Copyright 2014, Hortonworks, Inc. All rights reserved.

417

The ZKFailoverController (ZKFC) monitors and manages the state of the NameNode. The
Active and Standby NameNode will each run a ZKFC.
The ZKFC:

Monitors the health of the NameNode it is monitoring and manages its state of
being healthy or unhealthy.

Manages the NameNode session. A lock is maintained on the Active NameNode.

Performs elections to determine which NameNode should be active and if


necessary will perform a failover.

The Journal Nodes (JNs) make sure that a split-brain scenario (both NN writing at same
time) does not occur. The JNs make sure that only one NameNode can be a writer at a
time.
The Active NameNode will write records to the shared edit.log file. The Standby Node
will read the edits.log file and apply changes to itself. The Standby NameNode will read
all edits before becoming active during a failover.
Currently there can only be one shared directory. The storage needs to support
redundancy to protect the metadata.
The Standby NameNode performs checkpoints. If upgrading from HDP1 to HDP2, the
previous Secondary NameNode can be replaced with the Standby NameNode.
An experimental shared storage solution is BookKeeper. BookKeeper can replicate edit
log entries across multiple storage nodes. The edit log can be striped across the storage
nodes for high performance. Fencing is supported in the protocol. The metadata for
BookKeeper is stored in ZooKeeper. In current HA architecture, a ZooKeeper cluster is
required for ZKFC. The same cluster can be for BookKeeper metadata. Refer to the
Apache BookKeeper project documentation for more information.
http://zookeeper.apache.org/bookkeeper/

418

Copyright 2014, Hortonworks, Inc. All rights reserved.

Understanding NameNode HA
Fo r a
utom
failov atic
er
NameNode

SNameNode

ZKFC
Active

ZKFC
Standby

ZK

ZK

ZK

Namespace

Writes

JN

JN

Namespace

Reads

JN

Block Management

Block Management

Hadoop HA Cluster
Heartbeats

Heartbeats

DataNode 1

DataNode 2

BLK1

BLK6

BKL4

BLK2

BK5

BK5

DataNode 3

BLK6

BLK1

BLK6

BLK2

BKL4

DataNode n

BLK1

BK5

BLK2

BKL4

HDFS Distributed Storage

Understanding NameNode HA
NameNode High Availability (HA) has no external dependency.
NameNode HA has an active NameNode and standby NameNode running in an activepassive relationship. If the active NameNode goes down the Passive NameNode
becomes the Active NameNode. If the failed NameNode restarts it will become the
passive NameNode. The ZooKeeper FailOverController (ZKFC) maintains a lock on the
active NameNode for a namespace.
On each platform running a NameNode service there will be an associated ZKFC. The
ZKFC communicates with:

The NameNode service it is associated with. ZKFC monitors the health and
manages the HA state of the NameNode.

The ZooKeeper Service.

The FailoverController (FC) monitors the health of the NameNode, Operating System
(OS) and Hardware (HW). There is an active and standby FailoverController.
Heartbeats occur between the Failover Controllers (active and passive) and the
zookeeper servers.
Copyright 2014, Hortonworks, Inc. All rights reserved.

419

Recommendations:

Have three to five ZooKeeper daemons.

It is okay to run Zookeeper daemons on the NameNode platforms (active and


standby).

It is okay to run a Zookeeper daemon on the ResourceManager platform.

It is recommended to keep the HDFS metadata and the ZooKeeper data on


separate disks and controllers.

420

Copyright 2014, Hortonworks, Inc. All rights reserved.

NameNodes in HA
Start the services in the following order:
1. JournalNodes
2. NameNodes
3. DataNodes
Always start the NameNode then its corresponding ZKFC.
The Active NameNode is determined by which NameNode starts first. If one NameNode
is the preferred Active NameNode then always start if first.
The hdfs haadmin command is used to perform a manual failover.

Copyright 2014, Hortonworks, Inc. All rights reserved.

421

There are two ways of sharing edit logs with NameNode HA:

Quorum based-storage (best practice)

Shared storage using NFS

The active NameNode writes the edits in the edits.log. The Standby NameNode will
read and apply edits to maintain a consistent state. The current state is maintained with
a quorum of Journal Nodes.
Commands and scripts used to manage HA:
Format and initialize the state of the Zookeeper
$ hdfs zkfc formatZK

Start-dfs.sh will start the ZKFC daemon when automatic failover is setup.
To manually start a zkfc process:
$ hadoop-daemon.sh start zkfc

422

Copyright 2014, Hortonworks, Inc. All rights reserved.

Failover Modes
The ZooKeeper FailOverController processes monitors the health of the NameNodes for
a NameSpace. The FailOverControl will facilitate the failover process and perform a
fencing operation to make sure a split-brain scenario cannot occur.
A split-brain scenario occurs if both NameNodes think they are both the active
NameNode. The fencing operation makes sure a NameNode gets fenced off so it cannot
be active. This protects the NameNode metadata to make sure it does not become
corrupt by two NameNodes doing writes at the same time.
The command below can be used to failover from active to passive NameNode.
$ hdfs haadmin failover <StandbyNN-To-Be>
Be>

Copyright 2014, Hortonworks, Inc. All rights reserved.

<ActiveNN-To-

423

HDP2 supports a number of different NameNode architectures dependent upon the


requirements and Service Level Agreements (SLAs).

A single NameNode with a Secondary NameNode is supported. A customer may


not have an HA requirement or has just upgraded to HDP2.

A cluster may not generate the workload or need federated NameNodes but has
HA requirements.

A cluster may need a Federation NameNode configuration but not have a


requirement for HA.

A cluster may need a Federation NameNode configuration and may have a


requirement for HA.

A cluster may not have an HA requirement for all federated NameNodes.

424

Copyright 2014, Hortonworks, Inc. All rights reserved.

Examples of Failover Scenarios:

HDP Master service failure.

HDP Master JVM failure.

Hung HDP Master daemon or hung operating system.

HDP Master operating system failure.

Failures in virtual infrastructure (VMware).

Virtual machine failure.

ESXi host failure.

Failure of the NIC cards on ESXi hosts.

Network failure between ESXi hosts.

Copyright 2014, Hortonworks, Inc. All rights reserved.

425

hdfs haadmin Command


Running the hdfs haadmin command without any additional arguments will display the
usage information.

To perform failover, be sure to use the hdfs haadmin -failover command.

Options include:

getServiceState: Determines whether the given NameNode is Active or Standby.


This option is useful for administration scripts and cron jobs.

checkHealth: Checks the health of the given NameNodeConnect to the provided


NameNode. The NameNode is capable of performing some diagnostics on itself
including checking if internal services are running as expected. This command
will return 0 if the NameNode is healthy, non-zero otherwise.

426

Copyright 2014, Hortonworks, Inc. All rights reserved.

Red Hat HA
Power Fencing

Monitoring Agent

Heartbeat

Monitoring Agent

Monitor NN

Monitor NN
NN
Standby

NameNode

Shared
NameNode
State

Red Hat HA
Red Hat Enterprise Linux (RHEL) HA cluster software is separate from the Hadoop
cluster. A power-fencing device is required (deals with split-brain scenario). A floating
IP is required for failover. RHEL HA cluster must be configured for the Hadoop master
servers that have high availability requirements.
Typically, the overall Hadoop cluster must include the following types of machines:

The RHEL HA cluster machines. These machines must host only those master
services that require HA (in this case, the NameNode and the ResourceManager).

Master machines that run other master services such as Hive Server 2, HBase
master, etc.

Copyright 2014, Hortonworks, Inc. All rights reserved.

427

VMware HA
VMware vCenter Server
NameNode
(VM)

MasterNode
(VM)

Heartbeat

MasterNode
(VM)

MasterNode
(VM)

On failure, start
VMs on another
ESXi host

ESXi Host

Shared
Storage

Resource Manager
(VM)
MasterNode
(VM)

MasterNode
(VM)

ESXi Host

VMware HA
vSphere is VMwares software platform for providing a virtualization platform. The
VMware vCenter Server is VMwares central point of management.
A vSphere ESXi host can run multiple VMs. A vSphere HA/DRS cluster can be set with
multiple ESXi hosts. The ESXi hosts maintain heartbeats and communication so they
understand what VMs are running in the vSphere HA cluster. If an ESXi host fails, HA will
start the failed VMs on another ESXi host in the vSphere HA cluster automatically. If a
VM fails on an ESXi host, VMware HA can restart the VM on another ESXi host in the HA
cluster. The vSphere HA cluster must use shared storage.
A NameNode monitoring agent notifies vSphere if the NameNode daemon fails or
becomes unstable. vSphere HA will trigger the NameNode VM to restart on the same
ESXi host or a different ESXi host dependent on the error. A monitoring agent needs to
be setup to monitor any other HDP2 master nodes to let vSphere HA be aware of the
need to start up the master node VM again. vSphere HA can also automatically handle
an
It takes about five clicks to set up HA with vSphere and vSphere will manage the HA
environment automatically. When HA is enabled, a Fault Domain Manager (FDM)
service is started on the ESXi hosts. The ESXi hosts have an election and select an
election and pick a master host. The master host manages the FDM environment.
428

Copyright 2014, Hortonworks, Inc. All rights reserved.

There are a number of different options for configuring how vSphere HA/DRS high
availability works. It takes a fair about of expertise to setup a virtual HA environment
but once set up it works automatically.
When running a vSphere HA/DRS cluster a number of features of virtualization can be
leveraged.

vMotion: Transparently moves a VM to another ESXi host.

Distributed Resource Scheduling (DRS): Automatically moves a VM


(transparently) to another ESXi host to balance the workload across the vSphere
HA/DRS cluster.

Fault Tolerance: Two VMs across different ESXi hosts can stay synchronized in an
active-passive relationship. If active VM fails, passive VM takes over. (Only
supported with up to four vCPUs in vSphere 5.5). Fault tolerance has zero-down
time.

vSphere Replication (VR) can perform VM replication across different sites (the
hardware does not have to be an exact match between sites).
Site Recovery Manager (SRM) supports automatic failover to another site.
vSphere HA can protect against an ESXi failure or the failure of applications running on
the VM (vSphere Application Failover (in vSphere 5.5). vSphere HA also protects against
the VM, guest OS failure and network failures.

When using VMware HA:

The NameNode must run inside a virtual machine which is hosted on the
vSphere HA cluster.

The ResourceManager must run inside its own virtual machine which is hosted
on the vSphere HA cluster.

The vSphere HA cluster must include a minimum of two ESXi server machines.

Copyright 2014, Hortonworks, Inc. All rights reserved.

429

Lab 19.1: Implementing NameNode HA

Objective: To configure and verify NameNode High Availability using


Ambari.
Successful Outcome: Your cluster will have a Standby NameNode along with
Active NameNode.
Before You Begin: Open the Ambari UI.

Step 1: Stop HBase


1.1. Click on the Admin tab in Ambari, then select High Availability from the menu
on the left side of the page:

1.2. Click the Enable NameNode HA button. Notice on the first step of the wizard
that you get a warning about stopping HBase first:

430

Copyright 2014, Hortonworks, Inc. All rights reserved.

1.3. Click the X in the upper-right corner to end the wizard.


1.4. Go to Services -> HBase and click the Stop button. Wait for HBase to stop.
Step 2: Enable NameNode HA
2.1. Go back to the Admin page and start the Enable NameNode HA Wizard
again.
2.2. Enter HACluster as the Nameservice ID for your NameNode HA cluster. (The
name must consist of one word and not have any special characters.) Click the
Next button:

2.3. Choose node4 as the Additional NameNode:

Copyright 2014, Hortonworks, Inc. All rights reserved.

431

2.4. Click the Next button.


2.5. Notice that a Secondary NameNode is not allowed if you have NameNode HA
configured. Ambari takes care of this for you, as you can see on the Review step
of the wizard:

2.6. Click the Next button to continue the wizard.


Step 3: Perform the Manual Steps
3.1. Notice the Enable NameNode HA Wizard is requiring you to perform some
manual steps before you are able to continue. Start by SSH-ing into node1.
3.2. Put the NameNode in Safe Mode:
# sudo su -l hdfs -c 'hdfs dfsadmin -safemode enter'

3.3. Create a Checkpoint:


# sudo su -l hdfs -c 'hdfs dfsadmin -saveNamespace'

3.4. Once Ambari recognizes that your cluster is in Safe Mode and a Checkpoint
has been made, you will be able to click the Next button.
Step 4: Wait for the Configuration
4.1. At this point, Ambari will stop all services, install the necessary components,
and restart the services. Wait for these tasks to complete:

432

Copyright 2014, Hortonworks, Inc. All rights reserved.

4.2. Once all the tasks are complete, click the Next button.
Step 5: Initialize the JournalNodes
5.1. On node1, enter the command shown in the wizard to initialize the
JournalNodes:
# sudo su -l hdfs -c 'hdfs namenode -initializeSharedEdits'

5.2. Once Ambari determines that the JournalNodes are initialized, you will be
able to click the Next button:

Step 6: Start the Components


6.1. In the next step, Ambari will start ZooKeeper and the NameNode service. Click
Next when its complete:

Copyright 2014, Hortonworks, Inc. All rights reserved.

433

Step 7: Initialize NameNode HA Metadata


7.1. On node1, enter the command shown in the wizard to configure the
metadata for automatic failover:
# sudo su -l hdfs -c 'hdfs zkfc -formatZK'

7.2. On node4, run the command to initialize the metadata for the new
NameNode:
# sudo su -l hdfs -c 'hdfs namenode -bootstrapStandby'

7.3. Click the Next button to continue.


Step 8: Wait for the Wizard to Finish
8.1. In this final step, the wizard will start all required services and delete the
Secondary NameNode:

8.2. Click the Done button when all the tasks are complete.
434

Copyright 2014, Hortonworks, Inc. All rights reserved.

Step 9: Verify the Standby NameNode


9.1. Go to the HDFS service page. You should see an Active NameNode and a
Standby NameNode:

Step 10: Test NameNode HA.


10.1. Go the Host page of node1 in Ambari.
10.2. Next to the Active NameNode component, select Stop from the Action
menu:

10.3. Go back the HDFS page in Ambari. Notice the Standby NameNode has
become the Active NameNode:

10.4. Now start the stopped NameNode again, and you will notice that it becomes
a Standby NameNode:

Copyright 2014, Hortonworks, Inc. All rights reserved.

435

RESULT: You now have NameNode HA configured on your cluster, and you have also
verified that the HA works when one of the NameNodes stops.

436

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 20: Securing HDP


Topics covered:

Security Concepts

Kerberos Synopsis

HDP Security Overview

Securing HDP Authentication

Securing HDP - Authorization

Lab 20.1: Securing a HDP Cluster

Copyright 2014, Hortonworks, Inc. All rights reserved.

437

Security Concepts
Before implementing security in a Hadoop cluster, its important to understand basic
security concepts and terms.
Principal: A principal is any user or service that is performing an operation in the
secured environment. A user principal an interactive or unattended (system) user that
logs into a secured environment and starts to interact with services. A service principal
is a service that needs to perform operations in a secured environment.
Authentication: There are many authentication mechanisms available by which
principals can prove their credentials are trusted. Credentials can be username and
password, a key file or certificate of trust, or a combination of usernames and trust files.
The common authentication protocols are Kerberos, Plain Text, X.509, Digest, and many
others. The protocol that is used in Hadoop is Kerberos, an MIT open source project.

438

Copyright 2014, Hortonworks, Inc. All rights reserved.

Authorization: Once a principal is authenticated, it needs to receive authorization to


perform the operations it wants. Authorizations can be provided by lookups into a file,
database, and directory services such as LDAP. In Hadoop, authorization is provided via
configuration files which will be discussed in detail below.

Copyright 2014, Hortonworks, Inc. All rights reserved.

439

Kerberos Synopsis
Kerberos is a protocol that aims to provide an authentication and authorization system
to:
Prevent the need for passwords to be transferred over the network.
Still allow for users to enter passwords.
Allow a user to establish an authenticated session without having the need
to re-enter passwords for every operation.
To create that secure communication among its various components, Hadoop uses
Kerberos. Kerberos is a third party authentication mechanism, in which users and
services that users want to access rely on a third party - the Kerberos server - to
authenticate each to the other. The Kerberos server itself is known as theKey
Distribution Center, or KDC.
At a high level, it has three parts:

440

A database of the users and services (known as principals) that it knows


about and their respective Kerberos passwords.

An authentication server (AS) which performs the initial authentication and


issues a Ticket Granting Ticket (TGT).
Copyright 2014, Hortonworks, Inc. All rights reserved.

A Ticket Granting Server (TGS) that issues subsequent service tickets based
on the initial TGT.

A user principal requests authentication from the AS. The AS returns a TGT that is
encrypted using the user principal's Kerberos password, which is known only to the user
principal and the AS. The user principal decrypts the TGT locally using its Kerberos
password, and from that point forward, until the ticket expires, the user principal can
use the TGT to get service tickets from the TGS. Service tickets are what allow a principal
to access various services.
Because cluster resources (hosts or services) cannot provide a password each time to
decrypt the TGT, they use a special file, called a keytab, which contains the resource
principal's encrypted credentials.
Kerberos Components
Term

Description

Key Distribution
Center, or KDC

The trusted source for authentication in a Kerberos-enabled environment.

Kerberos KDC
Server

The machine, or server, that serves as the Key Distribution Center.

Kerberos Client

Any machine in the cluster that authenticates against the KDC.

Principal

The unique name of a user or service that authenticates against the KDC.

Keytab

A file that includes one or more principals and their keys.

Realm

The Kerberos network that includes a KDC and a number of Clients.

Copyright 2014, Hortonworks, Inc. All rights reserved.

441

HDP Security Overview


HDP has a set of pre-defined Kerberos principals:
Service

Component

Mandatory Principal Name

HDFS

NameNode

nn/$FQDN

HDFS

NameNode HTTP

HTTP/$FQDN

HDFS

SecondaryNameNode

nn/$FQDN

HDFS

SecondaryNameNode HTTP

HTTP/$FQDN

HDFS

DataNode

dn/$FQDN

MR2

History Server

jhs/$FQDN

MR2

History Server HTTP

HTTP/$FQDN

YARN

ResourceManager

rm/$FQDN

YARN

NodeManager

nm/$FQDN

442

Copyright 2014, Hortonworks, Inc. All rights reserved.

Oozie

Oozie Server

oozie/$FQDN

Oozie

Oozie HTTP

HTTP/$FQDN

Hive

Hive Metastore
HiveServer2

hive/$FQDN

Hive

WebHCat

HTTP/$FQDN

HBase

MasterServer

hbase/$FQDN

HBase

RegionServer

hbase/$FQDN

ZooKeeper

ZooKeeper

zookeeper/$FQDN

Nagios Server

Nagios

nagios/$FQDN

JournalNode
Server[a]

JournalNode

jn/$FQDN

[a]

Only required if you are setting up NameNode HA.

To create the principal for a DataNode service, issue this command:


$kadmin.local addprinc -randkey dn/$DataNode-Host@EXAMPLE.COM

Once principals are established in the KDCs database, keytab files can be extracted.
Recall that keytabs are a key file that identifies a principal. Keytabs need to be installed
on each appropriate host; wherever a service principal resides.
To extract a keytab file from an established principal:
$kadmin.local xst -norandkey -k $keytab_file_name
$primary_name/fully.qualified.domain.name@EXAMPLE.COM

$primary_name name of the principal


/fully.qualified.domain.name category in which a principal belongs
@EXAMPLE.COM the Realm

Copyright 2014, Hortonworks, Inc. All rights reserved.

443

Securing HDP Authentication


The easiest way to enable security in HDP is via Ambari. Your Security lab at the end of
this section will walk you through the steps to secure your cluster.

Note:
Once authentication is setup, the next step is to set up mappings of local UNIX
service accounts to Kerberos principals.
These mappings live in the core-site.xml under the hadoop.security.auth_to_local
property.

444

Copyright 2014, Hortonworks, Inc. All rights reserved.

Securing HDP - Authorization


Access Control Lists (ACLs) can also be set up in a Hadoop cluster. Authorization ACLs
can be set for:
HDFS
Jobs
HBase
Hive
ZooKeeper

Copyright 2014, Hortonworks, Inc. All rights reserved.

445

Lab 20.1: Securing a HDP Cluster

Objective: To understand how to configure security for HDP.


Successful Outcome: The horton user will be authenticated and authorized to view
the contents of the /user folder in HDFS.
Before You Begin: SSH into node1.

Step 1: Create a New User


1.1. On node1 as root, create a new user named horton:
# useradd -g hadoop horton

1.2. Switch to the hdfs user and create a new directory in HDFS named
/user/horton.
1.3. Change ownership of /user/horton in HDFS to the horton user.
1.4. Exit out from the hdfs user, and switch to the horton user.
1.5. Check whether you can do a listing of the /user directory successfully:
$ hadoop fs -ls /user

1.6. Exit out from the horton user.

NOTE: The current cluster is not a secure cluster so you can easily do a
listing of the /user directory in HDFS successfully.

Step 2: Install Kerberos


446

Copyright 2014, Hortonworks, Inc. All rights reserved.

2.1. As the root user on node1, install the following packages:


# yum -y install krb5-server krb5-libs krb5-auth-dialog
krb5-workstation

2.2. Login to node2, node3 and node4 and install the Kerberos client only:
# yum -y install krb5-workstation

Step 3: Configure Kerberos


3.1. On node1, edit the configuration file /etc/krb5.conf and modify the value for
kdc and admin_server to node1:
# vi /etc/krb5.conf
[realms]
EXAMPLE.COM = {
kdc = node1
admin_server = node1
}

3.2. Copy the /etc/krb5.conf file to all the other nodes:


# ~/scripts/distFile.sh /etc/krb5.conf /etc/krb5.conf

3.3. Enter the following command to create a Kerberos database using the
kdb5_util utility:
# kdb5_util create -s

During this step it will ask you to define a master key. Enter 1234 as the key.
Step 4: Start Kerberos
4.1. Start the KDC server by executing following commands:
# /etc/rc.d/init.d/krb5kdc start
# /etc/rc.d/init.d/kadmin start

Step 5: Run the Enable Security Wizard


5.1. In Ambari, go the Admin page and click on the Security link:

Copyright 2014, Hortonworks, Inc. All rights reserved.

447

5.2. Click the Enable Security button.


5.3. Notice on the Get Started step of the Enable Security Wizard there are four
manual steps that must be completed first. We have already executed the first 2
steps. For the remaining 2 steps, click the Next button.
5.4. The default settings on the Configure Services step of the wizard are fine, so
click Next to continue.
Step 6: Create the Principals and Keytabs
6.1. (NOTE: This step does not work using Safari.) On the Create Principals and
Keytabs step of the wizard, all the required default settings for Kerberos are
shown in a tabular format. Click the Download CSV button at the bottom of the
page and save the file on your local machine:

6.2. Create a new file on node1 named /root/scripts/kerberos.csv and copy-andpaste the contents of the CSV file into kerberos.csv.
6.3. Run the pre-written script to create all required principals and keytabs. It will
ask for the location of the CSV file you created. Provide the full path to the file:
# /root/scripts/create_principals.sh

NOTE: This step will create all required principals and keytab files on all the
nodes. Once you are done with the step, go back to Ambari UI.

Step 7: Finish the Enable Security Wizard

448

Copyright 2014, Hortonworks, Inc. All rights reserved.

7.1. We have completed all 4 required steps. Now it is time to enable security
through Ambari. Click the Apply button.
7.2. The Save and Apply Configuration step can take 10-15 minutes. When the
task is complete, click the Done button:

Step 8: Verify Security is Enabled


8.1. On node1, switch to the horton user and try to list the contents of /user in
HDFS:
# su - horton
$ hadoop fs -ls /user

The command should throw an error.


Step 9: Configure User Permissions
9.1. As the root user, add horton to the Kerberos database. Start the kadmin
service by typing following:
# kadmin.local

9.2. Type following command to create a horton principal:


kadmin.local: addprinc -randkey horton@EXAMPLE.COM

9.3. Now create a keytab file in the /etc/security/keytabs directory using the
following command:

Copyright 2014, Hortonworks, Inc. All rights reserved.

449

kadmin.local: xst -norandkey -k


/etc/security/keytabs/horton.headless.keytab
horton@EXAMPLE.COM
kadmin.local: exit

9.4. Set appropriate permissions for the keytab file for the horton user:
# chown horton:hadoop
/etc/security/keytabs/horton.headless.keytab
# chmod 440 /etc/security/keytabs/horton.headless.keytab

9.5. Switch to the horton user and initialize the keytab file:
# su - horton
$ kinit -kt /etc/security/keytabs/horton.headless.keytab
horton@EXAMPLE.COM

9.6. Now try to list the contents of /user in HDFS again. This time you should be
able to view the folders contents!

RESULT: You have enabled Kerberos security for your HDP cluster.

450

Copyright 2014, Hortonworks, Inc. All rights reserved.

Appendix A: Unit Review Answers


Unit 1 Review Answers
1. YARN and HDFS
2. False
3. Ambari
4. NameNode are ResourceManager are the primary master processes.

Unit 2 Review Answers


1. NameNode
2. 3
3. False. A files data in HDFS never passes through the NameNode. Client
applications read and write directly from the DataNodes.
4. dfs.blocksize
5. The fsimage and edits files
Unit 5 Review Answers
1. Availability.
2. Defines the number of bytes for a checksum to be created (default is 512).
3. Block scanner.
4. Under-replicated blocks, Over-replicated blocks, Mis-replicated blocks (on the
same node), corrupt blocks.
5. The location option.
6. The Total Size and the Default Replication Factor.
7. To get a list of all the DataNodes in a cluster.

Copyright 2014, Hortonworks, Inc. All rights reserved.

451

Unit 6 Review Answers


1. DataNode, NameNode or HDP client machine
2. NFS server
3. hdfs-site.xml
4. UID/GID

Unit 7 Review Answers


1. Map phase, shuffle/sort phase, and reduce phase
2. The number of Mappers is determined by the input splits.
3. You get to choose the number of Reducers.
4. ResourceManager, NodeManager and ApplicationMaster
5. It is up to the ApplicationMaster to request a new Container from the
ResourceManager and attempt the task again.
Unit 8 Review Answers
1. Capacity and Fairness
2. Capacity
3. root
4. 3 (A and B and the root parent queue)
5. The A queue is allocated 80% of the resources of the cluster.
6. Yes, because its maximum-capacity is set to 100%.
7. No, because queue A does not have a maximum-capacity configured, which
means elasticity is disabled for A

452

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 9 Review Answers


1. Data ingestion
2. Time-based and immutable
3. Big Data Refinery
4. The Batch layer, serving layer and speed layer.
5. hftp
6. source and destination (using a checksum CRC32)

Unit 10 Review Answers


1. Read and write operations.
2. dfs.webhdfs.enabled
3. Kerberos SPNEGO and Hadoop delegation tokens
4. Service

Unit 11 Review Answers


1. Metastore
2. External
3. WebHCat
4. True

Unit 12 Review Answers


1. 4 map tasks by default
2. The -m option is for specifying the number of mappers.
3. The $CONDITIONS value is used internally by Sqoop to specify LIMIT and OFFSET
clauses so the data can be split up amongst the map tasks
Copyright 2014, Hortonworks, Inc. All rights reserved.

453

Unit 13 Review Answers


1. Event
2. Event driven
3. Custom
4. Interceptor

Unit 14 Review Answers


1. workflow, coordinator and bundle jobs
2. Bundle
3. Answers can include: Email, Shell, Pig, MapReduce, Hive, Sqoop, ssh, DistCp or
Custom
4. oozie-log4j.properties

Unit 15 Review Answers


1. Widget and Classic
2. gmond, gmead and gweb
3. hadoop-cluster.cfg
4. jmap
5. JMX

Unit 16 Review Answers


1. dfs.hosts
2. dfs.hosts.exclude

454

Copyright 2014, Hortonworks, Inc. All rights reserved.

3. yarn.resourcemanager.nodes.include-path and
yarn.resourcemanager.nodes.exclude-path
4. execute ResourceManager administration operations

Unit 18 Review Answers


1. Hierarchical
2. High Availability
3. core-site.xml

Copyright 2014, Hortonworks, Inc. All rights reserved.

455

Appendix B: Other Hadoop Tools


Topics covered:

Data Lifecycle Management

Data Lifecycle Management on Hadoop

Falcon Use Cases and Capabilities

Role of Falcon in Hadoop

Knox

ZooKeeper

HBase

HCatalog

NameNode Federation

456

Copyright 2014, Hortonworks, Inc. All rights reserved.

Data Lifecycle Management


In a world of increased governance and regulation, data lifecycle management is more
important than ever.
Data management lifecycle has different stages the data goes through:

Discover

Design

Enable

Maintain

Archive

Copyright 2014, Hortonworks, Inc. All rights reserved.

457

Data Lifecycle Management on Hadoop


Hadoop systems are integral to the data lifecycle management for an
organization.
Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retention

Distcp

Scheduling

Flume

Reprocessing

MapReduce

Multi-Cluster Management

Hive and Pig

Data Lifecycle Management on Hadoop


The data volume, ingestion rates and different types of data in Hadoop add to the
complexity of data lifecycle management for an organization.
The individual tools in Hadoop are excellent for their defined functionality but an
organized way of managing all this data needs to be implemented.

458

Copyright 2014, Hortonworks, Inc. All rights reserved.

Falcon Use Cases and Capabilities

Copyright 2014, Hortonworks, Inc. All rights reserved.

459

Falcon
Falcon is a data lifecycle management framework for Apache Hadoop.
Falcon enables users to configure, manage and orchestrate data motion,
disaster recovery, and data retention workflows to support of business
continuity and data governance use cases.

Falcon
Falcon provides the key services data processing applications need. Falcon manages
workflow and replication.
Falcons goal is to simplify data management on Hadoop. It achieves this by providing
important data lifecycle management services that any Hadoop application can rely on.
Instead of hard-coding complex data lifecycle capabilities, apps can now rely on a
proven, well-tested and extremely scalable data management system built specifically
for the unique capabilities that Hadoop offers.
Falcon also supports multi-cluster failover.

Falcon is not in the initial GA release of HDP2. Falcon will be in a following


release of HDP2.

460

Copyright 2014, Hortonworks, Inc. All rights reserved.

Future: Knox
Provide perimeter security
Support authentication and token
verification security scenarios
Single URL to access multiple
Hadoop services
Enable integration with enterprise
and cloud identity management
environments
Supports
WebHDFS
WebHCat
Oozie
HBase
Hive

Future: Knox
While not yet part of HDP, Knox is intended to provide permiter security. It aims to
provide a single point of entry into a Hadoop cluster for a user to access different
services such as HDFS, YARN, Hive, Oozie. Knox can be installed in HDP 2 as an add-on. A
user authenticates once with the Knox service via Kerberos, while Knox itself handles
serving requests for that user inside the cluster.
For more information:
Hortonworks: http://hortonworks.com/hadoop/knox-gateway/
Apache: http://knox.incubator.apache.org/

Copyright 2014, Hortonworks, Inc. All rights reserved.

461

ZooKeeper Synopsis
ZooKeeper is a service that provides configuration management, naming, distributed
synchronization, and group services. Various Hadoop services rely on ZooKeeper to
operate. In this Unit, we will focus on Administering ZooKeeper.
Centralized service for:
Configuration management: Services such as HBase use ZooKeeper extensively
for configuration management, such as a registry of all HBase nodes and tables.

Naming & group services: ZooKeeper can act as a naming service, similar to
what a DNS would provide. At an application level, can you can use ZooKeeper as
your replacement to DNS. For example, if your application needs to resolve a
host name, that information can be maintained and provided by ZooKeeper to
the application.

Distributed synchronization (i.e. distributed transactions): You can implement a


Two-phase commit (2PC), a common distributed transaction type, using
ZooKeeper.

462

Copyright 2014, Hortonworks, Inc. All rights reserved.

Components
An ensemble of ZooKeeper hosts 3 hosts suffices for most clusters
Ensembles are configured in odd numbers of 3,5,7, etc.
This is due to always having a majority and allowing for one
additional failure than even numbers.
5 zknodes allows for 2 failures, whereas 4 zknodes allows for only
1 failure. Both have the same number of majority.
The ensemble of hosts work together as a quorum; as long as a majority
of them agree on an operation, then the operation succeeds.

A leader host out of the ensemble.


ZooKeeper nodes will self-elect a leader.

All operations are stamped with a sequential transaction id calls a zxid.


The zxid exposes total ordering of operations.

ZooKeeper Client
ZooKeeper ships with a command line client that allows you to perform file-system like
operations:
/user/lib/zookeeper/bin/zkCli.sh

To connect to a ZooKeeper host:


[root@node1 ~]# /usr/lib/zookeeper/bin/zkCli.sh -server
node1:2181

Copyright 2014, Hortonworks, Inc. All rights reserved.

463

To view available commands:


[zk: node1:2181(CONNECTED) 0] help
ZooKeeper -server host:port cmd args
connect host:port
get path [watch]
ls path [watch]
set path data [version]
rmr path
delquota [-n|-b] path
quit
printwatches on|off
create [-s] [-e] path data acl
stat path [watch]
close
ls2 path [watch]
history
listquota path
setAcl path acl
getAcl path
sync path
redo cmdno
addauth scheme auth
delete path [version]
setquota -n|-b val path

The simple client allows you to create znodes. More complex operations would be
performed programmatically. For more in-depth information and programmers guide,
visit:
http://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html

464

Copyright 2014, Hortonworks, Inc. All rights reserved.

Configuring ZooKeeper
To configure ZooKeeper, find the config file at with key configuration properties
mentioned above:
/etc/zookeeper/conf/zoo.cfg

Data & Logs


By default, the data and log directory are the same. However, in production, these
should be separated out. As with any logging strategy, throughput increases and latency
decreases when logs are directed to a log device or drive. Use the dataDir and
dataLogDir properties to separate these out.
Monitoring
Ambari does a good job of monitoring liveliness of each ZooKeeper host in the cluster.
To perform more detailed monitoring, you can use a set of commands, called the Four
Letter Words. Another way to monitor ZooKeeper is via JMX. Several MBeans are
exposed that allow you to get detailed statistics.

Copyright 2014, Hortonworks, Inc. All rights reserved.

465

ZooKeeper Commands: The Four Letter Words


Below is an example of the ruok (are you okay?) command:
[root@node1 ~]# echo -n ruok | nc node1 2181
imok
[root@node1 ~]#

Notice that ZooKeeper echoed back imok. A full list of the four letter word commands is
provided below:
Command Description
conf
cons
crst
dump
envi

ruok

srst
srvr
stat
wchs

Print details about server configuration


List full connection/session details for all clients connected to this server. Includes
information on numbers of packets received/sent, session id, operation latencies,
last operation performed, etc...
Reset connection/session statistics for all connections.
Lists the outstanding sessions and ephemeral nodes. This only works on the leader.
Print details about serving environment.
Tests if server is running in a non-error state. The server will respond with imok if it
is running. Otherwise it will not respond at all.
A response of "imok" does not necessarily indicate that the server has joined the
quorum, just that the server process is active and bound to the specified client port.
Use "stat" for details on state wrt quorum and client connection information
Reset server statistics.
Lists full details for the server.
Lists brief details for the server and connected clients.
Lists brief information on watches for the server.
Lists detailed information on watches for the server, by session. This outputs a list of
sessions (connections) with associated watches (paths).

wchc
Note: Depending on the number of watches this operation may be expensive (ie
impact server performance), use it carefully.
Lists detailed information on watches for the server, by path. This outputs a list of
paths (znodes) with associated sessions.
wchp

mntr

466

Note: Depending on the number of watches this operation may be expensive (ie
impact server performance), use it carefully.
Outputs a list of variables that could be used for monitoring the health of the
cluster.

Copyright 2014, Hortonworks, Inc. All rights reserved.

Ports firewall considerations

Service

Servers

Ports

ZooKeeper Server
ZooKeeper Server
ZooKeeper Server

All ZK Nodes
All ZK Nodes
All ZK Nodes

2888
3888
2818

Copyright 2014, Hortonworks, Inc. All rights reserved.

Protocol

Description
Peer to peer communication.
Peer to peer leader election.
Clients connect to this port.

467

HBase Synopsis
HBase is a NoSQL database known as the Hadoop Database. Because HDFS is a resilient,
highly scalable distributed file system, HBase capitalizes on these characteristics by
persisting its data directly to HDFS.

HBase Architecture

ZooKeeper1
HMaster

ZooKeeper2
ZooKeeper3

RegionServer

RegionServer

RegionServer

DataNode

DataNode

DataNode

HDFS

468

Hortonworks Inc. 2012

Page 321

Copyright 2014, Hortonworks, Inc. All rights reserved.

Components
Rowkey: Data is always identified by a rowkey. A Row key can be considered as a
primary key that you find in relational databases. It is a unique key that identifies a row
in HBase. Rowkeys are always sorted lexagraphically in ascending order within Regions.
Region: A region is a collection of rows that is managed by one of the RegionServers.
RegionServer: The HBase worker node, java process, which is co-located with
DataNodes. A RegionServer can load HBase block files into memory for caching, scan
blocks locally, and is thus co-located with the data blocks that make up the regions it
manages.
HMaster: Responsible for HBase maintenance tasks such as load-balancing and
orchestrating recovery when a RegionServer fails. Since clients talk directly to
RegionServers, it is possible for an HMaster to go down, and HBase can continue
functioning, however, the HMaster should be restarted as soon as possible.
ZooKeeper: ZooKeeper handles all of the configuration management. Clients always talk
to ZooKeeper first to find the appropriate RegionServer to talk to.

Since HBase RegionServers have a data block cache, heap sizes for
RegionServers is often very large. It is recommended to set the heap for
RegionServers (resource permitting) to at least 8GB.

Copyright 2014, Hortonworks, Inc. All rights reserved.

469

Configuring HBase
HBase configuration properties are described above.
Ports Firewall considerations:
Service

Servers

Ports

Protocol Description

HMaster
HMaster Info Web UI

Masters
Masters

60000
60010

http

RegionServer

RegionServers 60020

RegionServer

RegionServers 60030

470

http

RegionServers comm
HMaster WebUI stats
Client to RegionServer
communications, Master to
RegionServer, RegionServer
to RegionServer
RegionServer Web UI
stats

Copyright 2014, Hortonworks, Inc. All rights reserved.

HCatalog
HCatalog is Hives table and storage management layer.

MapReduce

Pig

Hive

Streaming

HCatalog

ORC

RC

Text

Sequence

Custom

HBase

HCatalog
HCatalog allows the creation of schema definitions that will be accessed from
applications. This allows the schema definition to be outside of the application code.
HCatalog is a set of interfaces that provide access to Hive's metastore for different types
of applications.
HCatalog provides:

A shared schema and data type mechanism for Hadoop tools.

A table abstraction so that users need not be concerned with where or how their
data is stored.

Interoperability across data processing tools such as Pig, Map Reduce, and Hive.

A REST interface to allow language independent access to Hive's metadata.

The HCatalog CLI supports all Hive DDLs that do not require MapReduce. HCatalog is
used to create, alter, drop tables, etc. The HCatalog CLI supports commands like SHOW
TABLES and DESCRIBE TABLE.

Copyright 2014, Hortonworks, Inc. All rights reserved.

471

NameNode Architecture HDP1


NameNode
Namespace
fsimage

Block Management
Edits.log

Heartbeats

DataNode 1

DataNode 2

BLK1

BLK6

BKL4

BLK2

BK5

BK5

DataNode 3

BLK6

BLK1

BLK6

BLK2

BKL4

DataNode n

BLK1

BK5

BLK2

BKL4

HDFS Distributed Storage

NameNode Architecture HDP1


The NameNode has two main areas:

Namespace: Contains information on directories, files and blocks. In HDP1


there is only one NameNode per cluster, therefore one namespace for the
cluster. The namespace supports the creation, deletion and modification and
listing of files as well as directory operations.

Block Management: Locations of blocks, manages replicas of blocks, DataNode


cluster membership, heartbeats, etc. Supports the creation, deletion and
modification of block operations.

DataNodes handle all I/O, storage and block management on data node machines (slave
servers). Data blocks are replicated for high availability.
The HDP1 NameNode architecture scales to approximately 5,000 data nodes. One of
the big advantages of a Hadoop platform is the ability to coordinate data from all types
of different sources. Customers want to put more of their date into a central data lake
versus creating lots of Hadoop clusters.

472

Copyright 2014, Hortonworks, Inc. All rights reserved.

In HDP1 the Namespace Volume = Single Namespace + block. All tenants shared a single
namespace.
With a single namespace there is isolation in multi user environment.
In HDP1, customer clusters can contain 4500+ nodes, 100+ PB storage and 400+ million
files and they keep growing bigger.

HDFS has over 7 9s of data reliability with less than 0.38 failures across 25
clusters.

One administrator can manage 1000 3000k nodes in a cluster.

HDFS offers fast repair time for disk failure or node failure. In HDFS, repairs can
occur in minutes versus RAID arrays where fixes can take hours.

Copyright 2014, Hortonworks, Inc. All rights reserved.

473

Federating NameNodes
Hadoop clusters are increasing in size, workloads and complexity.

A typical large deployment at Yahoo! includes an HDFS cluster with 2700-4200


DataNodes with 180 million files and blocks, and address ~25 PB of storage.

At Facebook, HDFS has around 2600 nodes, 300 million files and blocks,
addressing up to 60PB of storage.

The number of files in HDFS is limited by the amount of memory in a single name
node. More RAM in a single machine creates more garbage collection issues.
Multiple NameNodes increase the amount of memory and files that can be
HDFS.

At a lot of companies, the Hadoop cluster is used in a multi-tenant environment where


different business units share a cluster. One large application can impact all the tenants
if a single namespace (HDP2).
In HDP1, a lot of organizations run HBase in a separate cluster to make sure SLAs can be
met. In HDP2 HBase can run in its own namespace, eliminating the need for separate
clusters.

474

Copyright 2014, Hortonworks, Inc. All rights reserved.

In a federated NameNode environment, the NameNodes are independent and do not


need to coordinate with each other.
When a NameNode/namespace is deleted, the corresponding block pool at the
DataNodes is deleted.
Each namespace volume is upgraded as a unit, during cluster upgrade.

Copyright 2014, Hortonworks, Inc. All rights reserved.

475

NameNode Architecture HDP2


NameNode k

NameNode 1
Namespace 1
Pool 1

NameNode n

Namespace k

Namespace n

Pool k

Block Management

fsimage

Pool n
fsimage

Block Management

fsimage

Block Management

Edits.log

Edits.log

Edits.log

Heartbeats

DataNode 1

DataNode 2

BLK1

BLK6

BKL4

BLK2

BK5

BK5

DataNode 3

BLK6

BLK1

BLK6

BLK2

BKL4

DataNode n

BLK1

BK5

BLK2

BKL4

HDFS Distributed Storage

NameNode Architecture HDP2


In HDP2, a Hadoop cluster can have a single NameNode managing a single namespace.
For scalability, HDP2 can configure a federation of NameNodes, each with its own
Namespace. A federation is a group of objects that work together with each group
acting with a level of independence. This helps allow a HDP2 cluster to scale to 10,000
DataNodes.
The NameNodes are federated, so each NameNode is independent. They do not need
to coordinate with each other. DataNodes will register with each federated NameNode.
A ClusterID is a new identifier, used to identify all nodes in the cluster.
Despite there being separate namespaces and block pools the NameNode is quite
similar to a configuration of a single NameNode in a single namespace. Most changes
are in the DataNodes, configurations, and tools.
DataNodes in a federated cluster:

Register with all the NameNodes.

Send periodic heartbeats and block reports to all the NameNodes.

Send block received/deleted for a block pool to corresponding NameNode.

476

Copyright 2014, Hortonworks, Inc. All rights reserved.

Namespace Volume
The NameServiceID is an identifier for coordinating a NameNode with its backup,
secondary or checkpointing nodes. The NameServiceId is used to identify a set of nodes
associated with a namespace in the configuration files. Datanodes will reference all the
DataNodes in the cluster. DataNodes store blocks for all the namespace volumes, there
is no partitioning.
A federation of NameNodes is a simple design and required minimal changes to existing
NameNode code.
Separating the namespace and block management also allows block storage to become
a separate service. The namespace happens to be one of the applications that uses the
service. This opens up the potential of associating different types of services on block
storage. Examples:

HBase

New block categories can be created in the future to support with different types
of garbage collection and optimization for different types of applications.

Foreign namespaces

Copyright 2014, Hortonworks, Inc. All rights reserved.

477

Benefits of Independent Block Pools


Separating the block pools from namespace volumes allows the potential of block
management to move into separate independent nodes. Separate block pools allow
different types of application implementations to be simplified. Areas such as
distributed caches become easier to implement.
An independent namespace can generate Block IDs for new blocks without the need for
coordination with the other namespaces. The failure of a NameNode does not prevent
other NameNodes to service DataNodes in the cluster.

478

Copyright 2014, Hortonworks, Inc. All rights reserved.

Namespaces Increase Scalability


The HDP1 NameNode architecture scales to approximately 5,000 data nodes. One of
the big advantages of Hadoop platforms is the ability to coordinate data from all types
of different sources. Customers want to put more of their data into a central data lake
versus creating many Hadoop clusters. HDP2 NameNode architecture can scale to
10,000+ data nodes.
The scalability of multiple NameNodes takes HDP2 pas 100k current tasks, 200PB of
storage and 1+ billion files.

Copyright 2014, Hortonworks, Inc. All rights reserved.

479

Configuring NameServices for DataNodes


<configuration>
<property>
<name>dfs.nameservices</name>
Define NameNodes <value>ns1,ns2</value>
in Cluster
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name> <value>nn-host1:rpc-port</value>
Server and Port # for NameNode1
</property>
<property>
<name>dfs.namenode.http-address.ns1</name> <value>nn-host1:http-port</value>
Server and Port # for NameNode1
</property>
<property>
<name>dfs.namenode.secondaryhttp-address.ns1</name> <value>snn-host1:http-port</
Server and Port # for SNameNode1
value>
</property>
<property>
Server and Port # for NameNode2
<name>dfs.namenode.rpc-address.ns2</name>
<value>nn-host2:rpc-port</value>
</property>
<property>
Server and Port # for NameNode2
<name>dfs.namenode.http-address.ns2</name>
<value>nn-host2:http-port</value>
</property>
<property>
Server and Port # for SNameNode2
<name>dfs.namenode.secondaryhttp-address.ns2</name>
<value>snn-host2:http-port</
value>
</property>

</configuration>

Configuring NameServices for DataNodes


There is a single configuration for all nodes in the cluster.
Formatting a NameNode requires specifying the ClusterId. Multiple NameNodes can be
formatted at the same time. ClusterIDs are auto generated if not provide.
$HADOOP_PREFIX_HOME/bin/hdfs namenode -format [-clusterId
<cluster_id>]

When upgrading and going from a single namespace to multiple namespaces


(NameNodes).
$HADOOP_PREFIX_HOME/bin/hdfs start namenode --config
$HADOOP_CONF_DIR -upgrade -clusterId <cluster_ID>

A NameNode can be added to an existing HDFS cluster. Update configuration files to


reflect the new NameNode(s). Run the refreshNameNode option on all the Datanodes.
$HADOOP_PREFIX_HOME/bin/hdfs dfadmin -refreshNameNode
<datanode_host_name>:<datanode_rpc_port>

480

Copyright 2014, Hortonworks, Inc. All rights reserved.

The HDFS cluster can be started from any node as long as the HDFS configuration
information is available. The startup process starts the NameNodes and the DataNodes
in the slaves file are started.
$HADOOP_PREFIX_HOME/bin/start-dfs.sh
$HADOOP_PREFIX_HOME/bin/stop-dfs.sh

NameNode Configuration Parameters


dfs.namenode.rpc-address dfs.namenode.servicerpc-address dfs.namenode.httpaddress dfs.namenode.https-address dfs.namenode.keytab.file dfs.namenode.name.dir
dfs.namenode.edits.dir dfs.namenode.checkpoint.dir dfs.namenode.checkpoint.edits.dir

Secondary Configuration Parameters


dfs.namenode.secondary.http-address dfs.secondary.namenode.keytab.file

Backup Configuration Parameters


dfs.namenode.backup.address dfs.secondary.namenode.keytab.file

Copyright 2014, Hortonworks, Inc. All rights reserved.

481

Block Management with Federation


The Cluster Web Service can use any of the NameNode servers to monitor the cluster.
The Cluster Web Service displays:

Number of files

Number of blocks

Total storage capacity

Used and available storage information for HDFS cluster.

You can run the Cluster Web Console from any NameNode:
http://<NameNodeNHost>:port>/dfsclusterhealth.jsp

NameNodes can be added and removed in a Federated cluster without restarting the
cluster.

482

Copyright 2014, Hortonworks, Inc. All rights reserved.

Run the balancer choosing the Node or Blockpool policy.


"$HADOOP_PREFIX"/bin/hadoop-daemon.sh --config
$HADOOP_CONF_DIR --script "$bin"/hdfs start balancer [policy <policy>]

When decommissioning a DataNode, the exclude file needs to be distributed to all


NameNodes.
"$HADOOP_PREFIX"/bin/distribute-exclude.sh <exclude_file>

To refresh all the NameNodes, run the following script:


"$HADOOP_PREFIX"/bin/refresh-namenodes.sh

Copyright 2014, Hortonworks, Inc. All rights reserved.

483

Configuration Parameters
Property

Value (examples)

Description

dfs.nameservices

mycoolcluster

Logical name for nameservice

dfs.ha.namenodes.
[nameservice ID]

nn1,nn2

NameNode IDs

dfs.ha.automaticfailover.enabled

true

Set cluster automatic failover on


each NameNode using HA.

ha.zookeeper.quorum

nn1,nn2

NameNodes in the cluster

ha.zookeeper.quorum

<Host1>:2181,
<Host2>:2181,
<Host3>:2181

Define Zookeeper Quorum

dfs.journalnode.edits.dir

/localdirpath/journalnode/

Where JNs store state data

dfs.namenode.rpcaddress.mycoolcluster

<Host1>:8020

Set RPC address for NN to listen


on per NN.

dfs.namenode.http-address.
mycoolcluster

<Host1>:50070

Set HTTP address for NN to


listen on per NN.

dfs.namenode.rpc-address.
mycoolcluster

<Host2>:8020

Set RPC address for NN to listen


on per NN.

dfs.namenode.httpaddress.nmycoolcluster

<Host2>:50070

Set HTTP address for NN to


listen on per NN.

Federation Configuration Parameters


Both the hdfs-site.xml and core-site.xml configuration files need to be modified to set
up HDFS HA. Examples:

hdfs-site.xml

dfs.ha.automatic-failover.enabled

core-site.xml

ha.zookeeper.quorum

484

Copyright 2014, Hortonworks, Inc. All rights reserved.

Additional parameters need to be configured:

dfs.ha.fencing.methods: A set of scripts or Java classes that determine how to


fence the Active NameNode during a failover.

Sshfence: Uses SSH to connect to the Active NameNode and kill the Active
NameNode process.
For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>

Shell: Execute a shell script or command to fence the Active NameNode.

Copyright 2014, Hortonworks, Inc. All rights reserved.

485

You might also like