Professional Documents
Culture Documents
com
Apache Hadoop
n
n
Hadoop Quick Reference
n
Hadoop Quick How-To
n
Staying Current
n
Hot Tips and more...
By Eugene Ciurana and Masoud Kalali
INTRODUCTION
What Is MapReduce?
MapReduce refers to a framework that runs on a computational
cluster to mine large datasets. The name derives from the
application of map() and reduce() functions repurposed from
www.dzone.com
Implementation patterns
The Map(k1, v1) -> list(k2, v2) function is applied to every
item in the split. It produces a list of (k2, v2) pairs for each call.
The framework groups all the results with the same key
Get over 90 DZone Refcardz
together in a new split. FREE from Refcardz.com!
The Reduce(k2, list(v2)) -> list(v3) function is applied
to each intermediate results split to produce a collection
of values v3 in the same domain. This collection may have
zero or more values. The desired result consists of all the v3
collections, often aggregated into one result file.
•Z
ooKeeper - a distributed application management tool
Hot http://hadoop.apache.org is the authoritative for configuration, event synchronization, naming, and
Tip reference for all things Hadoop. group services used for managing the nodes in a Hadoop
computational network.
•H
ive - structured data warehousing infrastructure that • J obTracker: dispatches jobs and assigns splits (splits) to
provides a mechanisms for storage, data extraction, mappers or reducers as each stage completes
transformation, and loading (ETL), and a SQL-like •T
askTracker: executes tasks sent by the JobTracker and
language for querying and analysis. reports status
•H
Base - a column-oriented (NoSQL) database designed for •D
ataNode: Manages HDFS content in the node and
real-time storage, retrieval, and search of very large tables updates status to the NameNode
(billions of rows/millions of columns) running atop HDFS.
These daemons execute in the three distinct processing
Utilities layers of a Hadoop cluster: master (Name Node), slaves (Data
•P ig - a set of tools for programmatic flat-file data Nodes), and user applications.
analysis that provides a programming language, data
transformation, and parallelized processing. Name Node (Master)
• Manages the file system name space
•S
qoop - a tool for importing and exporting data stored in
• Keeps track of job execution
relational databases into Hadoop or Hive, and vice versa
using MapReduce tools and standard JDBC drivers. • Manages the cluster
• Replicates data blocks and keeps them evenly distributed configuration from the Hadoop site. All the configuration
files are located in the directory $HADOOP_HOME/conf; the
•M
anages lists of files, list of blocks in each file, list of
minimum configuration requirements for each file are:
blocks per node, and file attributes and other meta-data
•T
racks HDFS file creation and deletion operations in an •h
adoop-env.sh — environmental configuration,
activity log JVM configuration, logging, master and slave
configuration files
Depending on system load, the NameNode and JobTracker
• c ore-site.xml — site wide configuration, such as users,
daemons may run on separate computers.
groups, sockets
Although there can be two or more Name Nodes in •h
dfs-site.xml — HDFS block size, Name and Data
a cluster, Hadoop supports only one Name Node. node directories
Hot Secondary nodes, at the time of writing, only log
Tip what happened in the primary. The Name Node is a
•m
apred-site.xml — total MapReduce tasks,
JobTracker address
single point of failure that requires manual fail-over!
•m
asters, slaves files — NameNode, JobTracker,
Data Nodes (Slaves) DataNodes, and TaskTrackers addresses, as appropriate
• Store blocks of data in their local file system
Test the Installation
• Store meta-data for each block Log on to each server without a passphrase:
• Serve data and meta-data to the job they execute ssh servername or ssh localhost
• Send periodic status reports to the Name Node Format a new distributed file system:
hadoop namenode -format
•S
end data blocks to other nodes required by the
Name Node Start the Hadoop daemons:
start-all.sh
Data nodes execute the DataNode and TaskTracker daemons
described earlier in this section. Check the logs for errors at $HADOOP_HOME/logs!
•S
et Hadoop runtime configuration parameters with
semantics that apply to the Name or the Data nodes The official commands guide is available from:
http://hadoop.apache.org/common/docs/current/commands_
A user application may be a stand-alone executable, a script, a manual.html
web application, or any combination of these. The application
is required to implement either the Java or the streaming APIs. Usage
Hadoop Installation hadoop [--config confdir] [COMMAND]
[GENERIC_OPTIONS] [COMMAND_OPTIONS]
Required detailed instructions for this section are available at: Generic Options
http://hadoop.apache.org/comon/docs/current -conf <config file> App configuration file
•E
nsure that Java 6 and both ssh and sshd are running in -D <property=value> Set a property
all nodes
-fs <local|namenode:port> Specify a namenode
•G
et the most recent, stable release from
-jg <local|jobtracker:port> Specify a job tracker; applies only
http://hadoop.apache.org/common/releases.html
to a job
• Decide on local, pseudo-distributed or distributed mode
-files <file1, file2, .., fileN> Files to copy to the cluster (job only)
• Install the Hadoop distribution on each server
-libjars <file1, file2, ..,fileN> .jar files to include in the classpath
•S
et the HADOOP_HOME environment variable to the directory (job only)
where the distribution is installed -archives <file1, file2, .., fileN> Archives to unbundle on the
• Add $HADOOP_HOME/bin to PATH computational nodes (job only)
Follow the instructions for local, pseudo-clustered, or clustered $HADOOP_HOME/bin/hadoop precedes all commands.
Administrator Commands
balancer -threshold 50 Cluster balancing at percent of
disk capacity
daemonlog -getlevel host name Fetch http://host/
logLevel?log=name
datanode Run a new datanode
jobtracker Run a new job tracker
namenode -format Format, start a new instance,
namenode -regular upgrade from a previous version
namenode -upgrade of Hadoop, or remove previous
namenode -finalize version's files and complete An application has one or more mappers and reducers and a
upgrade
configuration container that describes the job, its stages, and
HDFS shell commands apply to local or HDFS file systems and intermediate results. Classes are submitted and monitored
take the form: using the tools described in the previous section.
The implementation itself uses standard Java text manipulation The Mapper
tools; you can use regular expressions, scanners, whatever is #!/usr/bin/gawk -f
necessary. {
for (n = 2;n <= NF;n++) {
gsub(“[,:;)(|!\\[\\]\\.\\?]|--”,””);
There were significant changes to the method if (length($n) > 0) printf(“%s\t%s\n”, $n, $1);
Hot signatures in Hadoop 0.18, 0.20, and 0.21 - check
}
}
Performance Tradeoff
The streamed awk invocation vs. Java are
functionally equivalent and the awk version is only
Hot about 5% slower. This may be a good tradeoff if the
Tip
scripted version is significantly faster to develop and
is continuously maintained.
STAYING CURRENT
#82
CONTENTS INCLUDE:
■
■
About Cloud Computing
Usage Scenarios Getting Started with
Aldon Cloud#64Computing
■
Underlying Concepts
Cost
by...
■
Upcoming Refcardz
youTechnologies ®
■
Data
t toTier
brough Comply.
borate.
Platform Management and more...
■
Chan
ge. Colla By Daniel Rubio
tion:
dz. com
tegra ternvasll
ABOUT CLOUD COMPUTING one time events. TEN TS
INC ■
HTML LUD E:
us Ind Anti-PPaat
Basics
Automated growthHTM
ref car
nuorn
■
Valid
ation one time events, cloud ML
connected to what is now deemed the ‘cloud’. Having the capability to support
Network Security
Usef
Du
ti
■
ul M.
computing platforms alsoulfacilitate
Open the gradual growth curves
n an
Page Source
o
■
s
Vis it
C
faced by web applications. Tools
Core
By Key ■
Elem
atte
E: has changed substantially in recent years, especially with ents Structur
LUD al Elem
TS INC gration
P the entrance of service providers like Amazon, Google and Large scale growth scenarios involvingents
specialized
and mor equipment
rdz !
HTML
ry Microsoft. e chang
es e... away by
(e.g. load balancers and clusters) are all but abstracted
Con at Eve
About ns to isolat
relying on a cloud computing platform’s technology.
Software i-patter
space
ALM
n
Re fca
e Work
Build
riptio
and Ant
Desc
These companies have a Privat
are in long deployedtrol repos
itory
webmana applications
ge HTM
L BAS
Patterns Control lop softw n-con to
■
that adapt and scale
Deve
les toto large user
a versio ing and making them
bases, In addition, several cloud computing ICSplatforms support data
ize merg
ment
rn
Version e... Patte it all fi minim le HTM
Manage s and mor e Work knowledgeable in amany
ine to
mainl aspects related tos multip
cloud computing. tier technologies that Lexceed the precedent set by Relational
space Comm and XHT
■
Build
re
Buil Repo
This Refcard active
will introduce are to you to clouddcomputing,within units with an
RATION etc. Some platforms ram support large grapRDBMS deployments.
■
The src
dy Ha
softw
and Java s written in on of
riente
e ine loping hical
INTEG
task-o it attribute
softwar
Mainl emphasis onDeve es by all
S these
ines providers, so youComm can better understand
also rece JavaScri user interfac web develop and the rris
Solr
chang
codel desc
INUOU Task Level
ding e code as the
www.dzone.com
Subversion
reposit s that Pay only cieswhat you consume
tagge or Amazon’s cloud you cho platform
computing
the curr isome
heavily basedmoron fine. b></ ed insid
lained ) and anti the part solution duce nden For each (e.g. WAR es t
ent stan ose to writ more e impo a></ e
not lega each othe
exp x” al Depe ge nmen b> is
be text to “fi ns are Web application deployment until
nden a few years
t librari ago was similar
t enviro industry standard
that will softwaredard and virtualization app
e HTM technology. rtant,
to pro Minim packa
Mo re
ML shar
tern. term
Privat
(starting from zero) by Perfo all majorated cloud
feedb computing platforms. on As a user of Amazon’s
n HT EC2 essecloud
nti l computing es c platform, you are result elopers rs add
ed layo anyb
he pat tion f he le this h s d d utom they occur ld based i c d
DZone, Inc.
ISBN-13: 978-1-934238-75-2
140 Preston Executive Dr. ISBN-10: 1-934238-75-9
Suite 100
50795
Cary, NC 27513