HADOOP

UNIT IV PROGRAMMING MODEL
Open source grid middleware packages Globus Toolkit (GT4) Architecture ,

Configuration Usage of Globus Main components and Programming model -
Introduction to Hadoop Framework - Mapreduce, Input splitting, map and reduce
functions, specifying input and output parameters, configuring and running a job
Design of Hadoop file system, HDFS concepts, command line and java interface,
dataflow of File read & File write.
4.1.1 OPEN SOURCE GRID MIDDLEWARE PACKAGES:
The Open Grid Forum and Object Management are two well- formed
organizations behind the standards
Middleware is the software layer that connects software components. It lies

between operating system and the applications.
Grid middleware is specially designed a layer between hardware and software,

enable the sharing of heterogeneous resources and managing virtual organizations
created around the grid.
The popular grid middleware are
1. BOINC -Berkeley Open Infrastructure for Network Computing.
2. UNICORE - Middleware developed by the German grid computing

community.
3. Globus (GT4) - A middleware library jointly developed by Argonne

National Lab., Univ. of Chicago, and USC Information Science Institute,
funded by DARPA, NSF, and NIH.
4. CGSP in ChinaGrid - The CGSP (ChinaGrid Support Platform) is a

middleware library developed by 20 top universities in China as part of the
ChinaGrid Project.
5. Condor-G - Originally developed at the Univ. of Wisconsin for general
distributed computing, and later extended to Condor-G for grid job
management.
6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for business

grid applications. Applied to private grids and local clusters within
enterprises or campuses.
7. gLight -Born from the collaborative efforts of more than 80 people in 12

different academic and industrial research centers as part of the EGEE
Project, gLite provided a framework for building grid applications tapping
into the power of distributed computing and storage resources across the
Internet.
4.1.2 GLOBUS TOOLKIT ARCHITECTURE (GT4):
The Globus Toolkit is an open middleware library for the grid computing
communities.
These open source software libraries support many operational grids and their
applications on an international basis.
The toolkit addresses common problems and issues related to grid resource
discovery, management, communication, security, fault detection, and portability.
The software itself provides a variety of components and capabilities.
The library includes a rich set of service implementations.
The implemented software supports grid infrastructure management, provides
tools for building new web services in Java, C, and Python, builds a powerful
standard-based.
Security infrastructure and client APIs (in different languages), and offers
comprehensive command-line programs for accessing various grid services.
The Globus Toolkit was initially motivated by a desire to remove obstacles that
prevent seamless collaboration, and thus sharing of resources and services, in
scientific and engineering applications.
The shared resources can be computers, storage, data, services, networks, science
instruments (e.g., sensors), and so on.
Functionalities of GT4:
Global Resource Allocation Manager (GRAM ) - Grid Resource Access and
Management (HTTP-based)
Communication (Nexus ) - Unicast and multicast communication
Grid Security Infrastructure (GSI ) - Authentication and related security services
Monitory and Discovery Service (MDS ) - Distributed access to structure and

state information
Health and Status (HBM ) - Heartbeat monitoring of system components
Global Access of Secondary Storage (GASS ) - Grid access of data in remote

secondary storage
Grid File Transfer (GridFTP ) Inter-node fast file transfer
Figure: Globus Toolkit GT4 supports distributed and cluster computing services
The GT4 Library

The GT4 Library GT4 offers the middle-level core services in grid applications.
The high-level services and tools, such as MPI, Condor-G, and Nirod/G, are
developed by third parties for general purpose distributed computing applications.
The local services, such as LSF, TCP, Linux, and Condor, are at the bottom level
and are fundamental tools supplied by other developers.
Globus Job Workflow
A typical job execution sequence proceeds as follows: The user delegates his
credentials to a delegation service.
The user submits a job request to GRAM with the delegation identifier as a
parameter. GRAM parses the request, retrieves the user proxy certificate from the
delegation service, and then acts on behalf of the user.
GRAM sends a transfer request to the RFT, which applies GridFTP to bring in the
necessary files.
GRAM invokes a local scheduler via a GRAM adapter and the SEG initiates a set
of user jobs.
The local scheduler reports the job state to the SEG. Once the job is complete,
GRAM uses RFT and GridFTP to stage out the resultant files.
Figure: Globus job workflow among interactive functional modules.
Client-Globus Interactions:
There are strong interactions between provider programs and user code.
GT4 makes heavy use of industry-standard web service protocols and mechanisms
in service description, discovery, access, authentication, authorization, and the
like.
GT4 makes extensive use of Java, C, and Python to write user code.
Web service mechanisms define specific interfaces for grid computing.
Web services provide flexible, extensible, and widely adopted XML-based
interfaces.
These demand computational, communication, data, and storage resources.
We must enable a range of end-user tools that provide the higher-level capabilities
needed in specific user applications.
Developers can use these services and libraries to build simple and complex
systems quickly.
Figure: Client and GT4 server interactions; vertical boxes correspond to service
programs and horizontal boxes represent the user codes.
4.1.3 USAGE OF GLOBUS

Required software
To build the Globus Toolkit from the source installer, first download the source
from download page, and be sure you have all of the following prerequisites installed.
This table shows specific package names (where available) for systems supported by
GT 6.0:
RedHat-based Debian-based
Prerequisite Reason Solaris 11 Mac OS X
Systems Systems
Most of the toolkit is pkg:/develop

written in C, using C99 er/gcc-45
C Compiler gcc gcc XCode
and POSIX.1 features orSolaris
and libraries. Studio 12.3
Standard sed does not

support long enough
GNU or pkg:/text/gnu
lines to process sed sed (included in OS)
BSD sed -sed
autoconf-generated
scripts and Makefiles
RedHat-based Debian-based
Prerequisite Reason Solaris 11 Mac OS X
Systems Systems
Standard make does not

support long enough pkg:/develop
GNU Make lines to process make make er/build/gnu- (included in XCode)
autoconf-generated make
makefiles
GSI security uses

OpenSSL OpenSSL's pkg:/library/s
(included in base
0.9.8 or implementation of the openssl-devel libssl-dev ecurity/opens
OS)
higher SSL protocol and X.509 sl
certificates.
Parts of GRAM5 are

Perl 5.10 or pkg:/runtime/ (included in base
written in Perl, as are perl perl
higher perl-512 OS)
many test scripts
Download and
pkg:/develop
Parts of GRAM5 are install
pkg-config pkgconfig pkg-config er/gnome/gett
written in Perl fromfreedesktop.org
ext
source packages
The Globus Alliance receives support from government funding agencies.

In a time of funding scarcity, these agencies must be able to demonstrate that the
scientific community is benefiting from their investment.
To this end, we want to provide generic usage data about such things as the following:
How many people use GridFTP.
How many jobs run using GRAM.
To this end, we have added support to the Globus Toolkit that will allow installations
to send us generic usage statistics.
By participating in this project, you help our funders to justify continuing their
support for the software on which you rely.
4.2 .1 INTRODUCTION TO HADOOP FRAMEWORK
Hadoop is an Apache open source framework written in java that allows

distributed processing of large datasets across clusters of computers using simple
programming models.
A Hadoop frame worked.
application works in an environment that provides distributed storage and

computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each

offering local computation and storage.
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common:
These are Java libraries and utilities required by other Hadoop modules.
These libraries provides file system and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
Hadoop YARN:
This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System HDFS:
A distributed file system that provides high throughput access to application

data.
Hadoop MapReduce:
This is YARN-based system for parallel processing of large data sets.
We can use following Fig 1 diagram to depict these four components available
in Hadoop framework.
Fig1: Diagram For Hadoop FrameWork
Hadoop Framework Tools:

The Apache Hadoop project develops open-source software for reliable, scalable, distributed
computing, including:
Hadoop Core, our flagship sub-project, provides a distributed file system (HDFS)
and support for the MapReduce distributed computing metaphor.
HBase builds on Hadoop Core to provide a scalable, distributed database.
Pig is a high-level data-flow language and execution framework for parallel
computation. It is built on top of Hadoop Core.
ZooKeeper is a highly available and reliable coordination system. Distributed
applications use ZooKeeper to store and mediate updates for critical shared state.
Hive is a data warehouse infrastructure built on Hadoop Core that provides data
summarization, adhoc querying and analysis of datasets.
The Hadoop Core project provides the basic services for building a cloud
computing environment with commodity hardware, and the APIs for developing
software that will run on that cloud.
The two fundamental pieces of Hadoop Core are the MapReduce framework, the
cloud computing environment, and he Hadoop Distributed File System (HDFS).
The Hadoop Core MapReduce framework requires a shared file system.
This shared file system does not need to be a system-level file system, as long as
there is a distributed file system plug-in available to the framework.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications

which process big amounts of data in-parallel on large clusters thousands of
nodes of commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that
Hadoop programs perform:
The Map Task: This is the first task, which takes input data and converts it
into a set of data, where individual elements are broken down into tuples
key/value pairs.
The Reduce Task: This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is
always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework
takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node.
The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves,
monitoring them and re-executing the failed tasks.
The slaves TaskTracker execute the tasks as directed by the master and provide task-
status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which
means if JobTracker goes down, all running jobs are halted.
Hadoop Distributed File System
Hadoop can work directly with any mountable distributed file system such as Local
FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is
the Hadoop Distributed File System HDFS.
The Hadoop Distributed File System HDFS is based on the Google File System GFS
and provides a distributed file system that is designed to run on large clusters
thousands of computers of small computer machines in a reliable, fault-tolerant
manner.
HDFS uses a master/slave architecture where master consists of a single NameNode
that manages the file system metadata and one or more slave DataNodes that store the
actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored
in a set of DataNodes.
The NameNode determines the mapping of blocks to the DataNodes. The DataNodes
takes care of read and write operation with the file system.
They also take care of block creation, deletion and replication based on instruction
given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available
to interact with the file system.
These shell commands will be covered in a separate chapter along with appropriate
examples.
4.3.1 HADOOP DISTRIBUTED FILE SYSTEM OVER VIEW:
Hadoop File System was developed using distributed file system design.
It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault tolerant and designed
using low cost hardware.
HDFS holds very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from possible
data losses in case of failure.
HDFS also makes applications available to parallel processing.
FEATURES OF HDFS:
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of name node and data node help users to easily
check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS ARCHITECTURE
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode (master server) that manages the file system.
Datanodes (Slaves), which manage storage attached to the nodes.
Internally, a file is split into one or more blocks and these blocks are
stored in a set of Datanodes.
NAMENODE:
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
It is a software that can be run on commodity hardware. The system having
the namenode acts as the master server and it does the following tasks:
Manages the file system namespace.
Regulates clients access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.
DATANODE:
The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software.
For every node Commodity hardware/System in a cluster, there will be a
datanode.
These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client
request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
BLOCK
Generally the user data is stored in the files of HDFS.
The file in a file system will be divided into one or more segments and/or
stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount
of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
GOALS OF HDFS:
Fault detection and recovery :
Since HDFS includes a large number of commodity hardware, failure
of components is frequent.
Therefore HDFS should have mechanisms for quick and automatic
fault detection and recovery.
Huge datasets :
HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data :
A requested task can be done efficiently, when the computation takes
place near the data.
Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
Master node runs JobTracker instance, which accepts Job requests from
clients.
TaskTracker instances run on slave nodes.
TaskTracker forks separate Java process for task instances.
4.3.4 HDFS OPERATIONS:

Starting HDFS
Initially you have to format the configured HDFS file system, open namenode
HDFSserver, and execute the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system.
The following command will start the namenode as well as the data nodes as
cluster.
$ start-dfs.sh
Listing Files in HDFS
After loading the information in the server, we can find the list of files in a
directory, status of a file,using ls.
Given below is the syntax of ls that you can pass to a directory or a filename
as an argument.
$ $ HADOOP_HOME/bin/hadoop fs -ls <args>
Inserting Data Into Hdfs
Assume we have data in the file called file.txt in the local system which is
ought to be saved in the hdfs file system.
Follow the steps given below to insert the required file in the Hadoop file
system.
Step 1
You have to create an input directory.
$ $ HADOOP_HOME/bin/hadoop fs -m kdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file
system using the put command.
$ $ HADOOP_HOME/bin/hadoop fs -put /hom e/file.txt
/user/input
Step 3
You can verify the file using ls command.
$ $ HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data From Hdfs:
Assume we have a file in HDFS called out file. Given below is a
simple demonstration for retrieving the required file from the Hadoop
file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $ HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $ HADOOP_HOME/bin/hadoop fs -get /user/output/
/home/hadoop_tp/
Shutting Down The Hdfs:
You can shut down the HDFS by using the following command.
$ stop-dfs.sh
There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are
demonstrated
here, although these basic operations will get you started.
Running ./bin/hadoop dfs with no
additional arguments will list all the commands that can be run with the FsShell system.
Furthermore, $HADOOP_HOME/bin/hadoop fs -help commandName will display a short
usage
summary for the operation in question, if you are stuck.
A table of all the operations is shown below.
The following conventions are used for parameters:
"<path>" m eans any file or directory nam e.
"<path>..." m eans one or m ore file or directory nam es.
"<file>" m eans any filenam e.
"<src>" and "<dest>" are path nam es in a directed operation.
"<localSrc>" and "<localDest>" are paths as above, but on the local file system .
All other files and path names refer to the objects inside HDFS.
1. ls <path>
Lists the contents of the directory specified by path, showing the names, permissions,
owner, size and modification date for each entry.
2. lsr <path>
Behaves like -ls, but recursively displays entries in all subdirectories of path.
3. du <path>
Shows disk usage, in bytes, for all the files which match path; filenames are reported
with
the full HDFS protocol prefix.
4. dus <path>
Like -du, but prints a summary of disk usage of all files/directories in the path.
5. mv <src><dest>
Moves the file or directory indicated by src to dest, within HDFS.
6. cp <src> <dest>
Copies the file or directory identified by src to dest, within HDFS.
7. rm <path>
Removes the file or empty directory identified by path.
8. rmr <path>
Removes the file or directory identified by path. Recursively deletes any child entries
i. e. , filesorsubdirectoriesofpath.
9. put <localSrc> <dest>

Copies the file or directory from the local file system identified by localSrc to dest
within
the DFS.
10. copyFromLocal <localSrc> <dest>
Identical to -put
11. moveFromLocal <localSrc> <dest>
Copies the file or directory from the local file system identified by localSrc to dest
within
HDFS, and then deletes the local copy on success.
12. get [-crc] <src> <localDest>
Copies the file or directory in HDFS identified by src to the local file system path
identified
by localDest.
13. getmerge <src> <localDest>
Retrieves all files that match the path src in HDFS, and copies them to a single,
merged file in the local file system identified by localDest.
14. cat <filen-ame>
Displays the contents of filename on stdout.
15. copyToLocal <src> <localDest>
Identical to -get
16. moveToLocal <src> <localDest>
Works like -get, but deletes the HDFS copy on success.
17. mkdir <path>
Creates a directory named path in HDFS. Creates any parent directories in path that
are missing e. g. , mkdir pinLinux.
18. setrep [-R] [-w] rep <path>
Sets the target replication factor for files identified by path to rep.
Theactualreplicationfactorwillmovetowardthetargetovertime
19. touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already
exists at path, unless the file is already size 0.
20. test -[ezd] <path>
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
21. stat [format] <path>
Prints information about path. Format is a string which accepts file size in blocks ,
filename, block size , replication , and modification date .
22. tail [-f] <file2name>
Shows the last 1KB of file on stdout.
23. chmod [-R] mode,mode,... <path>...
Changes the file permissions associated with one or more objects identified by path....
Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-
{rwxX}.
Assumes if no scope is specified and does not apply an umask.
24. chown [-R] [owner][:[group]] <path>...
Sets the owning user and/or group for files or directories identified by path.... Sets
owner
recursively if -R is specified.
25.chgrp [-R] group <path>...
Sets the owning group for files or directories identified by path.... Sets group
recursively if - R is specified.
26. help <cmd-name>

Returns usage information for one of the commands listed above. You must omit the
leading '-' character in cmd.
4.4.1 MAPREDUCE MODEL
The model is based on two distinct steps for an application:
Map: An initial ingestion and transformation step, in which individual input records can be
processed in parallel.
Reduce: An aggregation or summarization step, in which all associated records must be
processed together by a single entity.
The core concept of MapReduce in Hadoop is that input may be split into logical
chunks, and each chunk may be initially processed independently, by a map task.
The results of these individual processing chunks can be physically partitioned into
distinct sets, which are then sorted.
Each sorted chunk is passed to a reduce task.
A map task may run on any compute node in the cluster, and multiple map tasks may
be running in parallel across the cluster.
The map task is responsible for transforming the input records into key/value pairs.
The output of all of the maps will be partitioned, and each partition will be sorted.
There will be one partition for each reduce task.
Each partitions sorted keys and the values associated with the keys are then
processed by the reduce task.
There may be multiple reduce tasks running in parallel on the cluster.
The application developer needs to provide only four items to the Hadoop framework:
the class that will read the input records and transform them into one key/value pair
per record, a map method, a reduce method, and a class that will transform the
key/value pairs that the reduce method outputs into output records.
My first MapReduce application was a specialized web crawler.
This crawler received as input large sets of media URLs that were to have their
content fetched and processed.
The media items were large, and fetching them had a significant cost in time and
resources.
The job had several steps:
1. Ingest the URLs and their associated metadata.
2. Normalize the URLs.
3. Eliminate duplicate URLs.
4. Filter the URLs against a set of exclusion and inclusion filters.
5. Filter the URLs against a do not fetch list.
6. Filter the URLs against a recently seen set.
7. Fetch the URLs.
8. Fingerprint the content items.
9. Update the recently seen set.
10. Prepare the work list for the next application.
Figure: The MapReduce model
Introduction to MapReduce Algorithm
MapReduce is a Distributed Data Processing Algorithm, introduced by Google in its

MapReduce Tech Paper.
MapReduce Algorithm is mainly inspired by Functional Programming model.
MapReduce algorithm is mainly useful to process huge amount of data in parallel,
reliable and efficient way in cluster environments.
It uses Divide and Conquer technique to process large amount of data.
It divides input task into smaller and manageable sub-tasks (They should be
executable independently) to execute them in-parallel.
MapReduce Algorithm Steps
MapReduce Algorithm uses the following three main steps:
1. Map Function.
2. Shuffle Function.
3. Reduce Function.
Map Function:
Map Function is the first step in MapReduce Algorithm.

It takes input tasks (say DataSets. I have given only one DataSet in below diagram.)
and divides them into smaller sub-tasks.
Then perform required computation on each sub-task in parallel.
This step performs the following two sub-steps:
1. Splitting
2. Mapping
Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
Mapping step takes those smaller Sub-DataSets and perform required action or
computation on each Sub-DataSet.
The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in
the below diagram.
MapReduce First Step Output:
ShuffleFunction
It is the second step in MapReduce Algorithm. Shuffle Function is also know as Combine
Function.
It performs the following two sub-steps:
1. Merging
2. Sorting
It takes a list of outputs coming from Map Function and perform these two sub-steps on
each and every key-value pair.
Merging step combines all key-value pairs which have same keys (that is grouping
key-value pairs by comparing Key). This step returns <Key, List<Value>>.
Sorting step takes input from Merging step and sort all key-value pairs by using Keys.
This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
MapReduce Second Step Output:
ReduceFunction:
It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.
It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce
operation as shown below.
MapReduce Final Step Output:
Final step output looks like first step output. However final step <Key, Value> pairs
are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are
computed and sorted pairs.
We can observe the difference between first step output and final step output with
some simple example. We will discuss same steps with one simple example in next
section.
Thats it all three steps of MapReduce Algorithm.
How MapReduce Algorithm Works With Word Count Example
How MapReduce Algorithm solves Word Count Problem theoretically.
We will implement a Hadoop MapReduce Program and test it in my coming post.

ProblemStatement:
Count the number of occurrences of each word available in a DataSet.
InputDataSet
Please find our example Input DataSet file in below diagram. Just for simplicity, we are going
to use simple small DataSet. However, Real-time applications use very huge amount of Data.
Client Required Final Result:
MapReduce Map Function (Split Step)
MapReduce Map Function (Mapping Step)

MapReduce Shuffle Function (Merge Step)
MapReduce Shuffle Function (Sorting Step)
MapReduce Reduce Function (Reduce Step)
Map & Reduce function:
A Simple Map Function: IdentityMapper

The Hadoop framework provides a very simple map function, called
IdentityMapper. It is used in jobs that only need to reduce the input, and not
transform the raw input.
All map functions must implement the Mapper interface, which guarantees that
the map function will always be called with a key.
The key is an instance of a Writable Comparable object, a value that is an instance
of a Writable object, an output object, and a reporter.
IdentityMapper.java
package org.apache.hadoop.mapred.lib;
import java.io.IOException;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.MapReduceBase;
/** Implements the identity function, mapping inputs directly to outputs. */
public class IdentityMapper<K, V>
extends MapReduceBase implements Mapper<K, V, K, V> {
/** The identify function. Input key/value pair is written directly to
* output.*/
public void map(K key, V val, OutputCollector<K, V> output, Reporter
reporter) throws IOException
{
output.collect(key, val);
}
}
A Simple Reduce Function: IdentityReducer
The Hadoop framework calls the reduce function one time for each unique
key.
The framework provides the key and the set of values that share that key.
IdentityReducer.java
package org.apache.hadoop.mapred.lib;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.MapReduceBase;
/** Performs no reduction, writing all input values directly to the output. */
public class IdentityReducer<K, V>
extends MapReduceBase implements Reducer<K, V, K, V> {
Chapter 2 THE BASICS OF A MAPREDUCE JOB 35
/** Writes all keys and values directly to output. */
public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output,
Reporter reporter) throws IOException
{
while (values.hasNext())
{
output.collect(key, values.next());
}
}
If you require the output of your job to be sorted, the reducer function must pass the
key objects to the output.collect() method unchanged.
The reduce phase is, however, free to output any number of records, including zero
records, with the same key and different values.
4.5.1 Data Flow of a File Read:
The client opens the file it wishes to read by calling open () on the FileSystem object,
which for HDFS is an instance of DistributedFileSystem.
DistributedFileSystem calls the namenode, using RPC, to determine the locations of
the blocks for the first few blocks in the file.
The namenode returns the addresses of the datanodes that have a copy of that block.
If the client is itself a datanode ,then it will read from the local datanode, if it hosts a
copy of the block .
The DistributedFileSystem returns an FSDataInputStream to the client for it to read
data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode
and namenode I/O.
Figure: A client reading data from HDFS
The client then calls read () on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first
(closest) datanode for the first block in the file.
Data is streamed from the datanode back to the client, which calls read () repeatedly
on the stream. When the end of the block is reached, DFSInputStream will close the
connection to the datanode, then find the best datanode for the next block.
This happens transparently to the client, which from its point of view is just reading a
continuous stream.
Blocks are read in order with the DFSInputStream opening new connections to
datanodes as the client reads through the stream.
It will also call the namenode to retrieve the datanode locations for the next batch of
blocks as needed. When the client has finished reading, it calls close () on the
FSDataInputStream.
Figure: Network distance in Hadoop
Data flow of a File write:

The client creates the file by calling create () on DistributedFileSystem.
DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystems namespace, with no blocks associated with it (step 2).
The namenode performs various checks to make sure the file doesnt already exist,
and that the client has the right permissions to create the file.
If these checks pass, the namenode makes a record of the new file.
Figure: A client writing data to HDFS
The DistributedFileSystem returns an FSDataOutputStream for the client to start

writing data to.
Just as in the read case, FSDataOutputStream wraps a DFSOutput Stream, which
handles communication with the datanodes and namenode.As the client writes data
(step 3), DFSOutputStream splits it into packets, which it writes to an internal queue,
called the data queue.
The data queue is consumed by the Data Streamer, whose responsibility it is to ask the
namenode to allocate new blocks by picking a list of suitable datanodes to store the
replicas. The list of datanodes forms a pipelinewell assume the replication level is
three, so there are three nodes in the pipeline.
The DataStreamer streams the packets to the first datanode in the pipeline, which
stores the packet and forwards it to the second datanode in the pipeline.
Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline (step 4).
DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue.
A packet is removed from the ack queue only when it has been acknowledged by all
the datanodes in the pipeline (step 5).
If a datanode fails while data is being written to it, then the following actions are
taken, which are transparent to the client writing the data
Figure: A typical replica pipeline
First the pipeline is closed, and any packets in the ack queue are added to the front of
the data queue so that datanodes that are downstream from the failed node will not
miss any packets. The current block 0on the good datanodes is given a new identity,
which is communicated to the namenode, so that the partial block on the failed
datanode will be deleted if the failed. Datanode recovers later on. The failed datanode
is removed from the pipeline and the remainder of the blocks data is written to the
two good datanodes in the pipeline.
The namenode notices that the block is under-replicated, and it arranges for a further
replica to be created on another node.
Subsequent blocks are then treated as normal.
Its possible, but unlikely, that multiple datanodes fail while a block is being written.
As long as dfs.replication.min replicas (default one) are written, the write will
succeed, and the block will be asynchronously replicated across the cluster until its
target replication factor is reached.
When the client has finished writing data, it calls close () on the stream (step 6).
This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete
(step 7).
The namenode already knows which blocks the file is made up.
Review questions:
1. What is middleware?
2. Define Grid middleware.
3. What are the middleware packages?
4. What is Condor-G?
5. List the components of Condor-G.
6. Give the advantages of Sungrid Engine.
7. List down the research grids.
8. Define sun grid engine.
9. What is GT4?
10. List down the GT4 efficiency.
11. List the services of GT4?
12. What is the use of GRAM?
13. What is meant by staging of files?
14. What are the ways for making secure file transfer in globus?
15. What are the objectives of GSI?
16. What are the functions of GSI?
17. What is Replica Location Service (RLS)?
18. What is Data Replication Service (DRS)?
19. What is Reliable File Transfer (RTF)?
20. What is GridFTP?
21. What is MDS?
22. What is meant by MDS?
23. Give the features of index service.
24. What are trigger services?
25. What is aggregator framework?
26. List the aggregator sources.
27. Define Aggregator framework.
28. What is CSF4?
29. What is Hadoop?
30. Give the modules in Hadoop framework.
31. What is Mapreduce?
32. What is Chukwa?
33. What is AVRO?
34. What is zookeeper?
35. What is HIVE?
36. What are the tasks of job monitoring?
37. What is Pig data flow language?
38. What is Scoop?
39. What is input splitting?
40. What are the operational modes of Hadoop?
41. Give the features of HDFS.
42. List down the parts of grid FTP?
43. What is meant by mutual authentication?
44. What is namenode?
45. What is datanode?
46. Name the various implementations of file systems in Hadoop.
47. What is meant by Reduce function?
48. What is meant by communication integrity?
8 MARKS
49. Explain Condor-G.

50. List some major commands in Hadoop.
51. Describe the salient features of Sungrid Engine.
52. Describe in detail about Hadoop framework.
53. Describe the file read and write process in Hadoop.
16 MARKS
54. Explain briefly the open source grid middleware packages.

55. Discuss in detail about the hadoop framework with mapreduce and input splitting.
56. Give a brief overview of GT4 architecture.
57. Explain the programming model of GT4.
58. Discuss the Concepts of globus GT4 Architecture.
59. Explain the design of hadoop file system.
60. Elucidate the role of Java in Hadoop.
61. Discuss briefly about the dataflow of file read and write.
62. Explain Mapreduce in detail

HADOOP

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HADOOP

Uploaded by

Copyright:

Available Formats

UNIT IV PROGRAMMING MODEL

Open source grid middleware packages Globus Toolkit (GT4) Architecture ,

4.1.1 OPEN SOURCE GRID MIDDLEWARE PACKAGES:

Middleware is the software layer that connects software components. It lies

Grid middleware is specially designed a layer between hardware and software,

The popular grid middleware are

1. BOINC -Berkeley Open Infrastructure for Network Computing.

2. UNICORE - Middleware developed by the German grid computing

3. Globus (GT4) - A middleware library jointly developed by Argonne

4. CGSP in ChinaGrid - The CGSP (ChinaGrid Support Platform) is a

6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for business

7. gLight -Born from the collaborative efforts of more than 80 people in 12

4.1.2 GLOBUS TOOLKIT ARCHITECTURE (GT4):

Grid Security Infrastructure (GSI ) - Authentication and related security services

Monitory and Discovery Service (MDS ) - Distributed access to structure and

Health and Status (HBM ) - Heartbeat monitoring of system components

Global Access of Secondary Storage (GASS ) - Grid access of data in remote

Grid File Transfer (GridFTP ) Inter-node fast file transfer

The GT4 Library

Figure: Globus job workflow among interactive functional modules.

4.1.3 USAGE OF GLOBUS

Most of the toolkit is pkg:/develop

Standard sed does not

Standard make does not

GSI security uses

Parts of GRAM5 are

The Globus Alliance receives support from government funding agencies.

Hadoop is an Apache open source framework written in java that allows

A Hadoop frame worked.

application works in an environment that provides distributed storage and

Hadoop is designed to scale up from single server to thousands of machines, each

Hadoop framework includes following four modules:

This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System HDFS:

A distributed file system that provides high throughput access to application

This is YARN-based system for parallel processing of large data sets.

Hadoop Framework Tools:

Hadoop MapReduce is a software framework for easily writing applications

4.3.4 HDFS OPERATIONS:

9. put <localSrc> <dest>

26. help <cmd-name>

Figure: The MapReduce model

Introduction to MapReduce Algorithm

MapReduce is a Distributed Data Processing Algorithm, introduced by Google in its

MapReduce Algorithm Steps

MapReduce Algorithm uses the following three main steps:

Map Function is the first step in MapReduce Algorithm.

This step performs the following two sub-steps:

It performs the following two sub-steps:

MapReduce Second Step Output:

How MapReduce Algorithm Works With Word Count Example

How MapReduce Algorithm solves Word Count Problem theoretically.

We will implement a Hadoop MapReduce Program and test it in my coming post.

Client Required Final Result:

MapReduce Map Function (Split Step)

MapReduce Map Function (Mapping Step)

MapReduce Reduce Function (Reduce Step)

Map & Reduce function:

A Simple Map Function: IdentityMapper

4.5.1 Data Flow of a File Read:

Figure: Network distance in Hadoop

Data flow of a File write:

Figure: A client writing data to HDFS

The DistributedFileSystem returns an FSDataOutputStream for the client to start