Professional Documents
Culture Documents
The Open Grid Forum and Object Management are two well- formed
organizations behind the standards
The Globus Toolkit is an open middleware library for the grid computing
communities.
These open source software libraries support many operational grids and their
applications on an international basis.
The toolkit addresses common problems and issues related to grid resource
discovery, management, communication, security, fault detection, and portability.
The software itself provides a variety of components and capabilities.
The library includes a rich set of service implementations.
The implemented software supports grid infrastructure management, provides
tools for building new web services in Java, C, and Python, builds a powerful
standard-based.
Security infrastructure and client APIs (in different languages), and offers
comprehensive command-line programs for accessing various grid services.
The Globus Toolkit was initially motivated by a desire to remove obstacles that
prevent seamless collaboration, and thus sharing of resources and services, in
scientific and engineering applications.
The shared resources can be computers, storage, data, services, networks, science
instruments (e.g., sensors), and so on.
Functionalities of GT4:
Global Resource Allocation Manager (GRAM ) - Grid Resource Access and
Management (HTTP-based)
Communication (Nexus ) - Unicast and multicast communication
Figure: Globus Toolkit GT4 supports distributed and cluster computing services
Client-Globus Interactions:
There are strong interactions between provider programs and user code.
GT4 makes heavy use of industry-standard web service protocols and mechanisms
in service description, discovery, access, authentication, authorization, and the
like.
GT4 makes extensive use of Java, C, and Python to write user code.
Web service mechanisms define specific interfaces for grid computing.
Web services provide flexible, extensible, and widely adopted XML-based
interfaces.
These demand computational, communication, data, and storage resources.
We must enable a range of end-user tools that provide the higher-level capabilities
needed in specific user applications.
Developers can use these services and libraries to build simple and complex
systems quickly.
Figure: Client and GT4 server interactions; vertical boxes correspond to service
programs and horizontal boxes represent the user codes.
Download and
pkg:/develop
Parts of GRAM5 are install
pkg-config pkgconfig pkg-config er/gnome/gett
written in Perl fromfreedesktop.org
ext
source packages
To this end, we have added support to the Globus Toolkit that will allow installations
to send us generic usage statistics.
By participating in this project, you help our funders to justify continuing their
support for the software on which you rely.
4.2 .1 INTRODUCTION TO HADOOP FRAMEWORK
Hadoop Architecture
Hadoop Common:
These are Java libraries and utilities required by other Hadoop modules.
These libraries provides file system and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
Hadoop YARN:
Hadoop MapReduce:
We can use following Fig 1 diagram to depict these four components available
in Hadoop framework.
Fig1: Diagram For Hadoop FrameWork
MapReduce
The term MapReduce actually refers to the following two different tasks that
Hadoop programs perform:
The Map Task: This is the first task, which takes input data and converts it
into a set of data, where individual elements are broken down into tuples
key/value pairs.
The Reduce Task: This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is
always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework
takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node.
The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves,
monitoring them and re-executing the failed tasks.
The slaves TaskTracker execute the tasks as directed by the master and provide task-
status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which
means if JobTracker goes down, all running jobs are halted.
Hadoop Distributed File System
Hadoop can work directly with any mountable distributed file system such as Local
FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is
the Hadoop Distributed File System HDFS.
The Hadoop Distributed File System HDFS is based on the Google File System GFS
and provides a distributed file system that is designed to run on large clusters
thousands of computers of small computer machines in a reliable, fault-tolerant
manner.
HDFS uses a master/slave architecture where master consists of a single NameNode
that manages the file system metadata and one or more slave DataNodes that store the
actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored
in a set of DataNodes.
The NameNode determines the mapping of blocks to the DataNodes. The DataNodes
takes care of read and write operation with the file system.
They also take care of block creation, deletion and replication based on instruction
given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available
to interact with the file system.
These shell commands will be covered in a separate chapter along with appropriate
examples.
4.3.1 HADOOP DISTRIBUTED FILE SYSTEM OVER VIEW:
Hadoop File System was developed using distributed file system design.
It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault tolerant and designed
using low cost hardware.
HDFS holds very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from possible
data losses in case of failure.
HDFS also makes applications available to parallel processing.
FEATURES OF HDFS:
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of name node and data node help users to easily
check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS ARCHITECTURE
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode (master server) that manages the file system.
Datanodes (Slaves), which manage storage attached to the nodes.
Internally, a file is split into one or more blocks and these blocks are
stored in a set of Datanodes.
NAMENODE:
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
It is a software that can be run on commodity hardware. The system having
the namenode acts as the master server and it does the following tasks:
Manages the file system namespace.
Regulates clients access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.
DATANODE:
The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software.
For every node Commodity hardware/System in a cluster, there will be a
datanode.
These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client
request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
BLOCK
Generally the user data is stored in the files of HDFS.
The file in a file system will be divided into one or more segments and/or
stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount
of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
GOALS OF HDFS:
Fault detection and recovery :
Since HDFS includes a large number of commodity hardware, failure
of components is frequent.
Therefore HDFS should have mechanisms for quick and automatic
fault detection and recovery.
Huge datasets :
HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data :
A requested task can be done efficiently, when the computation takes
place near the data.
Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
Master node runs JobTracker instance, which accepts Job requests from
clients.
TaskTracker instances run on slave nodes.
TaskTracker forks separate Java process for task instances.
1. Map Function.
2. Shuffle Function.
3. Reduce Function.
Map Function:
1. Splitting
2. Mapping
Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
Mapping step takes those smaller Sub-DataSets and perform required action or
computation on each Sub-DataSet.
The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in
the below diagram.
MapReduce First Step Output:
ShuffleFunction
It is the second step in MapReduce Algorithm. Shuffle Function is also know as Combine
Function.
1. Merging
2. Sorting
It takes a list of outputs coming from Map Function and perform these two sub-steps on
each and every key-value pair.
Merging step combines all key-value pairs which have same keys (that is grouping
key-value pairs by comparing Key). This step returns <Key, List<Value>>.
Sorting step takes input from Merging step and sort all key-value pairs by using Keys.
This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
ReduceFunction:
It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.
It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce
operation as shown below.
MapReduce Final Step Output:
Final step output looks like first step output. However final step <Key, Value> pairs
are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are
computed and sorted pairs.
We can observe the difference between first step output and final step output with
some simple example. We will discuss same steps with one simple example in next
section.
Thats it all three steps of MapReduce Algorithm.
InputDataSet
Please find our example Input DataSet file in below diagram. Just for simplicity, we are going
to use simple small DataSet. However, Real-time applications use very huge amount of Data.
The reduce phase is, however, free to output any number of records, including zero
records, with the same key and different values.
The client opens the file it wishes to read by calling open () on the FileSystem object,
which for HDFS is an instance of DistributedFileSystem.
DistributedFileSystem calls the namenode, using RPC, to determine the locations of
the blocks for the first few blocks in the file.
The namenode returns the addresses of the datanodes that have a copy of that block.
If the client is itself a datanode ,then it will read from the local datanode, if it hosts a
copy of the block .
The DistributedFileSystem returns an FSDataInputStream to the client for it to read
data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode
and namenode I/O.
Figure: A client reading data from HDFS
The client then calls read () on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first
(closest) datanode for the first block in the file.
Data is streamed from the datanode back to the client, which calls read () repeatedly
on the stream. When the end of the block is reached, DFSInputStream will close the
connection to the datanode, then find the best datanode for the next block.
This happens transparently to the client, which from its point of view is just reading a
continuous stream.
Blocks are read in order with the DFSInputStream opening new connections to
datanodes as the client reads through the stream.
It will also call the namenode to retrieve the datanode locations for the next batch of
blocks as needed. When the client has finished reading, it calls close () on the
FSDataInputStream.
First the pipeline is closed, and any packets in the ack queue are added to the front of
the data queue so that datanodes that are downstream from the failed node will not
miss any packets. The current block 0on the good datanodes is given a new identity,
which is communicated to the namenode, so that the partial block on the failed
datanode will be deleted if the failed. Datanode recovers later on. The failed datanode
is removed from the pipeline and the remainder of the blocks data is written to the
two good datanodes in the pipeline.
The namenode notices that the block is under-replicated, and it arranges for a further
replica to be created on another node.
Subsequent blocks are then treated as normal.
Its possible, but unlikely, that multiple datanodes fail while a block is being written.
As long as dfs.replication.min replicas (default one) are written, the write will
succeed, and the block will be asynchronously replicated across the cluster until its
target replication factor is reached.
When the client has finished writing data, it calls close () on the stream (step 6).
This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete
(step 7).
The namenode already knows which blocks the file is made up.
Review questions:
1. What is middleware?
2. Define Grid middleware.
3. What are the middleware packages?
4. What is Condor-G?
5. List the components of Condor-G.
6. Give the advantages of Sungrid Engine.
7. List down the research grids.
8. Define sun grid engine.
9. What is GT4?
10. List down the GT4 efficiency.
11. List the services of GT4?
12. What is the use of GRAM?
13. What is meant by staging of files?
14. What are the ways for making secure file transfer in globus?
15. What are the objectives of GSI?
16. What are the functions of GSI?
17. What is Replica Location Service (RLS)?
18. What is Data Replication Service (DRS)?
19. What is Reliable File Transfer (RTF)?
20. What is GridFTP?
21. What is MDS?
22. What is meant by MDS?
23. Give the features of index service.
24. What are trigger services?
25. What is aggregator framework?
26. List the aggregator sources.
27. Define Aggregator framework.
28. What is CSF4?
29. What is Hadoop?
30. Give the modules in Hadoop framework.
31. What is Mapreduce?
32. What is Chukwa?
33. What is AVRO?
34. What is zookeeper?
35. What is HIVE?
36. What are the tasks of job monitoring?
37. What is Pig data flow language?
38. What is Scoop?
39. What is input splitting?
40. What are the operational modes of Hadoop?
41. Give the features of HDFS.
42. List down the parts of grid FTP?
43. What is meant by mutual authentication?
44. What is namenode?
45. What is datanode?
46. Name the various implementations of file systems in Hadoop.
47. What is meant by Reduce function?
48. What is meant by communication integrity?
8 MARKS
16 MARKS