Professional Documents
Culture Documents
in
ENGINEERING COLLEGES
2016 – 17 Odd Semester
INDEX
REFERENCES:
1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the
Cloud‖, A Press, 2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books,
Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing
Infrastructure‖, 2 Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press,
2009.
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley
Publication, 2005.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman
and Hall, CRC, Taylor and Francis Group, 2010.
10
UNIT – I
INTRODUCTION
PART – A
The cluster is often a collection of homogeneous compute nodes that are physically
connected in close range to one another.
Benefits – clusters
Scalable performance
Programmability
Efficient message passing
High system availability
Seamless fault tolerance
Cluster wide job management
Dynamic load balancing
Cloud computing refers to both the applications delivered as services over the internet
and the hardware and system software in the data centers that provides those services.
The characteristics of cloud computing are:
On-demand usage
Ubiquitous access
Multitenancy
11
Elasticity
Measured usage
Resiliency
Ubiquitous computing refers to computing with pervasive devices at any place and
time using wired or wireless communication.
The Internet of Things (IoT) is a networked connection of everyday objects including
computers, sensors, humans, etc. The IoT is supported by Internet clouds to achieve
ubiquitous computing with any object at any place and time.
12
A storage area network (SAN) connects servers to network storage such as disk
arrays.
Network attached storage (NAS) connects client hosts directly to the disk arrays.
9. Define SSI.
Greg Pfister has indicated that an ideal cluster should merge multiple system images
into a single-system image (SSI). An SSI is an illusion created by software or
hardware that presents a collection of resources as one integrated, powerful resource.
13
SSI makes the cluster appear like a single machine to the user. A cluster with multiple
system images is nothing but a collection of independent computers.
10. Define grid computing. What are all the types of grid systems?
Grid computing has attracted global technical communities with the evolution of
Business on-demand computing and Autonomic computing. Grid computing is the
process of coordinated resource sharing and problem solving in dynamic, multi-
institutional virtual organizations.
Type of Grid systems:
Computational and data grids – provides computing utility, data and information
services through resource sharing and cooperation among participating organization.
P2P grids - the P2P grids are mainly for distributed computing and collaboration
applications that have no fixed structures. P2P grids are unreliable, resources are
controlled by the users and it is limited to use with few applications.
A grid is an environment that allows service oriented, flexible and seamless sharing of
heterogeneous network of resources for compute intensive and data intensive tasks
and provides faster throughput and scalability at lower costs. The distinct benefits of
using grids include performance with scalability, resource utilization, management
and reliability and virtualization.
Cloud - A cloud is a pool of virtualized computer resources. A cloud can host a
variety of different workloads, including batchstyle backend jobs and interactive and
user-facing applications.
Grid computing provides a single interface for managing the heterogeneous resources.
It can create a more robust and resilient infrastructure through the use of
decentralization, fail-over and fault tolerance to make the infrastructure better suited
to respond to minor or major disasters.
15. List out the design goals and requirement of HPC and HTC systems.
15
PART – B
Distributed and cloud computing systems are built over a large number of
autonomous computer nodes. These node machines are interconnected by
SANs, LANs, or WANs in a hierarchical manner.
With today’s networking technology, a few LAN switches can easily connect
hundreds of machines as a working cluster.
A WAN can connect many local clusters to form a very large cluster of
clusters. In this sense, one can build a massive system with millions of
computers connected to edge networks. Massive systems are considered highly
scalable, and can reach web-scale connectivity, either physically or logically.
Massive systems are classified into four groups: clusters, P2P networks,
computing grids, and Internet clouds over huge data centres. In terms of node
number, these four system classes may involve hundreds, thousands, or even
millions of computers as participating nodes. These machines work
collectively, cooperatively, or collaboratively at various levels.
16
Cluster Architecture
17
Single-System Image:
18
19
In the past 30 years, users have experienced a natural growth path from Internet to
web and grid computing services.
Grid computing is envisioned to allow close interaction among applications running
on distant computers simultaneously.
Computational Grids:
20
Figure 1.2 - Computational grid or data grid providing computing utility, data,
and information services through resource sharing and cooperation among
participating organizations.
21
services. Enterprises and consumers form the user base, which then
defines the usage trends and service characteristics.
Grid Families
22
P2P Systems:
o In a P2P system, every node acts as both a client and a server, providing
part of the system resources.
o Peer machines are simply client computers connected to the Internet.
o All client machines act autonomously to join or leave the system freely.
o This implies that no master slave relationship exists among the peers.
No central coordination or central database is needed.
o In other words, no peer machine has a global view of the entire P2P
system. The system is self-organizing with distributed control.
o The figure 1.3 given below shows the architecture of a P2P network at
two abstraction levels. Initially, the peers are totally unrelated. Each
peer machine joins or leaves the P2P network voluntarily. Only the
participating peers form the physical network at any time. Unlike the
cluster or grid, a P2P network does not use a dedicated interconnection
network. The physical network is simply an ad hoc network formed at
various Internet domains randomly using the TCP/IP and NAI protocols.
Thus, the physical network varies in size and topology dynamically du
due to the free membership in the P2P network.
23
Overlay Networks
24
25
o There are too many hardware models and architectures to select from;
incompatibility exists between software and the OS; and different
network connections and protocols make it too complex to apply in real
applications.
o System scalability is needed as the workload increases. System scaling
is directly related to performance and bandwidth.
o P2P networks have these properties. Data location is also important to
affect collective performance.
o Data locality, network proximity, and interoperability are three design
objectives in distributed P2P applications.
o P2P performance is affected by routing efficiency and self-organization
by participating peers.
o Fault tolerance, failure management, and load balancing are other
important issues in using overlay networks. Lack of trust among peers
poses another problem.
26
o A cloud allows workloads to be deployed and scaled out quickly through rapid
provisioning of virtual or physical machines.
o The cloud supports redundant, self-recovering, highly scalable programming
models that allow workloads to recover from many unavoidable hardware/
software failures.
o Finally, the cloud system should be able to monitor resource use in real time to
enable rebalancing of allocations when needed.
Internet Clouds
27
Figure 1.4 Virtualized resources from data centers to form an Internet cloud,
provisioned with hardware, software, storage, network, and services for paid
users to run their applications.
The Cloud Landscape
28
Figure 1.5 - Three cloud service models in a cloud landscape of major providers.
Platform as a Service (PaaS) This model enables the user to deploy user-built
applications onto a virtualized cloud platform. PaaS includes middleware, databases,
development tools, and some runtime support such as Web 2.0 and Java. The platform
includes both hardware and software integrated with specific programming interfaces.
The provider supplies the API and software tools (e.g., Java, Python, Web 2.0, .NET).
The user is freed from managing the cloud infrastructure.
Software as a Service (SaaS) This refers to browser-initiated application software
over thousands of paid cloud customers. The SaaS model applies to business
processes, industry applications, consumer relationship management (CRM),
enterprise resources planning (ERP), human resources (HR), and collaborative
applications. On the customer side, there is no upfront investment in servers or
software licensing. On the provider side, costs are rather low, compared with
conventional hosting of user applications.
Internet clouds offer four deployment modes: private, public, managed, and hybrid.
These modes demand different levels of security implications. The different SLAs
29
imply that the security responsibility is shared among all the cloud providers, the
cloud resource consumers, and the third-party cloud-enabled software providers.
Reasons for adaptation to cloud: The following list highlights eight reasons to adapt
the cloud for upgraded Internet applications and web services:
1. Desired location in areas with protected space and higher energy efficiency
2. Sharing of peak-load capacity among a large pool of users, improving overall
utilization
3. Separation of infrastructure maintenance duties from domain-specific
application development
4. Significant reduction in cloud computing cost, compared with traditional
computing paradigms
5. Cloud computing programming and application development
6. Service and data discovery and content/service distribution
7. Privacy, security, copyright, and reliability issues
8. Service agreements, business models, and pricing policies
2. VIRTUAL MACHINE
30
manner and those resources have to be aggregated, and hopefully, offer a single
system image.
In particular, a cloud of provisioned resources must rely on virtualization of
processors, memory, and I/O facilities dynamically. The Figure below
illustrates the architectures of three VM configurations.
Figure 1.6 - Three VM architectures in (b), (c), and (d), compared with the
traditional physical machine shown in (a).
Virtual Machines
In the Figure 1.6 , the host machine is equipped with the physical hardware, as
shown at the bottom of the same figure. An example is an x-86 architecture
desktop running its installed Windows OS, as shown in part (a) of the Figure
1.6 . The VM can be provisioned for any hardware system. The VM is built
with virtual resources managed by a guest OS to run a specific application.
Between the VMs and the host platform, one needs to deploy a middleware
layer called a virtual machine monitor (VMM).
Figure 1.6 (b) shows a native VM installed with the use of a VMM called a
hypervisor in privileged mode. For example, the hardware has x-86 architecture
running the Windows system. The guest OS could be a Linux system and the
31
VM Primitive Operations
32
33
Virtual Infrastructures
Physical resources for compute, storage, and networking are mapped to the
needy applications embedded in various VMs at the top.
Hardware and software are then separated.
Virtual infrastructure is what connects resources to distributed applications. It
is a dynamic mapping of system resources to specific applications. The result is
decreased costs and increased efficiency and responsiveness.
Virtualization for server consolidation and containment is a good example of
this.
34
On top of this base environment one would build a higher level environment
reflecting the special features of the distributed computing environment. This
starts with entity interfaces and inter-entity communication, which rebuild the
top four OSI layers but at the entity and not the bit level.
The Figure below shows the layered architecture for distributed entities used in
web services and grid systems.
35
Here, one might get several models with, for example, JNDI (Jini and Java
Naming and Directory Interface) illustrating different approaches within the
Java distributed object model.
Figure 1.8- Layered architecture for web services and the grids.
36
37
In CORBA and Java, the distributed entities are linked with RPCs, and the
simplest way to build composite applications is to view the entities as objects
and use the traditional ways of linking them together. For Java, this could be as
simple as writing a Java program with method calls replaced by Remote
Method Invocation (RMI), while CORBA supports a similar model with a
syntax reflecting the C++ style of its entity (object) interfaces.
Allowing the term ―grid‖ to refer to a single service or to represent a collection
of services, here sensors represent entities that output data (as messages), and
grids and clouds represent collections of services that have multiple message-
based inputs and outputs.
38
subsequently, the knowledge for our daily use. In fact, wisdom or intelligence
is sorted out of large knowledge bases.
Finally, intelligent decisions were made based on both biological and machine
wisdom. Most distributed systems require a web interface or portal.
For raw data collected by a large number of sensors to be transformed into
useful information or knowledge, the data stream may go through a sequence
of compute, storage, filter, and discovery clouds. Finally, the inter-service
messages converge at the portal, which is accessed by all users. Two example
are portals, OGFCE and HUBzero, which uses both web service (portlet) and
Web 2.0 (gadget) technologies.
Many distributed programming models are also built on top of these basic
constructs.
Grids versus Clouds
The boundary between grids and clouds are getting blurred in recent years.
For web services, workflow technologies are used to coordinate or orchestrate
services with certain specifications used to define critical business process
models such as two-phase transactions.
The general approaches used in workflow are the BPEL Web Service standard,
Pegasus, Taverna, Kepler, Trident, and Swift.
In all approaches, one is building a collection of services which together tackle
all or part of a distributed computing problem.
In general, a grid system applies static resources, while a cloud emphasizes
elastic resources. For some researchers, the differences between grids and
clouds are limited only in dynamic resource allocation based on virtualization
and autonomic computing.
One can build a grid out of multiple clouds. This type of grid can do a better
job than a pure cloud, because it can explicitly support negotiated resource
allocation. Thus one may end up building with a system of systems: such as a
cloud of clouds, a grid of clouds, or a cloud of grids, or interclouds as a basic
SOA architecture.
39
Figure 1.9 - The evolution of SOA: grids of clouds and grids, where ―SS‖ refers
to a sensor service and ―fs‖ to a filter or transforming service.
40
4. GRID ARCHITECTURE
Grid Architecture
A new architecture model and technology was developed for the establishment,
management, and cross-organizational resource sharing within a virtual organization.
This new architecture, called grid architecture, identifies the basic components of a
grid system, defines the purpose and functions of such components and indicates how
each of these components interacts with one another. The main attention of the
architecture is on the interoperability among resource providers and users to establish
the sharing relationships. This interoperability means common protocols at each layer
of the architecture model, which leads to the definition of a grid protocol architecture
as shown in Figure 1.10 below. This protocol architecture defines common
mechanisms, interfaces, schema, and protocols at each layer, by which users and
resources can negotiate, establish, manage, and share resources.
The Fabric layer defines the resources that can be shared. This could include
computational resources, data storage, networks, catalogs, and other system
resources. These resources can be physical resources or logical resources by
nature. Typical examples of the logical resources found in a Grid Computing
environment are distributed file systems, computer clusters, distributed
computer pools, software applications, and advanced forms of networking
services.
These logical resources are implemented by their own internal protocol (e.g.,
network file systems [NFS] for distributed file systems, and clusters using
logical file systems [LFS]).
These resources then comprise their own network of physical resources.
41
Figure 1.10 - The layered grid service protocols and their relationship with the
Internet service protocols.
42
43
grid fabric model, all grid solutions must provide integration with the local
environment and respective resources specifically engaged by the security
solution mechanisms.
User-based trust relationships: In Grid Computing, establishing an
absolute trust relationship between users and multiple service providers is very
critical. This accomplishes the environmental factor to which there is then no
need of interaction among the providers to access the resources that each of
them provide.
Data security: The data security topic is important in order to provide
data integrity and confidentiality. The data passing through the Grid
Computing solution, no matter what complications may exist, should be made
secure using various cryptographic and data encryption mechanisms. These
mechanisms are well known in the prior technological art, across all global
industries.
The Resource layer utilizes the communication and security protocols defined
by the networking communications layer, to control the secure negotiation,
initiation, monitoring, metering, accounting, and payment involving the sharing
of operations across individual resources.
The way this works is the Resource layer calls the Fabric layer functions in
order to access and control the multitude of local resources. This layer only
handles the individual resources and, hence, ignores the global state and atomic
actions across the other resource collection, which in the operational context is
the responsibility of the Collective layer.
There are two primary classes of resource layer protocols. These protocols are
key to the operations and integrity of any single resource. These protocols are
as follows:
Information Protocols: These protocols are used to get information
about the structure and the operational state of a single resource, including
44
45
46
These are user applications, which are constructed by utilizing the services
defined at each lower layer. Such an application can directly access the
resource, or can access the resource through the Collective Service interface
APIs (Application Provider Interface).
Each layer in the grid architecture provides a set of APIs and SDKs (software
developer kits) for the higher layers of integration. It is up to the application
developers whether they should use the collective services for general-purpose
discovery, and other high-level services across a set of resources, or if they
choose to start directly working with the exposed resources. These user-defined
grid applications are (in most cases) domain specific and provide specific
solutions.
47
5. GRID STANDARDS
OGSA
The Global Grid Forum has published the Open Grid Service Architecture
(OGSA). To address the requirements of grid computing in an open and
standard way requires a framework for distributed systems that support
integration, virtualization, and management. Such a framework requires a core
set of interfaces, expected behaviors, resource models, and bindings.
OGSA defines requirements for these core capabilities and thus provides
general reference architecture for grid computing environments. It identifies the
components and functions that are useful if not required for a grid environment.
Though it does not go to the level of detail such as defining programmatic
interfaces or other aspects that would guarantee interoperabilty between
implementations, it can be used to identify the functions that should be
included based on the requirements of the specific target environment.
OGSI
48
OGSA-DAI
The OGSA-DAI (data access and integration) project is concerned with constructing
middleware to assist with access and integration of data from separate data sources via
the grid.
The project was conceived by the UK Database Task Force and is working
closely with the Global Grid Forum DAIS-WG and the Globus team.
GridFTP
49
GridFTP can be used to move files (especially large files) across a network
efficiently and reliably. These files may include the executables required for an
application or data to be consumed or returned by an application.
Higher level services, such as data replication services, could be built on top of
GridFTP.
WSRF
Web Services Interoperabilty (WS-I) that also can be applied to and bring value
to grid environments, standards, and proposed standards.
50
Working of GPU
51
Furthermore, each core on a GPU can handle eight threads of instructions. This
translates to having up to 1,024 threads executed concurrently on a single GPU.
This is true massive parallelism, compared to only a few threads that can be
handled by a conventional CPU. The CPU is optimized for latency caches,
while the GPU is optimized to deliver much higher throughput with explicit
management of on-chip memory.
Modern GPUs are not restricted to accelerated graphics or video coding. They
are used in HPC systems to power supercomputers with massive parallelism at
multicore and multithreading levels.
GPUs are designed to handle large numbers of floating-point operations in
parallel.
In a way, the GPU offloads the CPU from all data-intensive calculations, not
just those that are related to video processing.
Conventional GPUs are widely used in mobile phones, game consoles,
embedded systems, PCs, and servers. The NVIDIA CUDA Tesla or Fermi is
used in GPU clusters or in HPC systems for parallel processing of massive
floating-pointing data.
The Figure 1.1 given below shows the interaction between a CPU and GPU in
performing parallel execution of floating-point operations concurrently.
The CPU is the conventional multicore processor with limited parallelism to
exploit.
The GPU has a many-core architecture that has hundreds of simple processing
cores organized as multiprocessors.
Each core can have one or more threads. Essentially, the CPU’s floatingpoint
kernel computation role is largely offloaded to the many-core GPU.
The CPU instructs the GPU to perform massive data processing.
52
The bandwidth must be matched between the on-board main memory and the
on-chip GPU memory. This process is carried out in NVIDIA’s CUDA
programming using the GeForce 8800 or Tesla and Fermi GPUs.
Figure 1.11 - The use of a GPU along with a CPU for massively parallel
execution in hundreds or thousands of processing cores
In the future, thousand-core GPUs may appear in Exascale (Eflops or 1018
flops) systems. This reflects a trend toward building future MPPs with hybrid
architectures of both types of processing chips.
In a DARPA report published in September 2008, four challenges are identified
for exascale computing: (1) energy and power, (2) memory and storage, (3)
concurrency and locality, and (4) system resiliency.
Power Efficiency of the GPU
Bill Dally of Stanford University considers power and massive parallelism as
the major benefits of GPUs over CPUs for the future.
By extrapolating current technology and computer architecture, it was
estimated that 60 Gflops/watt per core is needed to run an exaflops system.
Dally has estimated that the CPU chip consumes about 2 nJ/instruction, while
the GPU chip requires 200 pJ/instruction, which is 1/10 less than that of the
CPU.
The CPU is optimized for latency in caches and memory, while the GPU is
optimized for throughput with explicit management of on-chip memory.
53
54
Services management: users and applications must be able to query the grid in
an effective and efficient manner
More specifically, grid computing environment can be viewed as a computing setup
constituted by a number of logical hierarchical layers. They include grid fabric
resources, grid security infrastructure, core grid middleware, user level middleware
and resource aggregators, grid programming environment and tools and grid
applications. The major constituents of a grid computing system can be identified into
various categories from different perspectives as follows:
functional view
physical view
service view
Basic constituents of a grid from a functional view are decided depending on the grid
design and its expected use. Some of the functional constituents of a grid are
Security (in the form of grid security infrastructure)
Resource Broker
Scheduler
Data Management
Job and resource management
Resources
A resource is an entity that is to be shared; this includes computers, storage,
data and software. A resource need not be a physical entity. Normally, grid portal acts
as a user interaction mechanism which is application specific and can take many
forms.
A user-security functional block usually exists in the grid environment and is a
key requirement for grid computing.
In a grid environment, there is a need for mechanisms to provide
authentication, authorization, data confidentiality, data integrity and availability,
particularly from a user’s point of view.
In the case of inter-domain grids, there is also a requirement to support security
across organizational boundaries. This makes a centrally managed security system
impractical.
55
A large data center may be built with thousands of servers. Smaller data centers
are typically built with hundreds of servers. The cost to build and maintain data
center servers has increased over the years.
High-end switches or routers may be too cost prohibitive for building data
centers. Thus, using high-bandwidth networks may not fit the economics of
cloud computing.
Given a fixed budget, commodity switches and networks are more
desirable in data centers. Similarly, using commodity x86 servers is more
desired over expensive mainframes.
56
The software layer handles network traffic balancing, fault tolerance, and
expandability.
Currently, nearly all cloud computing data centers use Ethernet as their
fundamental network technology.
Convergence of Technologies
Essentially, cloud computing is enabled by the convergence of
technologies in four areas:
i. hardware virtualization and multi-core chips,
ii. utility and grid computing,
iii. SOA, Web 2.0, and WS mashups, and
iv. atonomic computing and data center automation.
Hardware virtualization and multicore chips enable the existence of
dynamic configurations in the cloud.
Utility and grid computing technologies lay the necessary foundation for
computing clouds.
Recent advances in SOA, Web 2.0, and mashups of platforms are pushing
the cloud another step forward.
Finally, achievements in autonomic computing and automated data center
operations contribute to the rise of cloud computing.
Science and society faces a data deluge. Data comes from sensors, lab
experiments, simulations, individual archives, and the web in all scales and
formats. Preservation, movement, and access of massive data sets require
generic tools supporting high-performance, scalable file systems,
databases, algorithms, workflows, and visualization.
With science becoming data-centric, a new paradigm of scientific
discovery is becoming based on data-intensive technologies.
On January 11, 2007, the Computer Science and Telecommunication Board
(CSTB) recommended fostering tools for data capture, data creation, and
data analysis.
A cycle of interaction exists among four technical areas.
57
58
UNIT – II
GRID SERVICES
PART – A
1. Define OGSA.
The Global Grid Forum has published the Open Grid Service Architecture (OGSA).
To address the requirements of grid computing in an open and standard way requires a
framework for distributed systems that support integration, virtualization, and
management. Such a framework requires a core set of interfaces, expected behaviors,
resource models, and bindings. OGSA defines requirements for these core capabilities
and thus provides general reference architecture for grid computing environments. It
identifies the components and functions that are useful if not required for a grid
environment.
The grid infrastructure is mainly concerned with the creation, management and the
application of dynamic coordinated resources and services which are complex. The
introduction of OGSA is to support the creation, maintenance and application of
ensembles of services maintained by virtual organizations.
Identify the use cases that can drive the OGSA platform components.
Identify and define the core OGSA platform components.
Define hosting and platform specific bindings.
59
Discovery of resources
Instantiating new service
Service level management to meet user expectation
Enabling metering and accounting to quantify resource usage into pricing units
Monitoring resource usage and availability
Managing service policies.
Providing service grouping and aggregation to provide better indexing and
information.
Managing end to end security
Servicing life cycle and change management
Failure provisioning management
Workload management
Load balancing to provide scalable system
60
6. What is OGSI?
OGSI specification defines a component model using a web service as its core
based technologies with WSDL as the service description mechanism and XML as
the message format. There are two dimensions to the stateful nature of web
service:
i. A service is maintaining its state information
ii. The interaction pattern between the client and service can be stateful.
8. What are software technologies behind the OGSA?
Monadic model
Hierarchical model
Hybrid model
Federation model
10. What are the grid service features that OGSI specification defines?
Statefulness
61
Stateful interactions
The ability to create new instances
Service lifetime management
Notification of state changes and Grid service groups
Replication strategies determine when and where to create a replica of the data. The
factors to consider include data demand, network conditions, and transfer cost. The
strategies of replication can be classified into method types: dynamic and static.
Applications in the grid are normally grouped into two categories: computation-
intensive and data-intensive.
62
63
Part – B
1. OGSA
OGSA
OGSA architecture is a layered architecture, as shown in Figure 2.1 below, with clear
separation of the functionalities at each layer. The purpose of the OGSA Platform is to
define standard approaches to, and mechanisms for, basic problems that are common
to a wide variety of Grid systems, such as communicating with other services,
establishing identity, negotiating authorization, service discovery, error notification,
and managing service collections.
64
Goals of OGSA
Identify the use cases that can drive the OGSA platform components
Identify and define the core OGSA platform components
Define hosting and platform-specific bindings
Define resource models and resource profiles with interoperable solutions
Facilitating distributed resource management across heterogeneous platforms
Providing seamless quality of service delivery
Building a common base for autonomic management solutions
Providing common infrastructure building blocks to avoid "stovepipe solution
towers"
Open and published interfaces and messages
Industry-standard integration solutions including Web services
Facilities to accomplish seamless integration with existing IT resources where
resources
become on-demand services/resources
Providing more knowledge-centric and semantic orientation of services
65
OGSA PLATFORM COMPONENTS: The job of the OGSA is to build on the grid
service specification (Open Grid Service Infrastructure, or OGSI) to define
architectures and standards for a set of "core grid services" that are essential
components to every grid.
A set of core OSGA use cases are developed, which forms a representative
collection from different business models (e.g., business grids and science
grids) and are used for the collection of the OGSA functional requirements.
The basic OGSA architectural organization can be classified into five layers:
o native platform services and transport mechanisms
o OGSA hosting environment
o OGSA transport and security
o OGSA infrastructure (OGSI)
o OGSA basic services (meta-OS and domain services)
Native Platform Services and Transport Mechanisms The native platforms form
the concrete resource-hosting environment. These platforms can be host resources
specific to operating systems or hardware components, and the native resource
managers manage them. The transport mechanisms use existing networking services
transport protocols and standards.
66
Grid applications typically rely on native operating system processes as their hosting
environment, with for example creation of a new service instance involving the
creation of a new process. In such environments, a service itself may be implemented
in a variety of languages such as C, C++, Java, or Fortran.
Core Networking Services Transport and Security: An OGSA standard does not
define the specific networking services transport, nor the security mechanisms in the
specification. Instead, it assumes use of the platform-specific transport and security at
the runtime instance of operation. In other words, these properties are defined as
service binding properties, and they are dynamically bound to the native networking
services transport and security systems at runtime. These binding requirements are
flexible; however, the communities in collaboration with the hosting and platform
capabilities must work together to provide the necessary interoperability aspects.
OGSA Infrastructure:
The grid service specification developed within the OGSI working group has defined
the essential building block for distributed systems. This is defined in terms of Web
service specifications and description mechanisms (i.e., W SDL). This specification
provides a common set of behaviors and interfaces to discover a service, create service
instance, service lifecycle management, and subscribe to and deliver respective
notifications.
67
OGSA services fall into seven broad areas, defined in terms of capabilities
frequently required in a grid scenario. The Figure 2.2 below shows the OGSA
architecture. These services are summarized as follows:
68
vi. Information Services provide efficient production of, and access to, information
about the grid and its constituent resources. The term ―information‖ refers to
dynamic data or events used for status monitoring; relatively static data used for
discovery; and any data that is logged. Troubleshooting is just one of the possible
uses for information provided by these services.
69
3. OGSI
The OGSI specification defines a component model using a W eb service as its core
base technology, with WSDL as the service description mechanism and XML as the
message format. Web services in general are dealing with stateless services, and their
client interaction is mostly stateless. On the other hand, grid services are a long-
running process, maintaining the state of the resource being shared, and the clients are
involved in a stateful interaction with the services. There are two dimensions to the
stateful nature of a Web service:
The Figure shown introduces a number of concepts surrounding OGSI, and its
relation to Web services. The following list describes points of interest related
to this model. Grid services are layered on top of Web services.
Grid services contain application state factors, and provide concepts for
exposing the state, which is referred to as the service data element.
70
Both grid services and Web services communicate with its client by
exchanging XML messages.
Grid services are described using GW SDL, which is an extension of WSDL.
GWSDL provides interface inheritance and open port type for exposing the
service state information referred to as service data. This is similar to interface
properties or attributes commonly found in other distributed description
languages.
The client programming model is the same for both grid service and Web
service. But grid services provide additional message exchange patterns such as
the handle resolution through OGSI port types.
The transport bindings are selected by the runtime. Message encoding and
decoding is done for the specific binding and high-level transport protocol
(SOAP/HTTP).
Web service: A software component identified using a URI, whose public interfaces
and binding are described using XML. These services interact with its clients using
XML message exchanges.
71
Stateful Web service: A Web service that maintains some state information between
clients' interactions.
Grid service: This is a stateful Web service with a common set of public operations
and state behaviors exposed by the service. These services are created using the
OGSI-defined specification.
Grid service instance: An instance of a grid service created by the hosting container
and identified by a unique URI called grid service handle (GSH).
Service data element: These are publicly accessible state information of a service
included with the WSDL portType. These can be treated as interface attributes.
Technical Details of OGSI Specification: OGSI is based on Web services and it uses
WSDL as a mechanism to describe the public interfaces of the grid service. There are
two core requirements for describing Web services based on the OGSI:
The ability to describe interface inheritance
The ability to describe additional information elements (state
data/attributes/properties) with the interface definitions
72
Similar to most Web services, OGSI services use W SDL as a service description
mechanism, but the current W SDL 1.1 specification lacks the above two capabilities
in its definition of portType. The W SDL 1.2 working group has agreed to support
these features through portType (now called "interface" in W SDL 1.2) inheritance
and an open content model for portTypes. As an interim solution, OGSI developed a
new schema for portType definition (extended from normal W SDL 1.1 schema
portType Type) under a new namespace definition, GWSDL.
73
Another important aspect of the OGSI is the naming convention adopted for the
portType operations, and the lack of support for operator overloading.
In these situations, the OGSI follows the same conventions as described in the
suggested WSDL 1.2 specification.
This now becomes rather complex across several different dimensions,
especially in the context of interface inheritance, and the process of
transformation to a single inheritance model as previously described.
In these kinds of situations, the OGSI recommendations have to be adhered.
The OGSI recommends that if two or more port type operation components
have the same value for their name and target namespace, then the component
model (i.e., the semantic and operation signature) for these operations must be
identical.
Furthermore, if the port type operation components are equivalent, then they
can be considered as candidates to collapse into a single operation.
74
This data access method is also known as caching, which is often applied to
enhance data efficiency in a grid environment.
By replicating the same data blocks and scattering them in multiple regions of a
grid, users can access the same data with locality of references. Furthermore,
the replicas of the same data set can be a backup for one another.
Some key data will not be lost in case of failures. However, data replication
may demand periodic consistency checks.
The increase in storage requirements and network bandwidth may cause
additional problems.
Replication strategies determine when and where to create a replica of the data.
75
The factors to consider include data demand, network conditions, and transfer
cost.
The strategies of replication can be classified into method types: dynamic and
static.
For the static method, the locations and number of replicas are determined in
advance and will not be modified. Although replication operations require little
overhead, static strategies cannot adapt to changes in demand, bandwidth, and
storage vailability.
Dynamic strategies can adjust locations and number of data replicas according
to changes in conditions (e.g., user behavior). However, frequent data-moving
operations can result in much more overhead than in static strategies.
The replication strategy must be optimized with respect to the status of data
replicas.
For static replication, optimization is required to determine the location and
number of data replicas.
For dynamic replication, optimization may be determined based on whether the
data replica is being created, deleted, or moved.
The most common replication strategies include preserving locality,
minimizing update costs, and maximizing profits.
76
In general, there are four access models for organizing a data grid, are shown in
Figure 2.5 below.
Monadic model: This is a centralized data repository model, shown in Figure 2.5 a.
All the data is saved in a central data repository. When users want to access some data
they have to submit requests directly to the central repository. No data is replicated for
preserving data locality. This model is the simplest to implement for a small grid. For
a large grid, this model is not efficient in terms of performance and reliability. Data
replication is permitted in this model only when fault tolerance is demanded.
Hierarchical model: The hierarchical model, shown in Figure 2.5 (b), is suitable for
building a large data grid which has only one large data access directory. The data
may be transferred from the source to a second level center. Then some data in the
regional center is transferred to the third-level center. After being forwarded several
times, specific data objects are accessed directly by users. Generally speaking, a
77
higher-level data center has a wider coverage area. It provides higher bandwidth for
access than a lower-level data center. PKI security services are easier to implement in
this hierarchical data access model.
Federation model: This data access model shown in Figure 2.5 (c) is better suited for
designing a data grid with multiple sources of data supplies. Sometimes this model is
also known as a mesh model. The data sources are distributed to many different
locations. Although the data is shared, the data items are still owned and controlled by
their original owners. According to predefined access policies, only authenticated
users are authorized to request data from any data source. This mesh model may cost
the most when the number of grid institutions becomes very large.
Hybrid model: This data access model is shown in Figure 2.5 (d). The model
combines the best features of the hierarchical and mesh models. Traditional data
transfer technology, such as FTP, applies for networks with lower bandwidth.
Network links in a data grid often have fairly high bandwidth, and other data transfer
models are exploited by high-speed data transfer tools such as GridFTP developed
with the Globus library. The cost of the hybrid model can be traded off between the
two extreme models for hierarchical and mesh connected grids.
78
SECURITY MODELS
Explain in detail about grid service handle, grid service migration, OGSA
security models.
79
80
Figure 2.6 - The OGSA security model implemented at various protection levels
81
UNIT III
VIRTUALIZATION
Part – A
IaaS
PaaS
SaaS
6. Define VMM.
The hardware-level virtualization inserts a layer between real hardware and
traditional operating systems. This layer is commonly called the Virtual
Machine Monitor (VMM) and it manages the hardware resources of a
computing system. Each time programs access the hardware the VMM captures
the process. In this sense, the VMM acts as a traditional OS.
83
84
85
Data center automation means that huge volumes of hardware, software, and
database resources in these data centers can be allocated dynamically to
86
87
PART B
Cloud computing supports any IT service that can be consumed as a utility and
delivered through a network, most likely the Internet. Such characterization includes
quite different aspects: infrastructure, development platforms, application and
services.
It is possible to organize all the concrete realizations of cloud computing into a
layered view covering the entire stack from hardware appliances to software systems.
Cloud resources are harnessed to offer ―computing horsepower‖ required for
providing services. Often, this layer is implemented using a data center in which
hundreds and thousands of nodes are stacked together. Cloud infrastructure can be
heterogeneous in nature because a variety of resources, such as clusters and even
networked PCs, can be used to build it. Moreover, database systems and other storage
services can also be part of the infrastructure.
The physical infrastructure is managed by the core middleware, the objectives
of which are to provide an appropriate runtime environment for applications and to
best utilize resources. At the bottom of the stack, virtualization technologies are used
to guarantee runtime environment customization, application isolation, sandboxing,
and quality of service. Hardware virtualization is most commonly used at this level.
Hypervisors manage the pool of resources and expose the distributed
infrastructure as a collection of virtual machines. By using virtual machine technology
it is possible to finely partition the hardware resources such as CPU and memory and
to virtualize specific devices, thus meeting the requirements of users and applications.
This solution is generally paired with storage and network virtualization strategies,
which allow the infrastructure to be completely virtualized and controlled.
88
89
management layer and the physical infrastructure; others provide only the
management layer(IaaS (M)).
In this second case, the management layer is often integrated with other IaaS
solutions that provide physical infrastructure and adds value to them. IaaS solutions
are suitable for designing the system infrastructure but provide limited services to
build applications. Such service is provided by cloud programming environments and
tools, which form a new layer for offering users a development platform for
applications.
The range of tools include Web-based interfaces, command-line tools, and
frameworks for concurrent and dis- tributed programming. In this scenario, users
develop their applications specifically for the cloud by using the API exposed at the
user-level middleware. For this reason,this approach is also known as Platform-as-a-
Service(PaaS) because the service offered to the user is a development platform rather
than an infrastructure.
PaaS solutions generally include the infrastructure as well, which is bundled as
part of the service provided to users. In the case of Pure PaaS, only the user-level
middleware is offered, and it has to be complemented with a virtual or physical
infrastructure. The top layer of the reference model depicted in Figure 3.1 contains
services delivered at the application level. These are mostly referred to as Software-
as-a-Service(SaaS).
In most cases these are Web-based applications that rely on the cloud to
provide service to end users. The horsepower of the cloud provided by IaaS and PaaS
solutions allows independent software vendors to deliver their application services
over the Internet. Other applications belonging to this layer are those that strongly
leverage the Internet for their core functionalities that rely on the cloud to sustain a
larger number of users; this is the case of gaming portals and, in general, social
networking websites.
SaaS implementations should feature such behavior automatically, whereas
PaaS and IaaS generally provide this functionality as a part of the API exposed to
users.The reference model also introduces the concept of everything as a Service
(XaaS).
90
This is one of the most important elements of cloud computing: Cloud services
from different providers can be combined to provide a completely integrated solution
covering all the computing stack of a system. IaaS providers can offer the bare metal
in terms of virtual machines where PaaS solutions are deployed.
When there is no need for a PaaS layer, it is possible to directly customize the
virtual infrastructure with the software stack needed to run applications. This is the
case of virtual Web farms: a distributed system composed of Web servers, database
servers, and load balancers on top of which prepackaged software is installed to run
Web applications. This possibility has made cloud computing an interesting option for
reducing startups’ capital investment in IT, allowing them to quickly commercialize
their ideas and grow their infrastructure according to their revenues.
Public clouds: Public clouds constitute the first expression of cloud computing. They
are a realization of the canonical view of cloud computing in which the services
offered are made available to anyone, from anywhere, and at any time through the
Internet.
From a structural point of view they are a distributed system, most likely
composed of one or more data centers connected together, on top of which the specific
services offered by the cloud are implemented. Any customer can easily sign in with
the cloud provider, enter her credential and billing details, and use the services
offered.
Public clouds were the first class of cloud that were implemented and offered.
They offer solutions for minimizing IT infrastructure costs and serve as a viable
option for handling peak loads on the local infrastructure. They have become an
interesting option for small enterprises, which are able to start their businesses without
91
92
Singapore, and Australia; they allow their customers to choose between three different
regions: us-west-1, us-east-1, or eu-west-1.
Such regions are priced differently and are further divided into availability
zones, which map to specific datacenters. According to the specific class of services
delivered by the cloud, a different software stack is installed to manage the
infrastructure: virtual machine managers, distributed middleware, or distributed
applications.
Private clouds: Public clouds are appealing and provide a viable option to cut IT
costs and reduce capital expenses, but they are not applicable in all scenarios. For
example, a very common critique to the use of cloud computing in its canonical
implementation is the loss of control. In the case of public clouds, the provider is
in control of the infrastructure and, eventually, of the customers’ core logic and
sensitive data. Even though there could be regulatory procedure in place that
guarantees fair management and respect of the customer’s privacy, this condition can
still be perceived as a threat or as an unacceptable risk that some organizations are not
willing to take.
In particular, institutions such as government and military agencies will not
consider public clouds as an option for processing or storing their sensitive data. The
risk of a breach in the security infrastructure of the provider could expose such
information to others; this could simply be considered unacceptable.
In other cases, the loss of control of where your virtual IT infrastructure resides
could open the way to other problematic situations. More precisely, the geographical
location of a data center generally determines the regulations that are applied to
management of digital information.
As a result, according to the specific location of data, some sensitive
information can be made accessible to government agencies or even considered
outside the law if processed with specific cryptographic techniques. For example,the
USAPATRIOTAct5 provides its government and other agencies with virtually
limitless powers to access information, including that belonging to any company that
stores information in the U.S. territory.
93
Finally, existing enterprises that have large computing infra- structures or large
installed bases of software do not simply want to switch to public clouds, but they use
the existing IT resources and optimize their revenue. All these aspects make the use of
a public computing infrastructure not always possible.
More specifically, having an infrastructure able to deliver IT services on
demand can still be a winning solution, even when implemented within the private
premises of an institution. This idea led to the diffusion of private clouds, which are
similar to pub- lic clouds, but their resource-provisioning model is limited within the
boundaries of an organization.
Private clouds are virtual distributed systems that rely on a private
infrastructure and provide internal users with dynamic provisioning of computing
resources. Instead of a pay-as-you-go model as in public clouds, there could be other
schemes in place, taking into account the usage of the cloud and proportionally billing
the different departments or sections of an enterprise.
Private clouds have the advantage of keeping the core business operations in-
house by relying on the existing IT infrastructure and reducing the burden of
maintaining it once the cloud has been set up. In this scenario, security concerns are
less critical, since sensitive information does not flow out of the private infrastructure.
Moreover, existing IT resources can be better utilized because the private cloud
can provide services to a different range of users. Another interesting opportunity that
comes with private clouds is the possibility of testing applications and systems at a
comparatively lower price rather than public clouds before deploying them on the
public virtual infrastructure.
A Forrester report on the benefits of delivering in-house cloud computing
solutions for enterprises highlighted some of the key advantages of using a private
cloud computing infrastructure:
Customer information protection: Despite assurances by the public cloud
leaders about security, few provide satisfactory disclosure or have long enough
histories with their cloud offerings to provide warranties about the specific level of
security put in place on their systems. In-house security is easier to maintain and
rely on.
94
95
96
IT capital costs and have just started considering their IT needs (i.e., start-ups), in
most cases the private cloud option prevails because of the existing IT infrastructure.
Private clouds are the perfect solution when it is necessary to keep the
processing of information within an enterprise’s premises or it is necessary to use the
existing hardware and software infrastructure. One of the major drawbacks of private
deployments is the inability to scale on demand and to efficiently address peak loads.
97
used to perform operations with less stringent constraints but that are still part of the
system workload.
It is a heterogeneous distributed system resulting from a private cloud that
integrates additional services or resources from one or more public clouds. For this
reason they are also called heterogeneous clouds. As depicted in the diagram, dynamic
provisioning is a fundamental component in this scenario.
Hybrid clouds address scalability issues by leveraging external resources for
exceeding capacity demand. These resources or services are temporarily leased for the
time required and then released. This practice is also known as cloud bursting.
Whereas the concept of hybrid cloud is general, it mostly applies to IT
infrastructure rather than software services. Service-oriented computing already
introduces the concept of integration of paid software services with existing
application deployed in the private premises.
Infrastructure management software such as OpenNebula already exposes the
capability of integrating resources from public clouds such as Amazon EC2. In this
case the virtual machine obtained from the public infrastructure is managed as all the
other virtual machine instances maintained locally. What is missing is then an
advanced scheduling engine that’s able to differentiate these resources and provide
smart allocations by taking into account the budget available to extend the existing
infrastructure.
In the case of OpenNebula, advanced schedulers such as Haizea can be
integrated to provide cost-based scheduling. A different approach is taken by
InterGrid. This is essentially a distributed scheduling engine that manages the
allocation of virtual machines in a col- lection of peer networks.
Dynamic provisioning is most commonly implemented in PaaS solutions that
support hybrid clouds. As previously discussed, one of the fundamental components
of PaaS middleware is the mapping of distributed applications onto the cloud
infrastructure. In this scenario, the role of dynamic provisioning becomes fundamental
to ensuring the execution of applications under the QoS agreed on with the user.
For example, Aneka provides a provisioning service that leverages different
IaaS providers for scaling the existing cloud infrastructure .The provisioning service
98
cooperates with the scheduler, which is in charge of guaranteeing a specific QoS for
applications. In particular, each user application has a budget attached, and the
scheduler uses that budget to optimize the execution of the application by renting
virtual nodes if needed.
Community clouds: Community clouds are distributed systems created by
integrating the services of different clouds to address the specific needs of an industry,
a community, or a business sector. The National Institute of Standards and
Technologies (NIST) characterize community clouds as follows:
The infrastructure is shared by several organizations and supports a specific
community that has shared concerns (e.g., mission, security requirements, policy, and
compliance considerations). It may be managed by the organizations or a third party
and may exist on premise or off premise.
The users of a specific community cloud fall into a well-identified community,
sharing the same concerns or needs; they can be government bodies, industries, or
even simple users, but all of them focus on the same issues for their interaction with
the cloud. This is a different scenario than public clouds, which serve a multitude of
users with different needs.
Community clouds are also different from private clouds, where the services
are generally delivered within the institution that owns the cloud. From an
architectural point of view, a community cloud is most likely implemented over
multiple administrative domains. This means that different organizations such as
government bodies, private enterprises, research organizations, and even public virtual
infrastructure providers contribute with their resources to build the cloud
infrastructure. Candidate sectors for community clouds are as follows:
Media industry: In the media industry, companies are looking for low-
cost, agile, and simple solutions to improve the efficiency of content production. Most
media productions involve an extended ecosystem of partners. In particular, the
creation of digital content is the outcome of a collaborative process that includes
movement of large data, massive compute-intensive rendering tasks, and complex
workflow executions. Community clouds can provide a shared environment where
services can facilitate business-to-business collaboration and offer the horsepower in
99
terms of aggregate bandwidth, CPU, and storage required to efficiently support media
production.
Healthcare industry: In the healthcare industry, there are different
scenarios in which community clouds could be of use. In particular, community
clouds can provide a global platform on which to share information and knowledge
without revealing sensitive data maintained within the private infrastructure. The
naturally hybrid deployment model of community clouds can easily support the
storing of patient-related data in a private cloud while using the shared infrastructure
for noncritical services and automating processes within hospitals.
Energy and other core industries: In these sectors, community clouds
can bundle the comprehensive set of solutions that together vertically address
management, deployment, and orchestration of services and operations. Since these
industries involve different providers, vendors, and organizations, a community cloud
can provide the right type of infrastructure to create an open and fair market.
Public sector: Legal and political restrictions in the public sector can
limit the adoption of public cloud offerings. Moreover, governmental processes
involve several institutions and agencies and are aimed at providing strategic solutions
at local, national, and international administrative levels. They involve business-to-
administration, citizen-to-administration, and possibly business-to-business processes.
Some examples include invoice approval, infrastructure planning, and public hearings.
A community cloud can constitute the optimal venue to provide a distributed
environment in which to create a communication platform for performing such
operations.
Scientific research: Science clouds are an interesting example of
community clouds. In this case, the common interest driving different organizations
sharing a large distributed infrastructure is scientific computing.
The benefits of these community clouds are the following:
Openness: By removing the dependency on cloud vendors, community
clouds are open systems in which fair competition between different solutions can
happen.
100
101
Explain about the various services provided by cloud computing in detail. (or)
Describe about everything as a service in cloud Environment in detail.
102
interconnected define the distributed sys- tem on top of which applications are
installed and deployed. Virtual machines also constitute the atomic components that
are deployed and priced according to the specific features of the virtual hardware:
memory, number of processors, and disk storage.
IaaS/HaaS solutions bring all the benefits of hardware virtualization: workload
partitioning, application isolation, sandboxing, and hard- ware tuning. From the
perspective of the service provider, IaaS/HaaS allows better exploiting the IT
infrastructure and provides a more secure environment where executing third party
applications.
From the perspective of the customer it reduces the administration and
maintenance cost as well as the capital costs allocated to purchase hardware. At the
same time, users can take advantage of the full customization offered by virtualization
to deploy their infrastructure in the cloud; in most cases virtual machines come with
only the selected operating system installed and the system can be generally based on
Web 2.0 technologies: Web services, RESTful APIs, and mash-ups. These
technologies allow either applications or final users to access the services exposed by
the underlying infrastructure.
In particular, management of the virtual machines is the most important function
performed by this layer. A central role is played by the scheduler, which is in charge
of allocating the execution of virtual machine instances. The scheduler interacts with
the other components that perform a variety of tasks:
The pricing and billing component takes care of the cost of executing
each virtual machine instance and maintains data that will be used to charge the user.
The monitoring component tracks the execution of each virtual machine
instance and maintains data required for reporting and analyzing the performance of
the system.
The reservation component stores the information of all the virtual
machine instances that have been executed or that will be executed in the future.
103
with the monitoring component, this component is used to ensure that a given virtual
machine instance is executed with the desired quality of service.The VM repository
component provides a catalog of virtual machine images that users can use to create
virtual instances. Some implementations also allow users to upload their specific
virtual machine images.
104
with Enomaly, Elastra, Eucalyptus, OpenNebula, and specific IaaS (M) solutions from
VMware, IBM, and Microsoft.
Finally, the reference architecture applies to IaaS implementations that provide
computing resources, especially for the scheduling component. If storage is the main
service provided, it is still possible to distinguish these three layers. The role of
infrastructure management software is not to keep track and manage the execution of
virtual machines but to provide access to large infrastructures and implement storage
virtualization solutions on top of the physical layer.
ii. Platform as a service: Platform-as-a-Service (PaaS) solutions provide a
development and deployment platform for running applications in the cloud. They
constitute the middleware on top of which applications are built.
Application management is the core functionality of the middleware. PaaS
implementations provide applications with a runtime environment and do not expose
any service for managing the underlying infrastructure. They automate the process of
deploying applications to the infrastructure, configuring application components,
provisioning and configuring supporting technologies such as load balancers and
databases, and managing system change based on policies set by the user.
The specific development model decided for applications determines the
interface exposed to the user. Some implementations provide a completely Web-based
interface hosted in the cloud and offering a variety of services. It is possible to find
integrated developed environments based on 4GL and visual programming concepts,
or rapid prototyping environments where applications are built by assembling mash-
ups and user-defined components and successively customized.
Other implementations of the PaaS model provide a complete object model for
representing an application and provide a programming language-based approach.
This approach generally offers more flexibility and opportunities but incurs longer
development cycles. Developers generally have the full power of programming
languages such as Java, .NET, Python, or Ruby, with some restrictions to provide
better scalability and security.
In this case the traditional development environments can be used to design and
develop applications, which are then deployed on the cloud by using the APIs exposed
105
by the PaaS provider. Specific components can be offered together with the
development libraries for better exploiting the services offered by the PaaS
environment. Sometimes a local runtime environment that simulates the conditions of
the cloud is given to users for testing their applications before deployment. This
environment can be restricted in terms of features, and it is generally not optimized for
scaling.
It is possible to organize the various solutions into three wide categories: PaaS-
I, PaaS-II, and PaaS-III.
The first category identifies PaaS implementations that completely follow the
cloud computing style for application development and deployment. They offer an
integrated development environment hosted within the Web browser where
applications are designed, developed, composed, and deployed. This is the case of
Force.com and Long jump. Both deliver as platforms the combination of middleware
and infrastructure.
In the second class we can list all those solutions that are focused on providing
a scalable infrastructure for Web application, mostly websites. In this case, developers
generally use the providers’ APIs, which are built on top of industrial runtimes, to
develop applications.
Google AppEngine is the most popular product in this category. It provides a scalable
runtime based on the Java and Python programming languages, which have been
modified for pro- viding a secure runtime environment and enriched with additional
APIs and components to support scalability.
AppScale, an open-source implementation of Google AppEngine, provides
interface- compatible middleware that has to be installed on a physical infrastructure.
Joyent Smart Platform provides a similar approach to Google AppEngine. A different
approach is taken by Heroku and Engine Yard, which provide scalability support for
Ruby- and Ruby on Rails-based Websites..
The third category consists of all those solutions that provide a cloud
programming platform for any kind of application, not only Web applications. Among
these, the most popular is Microsoft Windows Azure, which provides a
106
107
108
110
space when the concept of virtual memory was introduced. Similarly, virtualization
techniques can be applied to enhance the use of compute engines, networks,and
storage.
Levels of Virtualization:
A traditional computer runs with a host operating system specially tailored for
its hardware architecture, as shown in Figure 3.6 (a). After virtualization, different
user applications managed by their own operating systems (guest OS) can run on the
same hardware, independent of the host OS.
This is often done by adding additional software, called a virtualization layer as
shown in Figure 3.6 (b). This virtualization layer is known as hypervisor or virtual
machine monitor (VMM) .The VMs are shown in the upper boxes, where applications
run with their own guest OS over the virtualized CPU, memory, and I/O resources.The
main function of the software layer for virtualization is to virtualize the physical
hardware of a host machine into virtual resources to be used by the VMs, exclusively.
The virtualization software creates the abstraction of VMs by interposing a
virtualization layer at various levels of a computer system. Common virtualization
layers include the instruction set architecture(ISA) level, hardware level, operating
system level, library support level, and application level.
111
112
Library Support Level: Most applications use APIs exported by user level libraries
rather than using lengthy system calls by the OS. Since most systems provide well-
documented APIs, such an interface becomes another candidate for virtualization.
Virtualization with library interfaces is possible by controlling the
communication link between applications and the rest of a system through API
hooks. The software tool WINE has implemented this approach to support Windows
applications on top of UNIX hosts. Another example is the vCUDA which
allows applications executing within VMs to leverage GPU hardware acceleration.
113
114
115
The Xen Architecture: The core components of a Xen system are the hypervisor,
kernel, and applications. The organization of the three components is important. Like
other virtualization systems, many guest OSes can run on top of the hypervisor.
However, not all guest OSes are created equal, and one in particular controls the
others.
The guest OS, which has control ability, is called Domain 0, and the others are
called Domain U. Domain 0 is a privileged guest OS of Xen. It is first loaded when
Xen boots without any file system drivers being available. Domain 0 is designed to
access hardware directly and manage devices. Therefore, one of the responsibilities of
Domain 0 is to allocate and map hardware resources for the guest domains (the
Domain U domains).
116
117
118
119
The KVM does the rest, which makes it simpler than the hypervisor that controls the
entire machine.
KVM is a hardware-assisted para-virtualization tool, which improves
performance and supports unmodified guest OSes such as Windows, Linux, Solaris,
and other UNIX variants. Unlike the full virtualization architecture which intercepts
and emulates privileged and sensitive instructions at runtime, para-virtualization
handles these instructions at compile time.
The guest OS kernel is modified to replace the privileged and sensitive
instructions with hypercalls to the hypervisor or VMM. Xen assumes such a para-
virtualization architecture. The guest OS running in a guest domain may run at Ring 1
instead of at Ring 0. This implies that the guest OS may not be able to execute some
privileged and sensitive instructions.The privileged instructions are implemented by
hypercalls to the hypervisor. After replacing the instructions with hypercalls, the
modified guest OS emulates the behavior of the original guest OS.
120
121
All the privileged and sensitive instructions are trapped in the hypervisor
automatically. This technique removes the difficulty of implementing binary
translation of full virtualization. It also lets the operating system run in VMs without
modification.
Memory Virtualization :
Virtual memory virtualization is similar to the virtual memory support provided
by modern operating systems. In a traditional execution environment, the operating
system maintains mappings of virtual memory to machine memory using page tables,
which is a one-stage mapping from virtual memory to machine memory.
All modern x86 CPUs include a memory management unit (MMU) and a
translation lookaside buffer (TLB) to optimize virtual memory performance.
However, in a virtual execution environment, virtual memory virtualization involves
sharing the physical system memory in RAM and dynamically allocating it to the
physical memory of the VMs. That means a two-stage mapping process should be
maintained by the guest OS and the VMM, respectively: virtual memory to physical
memory and physical memory to machine memory. Furthermore, MMU virtualization
should be supported, which is transparent to the guest OS. The guest OS continues to
control the mapping of virtual addresses to the physical memory addresses of VMs.
But the guest OS cannot directly access the actual machine memory.
The VMM is responsible for mapping the guest physical memory to the actual
machine memory. Figure 3.11 shows the two-level memory mapping procedure.
I/O Virtualization : I/O virtualization involves managing the routing of I/O requests
between virtual devices and the shared physical hardware. There are three ways to
implement I/O virtualization:
Full device emulation
Para virtualization
Direct I/O
122
Full device emulation is the first approach for I/O virtualization. Generally, this
approach emulates well known, real-world devices. All the functions of a device or
bus infrastructure, such as device enumeration, identification, interrupts, and DMA,
are replicated in software. This software is located in the VMM and acts as a virtual
device. The I/O access requests of the guest OS are trapped in the VMM which
interacts with the I/O devices.
A single hardware device can be shared by multiple VMs that run concurrently.
However, software emulation runs much slower than the hardware it emulates. The
para virtualization method of I/O virtualization is typically used in Xen. It is also
known as the split driver model consisting of a frontend driver and a backend driver.
The frontend driver is running in Domain U and the backend driver is running
in Domain 0. They interact with each other via a block of shared memory. The
frontend driver manages the I/O requests of the guest OSes and the backend driver is
responsible for managing the real I/O devices and multiplexing the I/O data of
different VMs.
Although para I/O-virtualization achieves better device performance than full
device emulation, it comes with a higher CPU overhead.
123
Figure 3.12 Device emulation for I/O virtualization implemented inside the
middle layer that maps real I/O devices into the virtual devices for the guest
device driver to use.
Virtualization in Multi-Core Processors: Virtualizing a multi-core processor is
relatively more complicated than virtualizing a unicore processor. Though multicore
processors are claimed to have higher performance by integrating multiple processor
cores in a single chip, muti-core virtualization has raised some new challenges to
computer architects, compiler constructors, system designers, and application
programmers.
There are mainly two difficulties: Application programs must be parallelized to
use all cores fully, and software must explicitly assign tasks to the cores, which is a
very complex problem.
124
(EC2) is a good example of a web service that provides elastic computing power in a
cloud. EC2 permits customers to create VMs and to manage user accounts over the
time of their use.
Most virtualization platforms, including XenServer and VMware ESX Server,
support a bridging mode which allows all domains to appear on the network as
individual hosts. By using this mode, VMs can communicate with one another freely
through the virtual network interface card and conFigure the network automatically.
Physical versus Virtual Clusters: Virtual clusters are built with VMs installed at
distributed servers from one or more physical clusters. The VMs in a virtual cluster
are interconnected logically by a virtual network across several physical networks.
Figure 3.18 illustrates the concepts of virtual clusters and physical clusters.Each
virtual cluster is formed with physical machines or a VM hosted by multiple physical
clusters. The virtual cluster boundaries are shown as distinct boundaries.
Properties: The virtual cluster nodes can be either physical or virtual machines.
Multiple VMs running with different OSes can be deployed n the same physical node.
A VM runs with a guest OS, which is often different from the host OS, that manages
the resources in the physical machine, where the VM is implemented.
Figure 3. 13 A cloud platform with four virtual clusters over three physical
clusters shaded differently.
The purpose of using VMs is to consolidate multiple functionalities on the
same server. This will greatly enhance server utilization and application flexibility.
125
VMs can be colonized (replicated) in multiple servers for the purpose of promoting
distributed parallelism, fault tolerance, and disaster recovery.
The size (number of nodes) of a virtual cluster can grow or shrink dynamically,
similar to the way an overlay network varies in size in a peer-to-peer (P2P) network.
The failure of any physical nodes may disable some VMs installed on the failing
nodes. But the failure of VMs will not pull down the host system. Figure shows the
concept of a virtual cluster based on application partitioning or customization.
As a large number of VM images might be present, the most important thing is
to determine how to store those images in the system efficiently. There are common
installations for most users or applications, such as operating systems or user-level
programming libraries. These software packages can be preinstalled as templates
(called template VMs). With these templates, users can build their own software
stacks.
126
virtual clusters are created on the right, over the physical clusters. The physical
machines are also called host systems. In contrast, the VMs are guest systems. The
host and guest systems may run with different operating systems.
Each VM can be installed on a remote server or replicated on multiple servers
belonging to the same or different physical clusters. The boundary of a virtual cluster
can change as VM nodes are added, removed, or migrated dynamically over time.
Data centers have grown rapidly in recent years, and all major IT companies
are pouring their resources into building new data centers. In addition, Google,
Yahoo!, Amazon, Microsoft, HP, Apple, and IBM are all in the game. All these
companies have invested billions of dollars in data-center construction and
automation.
Data-center automation means that huge volumes of hardware, software, and
database resources in these data centers can be allocated dynamically to millions of
Internet users simultaneously, with guaranteed QoS and cost-effectiveness. This
automation process is triggered by the growth of virtualization products and cloud
computing services.
The latest virtualization development highlights high availability (HA), backup
services, workload balancing, and further increases in client bases. IDC projected that
automation, service orientation, policy-based, and variable costs in the virtualization
market.
127
128
129
Virtual Storage Management : The term ―storage virtualization‖ was widely used
before the renaissance of system virtualization. Yet the term has a different meaning
in a system virtualization environment. Previously, storage virtualization was largely
used to describe the aggregation and repartitioning of disks at very coarse time scales
for use by physical machines.
In system virtualization, virtual storage includes the storage managed by
VMMs and guest OSes. Generally, the data stored in this environment can be
classified into two categories: VM images and application data. The VM images are
special to the virtual environment, while application data includes all other data which
is the same as the data in traditional OS environments.
The most important aspects of system virtualization are encapsulation and
isolation. In virtualization environments, a virtualization layer is inserted between the
hardware and traditional operating systems or a traditional operating system is
modified to support virtualization.
This procedure complicates storage operations. On the one hand, storage
management of the guest OS performs as though it is operating in a real hard disk
while the guest OSes cannot access the hard disk directly.
On the other hand, many guest OSes contest the hard disk when many VMs are
running on a single physical machine. Since traditional storage management
techniques do not consider the features of storage in virtualization environments,
Parallax designs a novel architecture in which storage features that have traditionally
been implemented directly on high-end storage arrays and switchers are relocated into
a federation of storage VMs.
Cloud OS for Virtualized Data Centers: Data centers must be virtualized to serve as
cloud providers. Table 3.6 summarizes four virtual infrastructure (VI) managers and
OSes. These VI managers and OSes are specially tailored for virtualizing data centers
which often own a large number of servers in clusters. Nimbus, Eucalyptus, and
OpenNebula are all open source software available to the general public. Only
vSphere 4 is a proprietary OS for cloud resource virtualization and management over
data centers.
130
131
Figure 3.15 The architecture of livewire for intrusion detection using a dedicated
VM.
If you want to get more technical and analytical, cloud computing delivers a
better cash flow by eliminating the capital expense (CAPEX) associated with
developing and maintaining the server infrastructure.
ii. Convenience and continuous availability: Public clouds offer services that are
available wherever the end user might be located. This approach enables easy access
to information and accommodates the needs of users in different time zones and
geographic locations. As a side benefit, collaboration booms since it is now easier
than ever to access, view and modify shared documents and files.
Moreover, service uptime is in most cases guaranteed, providing in that way
continuous availability of resources. The various cloud vendors typically use multiple
servers for maximum redundancy. In case of system failure, alternative instances are
automatically spawned on other machines.
iii. Backup and Recovery: The process of backing up and recovering data is
simplified since those now reside on the cloud and not on a physical device. The
various cloud providers offer reliable and flexible backup/recovery solutions. In some
cases, the cloud itself is used solely as a backup repository of the data located in local
computers.
iv. Cloud is environmentally friendly: The cloud is in general more efficient than
the typical IT infrastructure and It takes fewer resources to compute, thus saving
energy. For example, when servers are not used, the infrastructure normally scales
down, freeing up resources and consuming less power. At any moment, only the
resources that are truly needed are consumed by the system.
v. Resiliency and Redundancy: A cloud deployment is usually built on a robust
architecture thus providing resiliency and redundancy to its users. The cloud offers
automatic failover between hardware platforms out of the box, while disaster recovery
services are also often included.
vi. Scalability and Performance: Scalability is a built-in feature for cloud
deployments. Cloud instances are deployed automatically only when needed and as a
result, you pay only for the applications and data storage you need. Hand in hand, also
comes elasticity, since clouds can be scaled to meet your changing IT system
demands.
133
134
x. Smaller learning curve: Cloud applications usually entail smaller learning curves
since people are quietly used to them. Users find it easier to adopt them and come up
to speed much faster. Main examples of this are applications like GMail and Google
Docs.
Disadvantages of Cloud Computing: As made clear from the above, cloud
computing is a tool that offers enormous benefits to its adopters. However, being a
tool, it also comes with its set of problems and inefficiencies.
i. Security and privacy in the Cloud: Security is the biggest concern when it
comes to cloud computing. By leveraging a remote cloud based infrastructure, a
company essentially gives away private data and information, things that might be
sensitive and confidential. It is then up to the cloud service provider to manage,
protect and retain them, thus the provider’s reliability is very critical. A company’s
existence might be put in jeopardy, so all possible alternatives should be explored
before a decision. On the same note, even end users might feel uncomfortable
surrendering their data to a third party.
Similarly, privacy in the cloud is another huge issue. Companies and users have
to trust their cloud service vendors that they will protect their data from unauthorized
users. The various stories of data loss and password leakage in the media does not
help to reassure some of the most concerned users.
ii. Dependency and vendor lock-in: One of the major disadvantages of cloud
computing is the implicit dependency on the provider. This is what the industry calls
―vendor lock-in‖ since it is difficult, and sometimes impossible, to migrate from a
provider once you have rolled with him. If a user wishes to switch to some other
provider, then it can be really painful and cumbersome to transfer huge data from the
old provider to the new one. This is another reason why you should carefully and
thoroughly contemplate all options when picking a vendor.
iii. Technical Difficulties and Downtime: Certainly the smaller business will enjoy
not having to deal with the daily technical issues and will prefer handing those to an
established IT company; however you should keep in mind that all systems might face
dysfunctions from time to time. Outage and downtime is possible even to the best
cloud service providers, as the past has shown.
135
136
UNIT-4
PROGRAMMING MODEL
PART – A
137
Module
Service Functionality Functional Description
Name
Global Resource Allocation Grid Resource Access and Management
GRAM
Manager (HTTP-based)
Communication Nexus Unicast and multicast communication
Grid Security Infrastructure GSI Authentication and related security services
Monitory and Discovery Distributed access to structure and state
MDS
Service information
Heartbeat monitoring of system
Health and Status HBM
components
Global Access of Secondary Grid access of data in remote secondary
GASS
Storage storage
Grid File Transfer GridFTP Inter-node fast file transfer
Source package
Binary package
Obtain the Globus Toolkit 4 binary package from the Globus site.
Extract the binary package as the Globus user
Set environmental variables for the Globus location.
Create and change the ownership of directory for user and group globus
ConFigure and install Globus Toolkit 4
Obtain the Globus Toolkit 4 source package from the Globus site
Extract the source package with the Globus user ID
Set environmental variables for the Globus location.
138
Create and change the ownership of the directory for user and group Globus
ConFigure and install Globus Toolkit 4
6. What is Haoop?
Hadoop is the Apache Software Foundation top-level project that holds the
various Hadoop subprojects that graduated from the Apache Incubator. The Hadoop
project provides and supports the development of open source software that supplies a
framework for the development of highly scalable distributed computing applications.
The Hadoop framework handles the processing details, leaving developers free to
focus on application logic.
7. What is MapReduce?
HDFS is a file system that is designed for use for MapReduce jobs that read
input in large chunks of input, process it, and write potentially large chunks of output.
HDFS does not handle random access particularly well.
139
For the Hadoop framework to be able to distribute pieces of the job to multiple
machines, it needs to fragment the input into individual pieces, which can in turn be
provided as input to the individual distributed tasks. Each fragment of input is called
an input split. The default rules for how input splits are constructed from the actual
input files are a combination of configuration parameters and the capabilities of the
class that actually reads the input records.
10. What are all the various input formats specified in Hadoop framework?
11. Mention the information that has to be supplied by the user while configuring
the reduce phase.
To configure the reduce phase, the user must supply the framework with five
pieces of information:
The number of reduce tasks; if zero, no reduce phase is run
The class supplying the reduce method
The input key and value types for the reduce task; by default, the same
as the reduce output
The output key and value types for the reduce task
The output file type for the reduce task output
140
Pig: Pig is a data flow language and execution environment for exploring very
large datasets. Pig runs on HDFS and MapReduce clusters.
Sqoop: A tool for efficiently moving data between relational databases and
HDFS.
141
PART – B
Write in detail about the configuration and testing of Globus Toolkit GT4 in a
Grid environment.
After the installation of the Globus Toolkit, each element of your grid environment
must be conFigured.
a. Configuring environmental variables
Before starting the configuration process, it is useful to set up the
GLOBUS_LOCATION environmental variables in either /etc/profile or
(userhome)/.bash_profile. To save time upon subsequent logins from different user
IDs, we specified GLOBUS_LOCATION in /etc/profile.
Also, Globus Toolkit provides shell scripts to set up these environmental variables.
They can be sourced as follows:
source $GLOBUS_LOCATION/etc/globus-user-env.sh (sh)
source $GLOBUS_LOCATION/etc/globus-user-env.csh (csh)
The Globus Toolkit also provides shell scripts for developers to set up Java
CLASSPATH environmental variables. They can be sourced as follows:
source $GLOBUS_LOCATION/etc/globus-devel-env.sh (sh)
source $GLOBUS_LOCATION/etc/globus-devel-env.csh (csh)
The globus-user-env.sh and globus-devel-env.sh in /etc/profile are specified, so that
all users can use the grid environment.
Example of /etc/profile
Export GLOBUS_LOCATION=/usr/local/globus-4.0.0
source $GLOBUS_LOCATION/etc/globus-user-env.sh
source $GLOBUS_LOCATION/etc/globus-devel-env.sh
b. Security set up
Installation of CA packages
To install CA packages:
i.Log in to the CA host as a Globus user.
142
ii. Invoke the setup-simple-ca script, and answer the prompts as appropriate
See the following Example. This script initializes the files that are necessary for
SimpleCA.
Example Setting up SimpleCA
[globus@ca]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca
WARNING: GPT_LOCATION not set, assuming:
GPT_LOCATION=/usr/local/globus-4.0.0
The CA certificate has an expiration date. Keep in mind that once the CA
certificate has expired, all the certificates signed by that CA become invalid. A CA
should regenerate the CA certificate and start re-issuing ca-setup packages before the
actual CA certificate expires. This can be done by re-running this setup script. Enter
the number of DAYS the CA certificate should last before it expires. [default: 5 years
(1825 days)]: (type the number of days)1825
Setting up security in each grid node: After performing the steps above, a package
file has been created that needs to be used on other nodes, as described in this section.
In order to use certificates from this CA in other grid nodes, you need to copy and
install the CA setup package to each grid node.
i. Log in to a grid node as a Globus user and obtain a CA setup package from the
CA host. Then run the setup commands for configuration (see the following
Example).
Example Set up CA in each grid node
[globus@hosta]$
scp globus@ca:/home/globus/.globus/simpleCA \
/globus_simple_ca_(ca_hash)_setup-0.18.tar.gz .
[globus@hosta]$ $GLOBUS_LOCATION/sbin/gpt-build \
globus_simple_ca_(ca_hash)_setup-0.18.tar.gz gcc32dbg
[globus@hosta]$ $GLOBUS_LOCATION/sbin/gpt-postinstall
ii. As the root user, submit the commands in Example to conFigure the CA
settings in each grid node. This script creates the /etc/grid-security directory. This
directory contains the configuration files for security.
143
Obtain and sign a host certificate: In order to use some of the services provided by
Globus Toolkit 4, such as Grid FTP, you need to have a CA signed host certificate and
host key in the Appropriate directory.
As root user, request a host certificate with the command in the above
Example.
Copy or send the /etc/grid-security/hostcert_request.pem file to the CA host.
In the CA host as a Globus user, sign the host certificate by using the grid-ca-
sign command.
Copy the hostcert.pem back to the /etc/grid-security/ directory in the grid node
.
Obtain and sign a user certificate
In order to use the grid environment, a grid user needs to have a CA signed user
certificate and user key in the user’s directory.
As a user (auser1 in hosta), request a user certificate with the command.
Copy or send the (userhome)/.globus/usercert_request.pem file to the CAhost.
In CA host as a Globus user, sign the user certificate by using the grid-ca-sign
command
Copy the created usercert.pem to the (userhome)/.globus/ directory on the grid
node.
Test the user certificate by typing grid-proxy-init -debug -verify as the a user.
With this command, you can see the location of a user certificate and a key,
CA’s certificate directory, a distinguished name for the user, and the expiration
time. After you successfully execute grid-proxy-init, you have been
authenticated and are ready to use the grid environment.
c. Configuration of Java WS Core: The Java WS Core container is installed as a
part of the default Globus Toolkit 4 installation. There are a few things you need to
conFigure before you start Java WS Core.
144
Setting up Java WS Core environment The Java WS Core container uses a copy of
the host certificate and a host key. You need to copy and change the owner of those
files before you start the Java WS Core container.
As a root user, copy hostcert.pem and hostkey.pem to containercert.pem and
containerkey.pem in /etc/grid-security/. Then change the owner of the new files to
Globus (see the following Example).
Example Copying host certificate and key to container certificate and key
[root@hosta]# cp hostcert.pem containercert.pem
[root@hosta]# cp hostkey.pem containerkey.pem
[root@hosta]# chown globus.globus containercert.pem containerkey.pem
Verifying the installation and configuration of Java WS Core To verify that the
Java WS Core has been installed successfully and that grid security has been
implemented correctly, complete the following procedure:
As a Globus user, run the following command to start the container: globus-
start-container. If you do not use a secured container, then type following
command: globus-start-container –nosec.
When the process is complete, a message indicates that the container I open for
Grid services, as shown in the following Example.
Troubleshooting: The following are a few common errors that may occur and what
you might do to correct them. The following message appears during the globus-start-
container command.
145
Failed to start container: Container failed to initialize [Caused by: Address already
in use]
This is because you have another container or program running. You may need to stop
the container or program in order to make this command work.
The following message appears during the counter-create command. Error: nested
exception is:
GSSException: Defective credential detected [Caused by: Proxy file
(/tmp/x509up_u511) not found.]
This is because you have tried to access a secured container without an activated
proxy certificate. You need to run the grid-proxy-init command in order to make this
command work.
Configuration and testing of GridFTP You need to conFigure GridFTP before RFT,
because GridFTP is required by RFT. GridFTP is already installed during the default
installation process. You only need to conFigure GridFTP as a service daemon so that
you can transfer data between two hosts with GridFTP.
146
Java WS Core: Java WS Core consists of APIs and tools that implement
WSRF and WS-Notification standards implemented in Java. These components act as
the base components for various default services that Globus Toolkit 4 supplies. Also,
Java WS Core provides the development base libraries and tools for custom WS-RF
based services.
147
148
149
Reliable File Transfer (RFT) Reliable File Transfer provides a Web service
interface for transfer and deletion of files. RFT receives requests via SOAP messages
over HTTP and utilizes GridFTP. RFT also uses a database to store the list of file
transfers and their states, and is capable of recovering a transfer request that was
interrupted.
150
Replica Location Service (RLS) The Replica Location Service maintains and
provides access to information about the physical locations of replicated data. This
component can map multiple physical replicas to one single logical file, and enables
data redundancy in a grid environment.
Monitoring and Discovery Services The Monitoring and Discovery Services (MDS)
are mainly concerned with the collection, distribution, indexing, archival, and
otherwise processing information about the state of various resources, services, and
system configurations. The information collected is used to either discover new
services or resources, or to enable monitoring of system status. The GT4 provides a
WS-RF and WS-Notification compliant version of MDS, also known as MDS4. The
resource properties provided by a WS-RF compliant resource can be registered with
MDS4 services for information collection purposes. The GT4 WS-RF compliant
services such as GRAM and RFT provide such properties. Upon GT4 container
startup these services are registered with MDS4 services. MDS4 consists of two
higher-level services, an Index service and a Trigger service, which are based on the
Aggregator Framework.
Index service The Index service is the central component of the GT4 MDS
implementation. Every instance of a GT4 container has a default indexing service.
151
152
WS GRAM WS GRAM is the Grid service that provides the remote execution
and status management of jobs. When a job is submitted by a client, the request is sent
153
to the remote host as a SOAP message, and handled by WS GRAM service located in
the remote host. The WS GRAM service is capable of submitting those requests to
local job schedulers such as Platform LSF or Altair PBS. The WS GRAM service
returns status information of the job using WSNotification. The WS GRAM service
can collaborate with the RFT service for staging files required by jobs. In order to
enable staging with RFT, valid credentials should be delegated to the RFT service by
the Delegation service.
154
How will you define the Map and Reduce function in Hadoop framework using
Java program?
The whole data flow is illustrated in the following Figure 4.6. At the bottom of the
diagram is a Unix pipeline, which mimics the whole MapReduce flow.
Java MapReduce Having run through how the MapReduce program works, the next
step is to express it in code. We need three things: a map function, a reduce function,
and some code to run the job. The map function is represented by the Mapper class,
155
which declares an abstract map() method. The following example shows the
implementation of our map method.
Mapper for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable>
{
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+')
{ // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
}
else
{
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]"))
{
context.write(new Text(year), new IntWritable(airTemperature));
156
}
}}
The Mapper class is a generic type, with four formal type parameters that specify the
input key, input value, output key, and output value types of the map function. For the
present example, the input key is a long integer offset, the input value is a line of text,
the output key is a year, and the output value is an air temperature (an integer).
Rather than use built-in Java types, Hadoop provides its own set of basic types that
are optimized for network serialization. These are found in the org.apache.hadoop.io
package. Here we use LongWritable, which corresponds to a Java Long, Text (like
Java String), and IntWritable (like Java Integer). The map () method is passed a key
and a value. We convert the Text value containing the line of input into a Java String,
then use its substring() method to extract the columns we are interested in. The map()
method also provides an instance of Context to write the output to. In this case, we
write the year as a Text object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable. We write an output record only if the
temperature is present and the quality code indicates the temperature reading is OK.
The reduce function is similarly defined using a Reducer, as illustrated in the
following Example.
Reducer for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
157
158
A Job object forms the specification of the job. It gives you control over how the job
is run. When we run this job on a Hadoop cluster, we will package the code into a
JAR file (which Hadoop will distribute around the cluster). Rather than explicitly
specify the name of the JAR file, we can pass a class in the Job’s setJarByClass()
method, which Hadoop will use to locate the relevant JAR file by looking for the JAR
file containing this class.
Having constructed a Job object, we specify the input and output paths. An input path
is specified by calling the static addInputPath() method on FileInputFormat, and it can
be a single file, a directory (in which case, the input forms all the files in that
directory), or a file pattern. As the name suggests, addInputPath() can be called more
than once to use input from multiple paths.
The output path (of which there is only one) is specified by the static setOutput Path()
method on FileOutputFormat. It specifies a directory where the output files from the
reducer functions are written. The directory shouldn’t exist before running the job, as
Hadoop will complain and not run the job. This precaution is to prevent data loss. The
input types are controlled via the input format, which we have not explicitly set since
we are using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run
the job. The waitForCompletion() method on Job submits the job and waits for it to
finish. The method’s boolean argument is a verbose flag, so in this case the job writes
information about its progress to the console. The return value of the
waitForCompletion() method is a boolean indicating success (true) or failure (false),
which we translate into the program’s exit code of 0 or 1.
159
4. HDFS CONCEPTS
HDFS CONCEPTS
Blocks: A disk has a block size, which is the minimum amount of data that it can read
or write. File systems for a single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size. File system blocks are typically a
few kilobytes in size, while disk blocks are normally 512 bytes. This is generally
transparent to the file system user who is simply reading or writing a file—of
whatever length. However, there are tools to perform file system maintenance, such as
df and fsck, that operate on the file system block level.
HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by
default. Like in a file system for a single disk, files in HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a file system for a single
disk, a file in HDFS that is smaller than a single block does not occupy a full block’s
worth of underlying storage. When unqualified, the term ―block‖ in this book refers to
a block in HDFS.
Furthermore, blocks fit well with replication for providing fault tolerance and
availability. To insure against corrupted blocks and disk and machine failure, each
block is replicated to a small number of physically separate machines (typically three).
If a block becomes unavailable, a copy can be read from another location in a way that
is transparent to the client. A block that is no longer available due to corruption or
machine failure can be replicated from its alternative locations to other live machines
to bring the replication factor back to the normal level.
Similarly, some applications may choose to set a high replication factor for the blocks
in a popular file to spread the read load on the cluster. Like its disk filesystem cousin,
HDFS’s fsck command understands blocks. For example, running:
% hadoop fsck / -files -blocks
It will list the blocks that make up each file in the file system Namenodes and
Datanodes
160
HDFS Federation: The namenode keeps a reference to every file and block in the file
system in memory, which means that on very large clusters with many files, memory
becomes the limiting factor for scaling. HDFS Federation, introduced in the 0.23
release series, allows a cluster to scale by adding namenodes, each of which manages
a portion of the filesystem namespace. For example, one namenode might manage all
the files rooted under /user, say, and a second namenode might handle files under
/share. To access a federated HDFS cluster, clients use client-side mount tables to
161
162
Explain in detail about Hadoop File system and Command line Interface.
Hadoop has an abstract notion of file system, of which HDFS is just one
implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents a
file system in Hadoop, and there are several concrete implementations, which are
described in the following table. Hadoop provides many interfaces to its file systems,
and it generally uses the URI scheme to pick the correct file system instance to
communicate with. For example, the file system shell that we met in the previous
section operates with all Hadoop file systems.
To list the files in the root directory of the local file system, type:
% hadoop fs -ls file:///
Hadoop is written in Java, and all Hadoop file system interactions are mediated
through the Java API. The file system shell, for example, is a Java application that
uses the Java FileSystem class to provide file system operations. The other filesystem
interfaces are discussed briefly in this section. These interfaces are most commonly
used with HDFS, since the other file systems in Hadoop typically have existing tools
163
to access the underlying file system (FTP clients for FTP, S3 tools for S3, etc.), but
many of them will work with any Hadoop file system.
HTTP
There are two ways of accessing HDFS over HTTP: directly, where the HDFS
daemons serve HTTP requests to clients; and via a proxy (or proxies), which accesses
HDFS on the client’s behalf using the usual DistributedFileSystem API. The original
HDFS proxy (in src/contrib/hdfsproxy) was read-only, and could be accessed by
clients using the HSFTP FileSystem implementation (hsftp URIs).
The two ways are illustrated in the following Figure 4.7.
Figure 4.7 - Accessing HDFS over HTTP directly, and via a bank of HDFS
proxies
From release 0.23, there is a new proxy called HttpFS that has read and write
capabilities, and which exposes the same HTTP interface as WebHDFS, so clients can
access either using webhdfs URIs.
The HTTP REST API that WebHDFS exposes is formally defined in a specification,
so it is likely that over time clients in languages other than Java will be written that
use it directly.
C
164
Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface
(it was written as a C library for accessing HDFS, but despite its name it can be used
to access any Hadoop filesystem). It works using the Java Native Interface (JNI) to
call a Java filesystem client.
FUSE
Filesystem in Userspace (FUSE) allows filesystems that are implemented in user
space to be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module
allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard
filesystem. You can then use Unix utilities (such as ls and cat) to interact with the
filesystem, as well as POSIX libraries to access the filesystem from any programming
language.
165
We could also have used a relative path and copied the file to our home directory in
HDFS, which in this case is /user/tom:
% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt
Let’s copy the file back to the local filesystem and check whether it’s the same:
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9
The information returned is very similar to the Unix command ls -l, with a few minor
differences. The first column shows the file mode. The second column is the
replication factor of the file (something a traditional Unix filesystem does not have).
Remember we set the default replication factor in the site-wide configuration to be 1,
which is why we see the same value here. The entry in this column is empty for
directories since the concept of replication does not apply to them—directories are
treated as metadata and stored by the namenode, not the datanodes. The third and
fourth columns show the file owner and group. The fifth column is the size of the file
in bytes, or zero for directories. The sixth and seventh columns are the last modified
date and time. Finally, the eighth column is the absolute name of the file or directory
166
There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL
scheme. This is achieved by calling the setURLStreamHandlerFactory method on
URL with an instance of FsUrlStreamHandlerFactory. This method can only be called
once per JVM, so it is typically executed in a static block. This limitation means that if
some other part of your program—perhaps a third-party component outside your
control— sets a URLStreamHandlerFactory, you won’t be able to use this approach
for reading data from Hadoop. The next section discusses an alternative. The
following example shows a program for displaying files from Hadoop file systems on
standard output, like the Unix cat command.
167
168
filesystem as the given user. In some cases, you may want to retrieve a local
filesystem instance, in which case you can use the convenience method, getLocal():
public static LocalFileSystem getLocal(Configuration conf) throws IOException With
a FileSystem instance in hand, we invoke an open() method to get the input stream for
a file:
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws
IOException
169
FSDataInputStream
The open() method on FileSystem actually returns a FSDataInputStream rather than a
standard java.io class. This class is a specialization of java.io.DataInputStream with
support for random access, so you can read from any part of the stream:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}
The Seekable interface permits seeking to a position in the file and a query method for
the current offset from the start of the file (getPos()):
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
Calling seek() with a position that is greater than the length of the file will result in an
IOException. Unlike the skip() method of java.io.InputStream that positions the
stream at a point later than the current position, seek() can move to an arbitrary,
absolute position in the file.
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the
method that takes a Path object for the file to be created and returns an output stream
to write to:
public FSDataOutputStream create(Path f) throws IOException
The following example shows how to copy a local file to a Hadoop filesystem. We
illustrate progress by printing a period every time the progress() method is called by
Hadoop, which is after each 64 K packet of data is written to the datanode pipeline.
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
170
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like
FSDataInputStream, has a method for querying the current position in the file:
public class FSDataOutputStream extends DataOutputStream implements Syncable
{
public long getPos() throws IOException {
// implementation elided
}
// implementation elided
}
Directories
FileSystem provides a method to create a directory:
public boolean mkdirs(Path f) throws IOException
171
This method creates all of the necessary parent directories if they don’t already exist,
just like the java.io.File’s mkdirs() method. It returns true if the directory (and all
parent directories) was (were) successfully created.
Often, you don’t need to explicitly create a directory, since writing a file, by calling
create(), will automatically create any parent directories.
Querying the Filesystem
File metadata: FileStatus
An important feature of any filesystem is the ability to navigate its directory structure
and retrieve information about the files and directories that it stores. The FileStatus
class encapsulates filesystem metadata for files and directories, including file length,
block size, replication, modification time, ownership, and permission information.
The method getFileStatus() on FileSystem provides a way of getting a FileStatus
object for a single file or directory.
Listing files
Finding information on a single file or directory is useful, but you also often need to
be able to list the contents of a directory. That’s what FileSystem’s listStatus()
methods are for:
public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException
When the argument is a file, the simplest variant returns an array of FileStatus objects
of length 1. When the argument is a directory, it returns zero or more FileStatus
objects
representing the files and directories contained in the directory.
File patterns
It is a common requirement to process sets of files in a single operation. For example,
a MapReduce job for log processing might analyze a month’s worth of files contained
in a number of directories. Rather than having to enumerate each file and directory to
specify the input, it is convenient to use wildcard characters to match multiple files
172
173
174
175
5).Coherency Model
A coherency model for a file system describes the data visibility of reads and writes
for a file. HDFS trades off some POSIX requirements for performance, so some
operations
may behave differently than you expect them to.
After creating a file, it is visible in the file system namespace, as expected:
Path p = new Path("p");
fs.create(p);
assertThat(fs.exists(p), is(true));
However, any content written to the file is not guaranteed to be visible, even if the
stream is flushed. So the file appears to have a length of zero:
Path p = new Path("p");
OutputStream out = fs.create(p);
out.write("content".getBytes("UTF-8"));
out.flush();
assertThat(fs.getFileStatus(p).getLen(), is(0L));
Once more than a block’s worth of data has been written, the first block will be visible
to new readers. This is true of subsequent blocks, too: it is always the current block
being written that is not visible to other readers.
HDFS provides a method for forcing all buffers to be synchronized to the datanodes
via the sync() method on FSDataOutputStream. After a successful return from sync(),
HDFS guarantees that the data written up to that point in the file is persisted and
visible to all new readers:
Path p = new Path("p");
FSDataOutputStream out = fs.create(p);
out.write("content".getBytes("UTF-8"));
out.flush(); out.sync();
assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));
This behavior is similar to the fsync system call in POSIX that commits buffered data
for a file descriptor. For example, using the standard Java API to write a local file, we
are guaranteed to see the content after flushing the stream and synchronizing:
176
177
UNIT – V
SECURITY
Part – A
The major authentication methods in the grid include passwords, PKI, and
Kerberos.
178
Types of authority:
The authority can be classified into three categories:
Attribute authorities issue attribute assertions;
policy authorities issue authorization policies;
identity authorities issue certificates.
5. What is GSI?
179
7. What are all the protection mechanisms that are provided by GSI between
WS-Security and WS-Secure Conversation?
8. What is IAM?
Identity and access management (IAM) is the security and business discipline
that "enables the right individuals to access the right resources at the right times
and for the right reasons."
Data integrity refers to maintaining and assuring the accuracy and consistency
of data over its entire life-cycle, and is a critical aspect to the design,
implementation and usage of any system which stores, processes, or
retrieves data.
10. List out the responsibilities and challenges in managing users in IaaS
services.
User provisioning
Privileged user management
Customer key assignment Assigning IDs and keys
Developer user management
180
181
182
Part – B
183
FIGURE 7.1 GSI functional layers at the message and transport levels
TLS (transport-level security) or WS-Security and WS-Secure Conversation
(message-level) are used as message protection mechanisms in combination with
SOAP. X.509 End Entity Certificates or Username and Password are used as
authentication credentials. X.509 Proxy Certificates and WS-Trust are used for
delegation. An Authorization Framework allows for a variety of authorization
schemes, including a ―grid-mapfile‖ ACL, an ACL defined by a service, a custom
authorization handler, and access to an authorization service via the SAML protocol.
In addition, associated security tools provide for the storage of X.509 credentials
(MyProxy and Delegation services), the map- ping between GSI and other
authentication mechanisms (e.g., KX509 and PKINIT for Kerberos, MyProxy for one-
time passwords), and maintenance of information used for authorization (VOMS,
GUMS, PERMIS).
Transport-Level Security Transport-level security entails SOAP messages conveyed
over a network connection protected by TLS. TLS provides for both integrity
protection and privacy (via encryption). Transport-level security is normally used in
conjunction with X.509 credentials for authentication, but can also be used without
such credentials to provide message protection without authentication, often referred
to as ―anonymous transport-level security.‖ In this mode of operation, authentication
may be done by username and password in a SOAP message.
184
185
(1) a subject name, which identifies the person or object that the certificate represents;
(2) the public key belonging to the subject;
(3) the identity of a CA that has signed the certificate to certify that the public key and
the identity both belong to the subject; and
(4) the digital signature of the named CA. X.509 provides each entity with a unique
identifier (i.e., a distinguished name) and a method to assert that identifier to another
party through the use of an asymmetric key pair bound to the identifier by the
certificate.
The X.509 certificates used by GSI are conformant to the relevant standards
and conventions. Grid deployments around the world have established their own CAs
based on third-party software to issue the X.509 certificate for use with GSI and the
Globus Toolkit. GSI also supports delegation and single sign on through the use of
standard X.509 proxy certificates. Proxy certificates allow bearers of X.509 to
delegate their privileges temporarily to another entity. For the purposes of
authentication and authorization, GSI treats certificates and proxy certificates
equivalently. Authentication with X.509 credentials can be accomplished either via
TLS, in the case of transport-level security, or via signature as specified by WS-
Security, in the case of message-level security.
Trust Delegation To reduce or even avoid the number of times the user must
enter his passphrase when several grids are used or have agents (local or remote)
requesting services on behalf of a user, GSI provides a delegation capability and a
delegation service that provides an interface to allow clients to delegate (and renew)
X.509 proxy certificates to a service. The interface to this service is based on the WS-
Trust specification. A proxy consists of a new certificate and a private key. The key
pair that is used for the proxy, that is, the public key embedded in the certificate and
the private key, may either be regenerated for each proxy or be obtained by other
means. The new certificate contains the owner’s identity, modified slightly to indicate
that it is a proxy. The new certificate is signed by the owner, rather than a CA.
186
187
188
The traditional model of network zones and tiers has been replaced in public
cloud computing with ―security groups‖, or ―virtual data centers that have logical
separation between tiers but are less precise and afford less protection than the
formerly established model.
For example, the security groups feature in AWS allows your virtual machines to
access each other using a virtual firewall that has the ability to filter traffic based on IP
address, packet types and ports.
Infrastructure Security: The Host Level
Consider the context of cloud services delivery models (Saas, PaaS, IaaS) and
deployment models (public, private and hybrid). The dynamic nature (elasticity) of
cloud computing can bring new operational challenges from a security management
perspective.
SaaS and PaaS Host Security:
CSP’s do not share information related to their host platform, host OS, and the
processes that are in place to secure the hosts, since hackers can exploit that
information when they are trying to intrude into the cloud service. Hence, in the
context of SaaS or PaaS cloud services, host services, host security is opaque to
customers and the responsibility of securing the hosts is relegated to the CSP.
Virtualization is a key enabling technology that improves host hardware
utilization, among other benefits, it is common for CSPs to employ virtualization
platforms, including Xen and VMware hypervisors, in their host computing platform
architecture.
Boththe PaaS and SaaS platforms abstract and hide the host os from end users with
a host abstraction layer. One key difference between PaaS and SaaS is the
accessibility of the abstraction layer that hides the os services tha application
consume.
IaaA Host Security:
Unlike PaaS and SaaS, IaaS customers are primarily responsible for securing the
hosts provisioned in the cloud. Given that almost all IaaS services available today
employ virtualization at the host layer, host security in IaaS should be categorized as
follows:
189
190
191
and security must be embedded into the software Development Life Cycle (SDLC) in
Figure 5.4
192
functions, including user and access management as supported by the provider. Extra
attention needs to be paid to the authentication and access control features.
Explain the various aspects of data security and discuss the provider data and
its security.
Data security becomes more important when using cloud computing at all ―levels‖:
infrastructure-as-a-service (IaaS), platform-as-a-service (PaaS), and Software-as-a-
service (SaaS). Several aspects of data security, including:
Data-in transit.
193
Data-at rest.
Processing of Data, including multitenancy.
Data lineage.
Data provenance
Data remanence.
Aspects of Data security:
With regard to data-in-transit, the primary risk is in not using a vetted
encryption algorithm. It is also important to ensure that a protocol provides
confidentiality as well as integrity (FTP, HTTPS). Encrypting data and using a
non-secured protocol can provide confidentiality, but does not ensure the
integrity of the data.
Using encryption to protect data-at-rest might seem obvious; the reality is not
that simple. Encrypting data-at-rest is possible—and is strongly suggested.
Data-at-rest used by a cloud-based application is generally not encrypted,
because encryption would prevent indexing or searching of that data.
For any application to process data, that data must be unencrypted.
Homomorphic encryption scheme which allows data to be processed without
being decrypted. This is a huge advance in cryptography. Other cryptographic
research efforts are underway to limit the amount of data that would need to be
decrypted for processing in the cloud, such as predicate encryption. Whether
the data has put into the cloud is encrypted or not, it is useful and might be
required to know exactly where and when the data was specifically located
within the cloud.
Following the path of data (mapping application data flows or data path
visualization) is known as data lineage. Providing data lineage to auditors or
management is time consuming, even when the environment is completely
under an organization control. Trying to provide accurate reporting on data
lineage for a public cloud service is really not possible
Data lineage can be established in a public cloud, for some customer there is an
even more challenging requirement and problem: providing data provenance-
194
195
metadata? As your volume of data with a particular provider increases so does the
value of that metadata.
Additionally, you provider collects and must protect a huge amount of security-
related data. For example, at the network level your provider should be collecting
monitoring, and protecting firewall, intrusion prevention system(IPS), security
incident and event management (SIEM) and router flow data. Provider should be
collecting system log files and at the application level SaaS providers should be
collecting application log data including authentication and authorization information.
Storage: The three information security concerns are associated with the data stored
in the cloud: confidentiality, integrity and availability.
Confidentiality: Confidentiality of data stored in a public cloud, have two potential
concerns. First, what access control exists to protect the data? Access control consists
of both authentication (username + password) and authorization. The second potential
concern: how is the data that is stored in the cloud actually protected? For all practical
purposes, protection of data stored in the cloud involves the use of encryption.
If a CSP does encrypt a customer’s data, the next consideration concerns what
encryption algorithm it uses. Not all encryption algorithms are created equal.
Cryptographically, many algorithms provide insufficient security. Symmetric
encryption involves the use of a single secret key for both the encryption and
decryption of data. Although the example in Figure 5.9 is related to email, the same
concept (i.e., a single shared, secret key) is used in data storage encryption.
196
Although the example in Figure 5.10 is related to e-mail, the same concept (i.e., a
public key and a private key) is not used in data storage encryption.
The next consideration for you is what key is used. With symmetric encryption,
the longer the key length provides more protection. The key length should be
minimum of 112 bits for Triple DES (Data Encryption Standard) and 128 bits for AES
(Advanced Encryption Standard). Another confidentiality consideration for encryption
is key management. How are the encryption keys that are going to be managed? And
by whom? Because the key management is complex and difficult for a single
customer, it is even more complex and difficult to manage multiple customer’s key.
Integrity:
Confidentiality does not imply integrity; data can be encrypted for
confidentiality purposes, and yet you might not have a way to verify the
integrity of that data. Encryption alone is sufficient for confidentiality, but
integrity also requires the use of message authentication codes (MACs).
The simplest way to use MACs on encrypted data is to use a block symmetric
algorithm in cipher block chaining (CBC) mode, and to include a one-way hash
function.
Another aspect of data integrity is important, especially with bulk storage using
IaaS. What a customer really wants to do is to validate the integrity of its data
197
while that data remains in the cloud-without having to download and reupload
that data.This task is even more difficult. Additionally, that data set is probably
dynamic and changing frequently. Those frequent changes obviate the
effectiveness of traditional integrity insurance techniques.
Availability:
Assuming that a customer’s data has maintained its confidentiality and integrity,
you must also be concerned about the availability of your data. There are currently
three major threats in this regard:
The first threat to availability is network-based attacks.
The second threat to availability is the CSPs own availability.
Finally, prospective cloud storage customers must be certain to ascertain just
what services their provider is actually offering.
Cloud storage does not mean the stored data is actually backed up. Some cloud
storage providers do not back up customer data, in addition to providing storage.
However, many cloud storage providers do not backup or do so by only as an
additional service for an additional cost.
198
To compensate for the loss of network control and to strengthen risk assurance,
organizations will be forced to rely on other higher-level software controls, such as
application security and user access controls. These controls manifest as strong
authentication, authorization based on role or claims, trusted sources with accurate
attributes, identity federation, single sign-on (SSO), user activity monitoring, and
auditing. In particular, organizations need to pay attention to the identity federation
architecture and processes, as they can strengthen the controls and trust between
organizations and cloud service providers (CSPs).
IAM is a two way street. CSPs need to support IAM standards and practices
such as federation for customers to take advantage of and extend their practice to
maintain compliance with internal policies and standard.
Need for IAM:
Improve operational efficiency
Regulatory compliance management
Some of the cloud use cases that require IAM support from the CSP include:
IT administrators accessing the CSP management console to provision
resources and access for users using a corporate identity.
Developers creating accounts for partners users in a PaaS platform.
End users accessing storage service in the cloud and sharing files and
objects with users, within and outside a domain using access policy
management features.
An application residing in a cloud service provider accessing storage
from another cloud service.
IAM Challengers
One critical challenge of IAM accessing internal and externally hosted service
another issue is that turn over of users within the organizations. Turn over varies by
industry and function, for example, new product and service releases.
To addresses these challenges and risk many companies have sort technology
solutions to enable centralized and automated user access management. Many of these
initiatives are entered into with high expectations, which is not surprising given that
the problem is often large and complex.
199
IAM Definitions
Basic concept and definitions of IAM functions for any service:
Authentication – is a process of verifying the identity of a user or a system.
Authentication usually connotes a more roburst form of identification. In some use
cases such as service – to- service interaction, authentication involves verifying the
network service.
Authorization – is a process of determining the privileges the user or system is
entitled to once the identity is established. Authorization usually follows the
authentication step and is used to determine whether the user or service has the
necessary privileges to perform certain operations.
Auditing – Auditing entails the process of review and examination of
authentication, authorization records and activities to determine the adequacy of IAM
system controls, to verify complaints with established security policies and procedure,
to detect breaches in security services and to recommend any changes that are
indicated for counter measures
200
201
202
Compliance management: This process implies that access rights and privileges are
monitored and tracked to ensure the security of an enterprise’s resources. The process
also helps auditors verify compliance to various internal access control policies, and
standards that include practices such as segregation of duties, access monitoring,
periodic auditing, and reporting. An example is a user certification process that allows
application owners to certify that only authorized users have the privileges necessary
to access business-sensitive information.
Identity federation management: Federation is the process of managing the trust
relationships established beyond the internal network boundaries or administrative
domain boundaries among distinct organizations. A federation is an association of
organizations that come together to exchange information about their users and
resources to enable collaborations and transactions.
Centralization of authentication (authN) and authorization (authZ): A central
authentication and authorization infrastructure alleviates the need for application
developers to build custom authentication and authorization features into their
applications. Furthermore, it promotes a loose coupling architecture where
applications become agnostic to the authentication methods and policies. This
approach is also called an ―externalization of authN and authZ‖ from applications.
203
204
3. How can I provision user accounts with appropriate privileges and manage
entitlements for my users? XACML.
4. How can I authorize cloud service X to access my data in cloud service Y
without disclosing credentials? OAuth.
205
3. Google sends a redirect to the user’s browser. The redirect URL includes the
encoded SAML authentication request that should be submitted to your organization’s
IdP service.
4. Your IdP decodes the SAML request and extracts the URL for both Google’s
Assertion Consumer Service (ACS) and the user’s destination URL (the Relay State
parameter). Your IdP then authenticates the user. Your IdP could authenticate the user
by either asking for valid login credentials or checking for valid session cookies.
5. Your IdP generates a SAML response that contains the authenticated user’s
username. In accordance with the SAML 2.0 specification, this response is digitally
signed with the partner’s public and private DSA/RSA keys.
6. Your IdP encodes the SAML response and the Relay State parameter and
returns that information to the user’s browser. Your IdP provides a mechanism so that
the browser can forward that information to Google’s ACS.
7. Google’s ACS verifies the SAML response using your IdP’s public key. If
the response is successfully verified, ACS redirects the user to the destination URL.
8. The user has been redirected to the destination URL and is logged in to Google
Apps.
206
207
Figure 5.15 illustrates the interaction among various health care participants with
unique roles (authorization privileges) accessing sensitive patient records stored in a
health care application.
208
209
4. After evaluation, the PDP sends the XACML response to the PEP.
5. The PEP fulfills the obligations by enforcing the PDP’s authorization
decision.
Open Authentication (OAuth)
OAuth is an emerging authentication standard that allows consumers to share
their private resources (e.g., photos, videos, contact lists, bank accounts) stored on one
CSP with another CSP without having to disclose the authentication information (e.g.,
username and password). OAuth is an open protocol and it was created with the goal
of enabling authorization via a secure application programming interface (API)—a
simple and standard method for desktop, mobile, and web applications. For
application developers, OAuth is a method for publishing and interacting with
protected data.
Recently, Google released a hybrid version of an OpenID and OAuth protocol
that combines the authorization and authentication flow in fewer steps to enhance
usability. Google’s GData API recently announced support for OAuth. (GData also
supports SAML for browser SSO.) Figure 5.16 illustrates the sequence of interactions
between customer or partner web application, Google services, and end user:
1. Customer web application contacts the Google Authorization service, asking
for a request token for one or more Google service.
2. Google verifies that the web application is registered and responds with an
unauthorized request token.
3. The web application directs the end user to a Google authorization page,
referencing the request token.
4. On the Google authorization page, the user is prompted to log into his
account (for verification) and then either grant or deny limited access to his Google
service data by the web application.
5. The user decides whether to grant or deny access to the web application. If
the user denies access, he is directed to a Google page and not back to the web
application.
210
6. If the user grants access, the Authorization service redirects him back to a
page designated with the web application that was registered with Google. The
redirect includes the nowauthorized request token.
7. The web application sends a request to the Google Authorization service to
exchange the authorized request token for an access token.
8. Google verifies the request and returns a valid access token.
9. The web application sends a request to the Google service in question. The
request is signed and includes the access token.
10. If the Google service recognizes the token, it supplies the requested data.
The following protocols and specifications are oriented toward consumer cloud
services, and are not relevant from an enterprise cloud computing standpoint.
211
212
213
• Federation or SSO
• Authorization management
• Compliance management
214
215
216
The identity store in the cloud is kept in sync with the corporate directory through a
provider proprietary scheme (e.g., agents running on the customer’s premises
synchronizing a subset of an organization’s identity store to the identity store in the
cloud using SSL VPNs).
Once the IdP is established in the cloud, the organization should work with the CSP to
delegate authentication to the cloud identity service provider. The cloud IdP will
authenticate the cloud users prior to them accessing any cloud services (this is done
via browser SSO techniques that involve standard HTTP redirection techniques).
Here are the specific pros and cons of this approach:
Pros
Delegating certain authentication use cases to the cloud identity management service
hides the complexity of integrating with various CSPs supporting different federation
standards. Another benefit is that there is little need for architectural changes to
support this model. Once identity synchronization between the organization directory
or trusted system of record and the identity service directory in the cloud is set up,
users can sign on to cloud services using corporate identity, credentials (both static
and dynamic), and authentication policies.
Cons
When you rely on a third party for an identity management service, you may have less
visibility into the service, including implementation and architecture details. Hence,
the availability and authentication performance of cloud applications hinges on the
identity management service provider’s SLA, performance management, and
availability. It is important to understand the provider’s service level, architecture,
service redundancy, and performance guarantees of the identity management service
provider.
Another drawback to this approach is that it may not be able to generate custom
reports to meet internal compliance requirements. In addition, identity attribute
management can also become complex when identity attributes are not properly
defined and associated with identities (e.g., definitions of attributes, both mandatory
and optional).
217
Availability Management
Cloud services are not immune to outages, and the severity and scope of impact
to the customer can vary based on the outage situation. Similar to any internal IT-
supported application, business impact due to a service outage will depend on the
criticality of the cloud application and its relationship to internal business processes.
In the case of business-critical applications where businesses rely on the continuous
availability of service, even a few minutes of service outage can have a serious impact
on your organization’s productivity, revenue, customer satisfaction, and service-level
compliance.
The cloud service resiliency and availability depend on a few factors, including
the CSP’s data center architecture (load balancers, networks, systems), application
architecture, hosting location redundancy, diversity of Internet service providers
(ISPs), and data storage architecture. Following is a list of the major factors:
• SaaS and PaaS application architecture and redundancy.
• Cloud service data center architecture, and network and systems architecture,
including geographically diverse and fault-tolerance architecture.
• Reliability and redundancy of Internet connectivity used by the customer and
the CSP.
• Customer’s ability to respond quickly and fall back on internal applications
and other processes, including manual procedures.
• Customer’s visibility of the fault. In some downtime events, if the impact
affects a small subset of users, it may be difficult to get a full picture of the impact and
can make it harder to troubleshoot the situation.
218
By virtue of the service delivery and business model, SaaS service providers
are responsible for business continuity, application, and infrastructure security
management processes. This means the tasks your IT organization once handled will
now be handled by the CSP. Some mature organizations that are aligned with industry
standards, such as ITIL, will be faced with new challenges of governance of SaaS
services as they try to map internal service-level categories to a CSP.
For example, if a marketing application is considered critical and has a high
service-level requirement, how can the IT or business unit meet the internal marketing
department’s availability expectation based on the SaaS provider’s SLA? In some
cases, SaaS vendors may not offer SLAs and may simply address se ervice terms via
terms and conditions. For example, Salesforce.com does not offer a standardized SLA
that describes and specifies performance criteria and service commitments. However,
another CRM SaaS provider, NetSuite, offers the following SLA clauses:
Uptime Goal—NetSuite commits to provide 99.5% uptime with respect to the
NetSuite application, excluding regularly scheduled maintenance times.
Scheduled and Unscheduled Maintenance—Regularly scheduled maintenance time
does not count as downtime. Maintenance time is regularly scheduled if it is
communicated at least two full business days in advance of the maintenance time.
Regularly scheduled maintenance time typically is communicated at least a week in
advance, scheduled to occur at night on the weekend, and takes less than 10–15 hours
each quarter.
219
NetSuite hereby provides notice that every Saturday night 10:00pm–10:20pm Pacific
Time is reserved for routine scheduled maintenance for use as needed.
There is no such thing as standard SLA among cloud service providers. Uptime
guarantee, service credits, and service exclusions clauses will vary from provider to
provider.
Customer Responsibility
220
The following options are available to customers to stay informed on the health of
their service:
• Service health dashboard published by the CSP. Usually SaaS providers, such
as Salesforce.com, publish the current state of the service, current outages that may
impact customers, and upcoming scheduled maintenance services on their website
(e.g., http:// trust.salesforce.com/trust/status/).
• The Cloud Computing Incidents Database (CCID). (This database is generally
communitysupported, and may not reflect all CSPs and all incidents that have
occurred.)
• Customer mailing list that notifies customers of occurring and recently
occurred outages.
• Internal or third-party-based service monitoring tools that periodically check
SaaS provider health and alert customers when service becomes unavailable (e.g.,
Nagios monitoring tool).
• RSS feed hosted at the SaaS service provider.
221
unavailable. For example, the Google App Engine has a quota system whereby each
App Engine resource is measured against one of two kinds of quotas: a billable quota
or a fixed quota.
Billable quotas are resource maximums set by you, the application’s administrator, to
prevent the cost of the application from exceeding your budget. Every application gets
an amount of each billable quota for free. You can increase billable quotas for your
application by enabling billing, setting a daily budget, and then allocating the budget
to the quotas. You will be charged only for the resources your app actually uses, and
only for the amount of resources used above the free quota thresholds.
Fixed quotas are resource maximums set by the App Engine to ensure the integrity of
the system. These resources describe the boundaries of the architecture, and all
applications are expected to run within the same limits. They ensure that another app
that is consuming too many resources will not affect the performance of your app.
Customer Responsibility
Considering all of the variable parameters in availability management, the PaaS
application customer should carefully analyze the dependencies of the application on
the third-party web services (components) and outline a holistic management strategy
to manage and monitor all the dependencies.
The following considerations are for PaaS customers:
PaaS platform service levels
Customers should carefully review the terms and conditions of the CSP’s
SLAs and understand the availability constraints.
Third-party web services provider service levels
When your PaaS application depends on a third-party service, it is critical to
understand the SLA of that service. For example, your PaaS application may rely
on services such as Google Maps and use the Google Maps API to embed maps in
your own web pages with JavaScript.
222
223
consideration all the services that you depend on for your IT and business needs.
Customers are responsible for all aspects of availability management since they are
responsible for provisioning and managing the life cycle of virtual servers.
Managing your IaaS virtual infrastructure in the cloud depends on five factors:
• Availability of a CSP network, host, storage, and support application
infrastructure. This factor depends on the following:
— CSP data center architecture, including a geographically diverse and fault-
tolerance architecture.
— Reliability, diversity, and redundancy of Internet connectivity used by the
customer and the CSP.
— Reliability and redundancy architecture of the hardware and software
components used for delivering compute and storage services.
— Availability management process and procedures, including business
continuity processes established by the CSP.
— Web console or API service availability. The web console and API are
required to manage the life cycle of the virtual servers. When those services become
unavailable, customers are unable to provision, start, stop, and deprovision virtual
servers.
— SLA. Because this factor varies across CSPs, the SLA should be reviewed
and reconciled, including exclusion clauses.
• Availability of your virtual servers and the attached storage (persistent and
ephemeral) for compute services.
• Availability of virtual storage that your users and virtual server depend on for
storage service. This includes both synchronous and asynchronous storage access use
cases. Synchronous storage access use cases demand low data access latency and
continuous availability, whereas asynchronous use cases are more tolerant to latency
and availability.
• Availability of your network connectivity to the Internet or virtual network
connectivity to IaaS services. In some cases, this can involve virtual private network
(VPN) connectivity between your internal private data center and the public IaaS
cloud (e.g., hybrid clouds).
224
The grid and cloud computing plays a vital role in industries for the following:
Collaborative engineering on the cloud
Real-time data publishing
Intellectual expertise and optimization services
Automated data analysis services
225
Seventh Semester
1. Bring out the differences between private cloud and public cloud.
10. Discuss on the application and use of identity and access management
PART B — (5 x 16 = 80 marks)
11. (a) Illustrate the architecture of virtual machine and brief a operations.
Or
(b) Write short notes on :
226
Or
(b) Explain the data intensive grid service models with suitable dial
13. (a) List the cloud deployment models and give a detailed note about
Or
(b) Give the importance of cloud computing and elaborate the differ of services
offered by it.
Or
(b) Give a detailed note on Hadoop framework.
Or
(b) Write in detail about cloud security infrastructure.
227
Seventh Semester
(Regulations 2013)
228
Explain
(b) how migrations of grid services are handled. (16)
13. Discuss
(a) how virtualization is implemented in different layers (16)
Or
What
(b) do you mean by data centre automation using virtualization? (16)
14. Discuss
(a) MAPREDUCE with suitable diagrams. (16)
Or
Elaborate
(b) HDFS concepts with suitable illustrations. (16)
15. Write
(a) detailed note on identity and access management architecture. (16)
Or
Explain
(b) grid security infrastructure. (16)
229