Professional Documents
Culture Documents
A collection of facts that is transmitted and stored in electronic form, and processed through software.
Information
Processed data that is presented in a specific context to enable useful interpretation and decision-making.
Information is stored on storage devices on non-volatile media
Data Center
A facility that houses IT equipment including compute, storage, and network components, and other supporting
infrastructure for providing centralized data-processing capabilities.
First Platform
Based on mainframes Applications and databases hosted centrally Users connect to mainframes through terminals
Second Platform
Based on client-server model Distributed application architecture Servers receive and process requests for resources
from clients Users connect through a client program or a web interface
Third Platform cloud, big data, Mobile, Social
1
Cloud Computing
A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing
resources, (e.g., servers, storage, networks, applications, and services) that can be rapidly provisioned and released with
minimal management effort or service provider interaction.
A cloud is a collection of network-accessible hardware and software resources
Cloud Infrastructure
On-demand self-service
Broad Network Access
Resource Pooling
Rapid Elasticity
Measured Service
Infrastructure as a Service
The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing
resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and
applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating
systems, storage, and deployed applications; and possibly limited control of select networking components, (e.g., host
firewalls).
Platform as a Service
The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired
applications created using programming languages, libraries, services, and tools supported by the provider. The consumer
does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage,
but has control over the deployed applications and possibly configuration settings for the application-hosting environment
Software as a Service
The capability provided to the consumer is to use the providers applications running on a cloud infrastructure. The
applications are accessible from various client devices through either a thin client interface, such as a web browser, (e.g.,
web-based email, or a program interface. The consumer does not manage or control the underlying cloud infrastructure
including network, servers, operating systems, storage, or even individual application capabilities, with the possible
exception of limited user specific application configuration settings.
Big Data
Information assets whose high volume, high velocity, and high variety require the use of new technical architectures and
analytical methods to gain insights and for deriving business value.
2
Components of a Big Data Analytics Solution
Layers
Computer System
A computing platform (hardware and system software) that runs applications
Compute Virtualization
The technique of abstracting the physical compute hardware from the operating system and applications
enabling multiple operating systems to run concurrently on a single or clustered physical compute system(s).
Hypervisor
Software that provides a virtualization layer for abstracting compute system hardware, and enables the creation of
multiple virtual machines.
VM Files
3
Application Virtualization
The technique of decoupling an application from the underlying computing platform (OS and hardware) to enable the
application to be used on a computer system without installation.
Techniques
Application encapsulation
Application is converted into a standalone, self-contained executable package
Application packages may run directly from local drive, USB, or optical disc
Application presentation
Application is hosted and executes remotely, and the applications UI data is transmitted to client
Locally-installed agent on the client manages the exchange of UI information with users remote application session
Application streaming
Application-specific data is transmitted in portions to clients for local execution
Requires locally-installed agent, client software, or web browser plugin
Desktop Virtualization
Technology that decouples the OS, applications, and user state from a physical compute system to create a virtual
desktop environment that can be accessed from any client device.
Storage virtualization
Abstracts physical storage resources to create virtual storage resources:
Virtual volumes
Virtual disk files
Virtual storage systems
Network Virtualization
Abstracts physical network resources to create virtual network resources:
Virtual switch
Virtual LAN
Virtual SAN
Software-Defined Controller
Discovers underlying resources and provides an aggregated view of resources
Abstracts the underlying hardware resources and pools them
Enables the rapid provisioning of resources based on pre-defined policies
4
Week 2
Disk service time = seek time + rotational latency + data transfer time
Seek Time
Time taken to position the read/write head
Rotational Latency
The time taken by the platter to rotate and position the data under the R/W head
RAID
A technique that combines multiple disk drives into a logical unit (RAID set) and provides protection, performance, or
both.
RAID Levels
Commonly used RAID levels are:
RAID 0 Striped set with no fault tolerance
RAID 1 Disk mirroring
RAID 1 + 0 Nested RAID
RAID 3 Striped set with parallel access and dedicated parity disk
RAID 5 Striped set with independent disk access and a Distributed parity
RAID 6 Striped set with independent disk access and dual Distributed parity
5
RAID Techniques
*Striping_ Disk striping is the process of dividing a body of data into blocks and spreading the data blocks across multiple
storage devices, such as hard disks or solid-state drives (SSDs).
*Mirroring_ Disk mirroring, also known as RAID 1, is the replication of data to two or more disks. Disk mirroring is a good
choice for applications that require high performance and high availability, such as transactional applications, email and
operating systems.
*Parity_ In computers, parity (from the Latin parties, meaning equal or equivalent) is a technique that checks whether
data has been lost or written over when it is moved from one place in storage to another or when it is transmitted
between computers.
Scale-up can solve a capacity problem without adding infrastructure elements such as network connectivity. However, it
does require additional space, power, and cooling. Scaling up does not add controller capabilities to handle additional host
activities. That means it doesnt add costs for extra control functions either.
Scale-out storage usually requires additional storage (called nodes) to add capacity and performance. Or in the case of
monolithic storage systems, it scales by adding more functional elements (usually controller cards).One difference
between scaling out and just putting more storage systems on the floor is that scale-out storage continues to be
represented as a single system.
6
Cache Management: Algorithms
Least recently used (LRU) Discards data that have not been accessed for a long time
Most recently used (MRU) Discards data that have been most recently accessed
Storage Provisioning
The process of assigning storage resources to compute system based on capacity, availability, and performance
requirements.
MetaLUN
A method to expand LUNs that require additional capacity or performance.
LUN Masking
A process that provides data access control by defining which LUNs a computer system can access.
Storage Tiering
A technique of establishing a hierarchy of storage types and identifying the candidate data to relocate to the appropriate
storage type to meet service level requirements at a minimal cost.
7
LUN and Sub-LUN Tiering
LUN tiering Moves entire LUN from one tier to another
Does not give effective cost and performance benefits
Sub-LUN tiering A LUN is broken down into smaller Segments and tiered at that level
Provides effective cost and Performance benefits
Cache Tiering
Enables creation of a large capacity secondary cache using SSDs
Enables tiering between DRAM cache and SSDs (secondary cache)
Most reads are served directly from high performance tiered cache
Benefits
Enhances performance during peak workload
Non-disruptive and transparent to applications
What is NAS __An IP-based, dedicated, high-performance file sharing and storage device.
8
Scale-up NAS
Provides capability to scale capacity and performance of a single NAS system
NAS systems have a fixed capacity ceiling
Performance may degrade after reaching the capacity limit
Scale-out NAS
Pools multiple nodes in a cluster to work as a single NAS device
Scales performance and/or capacity no disruptively
Creates a single file system that runs on all nodes in the cluster
Clients, connected to any node, can access the entire file system
File system grows dynamically as nodes are added
HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. Like other Hadoop-
related technologies, HDFS has become a key tool for managing pools of big data and supporting big data analytics
applications.
9
File-level Storage Tiering
Moves files from higher tier to lower tier
Storage tiers are defined based on cost, performance, and availability parameters
Uses policy engine to determine the files that are required to move to the lower tier
Predominant use of file tiering is archival
WEEK 3
10
Software-based Object Storage
Object storage software is installed on any compatible hardware
Provides the flexibility to reuse the existing infrastructure (compute and storage) and to use commodity hardware
Object-based software can also be installed on virtual machines
11