You are on page 1of 26

Open Source SOA in

the Cloud: Data


Analytics in the Cloud
Tom Plunkett TomPlunkett@vt.edu
Michael Sick michael.sick@serenesoftware.com

SOA World 2009


Overview

• Who are we?


Introductions
• Baselines & definitions

• Targeted Use Cases


Opportunity • Technical convergence & opportunities
• Commercial opportunities & drivers

• State of current technology


Data Analytics Technology &
• Commercial & FOSS solutions
in the Cloud Standards
• Hadoop Focus

• Challenges to Meet Target Use Cases


Challenges • Economic challenges & the role of “free”
• Wide scale challenges in Cloud and data analytics

• Questions
Questions
• Contacts

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 2
License
Introductions

Data Analytics in the Cloud: Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Introductions
Challenges

Questions

Introductions

Opportunity

Data Analytics Technology &


in the Cloud Standards

Challenges

Questions

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 3
License
Introductions

Opportunity

Tom Plunkett
Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Extensive Federal Government Experience

IBM Certified SOA Solution Designer

Patents

Teach OOP and Java for Virginia Tech

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 4
License
Introductions

Opportunity

Michael Sick
Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Commercial & Federal Enterprise Architect

Owner: Serene Software Inc. – EA Services Firm

Clients include: BAE, USAF, Raytheon, BearingPoint,


McGraw-Hill, Sun Microsystems, Badcock Furniture

Fascinated by technology -15 years running

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 5
License
Introductions

Opportunity

Serene Software
Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

• Serene is a boutique consulting company focusing on


delivery of Enterprise Architecture services and solutions
• Service Areas
– IT Governance
– IT Strategy
– IT Cost Containment
– Service Oriented Architectures (SOA)
– IT Solution Selection
– IT Audit & Analysis
• Experience includes: BAE, USAF, Raytheon, BearingPoint,
McGraw-Hill, Sun Microsystems, Badcock Furniture, …
• Founded in 2003 (privately held, no debt) and
headquartered in Jacksonville, FL

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 6
License
Introductions

Opportunity

Draft NIST Definition of Cloud Computing


Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

A model for enabling convenient, on-demand network access to a shared pool


of configurable computing resources that can be rapidly provisioned and relea-
sed with minimal management effort or service provider interaction

Essential Characteristics Delivery Models Deployment Models


• On-demand self-service • Cloud Software as a • Private cloud
Service (SaaS)
• Ubiquitous network access • Community cloud
• Cloud Platform as a Service
• Location independent • Public cloud
(PaaS)
resource pooling
• Hybrid cloud
• Cloud Infrastructure as a
• Rapid elasticity
Service (IaaS)
• Measured Service

Source: Draft NIST Definition of Cloud Computing, 06/2009

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 7
License
Introductions

Opportunity

OSI Open Source Definition


Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Free Redistribution

Source Code

Derived Works

Integrity of The Author's Source Code

No Discrimination Against Persons or Groups

No Discrimination Against Fields of Endeavor

Distribution of License

License Must Not Be Specific to a Product

License Must Not Restrict Other Software

License Must Be Technology-Neutral


Source: http://www.opensource.org/docs/osd

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 8
License
Introductions

Opportunity

The Open Group SOA Definition


Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Service-Oriented Architecture (SOA) is an architectural


style that supports service orientation

Service orientation is a way of thinking in terms of services


and service-based development and the outcomes of services

Source: http://www.opengroup.org/projects/soa/doc.tpl?gdid=10632

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 9
License
Introductions

Data Clouds & Data Grids – What‘s the Data Analytics


in the Cloud
Opportunity

Technology &
Standards

difference?
Challenges

Questions

Often Data Clouds & Data Grids are used inter-


changeably, we make the following distinctions

Data Grids Data Clouds


• Grid computing system optimized to share • Focuses on perception of infinite storage,
large amounts of distributed data computing capacity
• Focus on technical capabilities • Focus on cost, virtualization & flexible
capacity
• Often combined with computational grid
computing systems • Enables scale-up/scale-down economics
• Data often moved to compute grid for use • Data moved rarely, locality is a key feature
• Often oriented towards highly structured • Clouds thus far focusing on column
scientific data computing applications oriented, massively scalable data stores

Sources: Wikipedia & [Grossman 1]

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 10
License
Introductions

Opportunity

Definition: Mashups
Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Web available resource that combines data/functions


from two or more external resources

Idea of mashup efforts is to reduce the cost of


producing and consuming resources

Integration should be fast, easy

Often focuses on widely available formats/protocols


like RSS or Atom over HTTP

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 11
License
Introductions

Data Analytics in the Cloud: Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Opportunities
Challenges

Questions

Introductions

Opportunity

Data Analytics Technology &


in the Cloud Standards

Challenges

Questions

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 12
License
Introductions

Use Case: Cloud Data Analytical Tools for Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Intelligence Community Field Analyst


Challenges

Questions

Problem Statement: Analytical Tools Obsolete On Deployment,


field analysts need timely, configurable data analytics. How
does cloud based DA meet the needs of IC analysts

Cloud Analytical
Customer Problem Customer Value
Tools Solution
• Traditional business • Recomposable Cloud • Enabling field analysts to
intelligence tools require Computing Data Analytical quickly build the analytical
years to develop Tools tool they need to analyze
petabytes of data
• Field Analysts confront – Apache Hadoop
situations which are rapidly
– Mashups
changing
– Service-Oriented
• Petabytes of data require
Architecture
analysis

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 13
License
Introductions

Why the “Buzzword” Soup? Convergence Data Analytics


in the Cloud
Opportunity

Technology &
Standards

of Capabilities
Challenges

Questions

Convergence of capabilities
Free Open New opportunities in breadth
Source and depth of DA services
Software • Big Data: Cloud disk and data
(FOSS) storage engines make peta-
byte environments available
to new clients
• Value Based Billing: Heavy
Virtual- Cloud Data use of FOSS in the cloud
SaaS reduces costs directly &
ization Computing Analytics
indirectly
• Capacity Scaling: Scaling
up/down of capacity in pay-go
fashion makes DA available to
wider audience
Mashups • Composable UI’s: Capability
to assemble DA results into
various interfaces

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 14
License
Introductions

Early Data Analytic Cloud Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Consumers/Providers
Challenges

Questions

Profile Types Example Companies

Big Internet Companies • Yahoo, Amazon – can build DA on inf.


Internet Scale

Services
Service SaaS Companies • Force.com – DA & Warehousing to SBA’s
Providers • Facebook – sell DA access to anon. user info
Social Platforms

Insurers • BCBS – private clouds across consortium

Services
Large data-
centric Tradi- Healthcare & Biotech • Kaiser Permanente – common DA services
Cloud DA tional Co’s
Rating Agencies • S & P – open DA cloud to customers
Oppor-
tunities
Intelligence Community • CIA –private org-wide Cloud

Services
Government
Defense Managed Services • DISA -- offer DA to .mil clients
Organizations
Healthcare • SSA – offer DA to fraud prevention analysts

Services
DAaas Infrastructure • Cloudera –managed Hadoop instances
DAaaS
Providers SMB DAaaS Provider • ?? – managed DAaaS, simplified, low cost

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 15
License
Introductions

Data Analytics in the Cloud: Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Technology & Standards


Challenges

Questions

Introductions

Opportunity

Data Analytics Technology &


in the Cloud Standards

Challenges

Questions

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 16
License
Introductions

Opportunity

Google MapReduce
Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Algorithm for computing distributed problems using a


divide and conquer approach with a cluster of nodes

Master node Maps input into smaller sub-problems and distributes


the work to the cluster. A worker node may further map the work
for a further cluster of nodes. The worker nodes then process the
smaller problems, and return the answers back to the master node

Master node then Reduces the set of answers into the answer to the
original problem

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 17
License
Introductions

Opportunity

Apache Hadoop
Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Open Source implementation of the MapReduce algorithms

Hadoop can store and process petabytes of data

Subprojects include HBase, Chukwa, Hive, Pig, and ZooKeeper

Yahoo (more than 100,000 CPUs in >25,000 computers


running Hadoop) and other companies make extensive use of Hadoop

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 18
License
Introductions

As-Is Hadoop Simplified Reference Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Architecture
Challenges

Questions

Chukwa HBase

Structured Data
Apache Hadoop

Unstructured
Zookeeper
Data

Business
ETL Pig Hive
Intelligence

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 19
License
Introductions

Opportunity

Apache Hadoop Sub-projects


Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Hadoop Sub-
Capabilities Example Companies
projects
Chukwa • Data collection system for monitoring and • Yahoo
analyzing large distributed systems

HBase • Similar to Google’s BigTable • Yahoo


• Distributed database for structured data
• Multi-dimensional sorted map

Hive • Data warehouse infrastructure for large • Facebook


datasets
• Hive QL query language

Pig • High-level language for data analysis • Yahoo


• Compiler for Map-Reduce programs

Zookeeper • Configuration, Naming, Distributed • Yahoo


Synchronization, and group services

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 20
License
Introductions

Data Analytics in the Cloud: Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Challenges
Challenges

Questions

Introductions

Opportunity

Data Analytics Technology &


in the Cloud Standards

Challenges

Questions

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 21
License
Introductions

Opportunity

To-Be Simplified Hadoop Architecture


Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

REST API

HBase
SOAP API

Business Structured
Intelligence Data
Query Apache Hadoop
Language Unstructured
Pig Chukwa Zookeeper Data

Hive
Algorithm
Library

ETL

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 22
License
Introductions

Opportunity

Key Challenges
Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Hardware Speed of Rack Interconnects, Multi-core


Infrastructure Parallelization Core platform, Data Analytic Components
Node Affinity Make use of super nodes, XML i/o, en/de-crypt
Cost “brutally efficient” pricing, FOSS advantages
Adoption Cost Models Accurate, open models of CapEx, OpEx costs
Migration Pain Full warehouse migration, ETL,
Ease of Admin. Parallel current RDBMS, Warehouse admin
Debugging Distributed debugging, integration w/ Provider
Emerging Administration
Challenges Flexible Provisioning Multi-level provisioning – co., dept, individual
System Reporting Reporting, audit trails, view to DA system
ETL Integration Interface, metadata optimized for ETL loading
Input & Analysis Intuitive API’s Declarative & programmatic cross language
Product Integration BI, Applications (SAP, Oracle Financial, Lawson)
Data Visualization Viewing & drill down of very large data sets
Output Intuitive API’s Declarative & programmatic cross language
Mashups/Dynamics Easy discovery of data & functions & workflows

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 23
License
Introductions

Opportunity

Solutions: Projected & In-Progress


Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Hardware Interconnect $$ dropping, hardware maturing


Infrastructure Parallelization Platforms advance, market for components
Node Affinity Discovery of capability, affinity into Hadoop, …
Cost FOSS’s game to loose, small diff * a lot = a lot
Adoption Cost Models Industry standard ROI/IRR models for CC
Migration Pain Migration toolkits for traditional DW products
Ease of Admin. Integrated & extended admin packages
Debugging Commercial distributed debugging
Emerging Administration
Challenges Flexible Provisioning Multi-level provisioning – co., dept, individual
System Reporting Reporting, audit trails, view to DA system
ETL Integration ETL interface, support of popular packages
Input & Analysis Intuitive API’s SQL like interface in core, language bindings
Product Integration 3rd party adaptors, IWay et al
Data Visualization Modeling, meta-data, traceability, and new UI’s
Output Intuitive API’s SQL like interface in core, language bindings
Mashups/Dynamics Generic datatypes, discovery services

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 24
License
Introductions

Data Analytics in the Cloud: Data Analytics


in the Cloud
Opportunity

Technology &
Standards

Questions
Challenges

Questions

Introductions

Opportunity

Data Analytics Technology &


in the Cloud Standards

Challenges

Questions

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 25
License
Introductions

Opportunity

Question? & Contact Information


Data Analytics Technology &
in the Cloud Standards

Challenges

Questions

Principle Architect / Partner Cloud Computing Architect


Michael A. Sick Tom Plunkett
888.777.1847 888.777.1847
michael.sick@serenesoftware.com TomPlunkett@vt.edu

Address Address
Serene Software Serene Software
116 19th Ave. North, Suite 503 116 19th Ave. North, Suite 503
Jacksonville Beach, FL Jacksonville Beach, FL
URL: www.serenesoftware.com URL: www.serenesoftware.com

This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 26
License

You might also like