You are on page 1of 13

WHITE PAPER

ETL 2.0
Data Integration Comes of Age
Robin Bloor, Ph.D.
Rebecca Jozwiak
Executive Summary
In this white paper, we examine the evolution of ETL, concluding that a new generation of
ETL products, ETL 2.0 as we have called it, is putting a much needed emphasis on the
transformation aspect of ETL. The following bullet points summarize the contents of the
paper.
Data movement has proliferated wildly since the advent of the data warehouse,
necessitating the growth of a market for ETL products that helps to automate such
transfers.
Few data centers have experienced a consistent use of ETL, with many such programs
being hand coded or implemented using SQL utilities. As a consequence, the ETL
environment is usually fragmented and poorly managed.
Databases and data stores in combination with data transfer activities can be viewed
as providing a data services layer to the organization. Ultimately, the goal of such data
services is to provide any data needed by authorized IT and business users when they
want it and in the form that they need it.
The capabilities of the rst generation of ETL products are now being stressed by:
- The growth of new applications, particularly BI applications.
- The growth of data volumes, the increasing variety of data, and the need for
speed.
- The increasing need to analyze very large pools of data, often including historical
and social network data.
- High-availability (24/7) requirements that have closed batch windows in which
ETL programs could run.
- Rapid changes in technology.
In respect to technology changes, we note the emergence of a whole new generation of
databases that are purpose-designed to exploit current computer hardware both to
achieve better performance and scale, and to manage very large collections of data.
Similarly, we believe the second generation of ETL products will be capable of better
performance and scalability, and will be better able to process very large volumes of
data.
We characterize the second generation of ETL products as having the following
qualities:
- Improved connectivity
- Versatility of extracts, transformations, and loads
- Breadth of application
- Usability and collaboration
- Economy of resource usage
- Self-optimization
ETL 2.0
1
By leveraging an ETL tool that is versatile in both connectivity and scalability, businesses can
negate the challenges of large data volumes to improve the overall performance of data ows.
The versatility of second generation ETL tools additionally allows for a wide variety of
applications that address business needs, however complex. These products will improve the
time to value for many applications that depend on data ows and provide a framework that
fosters collaboration among developers, analysts, and business users. By virtue of software
efciency, these tools will require fewer hardware resources than previous tools, and because
transformations are processed in memory, they will eliminate the need for workarounds,
scheduling, and constant tuning.
In summary, it is our view that ETL tools with such capabilities become increasingly strategic
because of their critical role in the provision of data services to applications and business
users, and the inherently low development and maintenance costs can help businesses realize
a signicantly lower overall total cost of ownership (TCO).

ETL 2.0
2
The Big Data Landscape
The vision of a single database that could serve the needs of a whole organization was laid to
rest long ago. That discarded ideal was superseded by a pragmatic acceptance that the data
resources of an organization will involve many data stores threaded together by ad hoc data
ows carrying data from one place to another. The corporate data resource is fragmented, and
there is scant hope that this state of affairs will change any time soon.
It is worth reecting on why this is the case.
As the use of database technology grew, it soon became clear that the typical workload of
transactional systems was incompatible with the heavy query workloads that provided data
to reporting applications. This gave rise to the idea of replicating the data from operational
systems into a single large database ! a data warehouse ! that could serve all reporting
requirements.
Initially data migration from operational systems to data warehouses was served by
programmers writing individual programs to feed data to the data warehouse. This task was
time-consuming, carried with it an ongoing maintenance cost, and could be better automated
through the use of purpose-built tools. Such tools quickly emerged and were called extract,
transform, and load (ETL) tools.
Over time, the wide proliferation of business intelligence (BI) applications drove the
increasing creation of data marts, a subset of data focused on a single business area or
function. This, in turn, meant more work for ETL tools.
Eventually, the speed of this data transfer process came into question. The time required to
move data from production systems to the data warehouse and then on to data marts was too
long for some business needs. Therefore, organizations were forced to implement suboptimal
workarounds to achieve the performance needed to support the business.
Meanwhile, data continued to grow exponentially. While Moores Law increases computer
power by a factor of 10 roughly every six years or so, big databases seemed to grow by 1,000
times in size during that period. Thats Moores Law cubed. Thats Big Data, and thats
mainly what is prompting the latest revolution. There is no doubt that data integration has
become increasingly complex and costly. Organizations can no longer rely solely on hardware
and inefcient workarounds to overcome the Big Data challenges ahead. Clearly, a new
approach is needed.
The Stressing of ETL 1.0
Aside from the growth in the number of business applications and the perennial growth in
data volumes, there are four distinct factors that have placed increasing demands on ETL
tools since they were originally introduced.
Timing Constraints
Fifteen years ago interactive applications rarely ran for more than 12 hours. This left ample
time for ETL tools to feed data to a data warehouse, for reporting tools to use the operational
data directly, and for database backups. This convenient slack period was generally referred
to as the batch window. ETL transfers tended to run on a weekly or even a monthly basis.
ETL 2.0
3
But the batch windows gradually began to close or vanish. Data marts proliferated as a much
wider variety of BI and analytics applications emerged. Eventually, the demand for data
warehouse updates shifted from nightly to hourly to near-real time as the need for timely
information grew. ETL tools had to try to accommodate this new reality.
Technology Shifts
Computer technology gets faster all the time, but what Moores law provides, data growth
takes away. To complicate the situation, hardware does not improve in a uniform way. By
2003, after several years of increasing CPU clock speed as means to accelerate processing
power, Intel and AMD began to produce multicore CPUs, packaging more than one processor
on each chip. Most databases, because they had been built for high scalability and
performance, soon beneted from these additional resources. However, few ETL tools were
designed to exploit multiple processors and parallelize workloads. They were behind the
curve.
The Advent of Big Data
There have always been massive amounts of data that would be useful to analyze. What is
relatively new is the increasing volume, complexity, and velocity of that data referred to as
Big Data.
In general, the term Big Data involves collections of data measured in tens or hundreds of
terabytes that require signicant hardware and software resources in order to be analyzed.
Large web businesses have such data. So do telecom companies, nancial sector companies,
and utilities companies of various kinds. And now that there are databases that cater to such
data, many other businesses are discovering areas of opportunity where they can accumulate
and analyze large volumes of data as well.
In practice, for many organizations Big Data means digging into previously archived or
historical data. Similarly, large and small businesses are also discovering the need to store and
analyze data at a more granular level. Either way, the elusive heaps of data that were
previously considered inaccessible are fast becoming viable data assets. For ETL, Big Data
translates into new and larger workloads.
Cloud Computing, Mobile Computing
Cloud computing adds complexity to the equation by extending the data center through the
Internet, providing not only additional data sources and feeds, but also cloud-based
applications like Salesforce.com, which may pose additional integration challenges.
Moreover, cloud environments will most likely suffer from relatively low connection speeds
and, possibly, data trafc limitations. Similarly, mobile computing adds new and different
applications, many of which demand a very specic data service. The dramatic adoption of
smart phones and other mobile devices ultimately augments the creation and velocity of data,
two of the key characteristics of Big Data.
ETL 2.0
4
The Limitations of ETL 1.0
ETL evolved from a point-to-point data
integration capability to a fundamental
component of the entire corporate data
infrastructure. We illustrate this in a simple
way in Figure 1, which notionally partitions
IT into a data services layer that manages
and provides data and an applications layer
that uses the data.
However, reality is far more complex than
shown in the illustration. There is a variety
of ways that applications can be connected
to data. Applications vary in size and can
reside on almost any device from a server to
a mobile phone. Data ows can weave a
complex web. Nevertheless, as the diagram
suggests, data management software and
ETL are complementary components that
combine to deliver a data service. As such,
they need to work hand in hand.
The problem with most ETL products, those
we think of as ETL 1.0, is that they were
never designed for such a role. They were
designed to make it easy for IT users and
programmers to specify data ows and to carry out some simple data transformations in
ight so that data arrived in the right format. They included a scheduling capability so that
they would re off at the right time, and they usually included a good set of connectors to
provide access to a wide variety of databases and data stores. They were very effective for
specifying and scheduling point-to-point data ows.
What many of them lacked, however, was a sophisticated software architecture. They werent
designed to efciently handle complex data transformations in ight. Indeed the T in ETL
was largely absent. They werent designed to use resources economically. They werent
designed for scalability or high-speed data transfers. They werent designed to handle ever-
increasing data volumes. In summary, they were not designed to globally manage data ows
in a data services layer.
As data volumes increased, so did the challenge of accessing that data. In many situations,
ETL tools simply were not fast enough or capable enough. Consequently, data transformation
activity was often delegated to the database, with database administrators (DBAs) trying to
manage performance through constant tuning. Developers resorted to hand coding or using
ETL tools just for scheduling. This inevitably led to spaghetti architectures, longer
development cycles, and higher total cost of ownership. Strategic business objectives were not
being met.
ETL 2.0
5
Figure 1. Applications and Data Services
Application Layer
The Data Services Layer
DBMS
DBMS
Files
Files
DBMS
DBMS
DBMS
ETL
ETL
ETL
DBMS
Files
Files
BI Apps
BI Apps
BI Apps
BI Apps
BI Apps
Apps
BI Apps
BI Apps
Apps
BI Apps
BI Apps
BI Apps
Increasingly more companies nd themselves in this situation. For instance, a leading
telecommunications company spent over $15 million dollars in additional database capacity,
just to get a 10% improvement in overall performance. More importantly, 80% of their
database capacity was consumed by data transformations as opposed to analytics.
The Nature of ETL 2.0
Having described the failings and limitations of ETL 1.0,
we can now describe what we believe to be the
characteristics of ETL 2.0. Just as database technology is
evolving to leverage Big Data, we should expect ETL
products to be either re-engineered or to be superseded.
ETL products are (or should be) complementary to
databases and data stores, together delivering a data
services layer that can provide a comprehensive data
service to the business. Unlike ETL 1.0, this new
approach would reduce the complexity, the cost, and the
time to value of data integration.
We list what we believe the qualities of an ETL 2.0
product are in Figure 2 and describe them in detail
below in the order in which they are listed.
Versatility of Connectivity, Extract, and Load
ETL has always been about connectivity to some degree with ETL tools providing as many
connections as possible to the wide variety of databases, data stores, and applications that
pervade the data center. As new databases and data stores emerge, ETL products need to
accommodate them, and this includes the ability to connect to sources of unstructured data in
addition to databases. It also means connecting to cloud data sources as well as those in the
data center. Where ETL tools fail to provide a connection, hand coding ! with all its painful
overhead ! becomes necessary.
Extracting data can be achieved in a variety of ways. The ETL product can simply use an SQL
interface to a database to extract data, for example, but this is likely to be inefcient and it
presents an extra workload to the database. Alternatively, it can make use of database log les
or it can access the raw disk directly. ETL products need to provide such options.
The same goes for loading data. The ETL tool may load the data into staging tables within the
database in a convenient form or may simply deposit the data as a le for the database to load
at its leisure. Ideally, ETL tools would be able to present data in a form that allows for the
fastest ingest of data by the target database without violating constraints dened within the
databases schema.
Products that qualify as ETL 2.0 need to have as many extract and load options as possible to
ensure the overall performance of any given data ow, while placing the least possible
overhead on data sources and targets.
Versatility of connectivity is also about leveraging and extending the capabilities of the
existing data integration environment, a concept commonly known as data integration
ETL 2.0
6
ETL 2.0 Qualities
Versatility of Connectivity,
Extract, and Load
Versatility of Transformations
and Scalability
Breadth of Application
Usability and Collaboration
Economy of Resource Usage
Self-Optimization
Figure 2. Nature of ETL 2.0
acceleration. This includes the ability to seamlessly accelerate existing data integration
deployments without the need to rip and replace as well as leveraging and accelerating
emerging technologies like Hadoop.
Versatility of Transformations and Scalability
All ETL products provide some transformations but few are versatile. Useful transformations
may involve translating data formats and coded values between the data sources and the
target (if they are, or need to be, different). They may involve deriving calculated values,
sorting data, aggregating data, or joining data. They may involve transposing data (from
columns to rows) or transposing single columns into multiple columns. They may involve
performing look-ups and substituting actual values with looked-up values accordingly,
applying validations (and rejecting records that fail) and more. If the ETL tool cannot perform
such transformations, they will have to be hand coded elsewhere ! in the database or in an
application.
It is extremely useful if transformations can draw data from multiple sources and data joins
can be performed between such sources in ight, eliminating the need for costly and
complex staging. Ideally, an ETL 2.0 product will be rich in transformation options since its
role is to eliminate the need for direct coding all such data transformations.
Currently ETL workloads beyond the multi-terabyte level are unlikely, although in the future
they may be seen more frequently. Consequently, scalability needs to be inherent within the
ETL 2.0 product architecture so that it can optimally and efciently transfer and transform
multiple terabytes of data when provided with sufcient hardware resources.
Breadth of Application
At its most basic, an ETL tool transfers data from a source to a target. It is more complex if
there are multiple sources and multiple targets. For example, ETL may supplement or replace
data replication carried out by a database, which can mean the ETL tool needs to deliver data
to multiple locations. This type of complexity can jeopardize speed and performance. An ETL
2.0 product must be able to swiftly and deftly transfer data, despite the number of sources
and targets.
A store-and-forward mode of use is important. The availability of data from data sources may
not exactly coincide with the availability of the target to ingest data, so the ETL tool needs to
gather the data from data sources, carry out whatever transformations are necessary, and then
store the data until the target database is ready to receive it.
Change data capture, whereby the ETL tool transfers only the data that has changed in the
source to the target, is a critically important option. This can reduce the ETL workload
signicantly and improve timing dramatically, by ensuring that the change data capture
keeps databases in sync.
The ability to stream data so that the ingest process begins immediately when the data arrives
at the target is another function to reduce the overall time of a data transfer and achieve near
real-time data movement. Such data transfers are often small batches of data being
transferred frequently. For similar small amounts of data, it is important that real-time
ETL 2.0
7
interfaces, such as web services, MQ/JMS, and HTTP are supported. This is also likely to be
important for mobile data services.
The ability to work within, or connected to, the cloud is swiftly becoming a necessity. This
requires not only supporting data transfer to and from common software-as-a-service (SaaS)
providers, such as Salesforce.com or Netsuite, but also accommodating the technical,
contractual, or cost constraints imposed by any cloud service.
An ETL 2.0 tool should be able to deliver on all these possibilities.
Usability and Collaboration
An ETL 2.0 tool should be easy to use for both the IT developer and the business user. As a
matter of course, the ETL tool should log and report on all its activity, including any
exceptions that occur in any of its activities. Such information must be easily available to
anyone who needs to analyze ETL activity for any purpose.
Developers must be able to dene complex data transfers involving many transformations
and rules, specifying the usage mode and scheduling the data transfer in a codeless manner.
Business users should be able to take advantage of the power of the ETL environment with a
self-service interface based on their role and technical prociency.
Todays business user is a more savvy purveyor of technology, and as such, he has the
potential to bring more to the table than a request for a report. An ETL tool should enable and
foster collaboration between business users, analysts, and developers by providing a
framework that automatically adapts to each users role. When the business user has a clear
understanding of the data life cycle and developers and analysts have a clear understanding
of the business goals and objectives, a greater level of connectivity can be achieved.
In addition to bridging the proverbial gap between IT and the business, this type of ETL
approach can result in faster time to production and, ultimately, increased business agility
and lower costs. By eliminating the typical back-and-forth discussions, a collaborative effort
during the planning stages can have a signicant impact on the efciency of the environment
in which the ETL tool is leveraged.
Economy of Resource Usage
At a hardware level, the ETL 2.0 tool must identify available resources (CPU power, memory,
disk, and network bandwidth) and take advantage of them in an economic fashion.
Specically, it should be capable of data compression to alleviate disk and network I/O
something particularly important for cloud environments and parallel operation both for
speed and resource efciency.
With any ETL operation, I/O is almost always one of the biggest bottlenecks. The ETL tool
should be able to dynamically understand and adapt to the le system and I/O bandwidth to
ensure optimized operation. The ETL tool also needs to clean up after itself, freeing up
computing resources, including disk space (eliminating all temporary les) and memory as
soon as it no longer requires them.
ETL 2.0
8
Self-Optimization
In our view, ETL 2.0 products should require very little tuning. The number of man-hours an
organization can spend on the constant tuning of databases and data ows hinders business
agility and eats up resources at an alarming rate. ETL tuning requires time and specic skills.
Even when it is effective, the gains are usually marginal and may evaporate when data
volumes increase or minor changes to requirements are implemented. Tuning is an expensive
and perpetual activity that doesnt solve the problem ! it just defers it.
ETL 2.0 products will optimize data transfer speeds in line with performance goals, all but
eliminating manual tuning. They will embody an optimization capability that is aware of the
computer resources available and is able to optimize its own operations in real time without
the need for human intervention beyond setting very basic parameters. The optimization
capability will need to consider all the ETL
activities (extracts, transforms, and loads)
automatically, adjusting data processing
algorithms to optimize the data transfer
activity irrespective of how complex it is.
The Benefits of ETL 2.0
It is clear that the modern business and
computing environment demands much more
from ETL and data integration tools than they
were designed to deliver. So it makes sense to
discuss how much that matters.
To most end users, ETL tools are little known
and largely invisible until they perform badly
or fail. As far as the business user is concerned,
there is useful data, and they need access to it
in a convenient form when and for whatever
reason they need it. Their needs are simple to
articulate but not so easy to satisfy.
Existing ETL products face many challenges as
summarized in Figure 3. First and foremost,
they need to deliver a rst-class data service to
business users by ensuring, where possible,
that the whole data services layer delivers data
to those users when and how they want it. The
ETL products need to accommodate timing
imperatives, perennial growth in applications
and data volumes, technology changes, Big Data, cloud computing, and mobile computing.
And, ultimately, they need to deliver business benet.
The business benets of effective ETL have two aspects: those that affect the operations of the
business directly and those that impact the efcient management and operation of IT
resources.
ETL 2.0
9
Figure 3. The ETL Challenges
ETL
Mobile
apps
Timing
imperatives
Technology
changes
Big
data
Any data delivered to
authorized users when
they want it and how they
want it
More
apps &
data
Cloud
computing
The Operations of the Business
The growth in the business use of data is unlikely to stall any time soon. At the leading edge
of this is the explosion of Big Data, cloud computing, and mobile BI, which are in their
infancy ! but they wont be for long. A product page on Facebook, for example, can record
certain behaviors of the users who like that page. Drawing on such information and
matching it, perhaps with specic information drawn from Twitter, the company can tailor its
marketing message to specic categories of customers and potential customers. Such
information is easy enough to gather, but not so easy to integrate with corporate data. The
business needs to be able to exploit any opportunity it identies in these and related areas as
quickly as possible; continued competitiveness and revenue opportunities depend on it.
Assuming data can be integrated, the information needs to be delivered to the entire business
through an accurate and nimble data service, tailored to the various uses for that data.
Delivering new data services quickly and effectively, means providing user self-service where
that is feasible and desirable, and enabling the fast development of new data services by IT
where that is necessary.
The ability to identify opportunities from the various sources of data and deliver the
information with agility is essential to continued competitiveness and revenue growth.
If this can be achieved then ETL and the associated databases and data stores that provide
information services are doing their job.
The Operations of IT
Even when an effective ETL service is provided, its delivery may be costly. The expense of
ETL is best viewed from a total cost of ownership perspective. A primary problem that
eventually emerges from the deployment of outdated ETL products is entropy, the gradual
deterioration of the data services layer, which results in escalating costs.
In reality, the software license fees for the ETL tools are likely to be a very small percentage of
the cost of ownership. The major benets of ETL 2.0 for the operations of IT will come from:
Low development costs: New data transfers can be built with very little effort.
Low maintenance effort: The manual effort of maintaining data transfers will be low
when changes to requirements emerge.
Tunability/optimization: There will be little or no effort associated with ensuring
adequate performance.
Economy of resource usage: They will require less hardware resources than previous
ETL products for any given workload.
Fast development and user self-service: They will reduce the time to value for many
applications that depend on data ows.
Scalability: There will be no signicant limits to moving data around since every
variety of data transfer will be possible and data volume growth will not require
exponential management overhead.
Manageability: Finally, all ETL activities will be visible and managed collectively
rather than on a case-by-case basis. The major win comes from being able to plan for
ETL 2.0
10
the future, avoiding unexpected costs and provisioning resources as the company
needs them.
Clearly ETL 2.0 benets will differ from business to business. A capable ETL product will help
organizations remain competitive and relevant in the marketplace. Those organizations that
are under pressure from data growth or a highly fragmented ETL environment will see results
immediately by putting their house in order. For example, a Fortune 500 company has been
able to reduce its annual costs by more than $1 million by deploying an ETL 2.0 environment
with a well-planned, scalable data ow architecture. It replaced most of its existing data
transfer programs, eliminating all hand coding, and signicantly reducing its tuning and
maintenance activity. Similarly, businesses that are pioneering in mobile computing and Big
Data may also see more gains than others. As a rough rule of thumb, the more data transfers
that are done, the more immediate the benets of ETL 2.0 will be.
Conclusions
The rst generation of ETL products, ETL 1.0, is becoming increasingly expensive and
difcult to deploy and maintain.
The details are different for each IT environment, but the same characteristics emerge. The
resource management costs of ETL escalate. The amount of IT effort to sustain ETL increases,
and the manageability of the whole environment deteriorates. What began as a series of
point-to-point deployments becomes an ad hoc spaghetti architecture. The environment
becomes saturated with a disparate set of transformations ! some of them using the ETL tool
itself, some of them in the database, and some of them hand coded.
Whats left is a data services layer that that is impossible to manage, reuse, or govern. The IT
department is faced with failing to deliver an adequate service to the business or paying a
high price in order to do so. Such a situation is not sustainable in the long run.
As weve discovered, to meet todays business needs, a new approach to data integration is a
necessity. We call this approach ETL 2.0, and it is key to helping organizations remain
competitive in the market place.
The characteristics of ETL 2.0 include:
Connectivity and versatility of extract and load
Versatility of transformations and scalability
Breadth of application
Usability and collaboration
Economy of resource usage
Self-optimization
ETL products that provide the full range of capabilities described in this paper will almost
certainly have a signicant impact on both organizations and the data integration industry as
a whole.
The benets of ETL 2.0 are threefold: the business receives the data service it needs to remain
competitive and achieve strategic objectives, the ETL environment does not suffer from
entropy and can quickly scale to accommodate new demands for information, and most
ETL 2.0
11
importantly, the total cost of owning, deploying, and maintaining the ETL environment is
signicantly lower than that of its predecessor. A capable ETL product will reduce TCO
simply by removing the need for additional personnel and hardware, but one that delivers
really well will further increase ROI by providing businesses with the data they need to make
game-changing decisions precisely when it is needed, enabling organizations to maximize the
opportunities of Big Data.
ETL 2.0
12
About The Bloor Group
The Bloor Group is a consulting, research and technology analysis rm that focuses on open
research and the use of modern media to gather knowledge and disseminate it to IT users.
Visit both www.TheBloorGroup.com and www.TheVirtualCircle.com for more information.
The Bloor Group is the sole copyright holder of this publication.
! PO Box 200638! Austin, TX 78720 ! Tel : 5125243689 !
w w w . T h e V i r t u a l C i r c l e . c o m
w w w . B l o o r G r o u p . c o m

You might also like