You are on page 1of 6

Data Warehousing in a Flat World: Trends

for 2006
The coming year will show what data warehousing looks like now that the world is flat.
Data warehousing trends in a flat world will be driven by open source platforms for data
management, offshore everything and the commoditization of infrastructure through
relatively low cost servers (which, if they were any less expensive, would be
disposable).
Saying the world is flat is a metaphor that points to the leveling effect created by the
digitalization, globalization and commoditization of information technology. An
increasingly level playing field exists between the Midwestern United States and
Bangalore, India, or Shanghai, China. An abundance of fiber-optic bandwidth, along with
the Internet, open source, outsourcing, offshoring and information infusion into business
processes mean that productivity-enhancing innovations will depend on ever-expanding
communities of collaboration, communication and cooperation. But if the world is flat, it
is still not completely flat.
Friction is required in order for trends to gain traction and move forward. Bumps in the
level playing field of data warehousing in a flat world are presented by complex and
heterogeneous master data, data quality issues and the opacity of distributed
information. In most cases, forward motion will not be linear and may even double back
on itself in circular and difficult ways. Three paradoxes will characterize the dynamics
around data warehousing trends in the year ahead:

Location persistence amid transparency in one of the most powerful trends service-oriented architecture (SOA);

The proprietary data warehousing appliance in a world of increasingly


commoditized infrastructure; and

Thin slices of information in a tidal wave of data.

The paradox of data persistence amid the location transparency of SOA is the first bump
in the road to data warehousing in a flat world.
The beauty of SOA for data warehousing is that it offers location transparency combined
with the action-at-a-distance characteristic of Web-centric computing. SOA is one of the
best approaches that have come along since the Web was invented. It enables
enterprises to make the Web useful, reusable and manageable for business purposes
consistent with all the basic first principles of tight coherence, loose coupling and
object-oriented design. In short, it enables in a practical way what is one of the holy
grails of business computing - information as a service.

At the same time, the challenges SOA presents to traditional data warehousing should
not be underestimated. It is the exact architectural opposite of traditional data
warehousing, especially if the latter is a large, centralized, persistent data store. SOA
wants the underlying data to be transparent, location independent. But as powerful as
computers have become, there is still room for doubt whether they are powerful enough
to perform really big joins on the fly without performance penalty or regard for data
movement. This is why the "virtual data warehouse" remains an illusion with minimum
justification and minimum adoption in the enterprise. A tension exists in building a
complete architecture - for example, in the form of a computing grid - between being
realistic about performance and the need to abstract from the location in order to do
what SOA does best - provide information as a service. In short, the race between
growing volumes of complex data and computing power is expected to continue even as
implementations of SOA for data warehousing go forward.
This is one area where the SOA approach will learn a few lessons from the proponents of
extract, transform and load (ETL) tools. Data transformation is now a service, too. ETL
technology has demonstrated significant ingenuity and innovation in building metadata
adapters, connectors and interfaces a wide variety of data sources and targets on a
truly dizzying variety of platforms. When this is combined with the concurrent
development of on-the-fly data integration technology able to juxtapose, combine and
compare unstructured and semistructured information interactively, albeit with
constraints, then the stage is set for a breakthrough in squeezing latency out of the
information supply chain and delivering business answers to those who need them in
time to act on the recommendations.
The standalone ETL tool is being upgraded to the data integration hub that includes
ETL-like process for big batch data volume and information integration messages for
time-sensitive updates. A data integration hub is an ideal point in the data warehousing
architecture to check on (and improve) data quality and rationalize heterogeneous
master data to a conforming paradigm.
Wherever you have data, you have master data. The care and management of that data
is how the information system comes to represent the market context in which the
business operates. Master data is one of the ways to sett the standard for defining data
and information quality. If the master data is out of line, so is the quality of the
information. The ERP revolution raised the hope of finally consolidating master data
around a single transactional system of record. But these hopes were disappointed as
proliferating instances of ERP applications were supplemented with customer
relationship management (CRM), supply chain management (SCM) and analytic
applications (data marts) corresponding to each. Proliferating silos and data marts were
the result. In short, the single version of the truth and its representation of the system
of record continues to be a point on the horizon toward which our system development
efforts converge, but which we never seem to be able to reach. If it is supposed to be a
master file, then why are there so many of them? We are chasing a moving target. In
the year ahead, the IT function will regroup around master data management,

acknowledge that large data warehouses are common and differentiate in the ability to
perform near real-time and real-time updates. Going forward, the critical path to
enterprise data warehousing will lie through the design and implementation of
consistent and unified representations (masters) of customers, products and whatever
other master data entities are needed to run your business.
A tipping point has been reached, and going forward, you will need only one kind of
database to run both the transactional and BI parts of enterprise systems. There will still
be different instances due to performance requirements to support diverging
transactional and BI workloads, but they will both operate with the same database.
Proprietary systems that operate with special purpose technology stacks and databases
are out. Open systems - including de facto standard such as IBM DB2, Oracle and
Microsoft SQL Server - are in. Open source databases will remain outside the
mainstream due to lack of features, functions and experience, but will exert a
remorseless flattening influence on the major players in downward pressure on prices.
This is completely consistent with the trend toward data warehousing appliances that
has emerged over the past two years and will continue to gain traction in 2006. Most
firms do not have the in-house expertise to balance computing power, disk I/O and
network capacity in a labor-intensive iterative process of data warehouse system
configuration. Preconfigured data warehousing appliances, predefined quasi-appliances
and balanced configuration systems will gain even more market traction reaching $2.5
billion in eighteen months (or about 20 percent of the overall DW market). However, the
majority of those dollars will go to large, established, late-arriving major innovators, not
the original upstart, proprietary ones. They will operate with a standard relational
database.
The paradox of the data warehousing appliance - a proprietary and special purpose
solution assembled out of low-cost, commodity components - will ultimately define the
outer boundary of the appliance market as enterprise data marts. There is no reason
why four-way Dell Intel servers should cost three to five times as much when overlaid by
a proprietary parallel database as they do when purchased retail. They will not.
Discounting will reach the point of no return under the pressure of an increasing
coefficient of flatness dictated by open source, commodity infrastructure and
competition defined in such terms. Meanwhile, data marts, no matter how big, rarely
grow up to be data warehouses. The appliance phenomenon will itself be flattened, and
it will merge with and be subsumed by enterprise data warehousing within three years,
but only because the major players will have succeeded in co-opting the technology by
then.
The third paradox is that of thin slicing. One of the keys in a coherent data warehousing
design is deciding on the proper level of granularity. With the collection of point-of-sale
records, individual transactions and now RFID tags, the tendency in BI has been shifting
to finer and finer granularity. The relevant customer, inventory or service processes are
put under an increasingly fine-grained microscope. The idea is that if you have the right

thin slice - the transaction that shows the customer is about to churn - then you can
make the right offer and keep the customer. But all those thin slices add up to a
veritable mountain of data. It is true that the outlier in the data mining algorithm has a
good chance of being a fraudulent claim or other interesting anomaly, but accumulating
all the detailed data to find the trend against which the outlier is an outlier results in an
explosion of data points and volume.
The paradox of thin slicing is that it leads to an explosion of data. The veteran
salesperson knows immediately whether the prospect will buy or is lying about his
intentions and gains enormous bandwidth by not wasting time on those who will not.
But when an ordinary analyst tries to reverse engineer the veteran's method, an
explosion of data results. The devil is in the details, and the details are numerous. The
flicker of contempt in the client's expression shows the relationship with the bank (or
the mobile phone company) is in trouble, but to get at that expression you have to code
every millisecond in a 10-minute transcript, and that is 600,000 data points. The blink of
an eye is indeed a short piece of data, and you only need one to make the proper
inference. But how do you know which one? It turns out that there are a lot of blinks.
The advantage is to the one who first develops the smart methods in predictive
analytics to identify the right blink. With large data warehouses of clean, consistent,
rationalized data becoming increasingly common, the competitive advantage shifts to
those firms able to mine that data for predictive analytics about customers, product
demand and market dynamics.
Regardless of filtering or predictive functions, a key challenge of data warehousing is to
get the data out in a timely way. Many enterprises have demonstrated the ability to
build really big data warehouses - to get the data in. Super large, multiterabyte data
warehouses are now common. More of a challenge is to update this information and get
access to it in a low latency, on-time way - to get the data out. This happens in a
conforming and performing way much less frequently than the press and vendor hype
might suggest. If you had spent 10 million dollars on a proprietary system over the past
five years that was underperforming, would you want to see it written up in the press
against your name? Of course not.
Better to build another data mart. Or is it? It's a bit of a dirty little secret that the result
of this failure to master latency on the part of proprietary databases is the proliferation
of data marts around some of the supposedly centralized, high-performance data
warehouses. Going forward, the advantage - and a key differentiator among data
warehousing competitors - will be to those enterprises that are able to perform
sustained real-time update of the data warehouse, gaining access to low latency data in
a timely way. An obvious corollary of this principle will be the value of (and trend to)
data mart consolidation.
One trend that will be strictly limited in its traction deserves honorable mention. The
much heralded convergence of structured data and unstructured content will continue
to hang fire (not happen) due to immaturity of the technology, applications and

business case. Until XML is driven into the database and becomes as easy to use and
ubiquitous as SQL, managing the content for business intelligence advantage will be a
nonstarter. Metadata is making progress in enabling intelligent information integration,
but there is still a long way to go to render semantics sufficiently transparent to scale up
to hundreds of systems.
No forecast is complete without commenting on the next high concept, the grid. The
grid will make plodding progress in distributed industries with islands of automation
over a course of several years. One of the industries is health care. Why health care?
The use case is that it requires a highly distributed, heterogeneous architecture and
presents a compelling business scenario. The emergence of a health care computing
grid in the U.S. is a possibility over the next three to five years. The formulation by
major employers such as the federal government, the companies comprising the
Technology CEO Council and other major stakeholders of an employee medical record as
a single version of the truth about individual physical well-being is a looming catalyst.
The next requirement will be to put this on a virtual private health care network - a grid,
if you will - that enables shared computing resources and communications to reduce
medical errors, duplicate clinical testing, inconsistent diagnoses and redundant storage
of the same information.
Such a health care provider computing grid puts this discussion right back where it
started, using commodity components to flatten the inefficiencies in the information
supply chain between participants in the digital economy. Suffice to say that grid
computing is different than linking clusters of servers, though that is part of it, and
relies on still-emerging standards to manage platform and computing heterogeneity
along with advanced scheduling, workload management, security and fault tolerance.
Much more work than can be accomplished in this short article will be needed before
this computing grand challenge is engaged and tamed.
For those still sufficiently challenged by mundane, large-scale commercial computing,
the top issue will be to use data warehousing systems to optimize operational
(transactional) ones. You know the forecast; now source it. You know the top customer
issues; now craft a promotion and communicate it in a timely way to take advantage of
the narrow window for action. Top companies are doing this today - but not many.
Everyone is working hard, but most are still not working smart, taking advantage of the
breakthroughs in processing power and software to design businesses and business
processes that are agile and responsive as the demands coming at companies.
Innovations in business processes as well as data warehousing will enable enterprises to
connect the dots between the two realms. On the business side, sales and marketing
will connect the dots between the business question, Which customers are leaving and
why? and the BI available from the data warehouse. Finance will connect the dots
between the questions, Which clients, products and categories are the profit winners
and which are the profit losers? with the consistent, unified view of customer and
product master data in the warehouse. Operations will connect the dots between the

questions about supplier and procurement efficiency, stock outages, capital risks and
reserves, dynamic pricing and the aggregations of transactional data in the warehouse.
The result will be an even smarter enterprise working smarter.

You might also like