Professional Documents
Culture Documents
2016
By Philip Russom
Sponsored by:
NOVEMBER 2016
T DW I CHECK L IS T RE P OR T
By Philip Russom
TABLE OF CONTENTS
2 F OREWORD
2 NUMBER ONE
Deploy a modern data warehouse to integrate and
leverage traditional enterprise data and new big data
3 NUMBER TWO
Make sense of multistructured data for new and
unique business insights
4 NUMBER THREE
Implement advanced forms of analytics to enable
discovery analytics for big data
5 NUMBER FOUR
Empower the business to operate in near real time
by delivering data faster
6 NUMBER FIVE
Integrate multiple platforms into a unified data
warehouse architecture
7 NUMBER SIX
Include cloud-based platforms as you diversify your
modern data warehouse environment
7 NUMBER SEVEN
Demand high performance and scalability of all
components of a modern data warehouse
8 ABOUT OUR SPONSOR
9 ABOUT THE AUTHOR
9 ABOUT TDWI RESEARCH
9 ABOUT TDWI CHECKLIST REPORTS
Note: This report is an update of TDWI Checklist Report: The Modern Data Warehouse,
published in 2013.
2016 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or
in part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.
Product and company names mentioned herein may be trademarks and/or registered trademarks of
their respective companies.
FOREWORD
In recent surveys by TDWI Research, roughly half of respondents report
that they will replace their primary data warehouse (DW) platform
and/or analytics tools within three years. Ripping out and replacing a
DW or analytics platform is expensive for IT budgets and intrusive for
business users. This raises the question: What circumstances would
lead so many people down such a dramatic path?
The answer is that many organizations need a more modern DW
platform to address a number of new and future business and
technology requirements. Most of the new requirements relate to big
data and advanced analytics, so the data warehouse of the future
must support these in multiple ways. For example, a leading goal
of the modern data warehouse is to enable more and bigger data
management solutions and analytics applications, which in turn
help the organization automate more business processes, operate
closer to real time, and learn valuable new facts about business
operations, customers, and products through analytics.
User organizations facing new and future requirements for big data,
analytics, and real-time operation need to start planning today
for the data warehouse of the future. Weve identified seven key
recommendations to guide your selection of vendor products and
solution design:
1. D
eploy a modern data warehouse to integrate and leverage
traditional enterprise data and new big data
2. M
ake sense of multistructured data for new and unique
business insights
3. I mplement advanced forms of analytics to enable discovery
analytics for big data
4. E mpower the business to operate in near real time by delivering
data faster
5. I ntegrate multiple platforms into a unified data warehouse
architecture
6. I nclude cloud-based platforms as you diversify your modern
data warehouse environment to embrace on-premises, cloud,
and hybrid deployments
7. D emand high performance and scalability of all components of
a data warehouse
This TDWI Checklist Report drills into each of the seven
recommendations, listing and discussing many of the new vendor
and open source product types, functionality, and user best
practices that will be common in the near future, along with the
business case and technology strengths of each.
NUMBER ONE
Enabling Technologies
No matter where you start, the long-term trend in big data analytics
is to collect and analyze all data regardless of size, type, model,
source, or vintage. Organizations that have matured into managing
tdwi.org
NUMBER TWO
and analyzing such a wide range of data typically do it with two types
of platforms: relational database management systems (RDBMSs) and
big data tools such as Hadoop, Spark, and other data lake platforms
including those in the cloud.
On the one hand, RDBMSs and SQL are more critical to DW and
analytics than ever. On the other hand, much of the valuable big data
thats emerging is not relational or even structured. Thats where a
big data platform such as Hadoop Distributed File System (HDFS)
complements an RDBMS by managing a massive and scalable data
lake, operational data store (ODS), and/or data staging area that feeds
an RDBMS. Furthermore, when combined with big data processing
engines such as MapReduce, Hive, Spark, and HBase or cloud-based
data lake platforms, a data lake can be a powerful platform for nonSQL algorithmic analytics as well as large-scale reporting or ETL
functions.
The strengths of Hadoop and other data lake platforms lie in areas
where most DWs are weak, such as unstructured data, very large data
sets, non-SQL algorithmic analytics, and handling file-based data.
Conversely, the limitations of data lake platforms are mostly met by
mature functionality available today from a wide range of RDBMS
types, including OLTP and columnar databases, DW appliances, and
new cloud-based warehouses.
In that light, Hadoop and other data lake platforms are complementary
to RDBMS-based DWs despite a bit of overlap. More to the point, the
two together empower user organizations to perform a wider range of
analytics with a wider range of data types, unprecedented scale, and
favorable economics. For these reasons, most modern DWs will be built
on a foundation consisting of both relational and data lake or big data
technologies in the near future.
In addition to very large data sets, big data can also be an eclectic
mix of structured (relational data), unstructured (human language
text), semistructured (RFID, XML), and streaming data (from machines,
sensors, Web applications, and social media). The term multistructured
data refers to data sets or environments that include a mix of these
types and structures.
Multistructured big data types have compelling value
propositions. Human language text drawn from your website, call
center application, and social media can be processed by tools for text
mining or analytics to create a sentiment analysis, which gives sales
and marketing valuable insights into what customers think of your firm
and its products or services. Organizations with an active supply chain
can analyze semistructured data exchanged among partners (in XML,
JSON, RFID, or CSV formats, for example) to understand which are the
most profitable and which supplies are of the highest quality.
Enabling Technologies
Hadoop and other big data or data lake platforms empower
organizations to finally crack the nut of multistructured data.
For years, most business intelligence (BI) and DW professionals
have known theres business value in processing multistructured
data, but few have done anything of consequence due to a lack of
credible support in data platforms. The situation is improving now
that big data technologies have proved their ability with processing
multistructured data. As a data-type-agnostic file system, HDFS
manages the full range of file-based data thats structured,
unstructured, semistructured, or a mix of these. The introduction of
Hadoop and other big data platforms into the modern DW provides
low-cost and scalable mechanisms for capturing, managing, and
analyzing diverse multistructured data for business value.
A data lake enables analytics with multistructured data. The
data lake is an emerging design pattern that organizes complex,
high-volume data collections by integrating and managing data in
its original raw state. Data analysts/scientists can then repurpose
detailed source data repeatedly at scale as new requirements arise for
new analytics applications.
Natural language processing (NLP) is key to getting business
value from multistructured data. NLP takes many forms, including
tools or services for text mining, text analytics, sentiment analysis,
fact clustering, and search.
tdwi.org
NUMBER THREE
Enabling Technologies
RDBMS optimized for SQL-based analytics. TDWI surveys show that
SQL-based analytics is the most common form of advanced analytics,
more popular than analytics based on data mining, statistics, artificial
intelligence, or NLP. This is natural because most DW professionals
know SQL well and have many BI and analytics tools that are
compatible with standard SQL.
In SQL-based analytics, a data analyst/scientist or similar user starts
with ad hoc queries and builds them up through numerous iterations
until the resulting data set yields the desired epiphany. With each
iteration, the SQL code gets longer and more complex, which is why
SQL-based analytics is best done with tools and RDBMSs built and
optimized for highly complex SQL.
RDBMS integrated with a big data platform. SQL-based analytics
assumes relational technology even when applied to data managed in
Hadoop or Spark. This is another reason the modern DW is built with
both relational and big data technologies. RDBMSs and relational tools
provide SQL support, productive development tools, and optimization
for standard SQL thats far more high performance, feature rich, and
user friendly than whats available from big data technologies such
as Spark, MapReduce, Hive, and cloud-based analytics solutions.
Additionally, Hadoop and other big data platforms provide a massive
data store with support for multistructured data types.
Big data platforms for a wide range of advanced analytics. SQL
aside, a growing number of analytics tools based on mining, statistics,
graph, predictive methods, and NLP are available as modules that run
against a data lake or other big data repository, usually through layers
above storage such as MapReduce, Hive, Spark, HBase, and cloudbased analytics solutions. These algorithmic approaches complement
SQLs set-based approach.
Self-service for end users. Accessing big data from data lakes
through Excel (if you have the proper interface) provides the ease
of use that empowers a wide range of users to do their own data
exploration, discovery analytics, and data visualization.
tdwi.org
NUMBER FOUR
Sometimes the term real time literally means that data is fetched,
processed, and delivered in seconds or milliseconds after the data is
created or altered. However, most so-called real-time data operations
take minutes or hours and are more aptly called near real time.
Thats fine because near real time is an improvement over the usual
24-hour data-refresh cycle, and few business processes can react in
seconds anyway.
Business managers need near-real-time reports and analyses for timesensitive business processes. This is already the case with established
BI practices, such as operational BI and metrics-driven performance
management. Both utilize fresh data fetched and presented in reports
and dashboards to end users in real time or near to it. This enables
managers to make tactical and operational decisions based on the
freshest information.
The practices of operational BI and performance management have
brought operational reports from overnight refreshes to near-realtime refreshes multiple times during the business day. Managers
now need analytics to likewise evolve from its current offline latency
toward real-time analytics, sometimes called operational analytics.
A few organizations need to capture streaming big data and analyze
the stream in real time or close to it. Examples include applications
for financial trading systems, business activity monitoring, utility
grid monitoring, e-commerce product recommendations, and facility
monitoring and surveillance.
Enabling Technologies
The modern data warehouse must support a number of capabilities
that fetch, process, and deliver data in real time or close to it.
SQL queries that return results in seconds. One of the reasons that
refreshing management dashboards in near real time has become a
technical reality is that recent versions of RDBMSs can execute report
queries in subsecond time frames. For the same performance with
the complex queries of SQL-based analytics, look for RDBMSs that are
purpose-built for DW and analytics.
Early ingestion into a data lake. Most data lakes are optimized for
early ingestion. Thats where incoming source data is landed as soon
as possible in the lake, with little or no improvement beforehand,
so that data is available immediately for exploration, reporting, and
analytics. This is not true real time, but its a near-time best practice
that suffices for business processes that benefit from data thats a
few hours old, such as business monitoring, data exploration, and
fuzzy metrics for performance management.
tdwi.org
NUMBER FIVE
Enabling Technologies
Traditional Data
CRM, SFA, ERP
Financials, billing, operations,
call center, supply chain mgt
Ingestion
Hadoop
Data
Landing
Massive Store of
Raw Source Data
(Data Lake)
Data
Staging
ETL/ELT
Set-based
Analytics
Stream Capture
Algorithmic
Analytics
Event
Processing
Analytic
Archive
New Data
Machine Data, Mobile Data,
Web Data, Social Media
Core Warehouse
Dimensions, cubes, subject
areas, time series, metrics...
Data for reports, dashboards
Specialized DBMSs
Based on clouds, appliances,
columns, graph, sandboxes,
and other specialized analytics
DIVERSE PLATFORMS: Web, Client/Server, Clusters, Racks, Grids, Clouds, Hybrid Combinations
tdwi.org
NUMBER SIX
Enabling Technologies
For scalability on demand, consider an elastic cloud. Many DW
platforms, Hadoop clusters, and analytics tools are now available
on public clouds in a platform-as-a-service licensing model. Startup costs are low and time-to-use is short with a cloud-based
DW or data lake solution (and both tend to be highly available).
Nevertheless, the leading benefit is that a truly elastic cloud
automatically provides and recovers server resources for optimal
speed and scale as demanding data processing and analytics
workloads ramp up and subside. For the data professional, this
greatly minimizes capacity planning and performance tweaks.
Demand DW platforms that are optimized specifically for
clouds. Enterprise software from many vendors has been ported
to clouds in recent years. Ports vary in terms of how the software
leverages a cloudor doesnt. For the best performance, scale,
and functionality, demand cloud-based DW platforms that have
a deep port, defined as full support for automated elasticity and
all on-premises platform functions available on clouds (not just a
subset). Even better, look into the new relational DW platforms or
data lake solutions built specifically for clouds.
NUMBER SEVEN
Enabling Technologies
A massively parallel processing (MPP) computing architecture can
give added scale-out capabilities that symmetrical multiprocessing
(SMP) does not. The 2013 TDWI big data management survey showed
(for the first time in a survey) that more users manage big data with
an MPP RDBMS than with an SMP. Although SMP is still the preferred
architecture for operational and transactional applications, MPP has
speed and scalability advantages for the large data set operations
typical of big data analytics and data warehousing in general.
(Continues)
tdwi.org
(Continued)
Hadoop and Spark have a strong track record for scaling to
massive data volumes. Hadoop and Spark clusters are known to
scale out to hundreds of nodes that scale up to handle hundreds
of terabytes of file-based data, all at a price thats a fraction of
doing the same with an RDBMS. Additionally, cloud-based data lake
solutions are now available, purpose-built to take advantage of the
elasticity of the cloud with very large amounts of data.
Look for platforms that need little or no query optimization or
data remodeling. Given the high performance of modern relational
and NoSQL platforms, there is little need now to remodel data or
optimize queries for the sake of performance. Therefore, data can be
left in its original schema with all the rich details of source data, as
seen in emerging practices for the data lake. Performance is so good
that many of the transformations previously performed a priori in ETL
are now executed ad hoc in SQL-based queries and equivalent runtime
processing (such as Java code running in MapReduce atop HDFS).
With efforts for query optimization and data modelling greatly
reduced within the modern DW, data management personnel are
offloaded so they can apply more of their time to designing and
building more new solutions. Likewise, analytics personnel and
business users are empowered to explore new big data and conduct
analyses at an unprecedented level of self-service.
www.microsoft.com
Microsoft is one of the leading providers in delivering the modern
data warehouse of the future, and Microsoft has been recognized by
third-party analysts such as Gartner, Forrester, and TDWI as a leader
in the data warehousing and big data space. Beyond the ubiquitous
SQL server software technologies that most organizations leverage,
Microsoft has added many enabling technologies both on premises
and in the cloud to modernize the traditional, relational data
warehouse to become the modern data warehouse. This includes
RDBMS technologies such as Azure SQL Data Warehouse, a scaleout data warehouse solution that independently scales compute and
storage that can be provisioned in three to five minutes as well as
the Analytics Platform System, a turnkey data warehouse appliance
for high performance and scale on premises. It also includes big
data technologies such as Azure HDInsight, Microsofts managed
Hadoop and Spark cluster solution, and Azure Data Lake, which
includes a serverless large-scale data processing system that
enhances the productivity of developers to write big data jobs with
an unlimited cloud analytics store.
To find out more, go to:
www.microsoft.com/en-us/cloud-platform/data-warehouse-big-data
tdwi.org
tdwi.org