You are on page 1of 23

G00227024

Top 10 Technology Trends Impacting Information Infrastructure, 2012


Published: 23 December 2011 Analyst(s): Regina Casonato, Andrew White, Mark A. Beyer, Merv Adrian, Ted Friedman, Mike Blechar, Whit Andrews

As evidenced in Gartner's 2012 Hype Cycles, significant innovation continues in the field of information management (IM) technologies. The key factors driving this innovation are the explosion in the volume, velocity and variety of information and the huge amount of value and potential liability locked inside all this ungoverned information.

But the growth in information volume, velocity, variety and complexity and the new information use cases makes IM infinitely more difficult than it has been in the past. In addition to the new internal and external sources of information, practically all information assets must be available for delivery through varied, multiple, concurrent and, in a growing number of instances, in real time channels and mobile devices. All this demands shareability and reusability of information for multiple context delivery and use cases.

The 10 technology trends analyzed in this document can play different roles in modernizing IM. Some help structure information, some help organize and mine information, some integrate and share information across applications and repositories, some store information, and some govern information.

Key Findings

Some IM technologies will enable enterprises to reduce risks, improve quality and avoid certain costs. Others will transform how they use information as an asset. The top 10 technology trends identified will help enterprises cope with vastly increased volumes of information, as well as its increased velocity, variety and the proliferation of use cases for information. Some address the diversity of information, some tackle the volume of information, and some help manage information as a strategic asset and improve pattern recognition and discovery. However, modernizing current information infrastructure and management capabilities requires more than just technology it requires the right IT, people, roles and processes. Use of some of these technologies requires new skills, as well as dedicated analysts to interpret the results.

Recommendations

Determine, for your current IM projects and those planned in the short term, whether the main emerging trends in technology have been factored into your strategic planning process. Identify where these 10 technology trends are in Gartner's 2011 Hype Cycles for master data management (MDM), content management, data management and enterprise information management (EIM). Anticipate their impact and plan adoption accordingly. Some of these technologies are in an early stage of enterprise adoption and come with a risk of failure.

Table of Contents
Analysis..................................................................................................................................................2 Overview..........................................................................................................................................2 Organize Information...................................................................................................................3 Describe Information...................................................................................................................4 Integrate Information...................................................................................................................7 Share Information.....................................................................................................................12 Integrate/Store Information.......................................................................................................16 Govern Information...................................................................................................................18 Recommended Reading.......................................................................................................................21

List of Tables
Table 1. Top 10 Technology Trends Likely to Impact Information Management in 2012..........................3

List of Figures
Figure 1. Common Features and Use Cases for Application Integration and Data Integration...............11

Analysis
Overview
This document updates Gartner's analysis of the top 10 technology trends in IM (see Table 1). We view these technologies as strategic because they directly address some of the challenges faced by enterprises, as set out in "Information Management in the 21st Century." This document should be read by CIOs, information architects, information managers, enterprise architects and anyone working in a team evaluating or implementing an enterprise content
Page 2 of 23
Gartner, Inc. | G00227024

management system, a business intelligence (BI) application, an MDM technology, a documentcentric collaboration initiative, an e-discovery effort, an information access project or a data warehouse (DW) system.
Table 1. Top 10 Technology Trends Likely to Impact Information Management in 2012 Purpose Organize Describe Technology Trend Content Analytics Enterprise Taxonomy and Ontology Management Big Data and Extreme Information Processing and Management Enterprisewide Metadata Repositories Data/Application Integration Convergence Information Semantic Services Logical Data Warehouse Hadoop MapReduce Implementations Multiple Domain Master Data Management Data Stewardship Applications

Integrate

Share

Integrate/Store Govern

Source: Gartner (December 2011)

Organize Information
Content Analytics Analysis by Whit Andrews Content analytics defines a family of technologies that processes content and the behavior of users in consuming content to derive answers to specific questions. Content types include text of all kinds, such as documents, blogs, news sites, customer conversations (both audio and text), and social network discussions. Analytic approaches include text analytics, rich media and speech analytics, and behavioral analytics. Applications of the technology are broad; it can serve to aid in: sentiment analysis; reputation management; trend analysis; affinity; recommendations; face recognition; industry-focused analytics (such as voice of the customer) to analyze call center data; fraud detection for insurance companies; crime detection to support law enforcement activities; competitive intelligence; and understanding consumers' reactions to a new product. Multicomponent functions are constructed by serializing simpler functions. The output of one analysis is passed as input to the next. As virtually all content analytics applications use proprietary APIs to integrate functions, today there's no way to easily construct analyses from applications created by different vendors. In the future, the Unstructured Information Management Architecture (UIMA), governed by the Organization for the Advancement of Structured Information Standards

Gartner, Inc. | G00227024

Page 3 of 23

(OASIS), may serve this purpose. Such a standard for unstructured data would serve a similar purpose to Structured Query Language (SQL) for structured data. Due to the explosion of social networking analyses, particularly sentiment analysis, use of both general- and special-purpose content analytics applications continues to grow, as stand-alone applications and as extensions to search and content management applications, as well as more specifically targeted business applications. The greatest growth comes from generally available content resources, such as contact center records and postsale service accounts, leading to uptake, especially in CRM. Also, open-source intelligence is seeking to use content analytics for more effective understanding of public and semipublic sentiment. Recommendations:

Enterprises should understand the broad range of functions that content analytics can be used for. It can identify high-priority clients, product problems, and customer sentiment and service problems; analyze competitors' activities and consumers' responses to a new product; support security and law enforcement operations by analyzing photographs; and detect fraud by analyzing complex behavioral patterns. Increasingly, it replaces difficult and time-consuming human analyses with automation, often making previously impossible tasks tractable. Complex results are often represented as visualizations (possibly graphical but also textual), making them easier for people to understand. Enterprises should employ content analytics to replace/augment time-consuming and complex human analyses. Firms should identify the analytics that are most able to simplify and demystify complex business processes. Users should identify vendors with specific products that meet their requirements and should review case studies to understand how other users have exploited these technologies.

Describe Information
Enterprise Taxonomy and Ontology Management Analysis by Mark Beyer At its core, the ontological method seeks to determine a universal commonality (think: single representation to start with) that permits the sharing of concepts across semantic barriers (for example, can many people look at something from different viewpoints and realize they are looking at the same thing?). Taxonomic design is the process of defining categories and the relationships between them of determining what features make one thing different from another. In the relational world, these categories and subcategories are expressed as entities and attributes. The techniques for ontology and taxonomy management are still at an early stage of development and the definition of the business needs continues to specify features or functionality in tools.

Page 4 of 23

Gartner, Inc. | G00227024

When analyzing multiple taxonomies in relation to one another, it is important to recognize that multiple metadata descriptions can exist simultaneously for each information asset. In other words, different people and systems attach seemingly different meanings to the same entity. The various metadata sources include business process modeling, federation, extraction, transformation and loading, metadata repository technologies and many others. When rationalizing more than one taxonomy, it is useful to remember that taxonomy design and ontological development occur simultaneously and implicitly. Data integration, at its core, is the process of rationalizing one taxonomy and its incumbent ontology to a second, disparate taxonomy, but with a similar ontology (the ontology of each must have some similarity or else there is little reason to put them together). In other words, it is the reconciliation of meaning the taxonomies have underlying commonalities which create shared meaning. So to answer the question "Should we reconcile these taxonomies?," ask whether they are describing more or less the same thing. Reconciling two taxonomies to each other is "point to point" integration. An alternative to point-to-point data integration can be a process of taking two or more data taxonomies and creating a series of rationalization rules (coded as transformations) that move them to a neutrally designed taxonomy. To do so, there must be some unifying set of principles that brings the datasets together. These are the ontology rules of the new, neutrally designed taxonomy. The process of resolving multiple ontologies and their explicit or implicit rules with multiple taxonomy definitions has an extremely high degree of potential for organizations, in that this resolution can enable them to leverage data from a much wider array of sources opening up the possibilities of cloud-based data, enabling rapid integration of new supplier and partner data, and leveraging market data. Gartner sees a growing need for these capabilities, based on the following evidence. Throughout 2011, the business demand for taxonomy and ontology management began to emerge from its embryonic state. Organizations have realized that sharing information requires ontological continuity. Continuity is not the same as consistency. Continuity merely requires that I can follow a chain of reasoning from point A through points B, C and D and arrive at point E. In a business requirement, this means visibility into an audit trail that tracks information as it is rationalized from one representation to another. In other words, "Why are we saying the corporations are people? And why are hours of the days considered an asset again?" Also in 2011, many organizations began to pursue information taxonomy management through the use of canonical models. Canonical models have significant benefit, but they represent a declared style of ontology/taxonomy pairing. With the rise of NoSQL deployments, social media data and the huge influx of operational technology (machine data), it is more important than ever to address the management that an information concept can belong to more than one taxonomy/ontology pair. Focusing on a single model will not yield good results if enforcement is the goal. If, however, learning is the objective, and testing and re-testing architectures is the goal, using a canonical model as a learning tool to do so, then this will provide valuable and insightful learning experiences. Recommendations: Information architects vexed by taxonomy and ontology issues should:

Gartner, Inc. | G00227024

Page 5 of 23

Identify information-related semantic tools that resolve differing taxonomy and ontology pairs. These include federation tools with advanced modeling capabilities and domain analytics, which populate extensible metadata repositories. Initiate text mining and analytics with a practice to align relational data and content analysis metadata. Acclimatize designated business personnel to their role in creating information assets and to the importance of metadata management as a precursor to introducing enterprise metadata taxonomy and ontology management practices. Utilize industry-related canonical models as candidates for leading this discussion.

Big Data and Extreme Information Processing and Management Analysis by Merv Adrian, Mark Beyer According to Gartner research, information quantification includes issues arising from the volume and the wide variety of information assets, and the velocity of their arrival (which includes both rapid record creation and highly variable rates of data creation, including suddenly lower velocity forcing the issue of scaling "down"). Within each variety of data types, there are significant variations and standards, creating additional complexity, including complexity in the processing of these various resources. "Big data" is the term adopted by the market to describe extreme information management and processing issues which exceed the capability of traditional information technology along one or multiple dimensions to support the use of the information assets. Throughout 2011 and into 2012, discussion about "big data" has focused primarily on the quantification issues associated with the extremely large datasets generated from technology practices such as social media, operational technology, Internet logging and streaming sources. A wide array of hardware and software solutions has emerged to address these issues. At this point, "big data" or, more broadly, extreme information processing and management, has become a disruptive transformation that presents new business opportunities. The larger context of "big data" challenges existing practices of selecting which data to integrate with the proposition that all information can be integrated, and that technology should be developed to support this. As a new issue driving requirements that demand an approach, the breaching of traditional boundaries will occur extremely fast because the many sources of new information assets are increasing geometrically (for example, desktops became notebooks and now tablets; portable data is everywhere and in multiple context formats), which is causing exponential increases in data volumes. Additionally, the information assets include the entire spectrum of the information content continuum, from fully undetermined structure ("content") to fully documented and traditionally accessed structures ("structured"). As a result, organizations will seek to address the full spectrum of extreme information management issues, and will seek this as differentiation from their competitors, so they can become leaders in their markets in the next two to five years. "Big data" is thus a current issue (focused on volume and including the velocity and variety of data, or, together, V3) which highlights a much larger extreme information management topic demanding almost immediate solutions.

Page 6 of 23

Gartner, Inc. | G00227024

Vendors are almost universally claiming that they have a big data strategy or solution. However, Gartner clients have made it clear that big data must include large volumes processed in streams and batch (not just MapReduce), and an extensible services framework which can deploy processing to the data or bring data to the process, and which spans more than one variety of asset type (for example, not just tabular, or just streams or just text). Partial solutions are acceptable but should be evaluated for what they do not the additional claims. Gartner clients are making a very large number of inquiries into this topic; and this is evidence of growing hype in the market. Importantly, the different aspects and types of "big data" have been around for more than a decade it is only recent market hype around legitimate new techniques and solutions which has created this heightened demand. Enterprises should understand the many "big data" use cases. "Big data," and addressing all the extreme aspects of 21st-century information management, permits greater analysis of all available data, detecting even the smallest details of the information corpus a precursor to effective Pattern-Based Strategies and the new type of applications they enable. "Big data" has many use cases. In the case of complex event processing, queries are complex with many different feeds, and the volume may be high or not high, the velocity will vary from high to low, and so on. Volume analytics can leverage technology such as the Apache Hadoop project, for example. In addition to MapReduce approaches which access data in external Hadoop Distributed File System (HDFS) files, BI use cases can utilize it in-database (for example, Aster Data and Greenplum), or as a service call managed by the database management system (IBM InfoSphere BigInsights, for example), or externally through third-party software (such as Apache Hadoop distributions from Cloudera or MapR). Recommendations:

Identify a volume analytics use case to implement a pilot effort for distributed processing, such as MapReduce. Enterprises using portals as a business delivery channel already should leverage the opportunity to combine geospatial, demographic, economic and engagement preferences data in analyzing their operations, and/or to leverage this type of data in developing new process models. For example, supply chain situations include location tracking through route and time, which can be combined with business process tracking. Life sciences generate enormous volumes of data in clinical trials, genomic research and environmental analysis as contributing factors to health conditions. CIOs and IT leaders should investigate the opportunities and understand the challenges of extreme information management. Gartner estimates that organizations which have introduced the full spectrum of extreme information management issues to their information management strategies by 2015 will begin to outperform their unprepared competitors within their industry sectors by 20% in every available financial metric.

Integrate Information
Enterprisewide Metadata Repositories Analysis by Michael Blechar

Gartner, Inc. | G00227024

Page 7 of 23

In "Gartner Clarifies the Definition of Metadata," Gartner defines metadata as "information that describes various facets of an information asset to improve its usability throughout its life cycle." Generally speaking, the more valuable the information asset, the more critical managing the metadata about it becomes, because the contextual definition of metadata provides understanding that unlocks the value of data. Examples of metadata are abstracted levels of information about the characteristics of an information asset, such as its name, location, perceived importance, quality or value to the organization, as well as its relationship to other information assets. Metadata can be stored as artifacts in "metadata repositories" in the form of digital data about information assets that the enterprise wants to manage. Metadata repositories are used to document and manage metadata (in terms of governance, compliance, security and collaborative sharing), and to perform analysis (such as change impact analysis and gap analysis) using the metadata. Repositories can also be used to publish reusable assets (such as application and data services) and browse metadata during life cycle activities (design, testing, release management and so on). As a result, metadata management and repositories are a key component of the implementation of the Gartner Information Capabilities Framework (see "Defining the Scope of Metadata Management for the Information Capabilities Framework" for more details). In "The Eight Common Sources of Metadata" we have explored a range of solutions to meet "enterprisewide" metadata management needs. These include several categories of metadata repositories, such as those used in support of tool suites (tool suite repositories), project-level initiatives and programs (community-based repositories), and those used to federate and consolidate metadata from multiple sources (enterprise repositories) to manage metadata in a more "enterprisewide" fashion. Most Global 1000 enterprises own one or more repositories that purport to address the need for enterprisewide metadata management. However, few organizations are using their solutions effectively in an "enterprisewide manner" to either federate their metadata across tool suites or community-based repositories, or to consolidate the metadata into one main enterprise repository (see "Six Common Approaches to Metadata Federation and Consolidation"). This is due to the difficulties related to integrating metadata with an enterprisewide breadth of scope. The growth in volume, velocity, variety and complexity of information, and the new use cases with insatiable demand for real-time access to socially mediated and context-aware information, are dramatically increasing the need for improved metadata management. And these in turn also raise new enterprisewide governance, risk and compliance needs which require more integrated (federated or consolidated) repository solutions. Unfortunately, there are no easy solutions to managing metadata on an enterprisewide basis, so, not surprisingly, there is no ideal solution in terms of the metadata repository market to meet the need at this point in time. We are seeing more and more organizations even those that already own enterprise repositories acquiring several other "best-of-breed" repositories, each focused on different communities of users in projects and programs involving data warehousing, MDM, business process modeling and analysis, service-oriented architecture (SOA) and data integration, to name just a few "types of communities." In each case, these community-focused repositories have shown benefits in improved quality and productivity through an improved understanding of the artifacts, the impact queries and the reuse of assets, such as data and process artifacts, services and components. This
Page 8 of 23
Gartner, Inc. | G00227024

has resulted in the "subsetting" of what once was the enterprise repository market into smaller "communities of interest," using solutions that are less expensive and easier to manage. However, attempting to federate metadata across multiple repositories to provide an "enterprisewide view of metadata" is no simple task. Because of these metadata federation issues, enterprise repository vendors are now marketing their solutions "a piece at a time" in other words, starting with just the subset required for the needs of the first community (for example, data warehousing) and then adding other components for the needs of the second, third and other communities (like MDM or SOA). This has the advantage of allowing each community to work within its own area of focus without having to worry about the issue of federating multiple community repositories when other communities are added. Or, in some cases, organizations are starting with the best-of-breed community repositories and, instead of federating across them, are passing subsets of metadata from each of the community repositories to an enterprise repository for consolidation and easier reporting. However, even in these cases, organizations are not reverting back to the use of enterprise repositories on the scale seen in the 1970s. While the need for enterprisewide metadata management is not being met by current technologies, we include this in the Top 10 technology trends for 2012 due to organizations' increased focus on using data to improve business operations. This requires greater information and metadata management, and demands that enterprises broaden their understanding and skills on entrerprisewide metadata management, starting with federation. Recommendations: By providing understanding, governance, change impact analysis and improved levels of reuse of information assets, the impact of the metadata repositories on the business can be significant.

Define metadata in a way that demonstrates the value of information to the organization. Develop a metadata management strategy which incrementally improves the standardization and sharing of metadata. Do not underestimate the importance of metadata management to the success of the enterprise's ability to create value from information assets, nor the challenges in successfully managing the metadata. Identify the uses of the term "metadata" throughout the organization to help organizations determine the level of effort necessary to implement metadata management. Form a cross-organizational team to reach agreement on the fundamental purpose of metadata and on what information asset is being described. Identify the scope of your metadata management initiatives based on business opportunities and threats and the ability to govern the metadata you capture. Be selective in which sources you use, since (like the data being described) metadata can be redundant, inconsistent and erroneous, depending on the chosen source.

Gartner, Inc. | G00227024

Page 9 of 23

Start with evaluating the metadata management tools you currently have, including their federation/integration capabilities.

When looking to acquire a metadata repository or evaluate the metadata management capabilities of a given technology, use the analytic hierarchy process (AHP) described in "Decision Framework for Evaluating Metadata Repositories" (or something similar), which includes a decision framework for defining and weighting criteria in areas such as tool functionality, vendor execution, service and support, vendor vision and cost. Data/Application Integration Convergence Analysis by Ted Friedman and Jess Thompson Application integration enables applications that were designed independently to work together. In the mid-1990s most IT organizations thought of application integration as creating interfaces that moved data between applications (data consistency integration) and that executed applications to complete processes (multistep process integration). Since then, organizations have extended the styles of integration to include composite applications, which use existing assets to eliminate the redundant development of business logic and the creation of duplicate data. Data integration enables data structures that were designed independently to be leveraged together. The discipline of data integration comprises the practices, architectural techniques and tools for achieving consistent access to, and delivery of, data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes. As such, data integration capabilities are at the heart of the information infrastructure and will power the alignment and delivery of data in support of various use cases such as BI/analytics, MDM, and more. In most companies, different teams within the IT organization handle application integration and data integration. As IT staff become experienced in those disciplines they develop different approaches. Application integration staff tend to focus on making independently designed applications work together, and the data management and integration staff focus on enabling data structures that are designed independently to be leveraged together, such as aggregating data. Yet these different approaches aim to support the same business strategy and goals. Organizations can use both application integration techniques and data integration techniques to solve many of the same problems. The choice of one approach versus the other is often made tactically, driven from one point of view or the other. But application integration and data integration are better together. The challenge goes beyond simply deploying both types of technology. Instead, organizations need to consider both approaches when looking at a particular integration problem, and then choosing the most appropriate solution for that problem, which may include both application integration and data integration techniques. The differences in functionality between the two disciplines and their respective technology classes mean that each is best suited to certain problem types (see Figure 1). The common problem types can often be readily solved with either discipline and related technologies. The problem types indicated as distinct represent sweet spots best addressed by one or the other. It is critical to note that there is significant (and growing) overlap across the problem space. It may not be "best fit" and
Page 10 of 23

Gartner, Inc. | G00227024

most effective, but it is entirely possible to apply one discipline and technology class to virtually any problem type. However, doing so can dramatically increase deployment cost and risk substantial augmentation (in the form of custom coding, manual processes and additional administrative overhead) is likely to be required to completely meet requirements, or certain requirements may go unmet. An increasing number of organizations, as seen in an upward trend for Gartner client inquiries on this topic, are recognizing the value of managing integration in a consistent way across both disciplines. This means that they are considering both application integration and data integration techniques and tools as possible solutions for a given integration problem, and then choosing the "best fit" approach based on the problem characteristics. Further, they are seeking ways in which both disciplines and the associated tools can be used together to address problems more completely and effectively than with a single approach to integration.
Figure 1. Common Features and Use Cases for Application Integration and Data Integration

Features
App Integration Monitoring Quality of service Community management Message-based transformation Common Features Communications Transformation Adapters Security Governance Management and administration Data Integration Modeling/metadata Data profiling/quality Management and administration Set-based transformations Data federation/ virtualization

Problem Types and Use Cases


App Integration Composite apps Multistep processes Common Data consistency SOA support B2B Master data management Data Integration Data preparation for BI/analytics Data migration/ consolidation HA/DR (replication)

BI = business intelligence, DR = disaster recovery, HA = high availability, SOA = service-oriented architecture

Source: Gartner (December 2011)

Gartner, Inc. | G00227024

Page 11 of 23

Recommendations:

Lay out a vision for bringing together your application integration and data integration technologies. Identify your IT teams with responsibility for application integration and data integration. Inventory the technologies supporting your application integration and data integration projects. Establish centers of excellence, to develop, plan and execute the integration of your application integration and data integration technologies. Actively apply reference architectures that include opportunities to use application integration and data integration capabilities together. Establish and begin building best practices that use those reference architectures.

Share Information
Information Semantic Services Analysis by Mark Beyer Information semantic services are a layer of services that provides a specific entry or "gate" into information management functions or capabilities. Information semantics follow styles or approaches, and those approaches support specific assumptions on how an application interfaces with the data it uses. For greater detail on this approach, see "Information Management in the 21st Century Is About All Kinds of Semantics." Categories of information semantic services include:

Dedicated Applications assume they have priority and the authority to manage the underlying information (legacy applications, for example). Registry Applications assume that the authority for management is derived from governance processes in other data handlers (for example, reading from other applications, MDM or ERP). Consolidation Applications assume the authority to collect and collate information in a physical repository (for example, data integration and DW). External Applications read data from another single source of information governance, usually external to the organization, but it can be external to the application (for example, reading from an ERP app). Auditing Applications read metadata or data as created in a governance model (for example, taxonomy/ontology), and use comparative analysis (BI, for example). Orchestration Applications use a convergence of multiple information assets, including reformatting or enrichment, as needed (for example, complex-event processing).

A modern IM approach will continue a slow evolution toward information governance practices and will become increasingly independent from application design. At the same time, services

Page 12 of 23

Gartner, Inc. | G00227024

enablement will become the standard even in packaged applications. Packaged applications will evolve to become composite services networks, and custom applications will want to utilize information services in common with those packaged applications. A key component in this approach is to negotiate how different applications will interact with the information assets in the organization. Much of this interaction will be based on semantics that interpret application needs for writing, reading and presenting data, so they can leverage finer-grained and easily recomposed data architectures that address describing, organizing, integrating, sharing and governing information through a common framework and shared governance model. However, while service architectures are increasingly important, they are not the only option. Enterprise information architecture is responsible for determining if and when service approaches should be used in contrast to the other options that are available (such as single-purpose applications with dedicated repositories). Gartner's view is that consistent semantic styles will emerge to enable widely varied applications to efficiently reuse a comprehensive information architecture, instead of having to write dedicated application-to-data interfaces (see "Information Management in the 21st Century Is About All Kinds of Semantics"). Recommendations: Organizations should focus on delivering a semantic layer to support the use of granular data services. They should also:

Develop an enterprise information architecture approach that includes a focused strategy for addressing high-value information reuse, and which includes all of the aspects of the information capabilities framework through the following steps. Create a candidate pool of high reuse information assets. Identify which systems in the organization populate and manage data as the originating source of information that is then shared across multiple business processes. They should then further qualify this list to identify those systems that are targeted for use in planned future deployments. These are "deep waterfalls" of data management, with plans for them to get even "deeper." Develop a "skeletal" business process model that focuses on locations where information usage crosses from one business process to another. Following the guidance of the business, determine if these "crossover" points are also key performance metrics. Quantify and qualify the definitional, format and governance differences of the data in the different processes to determine the level of work needed to develop a reusable data architecture at only those crossover points where there is also a candidate key metric input. Identify which information assets move most frequently from one business process to another and the systems involved. Determine the level of change to the data as it moves through the infrastructure. If change is minimal when audited back to the source system, do not revise those systems or the data governance, as that is evidence of an accepted governance and management strategy. For those cases wherein the data is augmented or values are changed, substituted or over-written, the enterprise architecture team needs to lead in creating an

Gartner, Inc. | G00227024

Page 13 of 23

independent model and follow a significantly more orchestration/implementation-style approach. The Logical Data Warehouse Analysis by Mark Beyer DW architecture is undergoing an important evolution, compared with the relative stasis of the previous 25 years. While the term "data warehouse" was coined around 1989, the architectural style predated the term (at American Airlines, Frito-Lay and Coca-Cola). At its core, a DW is a negotiated, consistent logical model that is populated using predefined transformation processes. Over the years, the various options centralized enterprise DW, federated marts, hub-and-spoke array of central warehouse with dependent marts, and virtual warehouse have all served to emphasize certain aspects of the service expectations for a DW. The common thread running through all styles is that they were repository-oriented. This, however, is changing: the DW is evolving from competing repository concepts to include a fully enabled data management and information-processing platform. This new warehouse forces a complete rethink of how data is manipulated, and where in the architecture each type of processing occurs to support transformation and integration. It also introduces a governance model that is only loosely coupled with data models and file structures, as opposed to the very tight, physical orientation previously used. This new type of warehouse the Logical Data Warehouse (LDW) is an information management and access engine that takes an architectural approach which de-emphasizes repositories in favor of new guidelines:

The LDW follows a semantic directive to orchestrate the consolidation and sharing of information assets, as opposed to one that focuses exclusively on storing integrated datasets. The LDW is highly dependent upon the introduction of information semantic services (see the Information Semantic Services section above). The semantics are described by governance rules from data creation and use case business processes in a data management layer, instead of via a negotiated, static transformation process located within individual tools or platforms. Integration leverages both steady-state data assets in repositories and services in a flexible, audited model via the best available optimization and comprehension solution available.

To develop an LDW, enterprises should start by introducing the LDW concepts used for dynamic consolidation, integration and implementation (see "Does the 21st-Century "Big Data" Warehouse Mean the End of the Enterprise Data Warehouse?").

The data integration process can be broken into sourcing, collation, data quality, formatting and domain governance segments, based on information availability and governance rules. For example, the sourcing/extraction process can be a registry semantic layer using "describe" verbs that tell the service "where" the data is. If data for "person" is located in documents, clickstreams and enterprise systems, one service can use textual analysis and search for

Page 14 of 23

Gartner, Inc. | G00227024

documents, another service can use MapReduce to read massive volumes of tags in "clicks," and a traditional native driver access approach can pull data from the enterprise system database. A data quality process can then verify the work done by each service and undertake an enrichment and/or value substitution process, before prepping the data for delivery. If the data is dynamic and constantly changing, the data integration process can deliver a virtual data object, but if the data is already validated by an MDM process and fairly static, it can be loaded into a table or file. A final service can determine the appropriate load or access format and put the data into that format.

In relation to latency issues, you are no longer bound by load restrictions. It is possible to indicate in a metadata layer that there are different requirements for different analytic end-use cases. For example, one department may require higher-quality data but tolerate higher-latency delivery (it would get data from fully validated tables), while another department might be prepared to risk inconsistencies in data but require low latency (it would get a combinedregistry delivery of yesterday's data in the tables with today's data from the online transaction processing system "dirty but fast"). Or, instead of this fixed approach, you could have a service that negotiates whether the quality SLA is being met for each of the departments and switches between strategies dynamically. For example, the department requiring low latency might receive data from the warehouse repository in the morning, after the previous night's load had brought everything up to date, but in the afternoon it might receive a composite view. And, instead of switching at a predetermined time of day, the switch would be based on how far out of synchronization the two sources are, based on record counts and data quality ratings. A dynamic service that determines when to write summary or aggregate data is generally faster than one that performs a query-time summary of detailed rows. It could even switch on the basis of CPU and storage utilization/performance audits, and change its approach throughout the day. It could also switch dynamically between approaches on the basis of system audits that determine whether more memory is added for caching, or even if it is worthwhile to perform caching. Adding external data based on services written to read and analyze those data sources also becomes easier. For example, adding operational-technology data such as the millions of records generated each day by RFID-enabled supply chain management tracking systems or utility smart grid meters requires massive data integration processing in a procedural manner when using traditional warehouses. But by developing two or three variants of the same MapReduce function, the LDW can orchestrate the preferred approach for different analyst audiences and leave the data in the source or the historian software (see "Historian Software and the Smart Grid: Questions and Misconceptions").

Note that with the LDW approach, the differing styles of support, such as federation, data repositories, messaging and reductions, are not mutually exclusive. They are just manifestations of data delivery. The focus is on getting the data first, then figuring out the delivery approach that best achieves the SLA with the querying application. The transformations occur in separate services.

Gartner, Inc. | G00227024

Page 15 of 23

Recommendations:

Start your evolution toward an LDW by identifying data assets that are not easily addressed by traditional data integration approaches and/or easily supported by a "single version of the truth." Consider all technology options for data access and do not focus only on consolidated repositories. This is especially relevant to "big data" issues. Identify pilot projects in which to use LDW concepts by focusing on highly volatile and significantly interdependent business processes. Use an LDW to create a single, logically consistent information resource independent of any semantic layer that is specific to an analytic platform. The LDW should manage reused semantics and reused data.

Integrate/Store Information
Hadoop MapReduce Implementations Analysis by Merv Adrian and Donald Feinberg MapReduce is a programming model for the processing and storage of both structured and semistructured data using parallel programming techniques. Although there are many implementations of MapReduce, Hadoop MapReduce is the most widely used. MapReduce and the HDFS are subprojects of Hadoop Common in the Apache project hierarchy, and as of November 2011 are commonly referred to as Apache Hadoop. The original open-source Hadoop system is described in the paper "MapReduce: Simplified Data Processing on Large Clusters," presented by Google researchers at the Sixth Symposium on Operating System Design and Implementation in 2004. The business value of Hadoop MapReduce lies in the performance improvement achieved by enabling the analysis of many complex mixed data types stored in a database management system (DBMS), HDFS, or HBase (another Apache project, a distributed, versioned, column-oriented store modeled after Google's BigTable), in a parallel environment composed of inexpensive servers, processing large amounts of data. The results of MapReduce programs can be then loaded into a DW for further processing, or used directly from within the programs. MapReduce's other key use is as a data integration tool to increase performance by reducing very large amounts of data to just what is needed in the DW. Due to the parallel nature of MapReduce, it increases the speed of extraction and transformation of the data. In 2008, MapReduce was included in both Aster Data (acquired by Teradata) and Greenplum (acquired by EMC) as an internal function within the DBMS; bidirectional connectors were included in Vertica (acquired by HP) in 2009. In addition to the open-source Apache Software Foundation components, there are other stand-alone implementations available. The pioneering distribution, Cloudera Distribution for Hadoop (CDH), has been joined by Hortonworks Data Platform (HDP), from the company formed as a spinout from Yahoo with added funding from Benchmark Capital. Distributions add value by organizing multiple components into a single download, single

Page 16 of 23

Gartner, Inc. | G00227024

preintegrated install unit, and typically add numerous additional components that extend core Hadoop. As of November 2011, the Apache Software Foundation lists 23 organizations that offer products that include Apache Hadoop, commercial support, and/or tools and utilities related to Hadoop. (see http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support). Commercially, some of the uses of MapReduce are:

To process large sets of data such as clickstream data from the Web (hence the development by Google). To process complex or mixed data types. As part of a data integration infrastructure to process large amounts of data and to then load the data into a DW for further processing.

Many MapReduce applications are also performed on small amounts of data, especially when the data contains complex mixed data types or involves complex multipass computations that include advanced analytics, such as pattern searching, clustering analysis or building "influencer networks." MapReduce implementations represent a much wider class of processing concepts than this, however. The idea of a centralized repository for all enterprise data is becoming unmanageable, given the volume and complexity we have and the available hardware technology (see "Does the 21st-Century "Big Data" Warehouse Mean the End of the Enterprise Data Warehouse?"). A new architecture is evolving that leverages new software and new, less costly and better performing hardware to create information assets on demand that may be either declared or inferred (sourced dynamically, based on context), or both. The processing involved treats search, mashups, metadata, integration and repositories as equal contributors. This requires a new type of processing in which massively parallel capabilities must be leveraged. The main reason is that the volume of data and the widely variant nature of its form/format (which comprises "extreme data") and content relative to the extremely wide variations in use cases makes centralized repositories of "all things informative" unrealistic. Hadoop MapReduce represents the first popular implementation of this new class of combined processing and storage. Here is an example of how it works. We have four hours of stock ticker data. For two hours, nothing meaningful happens. The Mapper puts two hours of data together and determines, based on a statistical algorithm, that there is nothing of value in this period. It then determines that the data for the next 25 minutes is important and groups that data into smaller clusters (each, for example, of 30 seconds). Then, after 25 minutes, the data returns to the same fluctuation pattern as before, but with a much higher average selling price, prompting the creation of a new cluster, different from that for the first two hours. Now Reduce works on the clusters, creating one two-hour-long record, multiple 30-second records and a 95-minute record. If the data actually contained billions of records for 400 stocks, the system would distribute portions of the records across a massively parallel processing complex. By the time MapReduce has finished its first run, it has created many different intervals and values. MapReduce then runs again, reconfiguring the clustered data across the system and finding better affinities. It continues to repeat until it finds the answer that best aligns the results with the original statistical algorithm. In other words, the amount of data that can be processed is no longer limited by a DBMS running on a dedicated database. Instead,

Gartner, Inc. | G00227024

Page 17 of 23

MapReduce can scale across all storage and all available servers, outside the DBMS, and then feed answers in the form of records to the DBMS for final delivery. Recommendations:

Organizations with highly technical developers or skilled statisticians should consider using MapReduce to perform complex analytics "outside" the DW on external data sources. Organizations with large amounts of historical data stored outside their DW should consider using MapReduce as an alternative to their current data integration tools, as the parallel processing will deliver better performance. Today, this is one of the primary uses of MapReduce implementations. Note that since the data loaded into a MapReduce process can come from many different sources, there is none of the implied data quality or referential integrity found in some DBMSs; this would have to come from the source systems creating the data. Organizations should carefully evaluate the use of MapReduce inside a DBMS, due to the volume of data that would be loaded into the data warehouse. This could result in very large database sizes, especially with complex, mixed data types, which would have a detrimental effect on performance and maintenance and deliver no clear benefit.

Govern Information
Multiple Domain Master Data Management Analysis by Andrew White and John Radcliffe Gartner defines master data as "the consistent and uniform set of identifiers and attributes that describe the core entities of the enterprise, and are used across multiple business processes." Core entities include parties, places and things. Groupings of master data include organizational hierarchies, sales territories, product roll-ups, pricing lists, customer segmentations and preferred suppliers. The different master data domains include domains with similar complexities, oriented around provinces:

The "party" province:

Customer: including consumers, business customers and channel/trading partners, prospects and citizens. Supplier: including vendors and suppliers. Human capital: including employees and contractors.

The "thing" province:


Product/service: including products, parts, assets, tools, services, accounts and policies. Sell side, or customer-facing.

Page 18 of 23

Gartner, Inc. | G00227024

Buy side, or supplier-facing. Financial: financial charts of account (also some similarity with hierarchies/relationships). Location: including locations, offices, regional alignments and geographies. Hierarchies, or relationships.

Reference data: sets of permissible values in look-up tables, often used in defining master data (for example, unit of measure, currency conversion, calendar).

The scope of a true enterprise MDM strategy will naturally need to support multiple master data domains (described above), but the number, type and characteristics of those domains will differ depending on the organization's structure, use cases for MDM, implementation style, and industry sector (see "The Five Vectors of Complexity That Define Your MDM Strategy"). Since interest in MDM started, most organizations acquired MDM technology on the basis of individual MDM initiatives which mapped to specific master data domains. Now they are increasingly interested in mapping out how they will support multiple MDM data domains over time. Many are hoping to enable the typical revenue generation, service enhancement, time to market reductions, cost optimization, growth enablement, risk and regulatory compliance business benefits associated with MDM initiatives, while minimizing the proliferation of different MDM vendors and products with its potential costs and complexities. In the long term we expect truly multidomain MDM technology to emerge, but this market dynamic is still only just emerging. Multidomain MDM will differ from multiple domain MDM, as multidomain MDM solutions are purpose-built solutions targeted at addressing the multidomain technology requirements of an MDM program. Multiple MDM represents multiple instances of single-domaincentric or use-case-centric MDM technologies, each possibly from a different vendor. Some vendors have evolved with a deep focus on a single domain such as MDM of Product Data or MDM of Customer Data. Some vendors remain so focused; yet others have expanded their focus (and offerings) toward a multiple domain offering using multiple products, sometimes oriented around specific industries. If the vendor offers such capability in one technology solution, we call this a multidomain MDM technology solution. If the vendor offers such capability but with specific technology solutions (that is, multiple) we call this a multiple domain MDM implementation (of single-domain-oriented technology solutions). We introduce this terminology to make the vendor perspective clear to users. However, end-user organizations use the term "multidomain MDM" and may not delineate what type of technology approach they seek; and vendors exploit this by using the term "multidomain MDM," often independently of their technology approach. This creates potential confusion in the market and can create a mismatch in expectations between user and vendor. Recommendations: Start creating a holistic, business-driven MDM vision and strategy to meet the MDM needs of your organization. Ensure that you address governance, organizational, process and metrics issues, and that you create a technical reference architecture for MDM. Be aware that for larger firms, and/or

Gartner, Inc. | G00227024

Page 19 of 23

those with complex information environments, it is likely that two or more MDM systems will be needed to meet all MDM requirements. Also, ensure that your MDM initiative is aligned with the objectives of your organization's EIM program. Without an overarching initiative like EIM, even if MDM is the only initiative adopted, it is very possible that informational and organizational silos will not be broken down completely, because the information infrastructure will be stretched in several, possibly competing, ways. While keeping the long-term vision in mind, approach individual steps of the MDM journey based on your business priorities and the maturity of MDM technology available. There is experience and proven MDM technology for managing customer and product data in operational use cases, and for developing experience with other data domains, such as supplier, employee and location. There is also experience and proven technology in niche analytical MDM products for managing financial and other types of data. For the next three years, be prepared to engage multiple domain MDM vendors, or at least use multiple MDM products from the same vendor, to meet all your MDM requirements. Build an MDM road map, working with your business stakeholders and your enterprise architecture team, to demonstrate how the different facets of MDM will be addressed over time. Align, or leverage, investment in resources associated with a BI or IM competency center. MDM will require human resources more to start with, and less once things are properly under way. When assessing your human resources requirements, take account of the need to rationalize multiple MDM initiatives in your organization over the long term. In summary:

Organizations should take a long-term strategic view of MDM, but balance it with a pragmatic outlook that delivers business value in stages and leverages technology for multiple master data domains. Develop an MDM strategy based on the "vectors of MDM complexity" requirements for your organization. Evaluate and select MDM vendors and products against this backdrop. Organizations looking for multidomain MDM technology need to be realistic about MDM vendors' marketing claims. MDM vendors need to improve their ability to provide in-depth support for multiple master data domains. This may involve providing more breadth and depth in a single MDM product or tighter integration and consistency of usage across multiple MDM products.

Data Stewardship Applications Analysis by Andrew White and John Radcliffe Governance of data is a people- and process-oriented discipline that forms a key part of any EIM program. Governance includes the specification of decision rights and an accountability framework to encourage desirable behaviors in the valuation, creation, storage, use, archiving and deletion of enterprise information enterprisewide. Technology plays a key role in the enforcement of the

Page 20 of 23

Gartner, Inc. | G00227024

policies and decision rites governance produces. Data stewardship applications specifically targeted at structured data are now emerging. MDM is the first specific EIM program that really formalized the necessary technology to support the role of the master data (or more generally, data) steward. Before the formation of MDM, individual data quality, data profiling and other tools produced some analysis and data that would help with stewardship activities, but often informally, and oriented at one task at a time, and not at the specific role of stewardship. MDM solutions have attempted to bring some additional capability together, but since 2008 a new technology has formed that sits alongside the established MDM tools. The early MDM solutions only focused their stewardship capabilities on the data in their own hubs; these offerings can now span "hubs of hubs" and any other stores of structured data. Overall, MDM's maturity is driving a lot of the hype in this technology, even though these tools could be (and are being) used by some organizations to govern reference data and other forms of enterprise information. The maturation of this toolset will take time, and will continue to evolve through different groupings of tools, initially spanning management processes supported by business rule engines, workflow, dashboards presenting analytics and to-do lists, more general master data reporting, metadata management of master data across application data stores and data life cycle visibility, as well as MDM hubs, and other visibility and management tools that express and operationalize governance. In the future, we expect this technology to be applied to the governance of other data such as content, digital assets and records. The emergence of composite, even packaged, applications for stewardship will soon challenge established MDM implementation efforts, because each new MDM effort brings native governance requirements, which vary greatly. In the long term, mature IT organizations will formally integrate business process management tools with MDM tools to express the governance routines, along with business activity monitoring and corporate performance management. Many self-named MDM solutions offer parts of this technology scope, although often not as an integrated environment for use by the master data steward, or to be deployed across all other master data stores. Recommendations: Enterprises should realize that the governance of data is a core component of any EIM discipline. Be explicit about establishing a necessary stewardship function in the business (not IT), and support it with governance overseers and the necessary set of tools and technologies to operationalize the day to day work of a business (data) steward. Don't assume that your MDM vendor automatically provides all the necessary tools and solutions in support of data stewardship, Evaluate when and if you find differences between MDM vendor offerings and newly emerging data stewardship solutions. Select vendors with proven experience that aligns with your immediate priority, and whose vision aligns with where your information governance programs will take you.

Recommended Reading
Some documents may not be available as part of your current Gartner subscription. "Hype Cycle for Analytic Applications, 2011"

Gartner, Inc. | G00227024

Page 21 of 23

"Hype Cycle for Data Management, 2011" "Hype Cycle for Enterprise Information Management, 2011" "Hype Cycle for Master Data Management, 2011" "Information Management in the 21st Century" "Information Management in the 21st Century Is About All Kinds of Semantics" "The Eight Common Sources of Metadata" "Does the 21st-Century "Big Data" Warehouse Mean the End of the Enterprise Data Warehouse?" "The Five Vectors of Complexity That Define Your MDM Strategy" "A View of Master Data Management Vendors' Experience In Handling Multiple Master Data Domains"

Page 22 of 23

Gartner, Inc. | G00227024

Regional Headquarters
Corporate Headquarters 56 Top Gallant Road Stamford, CT 06902-7700 USA +1 203 964 0096 Japan Headquarters Gartner Japan Ltd. Aobadai Hills, 6F 7-7, Aobadai, 4-chome Meguro-ku, Tokyo 153-0042 JAPAN +81 3 3481 3670 Latin America Headquarters Gartner do Brazil Av. das Naes Unidas, 12551 9 andarWorld Trade Center 04578-903So Paulo SP BRAZIL +55 11 3443 1509

European Headquarters Tamesis The Glanty Egham Surrey, TW20 9AW UNITED KINGDOM +44 1784 431611 Asia/Pacific Headquarters Gartner Australasia Pty. Ltd. Level 9, 141 Walker Street North Sydney New South Wales 2060 AUSTRALIA +61 2 9459 4600

2011 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. This publication may not be reproduced or distributed in any form without Gartners prior written permission. The information contained in this publication has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information. This publication consists of the opinions of Gartners research organization and should not be construed as statements of fact. The opinions expressed herein are subject to change without notice. Although Gartner research may include a discussion of related legal issues, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company, and its shareholders may include firms and funds that have financial interests in entities covered in Gartner research. Gartners Board of Directors may include senior managers of these firms or funds. Gartner research is produced independently by its research organization without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner research, see Guiding Principles on Independence and Objectivity on its website, http://www.gartner.com/technology/about/ ombudsman/omb_guide2.jsp.

Gartner, Inc. | G00227024

Page 23 of 23

You might also like