You are on page 1of 15

Data Profiling

A Quick Primer on the What and the Why of Data Integration

AUTHORS
Shankar Ganesh R
Senior Technical Architect
Architecture and Technology
Services
HCL Technologies, Chennai

Sathish Kumar Srinivasan


Enterprise Data Architect
Architecture and Technology
Services
HCL Technologies, Chennai

Subramanyam B S
Lead Researcher
ATS-Technical Research
HCL Technologies, Chennai
Data Profiling

Table of Contents

Introduction. ...................... 4
The Need for Data Profiling.. .................................. 5

Structure Discovery ................................... 7


Validation with Metadata ................................................................................. 7

Pattern Matching ................................................................................................ 8

Basic Statistics .................................................................................................... 8

Data Discovery .................................. 8


Standardization ................................................................................................... 8

Frequency Counts and Outliers..................................................................... 9

Business Rule Validation ................................................................................. 9

Relationship Discovery. .................................. 9


Data Profiling Typical Opportunity Areas.................................. 10
Data Profiling Tools. ................................ 11
Data Profiling in Action the Banking Sector. ........................... 12
Conclusion.. ................................ 15

2 Page
Data Profiling

Title: Data Profiling A Quick Primer on the What and the Why of Data
Integration

2008, HCL Technologies Ltd.

August, 2008

3 Page
Data Profiling

Introduction

In todays economic environment, businesses are facing increasing


pressure to reduce costs. In an effort to remain competitive, companies
are looking at all kinds of solutions Enterprise Resource Planning (ERP),
Customer Relationship Management (CRM), Supply Chain Management
(SCM), Stock Control, Logistics, and Business Intelligence (BI), to name
just a few. However, for any solution to deliver value, the data they
depend on needs to be accurate, complete, and consistent. In the Global
Data Management Survey conducted by Price Waterhouse Coopers, data is
83 per cent of considered the most important asset fundamental to an organizations
organizations suffer success.1
from problems
Despite its importance, most companies do not have detailed information
caused by poor data
about their data. As a result, the decision to proceed with a solution like
quality. Just 12 per
ERP, CRM, BI or SCM is fraught with the risk of implementation delays,
cent of data cost overruns or less than expected return on investment.
integration projects
The Data Warehousing Institute (TDWI) reports
are completed
that 83 per cent of organizations suffer from
within their target
problems caused by poor data quality2
budgets. A Standish Group report indicates that 88 per
cent of data integration projects will fail, or
overrun their target budgets by 66 per cent 3

Companies are also not completely sure of their data quality and in
another TDWI survey, half of the respondents said the quality of their data
is excellent or good, 44 per cent of respondents said that in reality, the
4
quality of their data is worse than everyone thinks. Rather than go by
the perception of the individuals managing the data, companies need to
resort to a data profiling exercise.

1
Price Waterhouse Coopers, P.18, Global Data Management Survey,
2001,http://sirnet.metamatrix.se/material/SIRNET_10/survey_01.pdf [June 2008] -> Date
on which the site was accessed.
2
TDWI: BI/DW Education Survey Finds 83 Percent of Organizations Suffering from Poor
Master Data http://vendors.ittoolbox.com/profiles/tdwi-dw-professional-
education/news/survey-finds-83-percent-of-organizations-suffering-from-poor-master-data-
23 [June 2008]
3
Trillium Software, 2004 P.5, Data Integration and Data Quality Management
4

http://www.trilliumsoftware.com/site/content/resources/library/pdf_detail.asp?id=49&pdfRec
Page

orded=1&type=
4
Ibid, Page 9
Data Profiling

This paper examines the reasons for and the process of data profiling. It
also takes a look at data profiling opportunities.

The Need for Data Profiling

A companys database contains information that touches most aspects of


its business activity market data, customer information, accounting
information, production details, sales records, billing details, collection
details, personnel records, salary records, and so on. This data is utilized
by the company for various business decisions, and it is therefore
imperative that the data in the database be consistent, accurate and
reliable. Figure 1 shows the factors affecting data quality and the effects
of poor data quality.

Factors Affecting Data Quality

1. Inadequately articulated requirements


2. Improper data creation process
3. Invalid data structures
Poor quality data
4. Duplicate data
increases costs, 5. Redundant data
results in time 6. Missing values
7. Incorrect data lengths
delays and loss of
8. Data imported from the database of an
business acquired company
9. Data imported from databases that belong
to business partners
10. Unusual values
11. Poor acceptance testing

Effects of Poor Data Quality

1. Increased transaction rework costs


2. Increased costs incurred in implementing new
systems
3. Time delays in delivering data to decision makers
4. Business and opportunity costs of lost customers
through poor service
5. Costs of lost production through supply chain
problems
5 Page

Figure 1: Causes and Effects of Poor Data Quality


Data Profiling

Since the costs of poor data quality are high, increasingly companies are
profiling data to check its quality and suitability for business. Data
profiling uses analytical techniques to discover the true content,
structure, and quality of data.5 It is different from data analysis in that it
derives information about the data, and not business information from the
data. The purpose of data profiling is to

Locate instances of inaccurate data


Determine invalid values, structural violations, and data rule
violations
Find the data characteristics that are useful to a business analyst to
determine if the data matches the business intent.

Typically, data profiling is carried out before data integration is performed,


or before business critical software systems are launched. However, data
profiling should be carried out on critical data, at regular intervals, to
ensure the continuing accuracy of information.

Data Profiling

The data profiling process comprises structure discovery, data discovery


and relationship discovery, as shown in Figure 2, and is undertaken before
any data-driven initiatives are executed. Data profiling is performed using
a tool that

Automates the discovery process


Helps uncover the characteristics of the data
Helps uncover the relationships between data sources

6 Page

5
See: http://etutorials.org/Misc/data+quality/Preface/ [June 2008]
Data Profiling

Figure 2: Data Profiling Workflow

Structure Discovery

Structure problems are caused by data inconsistencies. Some problems


are also caused by legacy data sources that are still in use or have been
migrated to a new application.

Structure discovery is the process of examining complete columns or


tables of data, and determining whether the data in those columns or
tables is consistent with the expectations for that data. There are three
common structure discovery techniques.

Validation with metadata


Pattern matching
Use of basic statistics

Validation with Metadata

Metadata is defined by Wikipedia as "data about data"6 and describes the


data that is in a table or column. Metadata contains information that
indicates the data type and field length. It also indicates if a field can be
missing or null or if it should be unique. Most data has some associated
metadata or a description of the characteristics of the data.
7 Page

6
See http://en.wikipedia.org/wiki/Metadata [June 2008]
Data Profiling

Metadata validation analyzes the data and indicates, for example, whether
or not the field length is appropriate and if there are fields with missing
values. Validation also helps determine if the data collected is as per the
original plan, or if there are deviations.

Pattern Matching

Pattern matching determines if the data values in a field are consistent


across the data source and whether or not the information is in the
expected format.7 Pattern matching also checks for other format-specific
information about the data such as type and length.8

Basic Statistics

Basic statistics provide a snapshot of an entire data field by presenting


statistical information such as minimum and maximum values, mean,
median, mode, and standard deviation, to highlight aberrations from
normal values.9

Data Discovery

The second step in the data profiling process is data discovery. Data
discovery examines the problem areas that are indicated by structure
discovery by examining individual data elements. Data discovery
techniques use

Matching technology to uncover non-standard data


Frequency counts and outlier detection to find data elements that
dont make sense

Standardization

Data in an organization comes from different sources consumers,


different departments and partners. Standardization helps discover
inconsistencies in the data and then provides a solution to address and fix

7
For example, a valid mobile telephone number, in India, could be entered in the database,
in the format (+NN) nnnnnnnnnn, (0) nnnnnnnnnn, nnnnnnnnnn; where NN is the numeric
code for the country, and n is a digit between 0 and 9. If a phone number is entered in a
different format, the pattern report will indicate that the telephone number did not match a
valid telephone number pattern.
8
Dorr, Brett and Herbert, Pat P. 4, Data Profiling: Designing the Blueprint for Improved Data
8

Quality, http://www2.sas.com/proceedings/sugi30/102-30.pdf [June 2008]


Page

9
For example, if customer orders range between 500 and 1000 units, an order of 10000
units would be considered abnormal and validated prior to its being entered into the system.
Data Profiling

the inconsistency. For example HCL, HCLT, HCL Technologies, and HCL
Tech all represent the same organization.10 Any report that is generated
must account for the way the company is represented to avoid missing
important data points that can affect the output of future processes.

Frequency Counts and Outliers

Frequency count looks at how values are related according to data


occurrences. An outlier is an observation that is numerically distant from
the rest of the data.11 Outlier detection examines the data values that are
remarkably different from other values12. In essence, these techniques
eliminate the need to validate the entire data by highlighting the data
values that need further investigation.

Business Rule Validation

A business rule defines or constrains one aspect of your business that is


intended to influence the behavior of your business. 13 Data profiling
software do not include business rules, since business rules are specific to
each organization. However, a robust data profiling process must be able
to build, store, and validate against an organizations unique business
rules.

Relationship Discovery

Relationship discovery is the third part of the data profiling process and
provides information about the ways in which data records inter-relate.
These data records can be multiple records in the same data file, records
across data files or records across databases. 14

Relationship discovery

Determines key relationships by using metadata, if available


Checks the relationships for the provision of a unique primary key
or a foreign key
Inspects the records that prevent a key from being unique
Identifies outstanding records that do not adhere to the relationship

10
Ibid 6
11
See http://en.wikipedia.org/wiki/Outlier [June 2008]
12
Outlier Detection http://www.dataflux.com/technology/methodology/data-
9

profiling/outlier-detection.asp [June 2008]


Page

13
Business Rules, http://www.agilemodeling.com/artifacts/businessRule.htm [June 2008]
14
Ibid 8
Data Profiling

Data Profiling Typical Opportunity Areas

Many companies still do not have a single view of the customer. Having a
single view enables a company to

Obtain a precise understanding of all the business that the


company is conducting with customers, across multiple units and
product lines
Identify cross-selling opportunities
Create targeted marketing campaigns

Some examples of data profiling are given below.

Data Profiling Example 1


provides a single
In supply chain management, supply chains are dependent upon effective
view of the
procurement processes and accurate procurement information. A single
customer. It helps
database that contains details about suppliers and the items that they sell
understand the increases efficiency.
gamut of company-
Data profiling is useful in integrating supplier details and information about
customer
the items that they sell to help improve immediate efficiencies and in
transactions,
facilitating the consolidation and integration of different processes and
identify cross-
systems.
selling
opportunities and Example 2

then aids in creating Data repetition in data sources is common. For example, in banking,
targeted marketing insurance or retail, an account holders name can be recorded
campaigns. as FirstName MiddleName LastName; FirstName M Lastname; F M
LastName and so on.

Data profiling traces and removes such repetitions to improve data quality
and enhance business intelligence and thereby enable better customer
experience and profitability.

Example 3

Databases provide assorted customer related information, such as the


types of products sold; the product profitability and customer profitability.
10

Critical business decisions depend on the accuracy of information in these


Page

databases.
Data Profiling

For example a credit card company or a telecom company can use data
profiling to create customer profiles. These customer profiles could help
the company customize products for specific individuals or groups.
Information in the customer profile about the individuals payment
behavior enables the company to monitor its overall risk portfolio and
enhance an individuals credit limit.

Data Profiling Tools

Data profiling is generally done by using specific software tools designed


for the purpose rather than using statistical tools. Table 1 compares
statistical tools with data profiling tools and illustrates the advantages of
using data profiling tools.

Statistical Tools Data Profiling Tools

Must formulate a large number of Addresses all the stages of


queries and/or reports in order to test data profiling
rules against the data
Execution is slow since rules are Processes a large amount of
executed serially data in a short period of time
Cannot discover rules and users do Includes discovery processes
not understand the actual structure or
content of the data without discovery
Use of validation processes alone will Includes automatic discovery
result in not discovering issues and validation processes
Table 1: Statistical Tools vs. Data Profiling Tools

An effective data profiling tool addresses the following three phases15:

Initial profiling and data assessment


Integration of profiling into automated processes
Passing profiling results to data quality and data integration
processes

The data profiling solution16 should also aid in constructing data correction,
validation, and verification routines directly from the profiling reports.

15
P.9, Ibid 8
11

16
Some well known data profiling tools are Trillium Software from Harte-Hanks, DataFlux
from DataFlux Corporation, Data Insight XI from Business Objects, Information Analyzer
Page

from IBM and Data Explorer from Informatica. For more information you can visit
http://mediaproducts.gartner.com/reprints/businessobjects/149359.html [August 2008]
Data Profiling

Data Profiling in Action the Banking Sector

One of the largest banks in Singapore, with an asset base of about US$93
Billion, needed to compute the amount of capital it required to guard
against financial and operational risks (Basel 2 norms).

Banks make their capital adequacy calculation on information such as

Case Study
Elapsed days since the last scheduled payment on a loan by a
customer
Customers payment behavior, which is based on information in
his/her account

A banks capital adequacy assessment should tally with the financial data
submitted to regulatory bodies.

Our client realized that the accuracy of its projections was dependent on
the quality of its source data, and therefore decided to get its source data
profiled.

Solution architects from HCL carried out a two-stage data profiling process
to determine data quality.

I Analysis Stage

Determine if individual values are valid values for a column


Analyze column values to discover problems with uniqueness rules,
unexpected frequencies of specific values
Analyze structure rules that govern functional dependencies,
primary keys, foreign keys, synonyms and duplicate columns

II Validation Stage

Validate data rules to ensure that they hold true for


A row of data
All rows for a single business object
A collection of a business object
Different types of business objects
12 Page
Data Profiling

As shown in Table 2 HCL architects ran a number of quality checks in the


data profiling process.

Quality Check Example

Domain Checking Gender fields should have a value


of either M or F
Range checking For age, the value should be less
than 125 and greater than 0
Referential integrity If an order shows that a customer
bought product X, then make sure
there actually is a product named X
Basic statistics, frequencies, If a company has products that cost
ranges and outliers between $100 and $1,000, flag any
that fall outside this range.
Uniqueness and missing value If a code is supposed to be unique,
validation make sure it is not being reused

Key identification If there is a defined primary


key/foreign key relationship across
tables, validate it by looking for
records that do not have their
corresponding related record
Data rule compliance If closed credit accounts must have
a zero balance, make sure there
are no records where the closed
account flag is true and the account
balance is greater than zero
Basic Statistics about the data Minimum Value
Maximum Value
Mean
Mode
Standard Deviation
Minimum field length
Maximum field length
Occurrences of Null values in
key-defined fields
Frequency distribution including
candidate columns for multi-
value compression
Invalid data formats

Table 2: Data Profiling Quality Checks

HCL developed a proof-of-concept to evaluate the advantages of using in-


house tools vis--vis third-party data profiling tools, before settling on
17
Teradata Profiler.
13 Page

17
Teradata is a registered trademark of Teradata Corporation
Data Profiling

HCLs approach to the data profiling process comprised of the following:


Create a data profiling plan for each source system
Formulate the business rules for
Checking data quality
Handling exceptional records
Implement the data profiling plan

Figure 3 shows the data profiling process using Teradata Profiler.

Source
Data Staging
Area

DataStage
Enterprise Data
Extract
Teradata Warehouse
The increased Load
Utilities
reliability of source
data brought about Business
Rules Teradata Profiler
by data profiling
Corrective
paves the way for Action
(Source
implementing Report Systems)

projects such as
MDM and SOA Figure 3: Data Profiling Flow

Data profiling enabled our client to gain a competitive advantage by

Being among the first banks to implement Basel 2 norms


Generating a trusted, accurate, reliable and standardized customer
and banking data source
Reducing business risks
Paving the way for the successful implementation of Master Data
Management (MDM) and SOA projects
14

Improving the Banks ability to satisfy compliance requirements


Page
Data Profiling

Conclusion

Databases in most companies have evolved in an ad-hoc manner, which


has resulted in information silos. Companies, therefore, do not have a
unified view of their customers, resulting in missed business opportunities
or increased cost of operations. Data integration addresses those issues,
but poses data verification challenges, since the source data are in diverse
Data quality databases. Most data integration and migration projects overshoot their
management time and cost estimates because of the effort expended to understand the

initiatives pay for source data.

themselves in
Data profiling automates the identification of problematic source data,
commercial terms. inconsistencies, redundancies, and inaccuracies. Data profiling also
some respondents provides a factual foundation, based on which data can be cleansed and

report up to ten then consolidated before integration. Some of the benefits of integrated,
accurate and validated data that is the outcome of data profiling, are:
times payback on
the investment
Enhanced accuracy of account receivables resulting in increased
involved P. 25, debt collection
Price Waterhouse Better customer service

Cooper Global Data Cross-selling


Focused brand marketing campaigns
Management
Reduced operational costs
Survey, 2004
Fraud detection
Compliance to regulations

Since these are tangible benefits to a company, irrespective of the sector


in which it operates, data profiling is relevant to companies and
consultants alike.
15 Page

You might also like