Professional Documents
Culture Documents
AUTHORS
Shankar Ganesh R
Senior Technical Architect
Architecture and Technology
Services
HCL Technologies, Chennai
Subramanyam B S
Lead Researcher
ATS-Technical Research
HCL Technologies, Chennai
Data Profiling
Table of Contents
Introduction. ...................... 4
The Need for Data Profiling.. .................................. 5
2 Page
Data Profiling
Title: Data Profiling A Quick Primer on the What and the Why of Data
Integration
August, 2008
3 Page
Data Profiling
Introduction
Companies are also not completely sure of their data quality and in
another TDWI survey, half of the respondents said the quality of their data
is excellent or good, 44 per cent of respondents said that in reality, the
4
quality of their data is worse than everyone thinks. Rather than go by
the perception of the individuals managing the data, companies need to
resort to a data profiling exercise.
1
Price Waterhouse Coopers, P.18, Global Data Management Survey,
2001,http://sirnet.metamatrix.se/material/SIRNET_10/survey_01.pdf [June 2008] -> Date
on which the site was accessed.
2
TDWI: BI/DW Education Survey Finds 83 Percent of Organizations Suffering from Poor
Master Data http://vendors.ittoolbox.com/profiles/tdwi-dw-professional-
education/news/survey-finds-83-percent-of-organizations-suffering-from-poor-master-data-
23 [June 2008]
3
Trillium Software, 2004 P.5, Data Integration and Data Quality Management
4
http://www.trilliumsoftware.com/site/content/resources/library/pdf_detail.asp?id=49&pdfRec
Page
orded=1&type=
4
Ibid, Page 9
Data Profiling
This paper examines the reasons for and the process of data profiling. It
also takes a look at data profiling opportunities.
Since the costs of poor data quality are high, increasingly companies are
profiling data to check its quality and suitability for business. Data
profiling uses analytical techniques to discover the true content,
structure, and quality of data.5 It is different from data analysis in that it
derives information about the data, and not business information from the
data. The purpose of data profiling is to
Data Profiling
6 Page
5
See: http://etutorials.org/Misc/data+quality/Preface/ [June 2008]
Data Profiling
Structure Discovery
6
See http://en.wikipedia.org/wiki/Metadata [June 2008]
Data Profiling
Metadata validation analyzes the data and indicates, for example, whether
or not the field length is appropriate and if there are fields with missing
values. Validation also helps determine if the data collected is as per the
original plan, or if there are deviations.
Pattern Matching
Basic Statistics
Data Discovery
The second step in the data profiling process is data discovery. Data
discovery examines the problem areas that are indicated by structure
discovery by examining individual data elements. Data discovery
techniques use
Standardization
7
For example, a valid mobile telephone number, in India, could be entered in the database,
in the format (+NN) nnnnnnnnnn, (0) nnnnnnnnnn, nnnnnnnnnn; where NN is the numeric
code for the country, and n is a digit between 0 and 9. If a phone number is entered in a
different format, the pattern report will indicate that the telephone number did not match a
valid telephone number pattern.
8
Dorr, Brett and Herbert, Pat P. 4, Data Profiling: Designing the Blueprint for Improved Data
8
9
For example, if customer orders range between 500 and 1000 units, an order of 10000
units would be considered abnormal and validated prior to its being entered into the system.
Data Profiling
the inconsistency. For example HCL, HCLT, HCL Technologies, and HCL
Tech all represent the same organization.10 Any report that is generated
must account for the way the company is represented to avoid missing
important data points that can affect the output of future processes.
Relationship Discovery
Relationship discovery is the third part of the data profiling process and
provides information about the ways in which data records inter-relate.
These data records can be multiple records in the same data file, records
across data files or records across databases. 14
Relationship discovery
10
Ibid 6
11
See http://en.wikipedia.org/wiki/Outlier [June 2008]
12
Outlier Detection http://www.dataflux.com/technology/methodology/data-
9
13
Business Rules, http://www.agilemodeling.com/artifacts/businessRule.htm [June 2008]
14
Ibid 8
Data Profiling
Many companies still do not have a single view of the customer. Having a
single view enables a company to
then aids in creating Data repetition in data sources is common. For example, in banking,
targeted marketing insurance or retail, an account holders name can be recorded
campaigns. as FirstName MiddleName LastName; FirstName M Lastname; F M
LastName and so on.
Data profiling traces and removes such repetitions to improve data quality
and enhance business intelligence and thereby enable better customer
experience and profitability.
Example 3
databases.
Data Profiling
For example a credit card company or a telecom company can use data
profiling to create customer profiles. These customer profiles could help
the company customize products for specific individuals or groups.
Information in the customer profile about the individuals payment
behavior enables the company to monitor its overall risk portfolio and
enhance an individuals credit limit.
The data profiling solution16 should also aid in constructing data correction,
validation, and verification routines directly from the profiling reports.
15
P.9, Ibid 8
11
16
Some well known data profiling tools are Trillium Software from Harte-Hanks, DataFlux
from DataFlux Corporation, Data Insight XI from Business Objects, Information Analyzer
Page
from IBM and Data Explorer from Informatica. For more information you can visit
http://mediaproducts.gartner.com/reprints/businessobjects/149359.html [August 2008]
Data Profiling
One of the largest banks in Singapore, with an asset base of about US$93
Billion, needed to compute the amount of capital it required to guard
against financial and operational risks (Basel 2 norms).
Case Study
Elapsed days since the last scheduled payment on a loan by a
customer
Customers payment behavior, which is based on information in
his/her account
A banks capital adequacy assessment should tally with the financial data
submitted to regulatory bodies.
Our client realized that the accuracy of its projections was dependent on
the quality of its source data, and therefore decided to get its source data
profiled.
Solution architects from HCL carried out a two-stage data profiling process
to determine data quality.
I Analysis Stage
II Validation Stage
17
Teradata is a registered trademark of Teradata Corporation
Data Profiling
Source
Data Staging
Area
DataStage
Enterprise Data
Extract
Teradata Warehouse
The increased Load
Utilities
reliability of source
data brought about Business
Rules Teradata Profiler
by data profiling
Corrective
paves the way for Action
(Source
implementing Report Systems)
projects such as
MDM and SOA Figure 3: Data Profiling Flow
Conclusion
themselves in
Data profiling automates the identification of problematic source data,
commercial terms. inconsistencies, redundancies, and inaccuracies. Data profiling also
some respondents provides a factual foundation, based on which data can be cleansed and
report up to ten then consolidated before integration. Some of the benefits of integrated,
accurate and validated data that is the outcome of data profiling, are:
times payback on
the investment
Enhanced accuracy of account receivables resulting in increased
involved P. 25, debt collection
Price Waterhouse Better customer service