You are on page 1of 40

-Vijay Nagappa NetApp Project

Contents

Definition Types of Data Profiling Reasons for profiling data Data Profiling Roles Key Steps followed during Data profiling Profiling of Data Sources The Iterative Profiling Process Data Profiling Strategies and Techniques Steps to understand the data through Data Profiling Benefits of Data Profiling Data Analysis Role of DP in Data Quality Management Metrics Specifics Data Profiling Tools Best Practices

What is Data Profiling?


Definitions: Data profiling refers to analytical techniques used to examine existing data for completeness and accuracy. Data profiling is the first step toward improving data quality.

Data profiling, also sometimes referred to as data discovery, is the process of statistically examining data sources such as a database or data residing in any file structure primarily to decipher problem prone areas in data organization and plan for data organizations revamp. Data profiling process is a stepping stone in improving upon data quality by way of deciphering and validating data patterns and formats, identifying and validating redundant data across data sources and finally setting and getting the data sources for development of an integrated data driven enterprise level applications. Profiling techniques for data completeness indicate whether all records that should be there are there, none of their fields are blank, existence of duplicate records. Profiling techniques for data accuracy show whether the values contained in the fields are valid.
3

Types of Data Profiling


Data Profiling can refer to: Data Quality Profiling or Database Profiling. 1) Data Quality Profiling is the process of analyzing a database in relation to a Data Quality Domain, to identify and prioritize data quality problems. The results can include: Summaries (with counts and percentages) describing

Completeness of datasets and data records Problems organized by importance The distribution of problems in a dataset Lists of missing data records Data problems in existing records

Details

Data quality profiling can be useful when planning and managing data cleanup projects.

Types of Data Profiling


2) Database Profiling is the process of analyzing a database to determine its structure and internal relationships: The tables used, their keys and number of rows The columns used and the number of rows with a value Relationships between tables Columns copied or derived from other columns Database Profiling can also include analysis of: Tables and columns used by different applications How tables and columns are populated and changed The importance of different tables and columns Database profiling can be useful when planning and managing data conversion and data cleanup projects. Database profiling can be an initial step in defining a Data Quality Domain, which is used in Data Quality Profiling.

Why Do People Profile?


People may want to profile for several reasons, including: Assessing risksCan data support the new initiative?

Planning projectsWhat are realistic time lines and what data, systems, and resources will the project involve? Scoping projectsWhich data and systems will be included based on priority, quality, and level of effort required? Assessing data qualityHow accurate, consistent, and complete is the data within a single system? Designing new systemsWhat should the target structures look like? What mappings or transformations need to occur? Checking/monitoring dataDoes the data continue to meet business requirements after systems have gone live and changes and additions occur?

Who should be Profiling the Data?


Data profiling is primarily considered part of IT projects, but the most successful efforts involve a blend of IT resources and business users of the data. I) IT system owners, developers, and project managers Analyze and understand issues of data structure: a) How complete is the data? b) How consistent are the formats? c) Are key fields unique? d) Is referential integrity enforced? II) Business users and subject matter experts People who understand the data content: what the data means, how it is applied in existing business processes, what data is required for new processes, what data is inaccurate or out of context? III) Data stewards People who understand corporate standards and enterprise data requirements as a whole. They can contribute to both the requirements for specific projects and the corporation.
7

Key Steps followed during Data profiling


1)

Use of analytical and statistical tools to outline the quality of data structure and data organization by determining various frequencies and ranges of key data element within data sources. Applying Numerical analysis techniques to determine the scope of numeric data within data sources. Identifying multiple coding schemes and different spellings used in the data content. Identifying data patterns and data formats and making note of the variation in the data types and data formats being used within data sources.

1)

2)

3)

Key Steps followed during Data profiling


5)

Identifying duplicity in the data content such as in name, address or other pertinent information. Deciphering and validating redundant data within the data sources. Making note of primary and foreign key relationships and studying their impact on data organization and data retrieval. Making validation trials by following specific business rules on data records across the data sources.

6)

7)

8)

Data Profiling Techniques


The techniques for profiling are either manual or automated via a profiling tool: I) Manual techniques It involves people sifting through the data to assess its condition, query by query. Manual profiling is appropriate for small data sets from a single source, with fewer than 50 fields, where the data is relatively simple. II) Automated techniques It uses software tools to collect summary statistics and analyses. These tools are the most appropriate for projects with hundreds of thousands of records, many fields, multiple sources, and questionable documentation and metadata. Sophisticated data profiling technology was built to handle complex problems, especially for high-profile and mission-critical projects.

10

Key Points in Profiling of Data Sources


At a minimum, all data sources should be profiled for the following: Value Completeness - Is there a value in the field?

Historical Completeness - Is the historical data complete? - missing data can corrupt results, but might not be noticed at a summary level. Data format - Are phone numbers phone numbers? Are email addresses properly formed? Do postal codes have the correct structure? Cross column consistency - Even if all the values in the columns are correct, its important to profile across columns. Do product codes match the product categories? Are geographical tags consistent? Do the same products appear under different categories depending on transaction date? Value outliers - By looking for extremely high or low values, "killer rows" can be identified. One row with a value that is orders of magnitude off will skew averages and totals and surprisingly might not be noticed through the normal course of report validation.
11

Data Profiling Challenge


> Metadata does not exist > Metadata is incomplete > Metadata is inaccurate > Metadata is difficult to gather The quality of metadata is generally much worse than the quality of the data

> Data deviates from expectations The quality of the data is generally unknown > Data has inaccuracies > Data inconsistent with metadata
12

The Iterative Profiling Process

13

The Iterative Profiling Process


There is an iterative approach to profiling within each of the analysis steps below. Running the analysis within the appropriate Vendor Tool Analyzing the results of each analysis Verify the results with the Source System SME Documenting the results (both in deliverables and within the profiling tools) Plan further analysis based on results The data investigation process should verify what the source system owners say about the data against the actual data and vice versa. For example, a source system owner may say that all customer names must have a full first name and full surname. But, the data analysis looks like below: However, when this rule is checked against the data, it shows that 10% of the records have only a first initial. This must be discussed with the source system owner. This type of anomaly may be explained by a business rule that was applied to new data that was not applied to historical data. Further analysis is performed in this case to verify that all anomalous records were created before the expected data. Data Re-Engineering also follows and iterative process for standardize, correct, match and enrich data. 14

Examples of Data Profiling

Distinct lengths of string values in a column and the percentage of rows in the table that each length represents.

Example: Profile of a column of US State codes, which should be two characters, shows values longer than 2 characters.

Percentage of null values in a column.

Example: Profile of a Zip Code/Postal code column shows a high percentage of missing codes.

Percentage of regular expressions that occur in a column.

Example: A pattern profile of a phone number column shows numbers entered in three different formats: (919)674-9999, [919]6749988, and 9199018888.

Minimum, maximum, average, and standard deviation for numeric columns; and minimum and maximum for date/time columns.

Example: Profile for an Employee birth date column shows the maximum value is in the future.

15

Examples of Data Profiling

Distinct values in a column and percentage of rows in the table that each value represents.

Example: A profile of a U.S State column contains more than 50 distinct values.
Example: Profile shows duplicate values in a potential key column.

Candidate key column for a selected table.

Dependency of values in one column to values in another column or columns.

Example: Profile shows that two or more values in the State field have the same value in the Zip Code field.

Value inclusion between two or more columns.

Example: Some values in the ProductID column of a Sales table have no corresponding value in the ProductID column of the Products table.

16

A Typical example

While doing a data profiling on some fields like date of birth in a Insurance project. Some weird trends were found people tended to be born on Jan. 1, Feb. 2, March 3, April 4, May 5 Why are those particular dates coming up? It was found out that the date of birth field was a required field on an insurance application that most people applying for automobile insurance didnt feel the need to provide. But the data entry clerks are being paid based on how many applications they can get in per hour, so when they came across a field that was required that didnt have a value for, they just made a date up and they just basically picked their birth year, but used Jan. 1 or Feb. 2. So we had a bunch of bogus dates entered into the system. The data value was accurate. Jan. 1, 1970, is legitimately somebodys birthday, but its not the birthday of the customer that was associated with the insurance transaction.

17

Data Profiling Strategies


Strategies based on the results: Once the result of the profiling is available, it is critical to be realistic about what our analysis tells you. Faced with data quality issues, we have a number of options: 1) Reduce scope to avoid data that simply is not viable We would not want some key information that was advertised during the project justification phase is simply not viable. Cut any losses if required to. 2) Isolate problem areas and mark them In some cases, its not necessary to abandon bad data completely- it might be possible to isolate bad records and ensure that they are clearly identified. If users can use subsets of data after certain analysis, it might still be of value to them to run certain analysis on a portion. For example, if sales data from a subsidiary is missing certain fields, at least the full analysis can be run on a portion of the sales not affected by the problem. Reports need to clearly state "Excluding XYZ".

18

Data Profiling Strategies


3) Cleanse issue areas: It may be possible to cleanse data with issues early. For example: a) Remove duplicates from Master files such as the customer list using matching software b) Identify missing information and launch data collection and/or entry sub-projects to populate the key tables c) Use external data sources to enhance existing or sparse data sets. For example, with an external service it might be possible to populate missing longitude and latitude information from street address to enable map based visualizations. 4) Revise your project budget if needed a) If you have established a project budget before you undertake your data quality investigation, as a project manager you need to honestly look yourself in the mirror and answer the question "Based on what I now know, is the approved budget still viable?" b) Ideally, you should not set the detailed budget until we have done a data quality investigation. Without understanding the raw materials available, it is difficult to design the data warehouse or accurately estimate the effort involved. c) Profile your data early and often. Data quality is an often overlooked. Its a key aspect of data warehouse project success.

19

Data Mining

Its part of Data Profiling process that is used to dig the data deeper in certain profiling areas. Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. Data Mining is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

20

Benefits of Data Profiling


1)

A big benefit of data profiling that perhaps should be left unstated (at least while justifying the data warehouse to executives) is that data profiling makes the implementation team look like they know what they are doing. By correctly anticipating the difficult data quality issues of a project up front, the team avoids the embarrassing and career-shortening surprises of discovering BIG problems near the end of a project. Standard data profiling automatically compiles statistics and other summary information about the data records. It includes analysis by field for minimum and maximum values, frequency counts for fields, data type and patterns/formats and conformity to expected values. It improves data quality, shortens the implementation cycle of major projects, and improve understanding of data for the users. Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling. Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.
21

2)

3)

1)

2)

3)

Benefits of Data Profiling


7)

Some of the other benefits gained by data profiling procedures are: a) Identifies Data Entry Errors b) Quantifies Data Corruption c) Optimizes Data Management Environment d) Enables enterprise-level data e) Enables metadata repository Saves time when re-doing errors: a) One must tediously operate using the provider on tips on how to go about the campaign and genuinely interrogate the information prior to making use of it. b) Need to patiently operate with a business, program each and every detail so that we can get from the limitless loop due to undiscovered errors inside your data. Raises the good quality in the database: a) Understanding or getting the full grasp the data to eliminate the probability of fixing unanticipated issues which is among the factors for venture delays and even failure. b) Information profiling will increase the top quality which gives out superior 22 outcomes for your marketing campaign.

7)

9)

Benefits of Data Profiling


10)

Maintains accuracy: If a corporation periodically profiles the facts it guarantees that the database has no lacking data and deletes any duplicate data and this has higher chances of producing better outcomes for your organization. Identify inconsistent home business info and data: 1. Reviews and methods produced through the database helps businesses a greater probability of closing a sale. 2. The reviews driven will serve as concrete evidences that will enable you to determine the essential techniques one should function on. Identifies Data Entry Errors Ideally, your systems were designed with checks and edits to protect against data entry errors. In reality, there ends up being a tradeoff between quality control and processing throughput. Every system lets data errors in. Profiler Suitcase saves you money by detecting errors before they cost you dollars and customers. Avoids Data Migration Rework Critical human resources can be targeted to resolve data anomalies rather than in manual investigation or costly rework. Significant savings result when anomalies are addressed up-front with risk management strategies, rather than in endless cycles of migrate, fix fallout, reattempt to migrate. 23

11)

12)

13)

Benefits of Data Profiling


14)

Quantifies Data Corruption By quantifying degrees of corruption, a business can make a knowledgeable decision about where corrective efforts will be most effective. Optimizes Data Management Environment Identifying and removing unnecessary, redundant and obsolete data streamlines the technical environment. Disk space is freed up, resulting in direct savings in both the environment and administration costs. Increases Efficacy of Major Data Tools Profiling results can be used to increase the effectiveness of source-to-target transformation mappings in ETL tools. And a business intelligence tool can use derived quantitative measurements of the structure within data content to create statistical models of the attribute domain. Enables enterprise-level data When used with other processes to evaluate applications, KnowledgeDriver provides significant input for defining an enterprise view of data. Enables meta-data repository Meta-data is a natural by-product of profiling.
24

15)

16)

17)

18)

Facts: Impact of Data Profiling on Business

21% of senior IT executives believed that poor data quality costs their company between $10 and $100 million per year, according to a recent survey. And less than 15% believe their data is high quality. Data is coming in through all sorts of processes, manual data-entry, batch feeds from 3rd parties, ecommerce, and the odd quick patch. It's little wonder that inconsistencies creep in. And these problems can have a significant impact on a business, its credibility and its bottom-line. Current data quality problems cost U.S. businesses more than $600 billion per year. Automation of traditional analysis techniques it is not uncommon to see analysis time cut by 90% while still providing a better understanding.

25

How does Data Profiling promote better Data Quality?

Delivering better data quality relies first and foremost on understanding the data you manage and the rules that govern it. Profiling data provides both the framework and roadmap to improved data quality, smoother running systems, more efficient business processes, and ultimately, the performance of the enterprise itself. Compared to manual analysis techniques, Data Profiling technology can significantly improve the organization's ability to meet the challenge of managing data quality.

26

Traditional V/s Data Profiling

27

Who does Data Profiling benefit and how?


As a mature technology, it has been amply demonstrated that a Data Profiling led approach can deliver tangible value across the business when applied to the challenges of data analysis and quality management. Adoption is straightforward and the potential returns on investment is very significant. At the enterprise level, its ability to raise and maintain the quality of corporate information promotes competitive advantage and cuts costs.
I.

a) b) c) d)
II.

Data Analysts and Data M anagers Step improvements in analysis performance through automation do more in less time Significant increase to achievable breadth and depth of analysis scope Far clearer understanding of data content and business rules Facilitated communication between analysts, business users and quality managers Project Managers Visibility of all data quality issues and their current statuses Condensed and achievable delivery timescales Reduced risk of project delays and budget over-runs due to unexpected data quality issues

a) b) c)

28

Who does Data Profiling benefit, and how?


III.

System Owners
a) b) c) d)

Sustainable data quality Diminished operational costs Fewer disruptions and less manual intervention Ability to deliver a better value service

III.

Data Owners/Stewards
a) b) c)

Framework for effective delivery of data quality strategy Ability to meet data quality responsibilities Greater confidence in data assets

III.

Executive Management
a) b)

c) d)

Effective business processes based on accurate, complete and trustworthy information Better guarantee of a return on investment from corporate systems and data assets Reliable information supports better strategic and tactical business decisions Increased profitability through improved efficiency and customer management 29

Typical Data Profiling Process in a Project

30

Typical DP Process in a Project


This approach consists of five steps or phases: 1) Prepare for the project 2) Prepare for the analysis 3) Extract and Format the data 4) Sampling
a) b) c) d) e)
5)

Loading a Sample of the data Analysis of the same Adjust the extracts and formats of the data Produce deliverables Delete the samples

Analysis
a) b) c)

Load the data Performance the Analysis Produce deliverables

31

Data Analysis to be followed in a Project


Identify your data's current state and determine data quality issues to develop standards Identify the reusability of the existing data Early management of possible risk when integrating data with other applications Resolve missing values/erroneous values Discover formats and patterns in your data Identify cleansing issues to maintain the integrity of the data. Reveal hidden business rules Identify appropriate data values and define transformations so as to maintain data validity. Report on column minimums, maximums, averages, mean, median, mode, variance, co-variance, standard deviation and outliers Measure business rule compliance across data sets Report results in various formats including PDF, HTML, XML and CSV Provide point in time data profiling history

32

Role of DP in Data Quality Management


Collection of quality facts: Data profiling uses analytical techniques to discover the true content, structure and quality of data. The data profiling exercise itself should also follow a specific method to be most effective. Ideally a bottom-up approach should be used. The exercise should start at the most atomic level of the data. Problems found at the lower levels can be used in the analysis at the higher level. Ideally analysts correct data inaccuracies at each level before moving to a higher level, which makes the data profiling of the higher level more successful. It is important to note here that data profiling does not find all inaccurate data, it can only find the violations to a specific set of predefined rules. Therefore it is crucial to start with the proper technical & business metadata definitions as discussed in the previous process step. Data profiling will produce facts about data inaccuracies and as such it will generate metrics based on the facts.

33

Key Considerations for Selecting a Data Profiling Tool


1. Who is profiling: business users, IT, or both 2. Common environment to communicate, review, and interpret results 3. Complexity of analysis, number of sources 4. Security of data 5. Ongoing support and monitoring

Data Profiling Tools


DataFlux Trillium InfoSphere Information Analyzer Data Quality Explorer IBM ProfileStage Oracle Warehouse Builder

34

IBM InfoSphere Information Analyzer

It helps in continuously Managing and monitoring Data Quality.

Key features are as below: IBM InfoSphere Information Analyzer helps quickly & easily understand data by offering data quality assessment, flexible data rules design & analysis, and quality monitoring capabilities. These insights help derive more information from enterprise data to accelerate information-centric projects.

Deep profiling capabilities - provide a comprehensive understanding of data at the column, key, source, and cross domain levels. Multi-level rules analysis (by rule, by record, by pattern) unique to the data quality space - provides the ability to evaluate, analyze, and address multiple data issues by record rather than in isolation.

35

IBM InfoSphere Information Analyzer

Native parallel execution for enterprise scalability - enables high performance against massive volumes of data. Supports Data Governance initiatives through auditing, tracking and monitoring of data quality conditions over time. Enhanced data classification capabilities help focus attention on common personal identification information to build a foundation for Data Governance. Proactively identify data quality issues, find patterns and set up baselines for implementing quality monitoring efforts and tracking data quality improvements.

36

IBM Profile Stage

Profile stage: is a profiling tool to investigate data sources to see inherent structures frequencies of phrases identify data types etc.. In addition it can based on the real data rather than metadata suggest a data model for the union of your data sources. This data model would be in 3 NF. Quality Stage: is now embedded in Information Server and provides functionality for fuzzy matching records and for standardizing record fields based on predefined rules. Audit stage: is now a part of Information Analyzer. This part of IA can based on predefined rules expose exceptions in your data from the required format contents and relationships.

37

Best Practices

Data profiling is best scheduled prior to system design, typically occurring during the discovery or analysis phase. The first step and also a critical dependency is to clearly identify the appropriate person to provide the source data and also serve as the go to resource for followup questions. Once you receive source data extracts, youre ready to prepare the data for profiling. As a tip, loading data extracts into a database structure will allow you to freely write SQL to query the data while also having the flexibility to use a profiling tool if needed.

When creating or updating a data profile, start with basic column-level analysis:
1)

Distinct count and percent: a) Analyzing the number of distinct values within each column will help identify possible unique keys within the source data b) Identification of natural keys is a fundamental requirement for database and ETL architecture.
38

Best Practices
2)
a) b)

Zero, blank, and NULL percent:


Analyzing each column for missing or unknown data helps you identify potential data issues. This information will help database and ETL architects set up appropriate default values or allow NULLs on the target database columns where an unknown or untouched (i.e.,., NULL) data element is an acceptable business case.

3)
a)

Minimum, maximum, and average string length:


Analyzing string lengths of the source data is a valuable step in selecting the most appropriate data types and sizes in the target database. (especially true in large and highly accessed tables where performance is a top consideration) Reducing the column widths to be just large enough to meet current and future requirements will improve query performance by minimizing table scan time. If the respective field is part of an index, keeping the data types in check will also minimize index size, overhead, and scan times.

b) c)

4)
a)

Numerical and date range analysis:


Gathering information on minimum and maximum numerical and date values is helpful for database architects to identify appropriate data types to balance storage and performance requirements. If your profile shows a numerical field does not require decimal precision, consider using an integer data type because of its relatively small size. 39

b)

40

You might also like