Professional Documents
Culture Documents
Contents
Definition Types of Data Profiling Reasons for profiling data Data Profiling Roles Key Steps followed during Data profiling Profiling of Data Sources The Iterative Profiling Process Data Profiling Strategies and Techniques Steps to understand the data through Data Profiling Benefits of Data Profiling Data Analysis Role of DP in Data Quality Management Metrics Specifics Data Profiling Tools Best Practices
Data profiling, also sometimes referred to as data discovery, is the process of statistically examining data sources such as a database or data residing in any file structure primarily to decipher problem prone areas in data organization and plan for data organizations revamp. Data profiling process is a stepping stone in improving upon data quality by way of deciphering and validating data patterns and formats, identifying and validating redundant data across data sources and finally setting and getting the data sources for development of an integrated data driven enterprise level applications. Profiling techniques for data completeness indicate whether all records that should be there are there, none of their fields are blank, existence of duplicate records. Profiling techniques for data accuracy show whether the values contained in the fields are valid.
3
Completeness of datasets and data records Problems organized by importance The distribution of problems in a dataset Lists of missing data records Data problems in existing records
Details
Data quality profiling can be useful when planning and managing data cleanup projects.
Planning projectsWhat are realistic time lines and what data, systems, and resources will the project involve? Scoping projectsWhich data and systems will be included based on priority, quality, and level of effort required? Assessing data qualityHow accurate, consistent, and complete is the data within a single system? Designing new systemsWhat should the target structures look like? What mappings or transformations need to occur? Checking/monitoring dataDoes the data continue to meet business requirements after systems have gone live and changes and additions occur?
Use of analytical and statistical tools to outline the quality of data structure and data organization by determining various frequencies and ranges of key data element within data sources. Applying Numerical analysis techniques to determine the scope of numeric data within data sources. Identifying multiple coding schemes and different spellings used in the data content. Identifying data patterns and data formats and making note of the variation in the data types and data formats being used within data sources.
1)
2)
3)
Identifying duplicity in the data content such as in name, address or other pertinent information. Deciphering and validating redundant data within the data sources. Making note of primary and foreign key relationships and studying their impact on data organization and data retrieval. Making validation trials by following specific business rules on data records across the data sources.
6)
7)
8)
10
Historical Completeness - Is the historical data complete? - missing data can corrupt results, but might not be noticed at a summary level. Data format - Are phone numbers phone numbers? Are email addresses properly formed? Do postal codes have the correct structure? Cross column consistency - Even if all the values in the columns are correct, its important to profile across columns. Do product codes match the product categories? Are geographical tags consistent? Do the same products appear under different categories depending on transaction date? Value outliers - By looking for extremely high or low values, "killer rows" can be identified. One row with a value that is orders of magnitude off will skew averages and totals and surprisingly might not be noticed through the normal course of report validation.
11
> Data deviates from expectations The quality of the data is generally unknown > Data has inaccuracies > Data inconsistent with metadata
12
13
Distinct lengths of string values in a column and the percentage of rows in the table that each length represents.
Example: Profile of a column of US State codes, which should be two characters, shows values longer than 2 characters.
Example: Profile of a Zip Code/Postal code column shows a high percentage of missing codes.
Example: A pattern profile of a phone number column shows numbers entered in three different formats: (919)674-9999, [919]6749988, and 9199018888.
Minimum, maximum, average, and standard deviation for numeric columns; and minimum and maximum for date/time columns.
Example: Profile for an Employee birth date column shows the maximum value is in the future.
15
Distinct values in a column and percentage of rows in the table that each value represents.
Example: A profile of a U.S State column contains more than 50 distinct values.
Example: Profile shows duplicate values in a potential key column.
Example: Profile shows that two or more values in the State field have the same value in the Zip Code field.
Example: Some values in the ProductID column of a Sales table have no corresponding value in the ProductID column of the Products table.
16
A Typical example
While doing a data profiling on some fields like date of birth in a Insurance project. Some weird trends were found people tended to be born on Jan. 1, Feb. 2, March 3, April 4, May 5 Why are those particular dates coming up? It was found out that the date of birth field was a required field on an insurance application that most people applying for automobile insurance didnt feel the need to provide. But the data entry clerks are being paid based on how many applications they can get in per hour, so when they came across a field that was required that didnt have a value for, they just made a date up and they just basically picked their birth year, but used Jan. 1 or Feb. 2. So we had a bunch of bogus dates entered into the system. The data value was accurate. Jan. 1, 1970, is legitimately somebodys birthday, but its not the birthday of the customer that was associated with the insurance transaction.
17
18
19
Data Mining
Its part of Data Profiling process that is used to dig the data deeper in certain profiling areas. Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. Data Mining is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.
20
A big benefit of data profiling that perhaps should be left unstated (at least while justifying the data warehouse to executives) is that data profiling makes the implementation team look like they know what they are doing. By correctly anticipating the difficult data quality issues of a project up front, the team avoids the embarrassing and career-shortening surprises of discovering BIG problems near the end of a project. Standard data profiling automatically compiles statistics and other summary information about the data records. It includes analysis by field for minimum and maximum values, frequency counts for fields, data type and patterns/formats and conformity to expected values. It improves data quality, shortens the implementation cycle of major projects, and improve understanding of data for the users. Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling. Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.
21
2)
3)
1)
2)
3)
Some of the other benefits gained by data profiling procedures are: a) Identifies Data Entry Errors b) Quantifies Data Corruption c) Optimizes Data Management Environment d) Enables enterprise-level data e) Enables metadata repository Saves time when re-doing errors: a) One must tediously operate using the provider on tips on how to go about the campaign and genuinely interrogate the information prior to making use of it. b) Need to patiently operate with a business, program each and every detail so that we can get from the limitless loop due to undiscovered errors inside your data. Raises the good quality in the database: a) Understanding or getting the full grasp the data to eliminate the probability of fixing unanticipated issues which is among the factors for venture delays and even failure. b) Information profiling will increase the top quality which gives out superior 22 outcomes for your marketing campaign.
7)
9)
Maintains accuracy: If a corporation periodically profiles the facts it guarantees that the database has no lacking data and deletes any duplicate data and this has higher chances of producing better outcomes for your organization. Identify inconsistent home business info and data: 1. Reviews and methods produced through the database helps businesses a greater probability of closing a sale. 2. The reviews driven will serve as concrete evidences that will enable you to determine the essential techniques one should function on. Identifies Data Entry Errors Ideally, your systems were designed with checks and edits to protect against data entry errors. In reality, there ends up being a tradeoff between quality control and processing throughput. Every system lets data errors in. Profiler Suitcase saves you money by detecting errors before they cost you dollars and customers. Avoids Data Migration Rework Critical human resources can be targeted to resolve data anomalies rather than in manual investigation or costly rework. Significant savings result when anomalies are addressed up-front with risk management strategies, rather than in endless cycles of migrate, fix fallout, reattempt to migrate. 23
11)
12)
13)
Quantifies Data Corruption By quantifying degrees of corruption, a business can make a knowledgeable decision about where corrective efforts will be most effective. Optimizes Data Management Environment Identifying and removing unnecessary, redundant and obsolete data streamlines the technical environment. Disk space is freed up, resulting in direct savings in both the environment and administration costs. Increases Efficacy of Major Data Tools Profiling results can be used to increase the effectiveness of source-to-target transformation mappings in ETL tools. And a business intelligence tool can use derived quantitative measurements of the structure within data content to create statistical models of the attribute domain. Enables enterprise-level data When used with other processes to evaluate applications, KnowledgeDriver provides significant input for defining an enterprise view of data. Enables meta-data repository Meta-data is a natural by-product of profiling.
24
15)
16)
17)
18)
21% of senior IT executives believed that poor data quality costs their company between $10 and $100 million per year, according to a recent survey. And less than 15% believe their data is high quality. Data is coming in through all sorts of processes, manual data-entry, batch feeds from 3rd parties, ecommerce, and the odd quick patch. It's little wonder that inconsistencies creep in. And these problems can have a significant impact on a business, its credibility and its bottom-line. Current data quality problems cost U.S. businesses more than $600 billion per year. Automation of traditional analysis techniques it is not uncommon to see analysis time cut by 90% while still providing a better understanding.
25
Delivering better data quality relies first and foremost on understanding the data you manage and the rules that govern it. Profiling data provides both the framework and roadmap to improved data quality, smoother running systems, more efficient business processes, and ultimately, the performance of the enterprise itself. Compared to manual analysis techniques, Data Profiling technology can significantly improve the organization's ability to meet the challenge of managing data quality.
26
27
a) b) c) d)
II.
Data Analysts and Data M anagers Step improvements in analysis performance through automation do more in less time Significant increase to achievable breadth and depth of analysis scope Far clearer understanding of data content and business rules Facilitated communication between analysts, business users and quality managers Project Managers Visibility of all data quality issues and their current statuses Condensed and achievable delivery timescales Reduced risk of project delays and budget over-runs due to unexpected data quality issues
a) b) c)
28
System Owners
a) b) c) d)
Sustainable data quality Diminished operational costs Fewer disruptions and less manual intervention Ability to deliver a better value service
III.
Data Owners/Stewards
a) b) c)
Framework for effective delivery of data quality strategy Ability to meet data quality responsibilities Greater confidence in data assets
III.
Executive Management
a) b)
c) d)
Effective business processes based on accurate, complete and trustworthy information Better guarantee of a return on investment from corporate systems and data assets Reliable information supports better strategic and tactical business decisions Increased profitability through improved efficiency and customer management 29
30
Loading a Sample of the data Analysis of the same Adjust the extracts and formats of the data Produce deliverables Delete the samples
Analysis
a) b) c)
31
Identify your data's current state and determine data quality issues to develop standards Identify the reusability of the existing data Early management of possible risk when integrating data with other applications Resolve missing values/erroneous values Discover formats and patterns in your data Identify cleansing issues to maintain the integrity of the data. Reveal hidden business rules Identify appropriate data values and define transformations so as to maintain data validity. Report on column minimums, maximums, averages, mean, median, mode, variance, co-variance, standard deviation and outliers Measure business rule compliance across data sets Report results in various formats including PDF, HTML, XML and CSV Provide point in time data profiling history
32
33
DataFlux Trillium InfoSphere Information Analyzer Data Quality Explorer IBM ProfileStage Oracle Warehouse Builder
34
Key features are as below: IBM InfoSphere Information Analyzer helps quickly & easily understand data by offering data quality assessment, flexible data rules design & analysis, and quality monitoring capabilities. These insights help derive more information from enterprise data to accelerate information-centric projects.
Deep profiling capabilities - provide a comprehensive understanding of data at the column, key, source, and cross domain levels. Multi-level rules analysis (by rule, by record, by pattern) unique to the data quality space - provides the ability to evaluate, analyze, and address multiple data issues by record rather than in isolation.
35
Native parallel execution for enterprise scalability - enables high performance against massive volumes of data. Supports Data Governance initiatives through auditing, tracking and monitoring of data quality conditions over time. Enhanced data classification capabilities help focus attention on common personal identification information to build a foundation for Data Governance. Proactively identify data quality issues, find patterns and set up baselines for implementing quality monitoring efforts and tracking data quality improvements.
36
Profile stage: is a profiling tool to investigate data sources to see inherent structures frequencies of phrases identify data types etc.. In addition it can based on the real data rather than metadata suggest a data model for the union of your data sources. This data model would be in 3 NF. Quality Stage: is now embedded in Information Server and provides functionality for fuzzy matching records and for standardizing record fields based on predefined rules. Audit stage: is now a part of Information Analyzer. This part of IA can based on predefined rules expose exceptions in your data from the required format contents and relationships.
37
Best Practices
Data profiling is best scheduled prior to system design, typically occurring during the discovery or analysis phase. The first step and also a critical dependency is to clearly identify the appropriate person to provide the source data and also serve as the go to resource for followup questions. Once you receive source data extracts, youre ready to prepare the data for profiling. As a tip, loading data extracts into a database structure will allow you to freely write SQL to query the data while also having the flexibility to use a profiling tool if needed.
When creating or updating a data profile, start with basic column-level analysis:
1)
Distinct count and percent: a) Analyzing the number of distinct values within each column will help identify possible unique keys within the source data b) Identification of natural keys is a fundamental requirement for database and ETL architecture.
38
Best Practices
2)
a) b)
3)
a)
b) c)
4)
a)
b)
40