Professional Documents
Culture Documents
List of Contents
1. Introduction .............................................................................. 1
2. Data & Information ................................................................... 1
3. What is data / Information quality ? ......................................... 2
4. Dimension of Data quality ......................................................... 3
5. Different approaches for data quality ...................................... 5
5.1.
Data Profiling...............................................................5
5.2.
Cleaning and Conforming.............................................8
6. Data quality analysis on Northwind database.........................10
7. Summary................................................................................15
8. Conclusion ............................................................................... 15
9. References..........................................................................................................................16
- i-
1 Introduction
Now most of the business owner prefers using data warehouse for their business. Using
data warehouse is the convenient way to handle overall business because they can observe the
overall business condition, easy for decision making and also easy to predict the business
future. Actually the management of the business are not much interested to look all activities;
they are more interested to see the reports or the summery of the business. The reports or the
summery are the different calculation of data in database. So, the data in the database should
be qualified because it will impact on decision making of the business. Otherwise lots of
problem will arise such as more users complain or wrong business direction etc. In Kimball
book talking about three important reasons why executives are more concern for data quality.
First, if I could see the data then I could manage my business better. Second, most of the data
sources are distributed; integrating disparate data sources are required. Third, sharply
increased demand of complain mean lake of qualified data.
Data quality is an essential characteristic that determines the reliability of data for
making decisions. 3
Data are of high quality if, "they are fit for their intended uses in operations, decision
making and planning." (J. M. Juran). Alternatively, data are deemed of high quality if they
correctly represent the real-world construct to which they refer.
From the different definition of data quality we can say that data should be reliable,
represent the real world and also meet the decision making purpose.
See (IBM)
- 2-
quality dimension
concern
about
accuracy,
availability,
completeness,
Accuracy
Timeliness
Availability
Relevance
Completeness
Data
Quality
Dimension
Processability
Conformance
Credibility
Consistency
Accuracy: The accuracy of data is the correctly representation of the real world object,
situation or event.
Availability: The availability of data means the data accessible for long time without
any problem.
Example: Suppose our source data coming from a URL. When URL not available will
show 404 not found error.
Completeness: The completeness of data means the data items or data points necessarily
need to support the application which it is intended.
Conformance: The conformance of data means a set of rules or regulation for capturing,
description of the data.
Consistency: The consistency means the data cannot violet the own rules of the database.
Example: For integer data type cannot insert character data type in database.
Process ability: Process ability data means that the data giving as input is understandable by
the machine or software.
Relevance: Relevance data means the data contain necessary information for support the
application.
- 4-
Data Profiling: According to Kimball Data profiling is technical analysis of data to describe
its content, consistency and structure. 5
Data profiling plays two roles strategic and tactical. When data source identified then data
profiling assessment determine its suitability for data warehouse and make go/no decision.
Data profiling is very critical stage for initiate any database; it incorporates source data from
external system. Allocation sufficient time and analysis to data profiling assessment give
designer a better solution and reduce project risk by identifying the potential data.
Best practices for data profiling6
Distinct count and percent: Analyzing the distinct values for each column will help to identify
the unique value within the source data. Identification of unique keys is the fundamental
requirements for database and ETL architecture. Especially when need to insert or update any
data from database we need this unique keys for do action on specific data.
Order ID
Order Date
Shipped Date
Ship Via
Ship Name
Ship City
Ship Region
Ship Country
5
6
Customer ID
Customer Name
Address
City
Region
Postal Code
- 5-
Zero, blank, null percent: Analyzing missing, blank, null values will help to identify
the potential data issue. This information help database or ETL architecture to set appropriate
default values or allow null to a target database column where data is unknown.
Field
Order ID
Order Date
Shipped Date
Ship Via
Ship Name
Ship City
Ship Country
Zero
0
500
50
40
20
20
400
Blank
0
200
20
100
400
400
200
Null
0
30
40
10
30
12
40
Percent
100%
30%
40%
35%
22%
25%
20%
Minimum, Maximum string length and type: Analyze string length of source data
will help to set length and type to a database. This is very important for big type database. It
save the space, increase the query performance by minimizing the table scan time. If the field
part of the index, keeping the data type in check will help the minimize index size, overhead
and scan times.
Field
Order ID
Order Date
Shipped
Date
Ship Via
Ship Name
Ship City
Ship Country
Minimum
6
10
10
Maximum
8
16
16
Type
integer
Date
Date
3
2
4
3
15
14
15
11
Varchar
Varchar
Varchar
Varchar
- 6-
Numerical and date range analysis: This analysis help to identify the numerical and
date values. Suppose we need only integer values but if we declare with precision then it take
more size than integer value. After observing date values then we can define which formant is
appropriate for database.
Blank
Field
Order ID
Order Date
Shipped
Date
Ship Via
Ship Name
Ship City
Ship Country
Data 1
123456
01.01.2015
03.01.2015
Data 2
123457
2015.01.02
04.01.2015
Data 3
123458
03.01.2015
05.01.2015
Dhaka
Bangladesh
Air
XXX
Kln
Germany
Bus
XXX
Cologne
Deutschland
Different
Same
Format
meaning
but
confusion
Pattern analysis: Checking pattern of the data will confirm that data field formatted
correctly..
Field
Customer ID
Customer Name
Address
Mobile No
E-mail
Websibe
Data 1
123456
Md. Aminul
Fuldaer str
017564879954
Aminul@yahoo.com
www.aminul.com
First
name,
Last
name
Data 2
123457
Mohammad Islam
Oranienstr
+4914756214789
holy@gmail.com
Go.com
Different
format
- 7-
Kimball says, 9 things which will help to address the data quality
Data cleansing system: The ETL data cleansing system fix the dirty data. At the same
time data warehouse providing the accurate picture of data capture by the organization`s
production system. Develop a ETL system which is capable of correcting, rejecting, or
loading data and easy to use structure, rules, standardization.
Quality screen: Quality screen is the heart of ETL system. Quality screen is a
test against data. If the test is successful then nothing happen but if the test get wrong data
then it keep record in the error record schema. There are three types of quality screen test
Column Screen test: This test happens within a single column. Such as
whether the column contains wrong values, null values, or the value fail the required format.
Structure Screen test: This test relationship of data among the columns.
Structure screen test primary key, foreign key, one to many relationship between fields in two
column.
Business rule screen: Implement more complex and not similar with column
or structure screen test. For example shipment date<delivery date.
- 8-
Error Event Schema: The error event schema is a centralized schema whose purpose
is record every error occurs in database with the date and time. By viewing the recorded error
it is possible to improve data quality.
- 9-
Address:
Full Name:
House No.
Country:
Region:
City:
Street:
Postal code:
Contact
Optional: Input person name
Order, shipment:
Order id, shipment id will be automatic.
When order or shipment create them date and time will auto add.
Order date < shipment date
- 10 -
SAS Code
proc sql;
select NMISS(Leader)
from West3.DIM_EMPLOYEE;
Results:
- 11 -
SAS code:
proc sql;
select COALESCE(Leader,0)
from West3.DIM_EMPLOYEE;
Results:
Here, replacing null value by 0 because the data type of the column is integer type.
- 12 -
Here, total row showing 832 and distinct value is 830. Thats mean tow values are
duplicate. Lets find the duplicate values.
- 13 -
proc sql;
select Bestell_Nr from WEST3.BESTELLUNGEN
group by Bestell_Nr
having count(Bestell_Nr)>1 ;
Result:
- 14 -
Summary
7. Conclusion
At the end, I want to say that data quality is a continual process. But if we analyze data
before building warehouse and use those techniques then it is possible to minimize data error
and increase data quality.
- 15 -
References
1. Data Vs Information, Available from:
http://www.diffen.com/difference/Data_vs_Information. [Accessed: 20th December
2014]
2. Rouse, M. (2015) Data Quality.[Online] Available from:
http://searchdatamanagement.techtarget.com/definition/data-quality. [Accessed: 11th
February 2015]
3. IBM, Data Quality, http://www-01.ibm.com/software/data/quality/ , [Accessed: 21th
December 2014]
4. Danette McGilvray, Granite Falls Consulting, Inc.Excerpted from Executing Data Quality
Projects, http://www.gfalls.com/storage/book/individual-downloads-quickth
ref/10steps_DQDimen.pdf , [Accessed: 25 December 2015]
5. Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). The Data
Warehouse Lifecycle Toolkit. 2nd Edition: Wiley Publishing.
6. tdwi(3 February 2010), The Necessity of Data Profiling,
http://tdwi.org/Articles/2010/02/03/Data-Profiling-Value.aspx?Page=1, [Accessed:
27th December 2015]
7. Open data & Meta data quality. Authors: Makx Dekkers, Nikolaos Loutas, Michiel De
Keyzer and Stijn Goedertier
8. Data Quality Management. The Most Critical Initiative You Can Implement. Authors:
Jonathan G. Geiger, Intelligent Solutions, Inc., Boulder
9. http://tdwi.org/Articles/2010/02/03/Data-Profiling-Value.aspx?Page=1
10. Wikipedia : http://en.wikipedia.org/wiki/Data_quality
- 16 -