Professional Documents
Culture Documents
including
marketing
segmentation,
inventory
management,
data
warehouse implementation
includes
the
conversion
of
data
from numerous source systems into a common format. Since each data from
the various departments is standardized, each department will produce results
that are in line with all the other departments. So you can have more
confidence in the accuracy of your data. And accurate data is the basis for
strong business decisions.
d) A Data Warehouse Provides Historical Intelligence
A data warehouse stores large amounts of historical data so you can analyze
different time periods and trends in order to make future predictions. Such
data typically cannot be stored in a transactional database or used to
generate reports from a transactional system.
e) A Data Warehouse Generates a High ROI
ROI retun oninvestment
Finally, the piece de resistancereturn on investment. Companies that have
implemented
data
warehouses
and
complementary
BI
systems
have generated more revenue and saved more money than companies that
havent invested in BI systems and data warehouses.
WH Inmon
1)As per WH Inmon approach
first we define data warehouse
and we should define data marts
4. What
are
1)Time variant: All the data in data warehouse identified with particular time
period. The data of DWH must be historical
Ex: Day, Week, Month etc.
2) Non-volatile: Data in DWH are never over written or deleted once committed.
The data is static. It is used for read only data we cant change. This data used
for future reporting.
3) Subject oriented: In this we collect the data from different subject area. The
collected data must be business oriented.
Ex: Sales, accounts, HR etc.
4) Integrated: The DWH contains data gathered from all organizations and
merged into a coherent.
5. What is the difference between ODS and DWH
ODS(Operational Data Store)
DWH
3
1) It is design to support
operational data monitoring.
the
2) Data is volatile
2) Data is non-volatile
Data Warehouse
4
1) Slowly changed dimension: It will change with respect to time.
Ex: Customer address etc.
It is three types
1) Type 1: In case of type 1 we will not maintain any history. We will
maintain only current record by updating existing record. Technically
type1 is insert else update.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension
problem, since there is no need to keep track of the old information.
Disadvantages:
All history is lost. By applying this methodology, it is not possible to trace
back in history. For example, in this case, the company would not be able
to know that Christina lived in Illinois before. Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not
necessary for the data warehouse to keep track of historical changes.
2) Type2:
In case of type2 we will maintain historical data.
Whenever there is a change in address will insert new address in address
table and end dated the existing record.
To maintain the versioning in SCD type2 we use a) start date, end date b)
flag c) Address key or version number
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the
number of rows for the table is very high to start with, storage and
performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary
for the data warehouse to track historical changes.
3) Type3: In case of scd type3 we will maintain the current record and
previous record. It means one customer can have maximum two records.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
Name
Origin
al
State
Current
State
Effective
Date
1001
Christi
Illinois
Californ
15-JAN-
na
ia
2003
Advantages:
- This does not increase the size of the table, since new information is
updated.
- This allows us to keep some part of history.
Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed
more than once. For example, if Christina later moves to Texas on
December 15, 2003, the California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is
necessary for the data warehouse to track historical changes, and when
such changes will only occur for a finite number of time.
9. What is Confirmed dimension? Give example?
A Dimension which is used two or more fact tables is called conformed
dimension.
Ex: Time, Product, Country etc.
10.What is Degenerated Dimension? Give example?
By nature it is a dimension but it will be located in fact table. Degenerated
dimension contains only a key and no attributes.
Ex: Bill number, Receipt number etc.
11.What is Junk Dimension? Give example?
A junk dimension is collection of random transactional codes, flags and/or text
attributes that are unrelated to any particular dimension.
Ex: Yes or No, Male or Female etc.
12.What is Dirty Dimension? Give example?
If a record occurs more than one time in a table by the difference of a non-key
attribute then such dimension is called dirty dimension.
For example A "dirty dimension" is one in which data quality cannot be
guaranteed. For example, in most banks, account-oriented source applications
contain data about the same customer multiple times. Many banks attempt to
derive a "customer" by matching names and addresses across account
applications, but this process results in more than one entry for each bank
customer. Similarly, different attributes must be held for each of a bank's
heterogeneous products. Attributes that are meaningful for a loan, such as term,
7
credit risk assessment, and collateral, have no meaning for savings, checking, or
investment products.
13.What is Role play dimension?
Dimension will pay multiple roles in the fact table like orders, sales, shipment
date, Invoice date etc.
Ex: Time dimension etc.
14.What is lately arrived Dimension?
In data warehousing some-times we receive fact data first and later we receive
dimension data this kind of dimension is called lately arrived dimension.
15.What is rapidly changing dimension?
It is dimension that changes rapidly with respect to time period.
Ex: share price etc.
16.What are different types of slowly changing dimensions?
Refer Q8.
17.What is the difference between type1, type2 and type3?
Refer Q8
18.How do we maintain the versioning in type2 dimension?
Refer Q8.
19.What are different types of Facts? Give Examples?
These are three types
1) Additive : These are facts on which we can perform arthematic operation like
addition on the fact table
Ex: If we want find yearly sales we can sum quarter sales
2) Semi additive : These are fact we can perform sum operation for some of the
dimension not to others
Ex: In bank customer account we can find out monthly balance using credits
and debits happen in the particular month. We cant find monthly balances
using daily transactions.
3) Non-additive: These are facts we cant be summed up for any dimension
present in the fact table
Ex: Temperature, Ratios etc.
20.What is fact less fact table? Give the example?
Fact less fact table captures many to many relation between dimensions but
contains with-out any measures
Common examples of fact less fact tables include:
No attribute is specified.
At this level, the data modeller attempts to describe the data in as much
detail as possible, without regard to how they will be physically implemented
in the database.
Physical Data Model
Features of physical data model include:
this level, the data modeller will specify how the logical data model will be
realized in the database schema.
A
star
schema
optimizes
the
performance by keeping queries simple
and providing fast response time. All
the information about the each level is
stored in one row.
Snowflake
schemas
normalize
dimensions to eliminated redundancy.
The result is more complex queries and
reduced query performance.
It is called
because the
snowflake.
a snowflake schema
diagram resembles a
10
Full load
First time in the data warehouse
we load all the data from source
into DWH is called full load
Full load we prefer load data first
time
Incremental load
Capture the changed data to load
DWH we use incremental load
Second time onwards we perform
incremental load in DWH
By performing the incremental load
we are trying to load less number
of data to improve performance
We can create indexes in the staging state, to perform our source qualifier
best.
11
Weeding out unnecessary or unwanted things (characters and spaces etc.) from
incoming data to make it more meaningful and informative
31.What is the difference between business key and Surrogate key?
Business Key
A business key or natural key is an
index which identifies unique ness
of a row based on column that
exists
naturally
in
a
table
according to business rules
Business are a good way of
avoiding duplicate records so in
practice I typically add business
key as an alternative key in a
dimension
I want to manage the patient
information using business key will
add
the
patient_id
and
patient_code this is different for
each hospital
Surrogate key
It is the system generated
sequence
number
used
for
maintaining the unique ness or
primary key
It is alternative of business key.
Surrogate key are most efficient
then business key so I would
always recommended joining a fact
table to a dimension table using
surrogate key
In case of SCD Type2 to maintain
the versioning number we use
surrogate key.
34.What is granularity?
12
The granularity is the lowest level of information stored in the fact table. The depth of data
level is known as granularity. In date dimension the level could be year, month, quarter,
period, week, day of granularity.
The process consists of the following two steps:
- Determining the dimensions that are to be included
- Determining the location to place the hierarchy of each dimension of information
35.What is the difference between view and materialized view?
View
View is a logical object
It will not store any data
It is used for security purpose and
projection
We can perform DML operation on
simple views
View will fetch the data from base
table it will run the base query
There is no refresh clause in the views
Materialized View
It is physical object
It will store summary or aggregated
data
It is used for reporting purpose
In this we cant perform DML
operations
It will have less number of records
compare to base tables performance
will be improved in the reports
We can refresh the materialized views
with refresh clause
13
My understanding says that the dimensions should be extract first and fact
should be extracted. That way the foreign keys still be honoured in the staging
area.
41.What happens if we try to load the fact tables before loading the dimension
tables?
If we try to load the fact tables before loading the dimension tables data will be
rejected.
42.What is a Schema?
Graphical Representation of the data structure is called as schema.