You are on page 1of 3

What is a Staging area?

Staging area is an intermediate storage area between the sources of information and the data warehouse (DW)
or Data mart. It is a back room area of data warehouse.

The staging area exists to be a separate “back room “or “engine room” of the warehouse from where the data can
be transformed, corrected and prepared for the warehouse. It should only be accessible to the data stewards
ETL developers working on the data, or administrators monitoring or managing the ETL processes.

A typical ETL Development environment has three distinct areas:-

 Several source systems which provide data.This can include databases or files or spreadsheets
 A single “staging area” which may use one or more database schemas or file system (depending upon
warehouse load volumes).
 One or more “visible” data marts or a single “warehouse presentation area” where data is made visible to end
user queries. This is what is called as a Data warehouse .If a data warehouse has a single subject area
(schema) then data warehouse & data marts are synonyms. But if data warehouse has multiple subject area
(schema) then data mart for each schema can be created depending on the business requirements.

Staging Area
The “staging area” is the middle bit.
Staging area is a database with the sources have originated from different databases for example db2, oracle,
Teradata etc all those can be standardize in staging area. Also joining of the tables can be done there. If there is
only one source system then depending on the quality of data & transformations required staging area can be
used or eliminated. Reporting always occurs on the Warehouse
2. Key Features

 One or more database schema(s) or file system used to “stage” data extracted from the source OLTP
systems prior to being loaded to the “warehouse” where it is visible to end users.
 Data in the staging area is not visible to end users for queries, reports or analysis of any kind. It does not hold
aggregated data ready for querying.
 The loadtimestamp column in staging enables to know the status of record i.e. updated, new or deleted. It
may even include a flag reporting this status.
 In most cases its size is larger than the “presentation area” itself or else equal to DW.
 Although the “stage” data – e.g. Last sequence loaded may be backed up, much of the staging area data is
automatically replaced in case of full refresh i.e. truncate & reload from source during the ETL
load processes, and hence backup effort can be avoided. The presentation area however, may need backup
in many cases.
 It may include some metadata, which may be used by analysts or operators monitoring the state of the
previous loads (e.g. audit information, summary totals of rows loaded etc).
 The Error log file holds details of “rejected” entries – data which has failed quality tests, and may need
correction and re-submission to the ETL process.
 It’s likely to have few indexes (compared to the “presentation area”), and hold data in a quite normalized
form. The presentation area (DW) is by comparison likely to be more highly indexed (mainly bitmap indexes),
with highly denormalized tables (the Dimension tables anyway).

Need of Staging Area

How you perform ETL staging depends on a number of things:-


 Your storage availability.
 Your architecture.
 Your data volumes. But if your data volumes are large, you will probably want to have the flexibility to
continue processing a set of data without having completed the previous step on all of your data.
 Your data sources. If your source data resides on heterogeneous source files/databases, you may be forced
to finish at least some of your extracts before you can proceed with the consolidation, integration, and
standardization steps.
 Your data quality. If your data is dirty, you may have to read additional source files/databases to perform your
edit checks and your data correction process.
 It minimizes the time ETL processes spend "inside" each source system, thereby minimizing any contention
and/or negative performance impact on the source system’s resources.
 It provides a snapshot of what was loaded in each ETL "run".
 In many cases, data quality assessment and data cleansing steps also take place in the staging area.

Designing the Staging Area


Regardless of the persistence of the data in the staging area, you must adhere to some basic rules when the
staging area is designed and deployed

 The data-staging area must be owned by the ETL team.


 Business Users are not allowed in the staging area for any reason.
 Reports cannot access data from the staging area.
 Only ETL processes can write to and read from the staging area.

You might also like