What's Data at A Warehouse

What is a Data Warehouse?
And Why Are So Many Schools Setting Them Up?

Richard Goerwitz
What Is a Data Warehouse?
Nobody can agree So Im not actually going to define a DW Dont feel cheated, though By the end of this talk, youll
Understand key concepts that underlie all warehouse implementations (talk the talk) Understand the various components out of which DW architects construct real-world data warehouses Understand what a data warehouse project looks like
Why Are Schools Setting Up Data Warehouses?
A data warehouse makes it easier to:

Optimize classroom, computer lab usage Refine admissions ratings systems Forecast future demand for courses, majors Tie private spreadsheet data into central repositories Correlate admissions and IR data with outcomes such as:

Notify advisors when extra help may be needed based on

GPAs Placement rates Happiness, as measured by alumni surveys
Better information = better decisions

Better admission decisions Better retention rates More effective fund raising, etc.
(Many other examples could be given!)
Admissions data (student vitals; SAT, etc.) Special events: A-student suddenly gets a C in his/her major Slower trends: Students GPA falls for > 2 semesters/terms
Talking The Talk
To think and communicate usefully about data warehouses youll need to understand a set of common terms and concepts:
OLTP ODS OLAP, ROLAP, MOLAP ETL Star schema Conformed dimension Data mart Cube Metadata
Even if youre not an IT person, pay heed:

Youll have to communicate with IT people More importantly:
Evidence shows that IT will only build a successful warehouse if you are intimately involved!
OLTP
OLTP = online transaction processing The process of moving data around to handle day-to-day affairs
Scheduling classes Registering students Tracking benefits Recording payments, etc.
Systems supporting this kind of activity are called transactional systems
Transactional Systems
Transactional systems are optimized primarily for the here and now
Can support many simultaneous users Can support heavy read/write access Allow for constant change Are big, ugly, and often dont give people the data they want

Transactional systems dont record all previous data states Lots of data gets thrown away or archived, e.g.:
As a result a lot of data ends up in shadow databases Some ends up locked away in private spreadsheets
Admissions data Enrollment data Asset tracking data (How many computers did we support each year, from 1996 to 2006, and where do we expect to be in 2010?)
Simple Transactional Database
Map of Microsoft Windows Update Service (WUS) back-end database

Diagrammed using Sybase PowerDesigner

Each green box is a database table Arrows are joins or foreign keys This is simple for an OLTP back end
More Complex Example
Recruitment Plus back-end database Used by many admissions offices Note again:
Green boxes are tables Lines are foreign key relationships Purple boxes are views
Considerable expertise is required to report off this database! Imagine what its like for even more complex systems
Colleague SCT Banner (over 4,000 tables)
The Reporting Problem

Often we require OLTP data as a snapshot, in a spreadsheet or report Reports require querying back-end OLTP support databases But OLTP databases are often very complex, and typically
Contain many, often obscure, tables Utilize cryptic, unintuitive field/column names Dont store all necessary historical data
As a result, reporting becomes a problem
Requires special expertise May require modifications to production OLTP systems Becomes harder and harder for staff to keep up!
Workarounds
1. Have OLTP system vendors do the work

Ways of working around the reporting problem include:

Provide canned reports Write reporting GUIs for their products
2. Hire more specialists

3. Periodically copy data from OLTP systems to a place where

To create simplified views of OLTP data To write reports, create snapshots
The data is easier to understand The data is optimized for reporting Easily pluggable into reporting tools
ODS

ODS = operational data store ODSs were an early workaround to the reporting problem To create an ODS you
Build a separate/simplified version of an OLTP system Periodically copy data into it from the live OLTP system Hook it to operational reporting tools
An ODS can be an integration point or real-time reporting database for an operational system Its not enough for full enterprise-level, crossdatabase analytical processing
OLAP
OLAP = online analytical processing OLAP is the process of creating and summarizing historical, multidimensional data
To help users understand the data better Provide a basis for informed decisions Allow users to manipulate and explore data themselves, easily and intuitively
More than just reporting Reporting is just one (static) product of OLAP
OLAP Support Databases

OLAP systems require support databases These databases typically

Support fewer simultaneous users than OLTP back ends Are structured simply; i.e., denormalized Can grow large

Hold snapshots of data in OLTP systems Provide history/time depth to our analyses
Are optimized for read (not write) access Updated via periodic batch (e.g., nightly) ETL processes
ETL Processes
ETL = extract, transform, load
ETL processes are coded in various ways
Extract data from various sources Transform and clean the data from those sources Load the data into databases used for analysis and reporting
Most institutions do hand ETL; but note well:

Hand ETL is slow Requires specialized knowledge Becomes extremely difficult to maintain as code accumulates and databases/personnel change!
By hand in SQL, UniBASIC, etc. Using more general programming languages In semi-automated fashion using specialized ETL tools like Cognos Decision Stream
Where Does the Data Go?
What sort of a database do the ETL processes dump data into? Typically, into very simple table structures These table structures are:
Denormalized Minimally branched/hierarchized Structured into star schemas
So What Are Star Schemas?
Star schemas are collections of data arranged into star-like patterns

They have fact tables in the middle, which contain amounts, measures (like counts, dollar amounts, GPAs) Dimension tables around the outside, which contain labels and classifications (like names, geocodes, majors) For faster processing, aggregate fact tables are sometimes also used (e.g., counts pre-averaged for an entire term)
Star schemas should

Have descriptive column/field labels Be easy for users to understand Perform well on queries
A Very Simple Star Schema

Data Center UPS Power Output Dimensions: Phase Time Date Facts: Volts Amps Etc.
A More Complex Star Schema
Freshman survey data (HERI/CIRP) Dimensions:

Questions Survey years Data about test takers
Facts:
Answer (text) Answer (raw) Count (1)
Oops
Not a star Snowflaked!
Oops, answers should have been placed in their own dimension (creating a factless fact table). Ill demo a better version of this star later!
Data Marts
One definition:
Data marts should not be built in isolation They need to be connected via dimensional tables that are
So, e.g., if I construct data marts for
The same or subsets of each other Hierarchized the same way internally
One or more star schemas that present data on a single or related set of business processes
I connect these marts via a conformed student dimension

Makes correlation of data across star schemas intuitive Makes it easier for OLAP tools to use the data Allows nonspecialists to do much of the work
GPA trends, student major trends, enrollments Freshman survey data, senior survey data, etc.
Simple Data Mart Example

UPS Battery star
By battery
Run-time % charged Current
Input star
By phase
Output star
Voltage Current
By phase
Sensor star
Voltage Current
By sensor
Temp Humidity
Note conformed date, time dimensions!
CIRP Star/Data Mart

CIRP Freshman survey data Corrected from a previous slide Note the CirpAnswer dimension Note student dimension (ties in with other marts)
CIRP Mart in Cognos BI 8
ROLAP, MOLAP
ROLAP = OLAP via direct relational query

E.g., against a (materialized) view Against star schemas in a warehouse
MOLAP = OLAP via multidimensional database (MDB)

MDB is a special kind of database Treats data kind of like a big, fast spreadsheet MDBs typically draw data in from a data warehouse
Built to work best with star schemas
Data Cubes
The term data cube means different things to different people Various definitions:
1. 2. 3. 4. A star schema Any DB view used for reporting A three-dimensional array in a MDB Any multidimensional MDB array (really a hypercube)
Which definition do you suppose is technically correct?
Metadata

Metadata = data about data In a data warehousing context it can mean many things
Information on data in source OLTP systems Information on ETL jobs and what they do to the data Information on data in marts/star schemas Documentation in OLAP tools on the data they manipulate
Many institutions make metadata available via data malls or warehouse portals, e.g.:
University of New Mexico UC Davis Rensselear Polytechnic Institute University of Illinois
Good ETL tools automate the setup of malls/portals!
The Data Warehouse

OK now were experts in terms like OLTP, OLAP, star schema, metadata, etc. Lets use some of these terms to describe how a DW works:
Provides ample metadata data about the data Utilizes easy-to-understand column/field names Feeds multidimensional databases (MDBs) Is updated via periodic (mainly nightly) ETL jobs Presents data in a simplified, denormalized form Utilizes star-like fact/dimension table schemas Encompasses multiple, smaller data marts Supports OLAP tools (Access/Excel, Safari, Cognos BI) Derives data from (multiple) back-end OLTP systems Houses historical data, and can grow very big
A Data Warehouse is Not
Vendor and consultant proclamations aside, a data warehouse is not:

A project
A product you buy from a vendor

With a specific end date
A database schema or instance

Like an ODS (such as SCTs) A canned warehouse supplied by iStrategy Cognos ReportNet Like Oracle SQL Server
A cut-down version of your live transactional database
Kimball & Casertas Definition
According to Ralph Kimball and Joe Caserta, a data warehouse is:

A system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.
Another def.: The union of all the enterprises data marts Aside: The Kimball model is not without some critics:
E.g., Bill Inmon
Example Data Warehouse (1)
This one is RPIs 5 parts:

Sources ETL stuff DW proper Cubes etc. OLAP apps
Example Data Warehouse (2)
Caltechs DW Five Parts:

Source systems ETL processes Data marts FM/metadata Reporting and analysis tools Note: Theyre also customers of Cognos!
So Where is Colorado College?
Phil Goldstein (Educause Center for Applied Research fellow) identifies the major deployment levels:
Level 1: Transactional systems only Level 2a: ODS or single data mart; no ETL Level 2: ODS or single data mart with ETL tools Level 3a: Warehouse or multiple marts; no ETL; OLAP Level 3b: Warehouse or multiple marts; ETL; OLAP Level 3: Enterprise-wide warehouse or multiple marts; ETL tools; OLAP tools
Goldsteins study was just released in late 2005 Its very good; based on real survey data Which level is Colorado College at?
Implementing a Data Warehouse
In many organizations IT people want to huddle and work out a warehousing plan, but in fact
The purpose of a DW is decision support The primary audience of a DW is therefore College decision makers It is College decision makers therefore who must determine

Decision makers cant make these determinations without an understanding of data warehouses It is therefore imperative that key decision makers first be educated about data warehouses
Once this occurs, it is possible to

Scope Priority Resources
Elicit requirements (a critical step thats often skipped) Determine priorities/scope Formulate a budget Create a plan and timeline, with real milestones and deliverables!
Is This Really a Good Plan?
Sure, according to Phil Goldstein (Educause Center for Applied Research) Hes conducted extensive surveys on academic analytics (= business intelligence for higher ed) His four recommendations for improving analytics:
1. Key decisionmakers must lead the way 2. Technologists must collaborate
3. IT must build the needed infrastructure

Must collect requirements Must form strong partnerships with functional sponsors
Carleton violated this rule with Cognos BI As we discovered, without an ETL/warehouse infrastructure, success with OLAP is elusive
4. Staff must train and develop deep analysis skills
Goldsteins findings mirror closely the advice of industry heavyweights Ralph Kimball, Laura Reeves, Margie Ross, Warren Thornthwaite, etc.
Isnt a DW a Huge Undertaking?
Sure, it can be huge Dont hold on too tightly to the bigsounding word, warehouse Luminaries like Ralph Kimball have shown that a data warehouse can be built incrementally
Can start with just a few data marts Targeted consulting help will ensure proper, extensible architecture and tool selection
What Takes Up the Most Time?
You may be surprised to learn what DW step takes the most time Try guessing which:
Hardware Physical database setup Database design ETL OLAP setup
90 80 70 60
Hardware
50 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Database East ETL Schemas
West
OLAP tools North
Acc. to Kimball & Caserta, ETL will eat up 70% of the time. Other analysts give estimates ranging from 50% to 80%.
The most often underestimated part of the warehouse project!
Eight Month Initial Deployment

Step
Begin educating decision makers Collect requirements Decide general DW design Determine budget Identify project roles Eval/choose ETL tool Eval/choose physical DB Spec/order, configure server
Duration
21 days 14 days 7 days 3 days 1 day 21 days 14 days 20 days
Step
Secure, configure network Deploy physical target DB Learn/deploy ETL tool Choose/set up modeling tool Design initial data mart Design ETL processes Hook up OLAP tools Publicize, train, train
Duration
1 day 4 days 28 days 21 days 7 days 28 days 7 days 21 days
Conclusion
Information is held in transactional systems
For normal people to explore institutional data, data in transactional systems needs to be
Renormalized as star schemas Moved to a system optimized for analysis Merged into a unified whole in a data warehouse Yes, IT people must build the infrastructure But IT people arent the main customers
But transactional systems are complex They dont talk to each other well; each is a silo They require specially trained people to report off of
Note: This process must be led by customers
So who are the customers?
Admissions officers trying to make good admission decisions Student counselors trying to find/help students at risk Development offers raising funds that support the College Alumni affairs people trying to manage volunteers Faculty deans trying to right-size departments IT people managing software/hardware assets, etc.

What's Data at A Warehouse

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What's Data at A Warehouse

Uploaded by

Copyright:

Available Formats

What is a Data Warehouse?

And Why Are So Many Schools Setting Them Up?

What Is a Data Warehouse?

Why Are Schools Setting Up Data Warehouses?

A data warehouse makes it easier to:

Notify advisors when extra help may be needed based on

GPAs Placement rates Happiness, as measured by alumni surveys

Better information = better decisions

(Many other examples could be given!)

Talking The Talk

Even if youre not an IT person, pay heed:

Systems supporting this kind of activity are called transactional systems

Simple Transactional Database

Map of Microsoft Windows Update Service (WUS) back-end database

More Complex Example

The Reporting Problem

As a result, reporting becomes a problem

1. Have OLTP system vendors do the work

Ways of working around the reporting problem include:

2. Hire more specialists

3. Periodically copy data from OLTP systems to a place where

To create simplified views of OLTP data To write reports, create snapshots

OLAP Support Databases

OLAP systems require support databases These databases typically

ETL = extract, transform, load

ETL processes are coded in various ways

Most institutions do hand ETL; but note well:

Where Does the Data Go?

So What Are Star Schemas?

Star schemas are collections of data arranged into star-like patterns

Star schemas should

A Very Simple Star Schema

A More Complex Star Schema

Freshman survey data (HERI/CIRP) Dimensions:

I connect these marts via a conformed student dimension

Simple Data Mart Example

Note conformed date, time dimensions!

CIRP Star/Data Mart

CIRP Mart in Cognos BI 8

ROLAP = OLAP via direct relational query

MOLAP = OLAP via multidimensional database (MDB)

Built to work best with star schemas

Which definition do you suppose is technically correct?

Good ETL tools automate the setup of malls/portals!

The Data Warehouse

A Data Warehouse is Not

Vendor and consultant proclamations aside, a data warehouse is not:

A product you buy from a vendor

With a specific end date

A database schema or instance

A cut-down version of your live transactional database

Kimball & Casertas Definition

According to Ralph Kimball and Joe Caserta, a data warehouse is:

Example Data Warehouse (1)

This one is RPIs 5 parts:

Example Data Warehouse (2)

Caltechs DW Five Parts:

So Where is Colorado College?

Implementing a Data Warehouse

Scope Priority Resources

Is This Really a Good Plan?

3. IT must build the needed infrastructure

4. Staff must train and develop deep analysis skills