Professional Documents
Culture Documents
Datawarehouse
Names
INTRODUCTION 3
PROJECTION STANDARDS 11
CONSTRAINT STANDARD 17
ATTRIBUTE STANDARD 18
TIME STANDARD 24
PLATFORM CONSTRAINTS 26
Introduction
Licence
As these are generic software documentation standards, they will be covered by the 'Creative
Commons Zero v1.0 Universal' CC0 licence.
Warranty
The author does not make any warranty, express or implied, that any statements in this document
are free of error, or are consistent with particular standard of merchantability, or they will meet the
requirements for any particular application or environment. They should not be relied on for solving
a problem whose incorrect solution could result in injury or loss of property. If you do use this
material in such a manner, it is at your own risk. The author disclaims all liability for direct or
consequential damage resulting from its use.
This is really a sample document, illustrating particular choices taken. As each DW environment is
different, so these choices will also be different. There is no universal standard for naming
conventions.
Purpose
This Naming standards document defines the standards to follow for schemas, subject areas,
projections, attributes, constraints, boilerplate columns, renaming, and time. This document is
needed to make it easier for Business users to understand column and table names, and reduce the
cost of egregious renaming.
Audience
The primary audience of this document are any staff who use the DW, or do design development or
maintenance on the DW. It will also be useful to Business Intelligence end users.
Assumptions
It is assumed that the naming conventions will support a variety of DW design patterns such as Data
Vault, Kimball, and source standards.
Approach
The naming standard will be based on common DW design patterns and other essential design
decisions.
Related Documents
There are many documents describing the DW design methodologies such as Kimball, Inmon, etc.
Some are below.
Ralph Kimball The Data Warehouse Toolkit: The Complete Wiley 2002
Guide to Dimensional Modelling
Definitions
Term Source Definition
Aggregate Kimball An aggregate is a summary table using group by, often based on fact
and/or dimension tables.
Data Model Chris Date The data model must represent demonstrably true statements about
the business area. That is, the entities must represent things that mean
something to the business, and the relationships between entities
must represent meaningful links.
Data Vault Linstedt Data Vault is a database method that is designed to provide long-term
historical storage of data coming in from multiple operational systems.
It provides a DW pattern that supports the Inmon goals of Subject-
orientation, non-volatility, integration and Time-variance.
Disemvowel DB This is a function that removes all vowels from a text string. E.g. A
sentence such as: “The quick brown fox jumps over the lazy dog”
would, after being disemvowelled, become: “Th qck brwn fx jmps vr th
lzy dg”.
(RDF) as an example.
Hub Linstedt A hub represents a core business concept such as Customer, Vendor,
Sale or Product. The hub table is formed around the business key, as
well as the source system keys.
Logical name Chris Date These are full words from a standard business vocabulary. This
contrasts with physical names that are often abbreviated due to name
length limitations, etc.
Metadata DB Metadata is "data about data". While not often used in reporting,
these tables are important in DW standards, and for generating and
describing DW components.
Non-volatile Inmon Non-volatile design is essential. Non-volatility literally means that once
a row is written, it is never modified. This is necessary to preserve
incremental net change history. This, in turn, is required to represent
data as of any point in time. When a data row is updated, the past
information is destroyed. A fact or total that included the unmodified
data can never be recreated.
Object DB For the purposes of the DW, an object is a projection over a table or a
table join. They are often used in applications to layer and rename
physical column names to attribute names. Consequently, the end user
is more familiar with the attribute name, rather than the column
name.
Pivot DB A pivot table is the transformation of an EAV table into columnar form.
RDF W3C The Resource Description Framework (RDF) is a family of World Wide
Web Consortium (W3C) specifications originally designed as a
metadata data model. It has come to be used as a general method for
conceptual description of information and/or knowledge management,
which is implemented in web resources, and other formats. An RDF
triple is a statement about resources in the form of a subject-
predicate-object expression. DF is a more general form of the original
EAV Entity-Attribute-Value design pattern, where the subject is an
entity, predicate is an attribute and object is a value. A set of such RDF
triples forms an RDF graph.
Satellite Linstedt A satellite contains the descriptive information (context) for a business
key.
Time Variant Inmon Time variance calls for storage of multiple copies of the underlying
detail in aggregations of differing periodicity and/or time frames. There
may be detail for seven years along with weekly, monthly and
quarterly aggregates of differing duration. This is critical for
maintaining the consistency of reported summaries over time.
Tags
Business Intelligence ; Data Mapping ; Metadata ; Standards ; Data Transformation ; Data
Warehouse ; Database ; Fact / Dimension ; Data Load ; Data Model ; Data Vault ; Database Design ;
Extract Load Transform - ELT ; Extract Transform Load - ETL ; Inmon ; Kimball ; Massive Parallel
Processing - MPP ; Netezza ; Oracle ; Data Integration ; Data Lineage ; Data Traceability ; Time
Variant ; Metadata Glossary ; Hub / Satellite ; Projection ;
Layer Standard
These are all defined as schemas in the DW. This assumes a simple 4 layer DW design. The final
layers start with A*, so they will appear first in any schema listing.
Gold Business AU_GOLD_ This is a schema for tables and views that are available to
Layer DECISION Business Decision Makers. It represents the final state of the
transformed data, suitably cleansed for the primary customer
of the DW, the business decision maker.
Silver Business AG_SILVER_ This is a schema that enables integration across multiple
Layer INTEGRATION source schemas. It represents transformed Lead data into
more valuable Silver data.
Lead Source PB_ This contains the raw loaded source data. The primary need is
Layer <Source for a unique definition of the data. This naming pattern
System Code>_ supports this. The name defines the important characteristics
<DB of the load. The first code identifies the source system. The
Name/Path>_ next code identifies the DBName for databases, or the
<Schema address for web sources or the directory path for file sources.
Name/File>_ The DBName must always be based on the ultimate
<Load Type> production database name, not the development name. In
the case of directory or path names, the protocol prefix will
need to be stripped off, and illegal characters like ":/" will
need to be substituted with underscores. The next code
identifies the Schema Name for databases, or the page for
web sources, or the file for file based sources. The next code
identifies the load type. This more precise method of
identifying sources will enable a much large number of
different sources to be identified, and to be able to manage
multiple loads from the same source, so that they do not
impact on each other. This approach is more self-
documenting, with a clearer self documenting data lineage.
Load Type
This is needed to distinguish how the data has been loaded. If data is brought in by separate ETL or
CDC processes, it will be subject to different time variance and recovery processes. This is even an
issue within CDC, where data can be either refreshed or mirrored. If the target is not distinct, then it
may be impossible to properly reason about its state, leading to incorrect replication and, ultimately,
erroneous business decisions.
CDC Change Data This is the default process for loading data.
Capture
ELT Extract Load A common pattern in DW is to load the data in before transformation.
Transform This results in faster loads in an MPP appliance.
ETL Extract ETL is used to load data into the DW. The transformation occurs in a
Transform separate ETL tool, which is a common anti-pattern which leads to poor
Load single threaded ETL performance.
MNL Manually This is practical for data that does not need to be loaded more than once,
loaded or very infrequently, which is generally invariant data.
REF Reference Used mainly for schema analysis, and possibly prototyping, but not to be
used for reporting. A throw away design area.
Just as there can be alternate means of loading, there can also be alternate source types from the
same source. In almost all cases, there is not a strict isomorphism between source and target, and
the source type plays an important part in determining the amount of change that occurs between
the source and target. In other words, the different source types have different constraints that
dictate how well the data is loaded. If the source type is also a file type, then they should conform to
standard file extensions. The source type may also be appended to the name, but it is not
mandatory.
HPR Hyperion
NTZ Netezza
ODS Operational Data Store Many source systems currently provide a ODS layer. While
well meaning, this anti-pattern often creates a reconciliation
ORA Oracle
Management
Note that some metadata can be placed in other layers. However, if the metadata is shared across
schemas or layers, then it is best to place in a separate schema. Lineage can be considered the
column to column mappings. Traceability can be considered the actual row value to row value
mappings.
Projection Standards
A projection can be thought of as a subset of columns. These are a series of codes or types that can
be used as part of a projection name. These codes will help to make the purpose of projection more
transparent, through consistent usage of the name patterns. The next chapter will give examples of
how they can be combined to create projection names.
A, Active (current) Source A projection of the logical active columns for an active table,
ACT filtered to show current data only, and excludes deleted
rows. An active column has a number of distinct values > 0
(greater than zero). An active table has at least one row.
Some redundancy here, as this is also defined by the column
filter type. However, this will be retained as it is now
standard practice.
B, Bridge, Helper Kimball A join over tables that can be used as a bridge in a Kimball
BRD schema.
C, Cube Kimball A projection that is a join of a Kimball Fact table, and related
CUB Kimball Dimension tables. This is useful when the users do
not know how to join tables. This does not use GROUP BY,
so it is different to an Aggregate.
F, Fact Kimball A join over tables that can be used as a fact in a Kimball
FCT schema. Typical columns are numeric which represent either
continuous data or countable data. This is often based on a
single table.
H, Hub Data A join over tables that can be used as a hub in a Data Vault
HUB Vault schema.
I, INV Invariant Inmon Something that is always true. For example, system setup
data does not change over the life of a system, so this can be
considered invariant. This data is not mutable. That is, it
cannot be updated, except for error correction.
K, Key Source This represents cleansed Source tables that have had a
KEY surrogate and/or a distribution key added. A surrogate key is
needed for Fact/Dimension joins. A distribution key is critical
for adequate Netezza performance. Note that these tables
would still retain their source names.
L, Link Data A join over tables that can be used as a link in a Data Vault
LNK Vault schema.
M, Metadata A join over tables that can be used for metadata. Examples
MTD Metadata include A&A, Glossary, Data Quality, Data Lineage, Data
Traceability, Invariants, etc.
N, Nub Source This represents Source tables that have been cleansed, but
NUB without column name changes. For example, cleansing can
discard boilerplate columns, de-duplicate rows, add defaults
values (e.g. N/A for nulls), convert types (e.g. Text ->
Dates), fix column lengths, etc.
O, Other Source This represents some source system based column and/or
OTH table grouping which is not defined as a table/view the
source RDBMS. For example, this could represent a
manually maintained subject area grouping, or some set of
data in the application layer.
P, Pivot (from Source A join over RDF/EAV tables that pivot or flatten the EAV data
PVT RDF/EAV) into multiple tables in columnar form. They can then be
joined to other tables.
R, Resource A join over source tables used to describe source data that is
RDF Description Metadata in RDF form. Typically, there are 4 tables needed: Resource
Framework (Entity) which defines tables and primary keys, Predicate
(Relation) which defines relationships, ResourceType which
defines type of each attribute (e.g. date, char, etc.) and RDF
(Value) which contains the RDF triples. This data can be used
to pivot into standard tables.
S, STL Satellite Data A join over tables that can be used as a satellite in a Data
Vault Vault schema.
T, Time Variant Inmon A projection of the logical active columns for an active table,
TMV (Historical) which shows current and historical data, and excludes
deleted rows. An active column has a number of distinct
values > 0 (greater than zero). This is only required for
source tables. The non-source DW Design Patterns all
support Time Variance. This data is mutable. That is, it
changes over time, and a new row is created whenever the
source data changes. Some redundancy here, as this is also
defined by the column filter type. However, this will be
retained as it is now standard practice.
U, Unconditional, Source A projection of all columns for all tables, without filtering
UCN (Audit) the column set or the row set. Therefore, this will show both
current and historical data. This also shows deleted rows.
Some redundancy here, as this is also defined by the column
filter type. However, this will be retained as it is now
standard practice.
This hierarchy shows what standards the logical projection types belong to. The projection designer
needs to ensure that the projection conforms to these projection types. For example, if the
projection is a fact table, it must begin with F_*, and no other prefix.
Logical
Projection
M E
A Active B Bridge H Hub
Metadata Ephemeral
D T Time
Z Work N Nub S Satellite
Dimension Variant
O Other F Fact
G
P Pivot
Aggregate
U Uncond
G Package True
L DB Link True
Q Queue True
R Synonym True True While a synonym can point to any other physical type, it
still needs to be managed as a particular type. Use *_A, if
*_R is too restrictive.
T Table True True This covers sub-types such as temporary and external.
U Type True
X XML True
Z Trigger True
The hierarchy below shows which environments physical projection types belong in.
Physical
Projection
A Abstract
Oracle Only
Shared Types
Types
F Function G Package
M Materialized
L Link DB
P Procedure Q Queue
R Synonym U type
S Sequence X XML
T Table Z Trigger
V View
Constraint Standard
Constraints will be implemented as a single character suffix. Note that these are RDBMS specific.
Constraint Code
CT Suffix Constraint Type Oracle Netezza Description
_C Check True
Attribute Standard
Attribute Data Type (ADT)
These standards only apply to new attributes created for the DW. All source column names and data
types should remain unchanged. All non-source columns MUST have an attribute data type. The DB
data type can be varied where it makes sense. Some choices are provided. For example, an
enumeration can be INTEGER or VARCHAR. However, a date must be DATE. Similarly, length can also
be varied where it makes sense.
_B, Binary Large BLOB BINARY Any Binary Large Object such
_BL Object VARYING as an image, audio or video.
For example, well-known
binary (WKB), which is used
to define geometric objects
in binary.
_Y, Boolean (aka, flag VARCHAR(1 CHAR) CHAR(1), An attribute type with only
_YN or indicator) 'Y' | 'N', VARCHAR(5 CHAR(5), two possible values: yes or
CHAR) 'True' | BYTEINT no. It must be expressed as a
'False', BIT, 0 | 1 question with a clear true or
NUMBER(1, 0), 0 | 1 false value e.g.
is_deleted_YN.
The longer attribute type (e.g. _AMT) may be used if there less than 27 characters in the name.
Otherwise, the shorter attribute type must be used, even if that means truncating the name to 28
characters.
ROUND((length(p)+s)/2))+1
where s equals zero if the number is positive and s equals 1 if the number is negative. Therefore, the
large number sizes will not impact storage, as they are not fixed.
SOURCE_CHANGED_ON_TS The date and time when the record was last modified in the source
system.
SOURCE_CREATED_ON_TS The date and time when the record was initially created in the
source system.
EFFECTIVE_AT_TS This column stores the moment date and time at which the record
represents a true value. A value is either assigned by the Unified
Data Store or extracted from the source. Note that this column is
not needed on most tables. It will only be on event type tables, and
certain kinds of fact tables.
EFFECTIVE_FROM_TS This column stores the date and time from which the record
represents a true value. A value is either assigned by the Unified
Data Store or extracted from the source. This column will be on all
SCD2 tables.
EFFECTIVE_TO_TS This column stores the date and time after which the record does
not represents a true value. The EFFECTIVE_TO_T of the previous
row MUST = EFFECTIVE_FROM_T of the next row – 1 TIME UNIT. A
value is either assigned by the Unified Data Store or extracted from
the source. This column will be on all SCD2 tables.
IS_DELETED_YN This boolean indicates the deletion status of the record in the
source system. A value of YT indicates the record is deleted from
the source system and logically deleted from the source aligned
layer. A value of NF indicates that the record is active.
PROCESS_INSERT_ID System field. This column is the unique identifier for the specific
ETL batch process used to create insert or update this data row.
Both INSERT_PROCESS_ID and UPDATE_PROCESS_ID are necessary
in order to be able to easily back out incorrectly loaded data.
PROCESS_UPDATE_ID This column is the unique identifier for the specific process used to
update this row. Both INSERT_PROCESS_ID and
UPDATE_PROCESS_ID are necessary in order to be able to easily
back out incorrectly loaded data.
<ForeignKeyName> ForeignKeyName is the name of the Primary Key on the table that
_DISTRIBUTION_ID, _DSTR_I is being used for distribution over the Netezza nodes.
TRANSACTION_TYPE_C This code indicates the kind of load transaction used for this
record. There are 4 possible values: I (Insert), U (Update), D
(Delete ), R (Refresh).
A flag IS_CURRENT_YN can be added to views, but it will not be needed on the physical tables, as
this can be derived from high date for on EFFECTIVE_TO_TS.
The primary key standard is needed to support the automatic key matching in Tableau and other
tools.
Foreign Key <Logical Table The only exception is when a child table has the two
Name name>_I relationships to the parent table. In this case, the most
important relationship should have the same name. All other
keys will have a prefix.
Foreign <Logical This acts like a "foreign key" column that uses the nth unique
Unique TableName> key of a parent table. N ranges from 1 to 9. These key types
Column Unique <n> ID must also match easily in Tableau.
Name
Unique <Logical This is the nth unique key for this table. N ranges from 1 to 9.
Column TableName> This kind of column can also be called an Alternate Primary
Name Unique <n> ID Key.
These are global rules for all names, whether they are used for schemas, projections, attributes, etc.
For Oracle, MaxNameLength = 30, and Netezza, MaxNameLength = 128. Restrict the length to
(MaxNameLength - 4) chars, so that 2 chars can be prefixed and 2 chars can be suffixed e.g. F_*_T
(for fact table).
1. If the name length is 26 or less, then use the name as is, after taking out illegal characters
e.g. "Unique GNAFs incl Proposed" -> "Unique GNAFs incl Proposed" or
UNIQUE_GNAFS_INCL_PROPOSED
3. Else de-duplicate, disemvowel, truncate all words to a max of 4 letters and truncate whole
name <= 26 e.g. "ERP Actual Burdened Total Cost Amount Total" -> "ERP Actl Brdn Ttl Cst Amn" or
ERP_ACTL_BRDN_TTL_CST_AMN
De-duplication removes repeated consonants. E.g. Access -> ACCSS -> ACS.
Truncation means removing all but the first 4 letters E.g. Burdened -> BRDND -> BRDN.
However, only use this approach when the name is approaching the Max Name Length (e.g. 26 for
Oracle). For example, if the Logical name is Appointment, turning this into APNT may create
confusion, as this could also be Apparent, or many other words starting with Ap*n*t. Use this
approach wisely.
Time Standard
Time Quantum or Time Delta
The DW time quantum will be a second. That is, this will be the smallest unit of time in the data.
The low date will be the start time for Epoch or UNIX time. That is: (UTC), 1 January 1970 or 01-01-
1970 00:00:00. However, in general, this low date should not be used. Instead, the low date of a row
should be the date that the row was first created in the source system. Only use low date when a
date is mandatory, and the data is missing from the source system.
Time Zone
This is EST or Australian Eastern Standard Time.
_FROM_TS to the next EFFECTIVE _TO_TS – 1 TIME QUANTUM. This is a fundamental standard that
must be adhered to.
Platform Constraints
Summary
Most of these standards are dictated by platform constraints. In many cases, this is obvious, as in the
set of Physical Projection Types is clearly dictated by the Oracle and Netezza DB engines. However,
there are some other standards that are more subtle. These are listed below.
Upper Case Datastage This ETL tool does not support mapping to case sensitive object
only in Target names. So, for physical tables loaded by Datastage, they will be in
upper case. Note that this does not apply to the CDC tool.
Column count Oracle Oracle has a 1,000 column limit for tables and views.
limit
Column size Oracle Oracle has a 4K column size limit, for VARCHAR.
limit
Row size limit Netezza Netezza has a 65,535K total row size table limit. This is the sum of the
column lengths in a row.
Column count Netezza Netezza have a 1,600 column limit for tables and views.
limit
Oracle
The following list of rules applies to both quoted and nonquoted Schema Object Name identifiers
unless otherwise indicated:
2. Nonquoted identifiers cannot be Oracle Database reserved words. Quoted identifiers can be
reserved words, although this is not recommended.
3. The Oracle has many pseudo reserved words with special meanings, such as DIMENSION,
SEGMENT, ALLOCATE, DISABLE, and so forth. These words are not reserved, but as Oracle uses them
internally in specific ways, this may lead to unpredictable results.
5. Nonquoted identifiers must begin with an alphabetic character from your database character set.
Quoted identifiers can begin with any character.
6. Nonquoted identifiers can contain only alphanumeric characters from your database character set
and the underscore (_), dollar sign ($), and pound sign (#). Oracle strongly discourages you from
using $ and # in nonquoted identifiers.
7. Within a namespace, no two objects can have the same name. The following schema objects
share one namespace: Tables, Views, Sequences, Private synonyms, Stand-alone procedures, Stand-
alone stored functions, Packages, Materialized views and User-defined types. Each of the following
schema objects has its own namespace: Indexes, Constraints, Clusters, Database triggers, Private
database links, Dimensions. Because tables and views are in the same namespace, a table and a view
in the same schema cannot have the same name. However, tables and indexes are in different
namespaces. Therefore, a table and an index in the same schema can have the same name.
Each of the following nonschema objects also has its own namespace: User roles, Public synonyms,
Public database links, Tablespaces, Profiles, Parameter files (PFILEs) and server parameter files
(SPFILEs). Because the objects in these namespaces are not contained in schemas, these namespaces
span the entire database.
8. Nonquoted identifiers are not case sensitive. Oracle interprets them as uppercase. Quoted
identifiers are case sensitive.
9. Columns in the same table or view cannot have the same name. However, columns in different
tables or views can have the same name.
10. Procedures or functions contained in the same package can have the same name, if their
arguments are not of the same number and datatypes. Creating multiple procedures or functions
with the same name in the same package with different arguments is called overloading the
procedure or function.
Netezza
Netezza objects include tables, views, and columns. Follow these naming conventions:
2. A name must begin with a letter (A through Z), diacritic marks, or non-Latin characters (200-377
octal).
3. A name cannot begin with an underscore (_). Leading underscores are reserved for system
objects.
4. Names without quotes are not case sensitive. For example, CUSTOMER and Customer are the
same, but object names are converted to lowercase when they are stored in the Netezza database.
However, if a name is enclosed in quotation marks, then it is case sensitive.
For optimal Netezza performance, the distributions keys should be converted to integer data types.