You are on page 1of 374

DataStage

Enterprise Edition
Proposed Course Agenda

Day 1 Day 3
Review of EE Concepts Combining Data
Sequential Access Configuration Files
Best Practices Extending EE
DBMS as Source Meta Data in EE

Day 2 Day 4
EE Architecture Job Sequencing
Transforming Data Testing and Debugging
DBMS as Target
Sorting Data
The Course Material

Course Manual

Exercise Files and


Exercise Guide

Online Help
Using the Course Material

Suggestions for learning


Take notes
Review previous material
Practice
Learn from errors
Intro
Part 1

Introduction to DataStage EE
What is DataStage?

Design jobs for Extraction, Transformation, and


Loading (ETL)
Ideal tool for data integration projects such as,
data warehouses, data marts, and system
migrations
Import, export, create, and managed metadata for
use within jobs
Schedule, run, and monitor jobs all within
DataStage
Administer your DataStage development and
execution environments
DataStage Server and Clients
DataStage Administrator
Client Logon
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage

Define global and project properties in


Administrator
Import meta data into Manager
Build job in Designer
Compile Designer
Validate, run, and monitor in Director
DataStage Projects
Quiz True or False

DataStage Designer is used to build and compile


your ETL jobs
Manager is used to execute your jobs after you
build them
Director is used to execute your jobs after you
build them
Administrator is used to set global and project
properties
Intro
Part 2

Configuring Projects
Module Objectives
After this module you will be able to:
Explain how to create and delete projects
Set project properties in Administrator
Set EE global properties in Administrator
Project Properties

Projects can be created and deleted in


Administrator
Project properties and defaults are set in
Administrator
Setting Project Properties

To set project properties, log onto Administrator,


select your project, and then click Properties
Licensing Tab
Projects General Tab
Environment Variables
Permissions Tab
Tracing Tab
Tunables Tab
Parallel Tab
Intro
Part 3

Managing Meta Data


Module Objectives
After this module you will be able to:
Describe the DataStage Manager components and
functionality
Import and export DataStage objects
Import metadata for a sequential file
What Is Metadata?

Data

Transform
Source Target

Meta Meta
Data Data
Meta Data
Repository
DataStage Manager
Manager Contents

Metadata describing sources and targets: Table


definitions
DataStage objects: jobs, routines, table
definitions, etc.
Import and Export

Any object in Manager can be exported to a file


Can export whole projects
Use for backup
Sometimes used for version control
Can be used to move DataStage objects from one
project to another
Use to share DataStage jobs and projects with
other developers
Export Procedure

In Manager, click Export>DataStage


Components
Select DataStage objects for export
Specified type of export: DSX, XML
Specify file path on client machine
Quiz: True or False?

You can export DataStage objects such as jobs,


but you cant export metadata, such as field
definitions of a sequential file.
Quiz: True or False?

The directory to which you export is on the


DataStage client machine, not on the DataStage
server machine.
Exporting DataStage Objects
Exporting DataStage Objects
Import Procedure

In Manager, click Import>DataStage


Components
Select DataStage objects for import
Importing DataStage Objects
Import Options
Exercise

Import DataStage Component (table definition)


Metadata Import

Import format and column destinations from


sequential files
Import relational table column destinations
Imported as Table Definitions
Table definitions can be loaded into job stages
Sequential File Import Procedure

In Manager, click Import>Table


Definitions>Sequential File Definitions
Select directory containing sequential file and
then the file
Select Manager category
Examined format and column definitions and edit
is necessary
Manager Table Definition
Importing Sequential Metadata
Intro
Part 4

Designing and Documenting Jobs


Module Objectives
After this module you will be able to:
Describe what a DataStage job is
List the steps involved in creating a job
Describe links and stages
Identify the different types of stages
Design a simple extraction and load job
Compile your job
Create parameters to make your job flexible
Document your job
What Is a Job?

Executable DataStage program


Created in DataStage Designer, but can use
components from Manager
Built using a graphical user interface
Compiles into Orchestrate shell language (OSH)
Job Development Overview

In Manager, import metadata defining sources


and targets
In Designer, add stages defining data extractions
and loads
And Transformers and other stages to defined
data transformations
Add linkss defining the flow of data from sources
to targets
Compiled the job
In Director, validate, run, and monitor your job
Designer Work Area
Designer Toolbar

Provides quick access to the main functions of Designer

Show/hide metadata markers

Job Compile
properties
Tools Palette
Adding Stages and Links

Stages can be dragged from the tools palette or


from the stage type branch of the repository view
Links can be drawn from the tools palette or by
right clicking and dragging from one stage to
another
Sequential File Stage

Used to extract data from, or load data to, a


sequential file
Specify full path to the file
Specify a file format: fixed width or delimited
Specified column definitions
Specify write action
Job Creation Example Sequence

Brief walkthrough of procedure


Presumes meta data already loaded in repository
Designer - Create New Job
Drag Stages and Links Using
Palette
Assign Meta Data
Editing a Sequential Source Stage
Editing a Sequential Target
Transformer Stage

Used to define constraints, derivations, and


column mappings
A column mapping maps an input column to an
output column
In this module will just defined column mappings
(no derivations)
Transformer Stage Elements
Create Column Mappings
Creating Stage Variables
Result
Adding Job Parameters

Makes the job more flexible


Parameters can be:
Used in constraints and derivations
Used in directory and file names

Parameter values are determined at run time


Adding Job Documentation

Job Properties
Short and long descriptions
Shows in Manager

Annotation stage
Is a stage on the tool palette
Shows on the job GUI (work area)
Job Properties Documentation
Annotation Stage on the Palette
Annotation Stage Properties
Final Job Work Area with
Documentation
Compiling a Job
Errors or Successful Message
Intro
Part 5

Running Jobs
Module Objectives
After this module you will be able to:
Validate your job
Use DataStage Director to run your job
Set to run options
Monitor your jobs progress
View job log messages
Prerequisite to Job Execution

Result from Designer compile


DataStage Director

Can schedule, validating, and run jobs


Can be invoked from DataStage Manager or
Designer
Tools > Run Director
Running Your Job
Run Options Parameters and
Limits
Director Log View
Message Details are Available
Other Director Functions

Schedule job to run on a particular date/time


Clear job log
Set Director options
Row limits
Abort after x warnings
Module 1

DSEE DataStage EE
Review
Ascentials Enterprise
Data Integration Platform

Command & Control

DISCOVER PREPARE TRANSFORM


ANY SOURCE ANY TARGET
CRM CRM
Gather Cleanse, Standardize
ERP ERP
relevant correct and and enrich
SCM SCM
RDBMS
informatio match input data and BI/Analytics
Legacy n for target data load to RDBMS
Real-time enterprise targets Real-time
Client-server application Client-server
Web services s Web services
Data Warehouse Data Warehouse
Other apps. Data Profiling Data Quality Extract, Transform, Other apps.
Load

Parallel Execution

Meta Data Management


Course Objectives

You will learn to:


Build DataStage EE jobs using complex logic
Utilize parallel processing techniques to increase job
performance
Build custom stages based on application needs

Course emphasis is:


Advanced usage of DataStage EE
Application job development
Best practices techniques
Course Agenda

Day 1 Day 3
Review of EE Concepts Combining Data
Sequential Access Configuration Files
Standards
DBMS Access

Day 2 Day 4
EE Architecture Extending EE
Transforming Data Meta Data Usage
Sorting Data Job Control
Testing
Module Objectives

Provide a background for completing work in the


DSEE course
Tasks
Review concepts covered in DSEE Essentials course

Skip this module if you recently completed the


DataStage EE essentials modules
Review Topics

DataStage architecture
DataStage client review
Administrator
Manager
Designer
Director

Parallel processing paradigm


DataStage Enterprise Edition (DSEE)
Client-Server Architecture
Command & Control

Microsoft Windows NT/2000/XP

ANY SOURCE ANY TARGET


CRM
ERP
Repository SCM
Designer Director Administrator
Manager BI/Analytics
RDBMS
Discover
Extract Prepare Transform
Cleanse Transform Extend
Integrate Real-Time
Client-server
Web services
Data Warehouse
Other apps.

Server Repository

Microsoft Windows NT or UNIX


Parallel Execution

Meta Data Management


Process Flow

Administrator add/delete projects, set defaults


Manager import meta data, backup projects
Designer assemble jobs, compile, and execute
Director execute jobs, examine job run logs
Administrator Licensing and
Timeout
Administrator Project
Creation/Removal

Functions
specific to a
project.
Administrator Project Properties

RCP for parallel


jobs should be
enabled

Variables for
parallel
processing
Administrator Environment
Variables

Variables are
category
specific
OSH is what is
run by the EE
Framework
DataStage Manager
Export Objects to MetaStage

Push meta
data to
MetaStage
Designer Workspace

Can execute
the job from
Designer
DataStage Generated OSH

The EE
Framework
runs OSH
Director Executing Jobs

Messages
from previous
run in different
color
Stages

Can now customize the Designers palette

Select desired stages


and drag to favorites
Popular Developer Stages

Row
generator

Peek
Row Generator

Can build test data

Edit row in
column tab

Repeatabl
e property
Peek

Displays field values


Will be displayed in job log or sent to a file
Skip records option
Can control number of records to be displayed

Can be used as stub stage for iterative


development (more later)
Why EE is so Effective

Parallel processing paradigm


More hardware, faster processing
Level of parallelization is determined by a
configuration file read at runtime
Emphasis on memory
Data read into memory and lookups performed like
hash table
Parallel Processing Systems

DataStage EE Enables parallel processing =


executing your application on multiple CPUs
simultaneously
If you add more resources
(CPUs, RAM, and disks) you increase system
performance
1 2

Example system containing


3 4 6 CPUs (or processing nodes)
and disks
5 6
Scaleable Systems: Examples

Three main types of scalable systems

Symmetric Multiprocessors (SMP): shared


memory and disk
Clusters: UNIX systems connected via networks
MPP: Massively Parallel Processing

note
SMP: Shared Everything

Multiple CPUs with a single operating system


Programs communicate using shared memory
All CPUs share system resources
(OS, memory with single linear address space,
disks, I/O)

cpu cpu
When used with Enterprise Edition:
Data transport uses shared memory cpu cpu
Simplified startup

Enterprise Edition treats NUMA (NonUniform Memory Access) as plain


SMP
Traditional Batch Processing

Operational Data
Transform Clean Load

Archived Data
Data
Warehouse
Disk Disk Disk

Source Target
Traditional approach to batch processing:
Write to disk and read from disk before each processing operation
Sub-optimal utilization of resources
a 10 GB stream leads to 70 GB of I/O
processing resources can sit idle during I/O
Very complex to manage (lots and lots of small jobs)
Becomes impractical with big data volumes
disk I/O consumes the processing
terabytes of disk required for temporary staging
Pipeline Multiprocessing

Data Pipelining
Transform, clean and load processes are executing simultaneously on the same processor
rows are moving forward through the flow

Operational Data

Archived Data Transform Clean Load Data


Warehouse

Source Target
Start a downstream process while an upstream process is still
running.
This eliminates intermediate storing to disk, which is critical for big data.
This also keeps the processors busy.
Still has limits on scalability
Think of a conveyor belt moving the rows from process to process!
Partition Parallelism

Data Partitioning
Break up big data into partitions

Run one partition on each processor

4X times faster on 4 processors - Node 1


With data big enough:
100X faster on 100 processors A-F Transform

This is exactly how the parallel Node 2


G- M
databases work! Source Transform
Data N-T
Data Partitioning requires the Node 3
same transform to all partitions: Transform
Aaron Abbott and Zygmund Zorn U-Z
undergo the same transform
Node 4
Transform
Combining Parallelism Types

Putting It All Together: Parallel Dataflow

Pipelining
g
nin
io

Source Data
rtit

Data Warehouse
Transform Clean Load
Pa

Source Target
Repartitioning

Putting It All Together: Parallel Dataflow


with Repartioning on-the-fly

Pipelining
g

ing
n in

nin
U-Z

ion
itio

itio
N-T

rtit
rt

Source G- M
art
Data
Pa

Data

pa
A-F Warehouse
p

Transform
Re

Re
Clean Load

Customer last name Customer zip code Credit card number


Source Targe

Without Landing To Disk!


EE Program Elements

Dataset: uniform set of rows in the Framework's internal representation


- Three flavors:
1. file sets *.fs : stored on multiple Unix files as flat files
2. persistent: *.ds : stored on multiple Unix files in Framework format
read and written using the DataSet Stage
3. virtual: *.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasetshence possible need for Import
- Different datasets typically have different schemas
- Convention: "dataset" = Framework data set.

Partition: subset of rows in a dataset earmarked for processing by the


same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset
DataStage EE Architecture

DataStage: Orchestrate Framework:


Provides data integration platform Provides application scalability O
rch
estra
teP
rog
ram
(s
equ
entia
lda
taflo
w )

FlatF
ile
s C
lea
n1 R
ela
tio
nalD
ata

Im
port M
erg
e A
naly
ze

C
lea
n2

C
entraliz
e dErro
rHandlin
g
C
onfigu
ratio
nFile andE ven
tL ogg
ing

P
erfo
rm ance
O
rch
estra
teA
pp lic
a tio
nF ra
m e
w ork V
isualiz
ation
a n
dRuntim eS ystem

P
ara
llela
cces
stoda
tainR
DBM
S

Pa
rallelpip
elin
ing
C
lea
n1
Im
port

M
erg
e A
naly
ze

C
lea
n2
In
ter-n
odec
o m
mun
ica
tio
ns
Pa
ralle
lacc
esstod
atainfile
s
P
ara
lle
lizationo
fope
ratio
ns

DataStage Enterprise Edition:


Best-of-breed scalable data integration platform
No limitations on data volumes or throughput
Introduction to DataStage EE

DSEE:
Automatically scales to fit the machine
Handles data flow among multiple CPUs and disks

With DSEE you can:


Create applications for SMPs, clusters and MPPs
Enterprise Edition is architecture-neutral
Access relational databases in parallel
Execute external applications in parallel
Store data across multiple disks and nodes
Job Design VS. Execution
Developer assembles data flow using the Designer

and gets: parallel access, propagation, transformation, and


load.
The design is good for 1 node, 4 nodes,
or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile the design
Partitioners and Collectors

Partitioners distribute rows into partitions


implement data-partition parallelism
Collectors = inverse partitioners
Live on input links of stages running
in parallel (partitioners)
sequentially (collectors)
Use a choice of methods
Example Partitioning Icons

partitioner
Exercise

Complete exercises 1-1 and 1-2, and 1-3


Module 2

DSEE Sequential Access


Module Objectives

You will learn to:


Import sequential files into the EE Framework
Utilize parallel processing techniques to increase
sequential file access
Understand usage of the Sequential, DataSet, FileSet,
and LookupFileSet stages
Manage partitioned data stored by the Framework
Types of Sequential Data Stages

Sequential
Fixed or variable length
File Set
Lookup File Set
Data Set
Sequential Stage Introduction

The EE Framework processes only datasets


For files other than datasets, such as flat files,
Enterprise Edition must perform import and
export operations this is performed by import
and export OSH operators generated by
Sequential or FileSet stages
During import or export DataStage performs
format translations into, or out of, the EE
internal format
Data is described to the Framework in a schema
How the Sequential Stage Works

Generates Import/Export operators, depending on


whether stage is source or target
Performs direct C++ file I/O streams
Using the Sequential File Stage

Both import and export of general files (text, binary) are


performed by the SequentialFile Stage.
Importing/Exporting Data
Data import: EE internal format

Data export EE internal format


Working With Flat Files

Sequential File Stage


Normally will execute in sequential mode
Can be parallel if reading multiple files (file pattern
option)
Can use multiple readers within a node
DSEE needs to know
How file is divided into rows
How row is divided into columns
Processes Needed to Import Data

Recordization
Divides input stream into records
Set on the format tab

Columnization
Divides the record into columns
Default set on the format tab but can be overridden on
the columns tab
Can be incomplete if using a schema or not even
specified in the stage if using RCP
File Format Example

R e c o rd d e lim ite r

F ie ld 1 , F ie ld 1 , F ie ld 1 , L a s t fie ld nl

F in a l D e lim ite r = e n d
F ie ld D e lim ite r

F ie ld 1 , F ie ld 1 , F ie ld 1 , L a s t fie ld , nl

F in a l D e lim ite r = c o m m a
Sequential File Stage

To set the properties, use stage editor


Page (general, input/output)
Tabs (format, columns)

Sequential stage link rules


One input link
One output links (except for reject link definition)
One reject link
Willreject any records not matching meta data in the column
definitions
Job Design Using Sequential Stages

Stage categories
General Tab Sequential Source

Multiple output Show records


links
Properties Multiple Files

Click to add more files


having the same meta data.
Properties - Multiple Readers

Multiple readers option


allows you to set number of
readers
Format Tab

File into records Record into


columns
Read Methods
Reject Link

Reject mode = output


Source
All records not matching the meta data (the column
definitions)
Target
All records that are rejected for any reason
Meta data one column, datatype = raw
File Set Stage

Can read or write file sets


Files suffixed by .fs
File set consists of:
1. Descriptor file contains location of raw data files +
meta data
2. Individual raw data files

Can be processed in parallel


File Set Stage Example

Descriptor file
File Set Usage

Why use a file set?


2G limit on some file systems
Need to distribute data among nodes to prevent
overruns
If used in parallel, runs faster that sequential file
Lookup File Set Stage

Can create file sets


Usually used in conjunction with Lookup stages
Lookup File Set > Properties

Key column
specified

Key column
dropped in
descriptor file
Data Set

Operating system (Framework) file


Suffixed by .ds
Referred to by a control file
Managed by Data Set Management utility from
GUI (Manager, Designer, Director)
Represents persistent data
Key to good performance in set of linked jobs
Persistent Datasets

Accessed from/to disk with DataSet Stage.


Two parts:
Descriptor file: input.ds
contains metadata, data location, but NOT the data itself
record (
partno: int32;
description:
Data file(s) string;
contain the data )
multiple Unix files (one per node), accessible in parallel

node1:/local/disk1/
node2:/local/disk2/
Quiz!

True or False?
Everything that has been data-partitioned must be
collected in same job
Data Set Stage

Is the data partitioned?


Engine Data Translation

Occurs on import
From sequential files or file sets
From RDBMS

Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS

Engine most efficient when processing internally


formatted records (I.e. data contained in datasets)
Managing DataSets

GUI (Manager, Designer, Director) tools > data


set management
Alternative methods
Orchadmin
Unix command line utility
List records
Remove data sets (will remove all components)

Dsrecords
Lists number of records in a dataset
Data Set Management

Display data

Schema
Data Set Management From Unix

Alternative method of managing file sets and data


sets
Dsrecords
Gives record count
Unix command-line utility
$ dsrecords ds_name
I.e.. $ dsrecords myDS.ds
156999 records
Orchadmin
Manages EE persistent data sets
Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds
Exercise

Complete exercises 2-1, 2-2, 2-3, and 2-4.


Module 3

Standards and Techniques


Objectives

Establish standard techniques for DSEE


development
Will cover:
Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters
Job Presentation

Document using
the annotation
stage
Job Properties Documentation

Organize jobs into


categories

Description shows in DS
Manager and MetaStage
Naming conventions

Stages named after the


Data they access
Function they perform
DO NOT leave defaulted stage names like
Sequential_File_0
Links named for the data they carry
DO NOT leave defaulted link names like DSLink3
Stage and Link Names

Stages and links


renamed to data
they handle
Create Reusable Job Components

Use Enterprise Edition shared containers when


feasible

Container
Use Iterative Job Design

Use copy or peek stage as stub


Test job in phases small first, then increasing in
complexity
Use Peek stage to examine records
Copy or Peek Stage Stub

Copy stage
Transformer Stage
Techniques
Suggestions -
Always include reject link.
Always test for null value before using a column in a
function.
Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later.
Be aware of Column and Stage variable Data Types.
Often user does not pay attention to Stage Variable type.
Avoid type conversions.
Try to maintain the data type as imported.
The Copy Stage

With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):


Partitioners
Sort / Remove Duplicates
Rename, Drop column

can be inserted on:


input link (Partitioning): Partitioners, Sort, Remove Duplicates)
output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:


Rename,
Drop,
Implicit type Conversions
Link Constraint break up schema
Developing Jobs

1. Keep it simple
Jobs with many stages are hard to debug and maintain.

2. Start small and Build to final Solution


Use view data, copy, and peek.
Start from source and work out.
Develop with a 1 node configuration file.

3. Solve the business problem before the performance


problem.
Dont worry too much about partitioning until the
sequential flow works as expected.

4. If you have to write to Disk use a Persistent Data set.


Final Result
Good Things to Have in each Job

Use job parameters


Some helpful environmental variables to add to
job parameters
$APT_DUMP_SCORE
Report OSH to message log
$APT_CONFIG_FILE
Establishes runtime parameters to EE engine; I.e. Degree of
parallelization
Setting Job Parameters

Click to add
environment
variables
DUMP SCORE Output

Setting APT_DUMP_SCORE yields:

Double-click Partitoner
And
Collector

Mapping
Node--> partition
Use Multiple Configuration Files

Make a set for 1X, 2X,.


Use different ones for test versus production
Include as a parameter in each job
Exercise

Complete exercise 3-1


Module 4

DBMS Access
Objectives

Understand how DSEE reads and writes records


to an RDBMS
Understand how to handle nulls on DBMS lookup
Utilize this knowledge to:
Read and write database tables
Use database tables to lookup data
Use null handling options to clean data
Parallel Database Connectivity

Traditional Client
Client-Server Enterprise Edition
Client
Sort
Client

Client
Client
Load
Client

Parallel RDBMS Parallel RDBMS

Only RDBMS is running in parallel


Parallel server runs APPLICATIONS
Each application has only one connection
Application has parallel connections to RDBMS
Suitable only for small data volumes
Suitable for large data volumes
Higher levels of integration possible
RDBMS Access
Supported Databases

Enterprise Edition provides high performance /


scalable interfaces for:
DB2
Informix
Oracle
Teradata
RDBMS Access

Automatically convert RDBMS table layouts to/from


Enterprise Edition Table Definitions
RDBMS nulls converted to/from nullable field values
Support for standard SQL syntax for specifying:
field list for SELECT statement
filter for WHERE clause
Can write an explicit SQL query to access RDBMS
EE supplies additional information in the SQL query
RDBMS Stages

DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
RDBMS Usage

As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined SQL
User-defined can perform joins, access views
Lookup (reference link)
Normal lookup is memory-based (all table data read into
memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options

As a target
Inserts
Upserts (Inserts and updates)
Loader
RDBMS Source Stream Link

Stream link
DBMS Source - User-defined SQL

Columns in SQL
statement must match the
meta data in columns tab
Exercise

User-defined SQL
Exercise 4-1
DBMS Source Reference Link

Reject
link
Lookup Reject Link

Output option automatically


creates the reject link
Null Handling

Must handle null condition if lookup record is not


found and continue option is chosen
Can be done in a transformer stage
Lookup Stage Mapping

Link
name
Lookup Stage Properties

Referenc
e link

Must have same column name


in input and reference links.
You will get the results of the
lookup in the output column.
DBMS as a Target
DBMS As Target

Write Methods
Delete
Load
Upsert
Write (DB2)

Write mode for load method


Truncate
Create
Replace
Append
Target Properties

Generated code
can be copied

Upsert mode
determines options
Checking for Nulls

Use Transformer stage to test for fields with null


values (Use IsNull functions)
In Transformer, can reject or load default value
Exercise

Complete exercise 4-2


Module 5

Platform Architecture
Objectives

Understand how Enterprise Edition Framework


processes data
You will be able to:
Read and understand OSH
Perform troubleshooting
Concepts

The Enterprise Edition Platform


Script language - OSH (generated by DataStage
Parallel Canvas, and run by DataStage Director)
Communication - conductor,section leaders,players.
Configuration files (only one active at a time,
describes H/W)
Meta data - schemas/tables
Schema propagation - RCP
EE extensibility - Buildop, Wrapper
Datasets (data in Framework's internal
representation)
DS-EE Stage Elements

EE Stages Involve A Series Of Processing Steps

Input Data Set schema:


prov_num:int16;
Output Data Set schema:
prov_num:int16; Piece of Application
member_num:int8;
member_num:int8;
custid:int32; custid:int32; Logic Running Against
Individual Records
Parallel or Sequential
Partitioner

Interface
Business
Interface

Output
Logic
Input

EE Stage
DSEE Stage Execution

Dual Parallelism Eliminates Bottlenecks!


EE Delivers Parallelism in
Two Ways
Pipeline
Partition
Producer Block Buffering Between
Components
Eliminates Need for Program
Pipeline Load Balancing
Maintains Orderly Data Flow
Consume Partition
r
Stages Control Partition Parallelism

Execution Mode (sequential/parallel) is controlled by Stage


default = parallel for most Ascential-supplied Stages
Developer can override default mode
Parallel Stage inserts the default partitioner (Auto) on its
input links
Sequential Stage inserts the default collector (Auto) on
its input links
Developer can override default
execution mode (parallel/sequential) of Stage >
Advanced tab
choice of partitioner/collector on Input > Partitioning
tab
How Parallel Is It?

Degree of parallelism is determined by the


configuration file
Total number of logical nodes in default pool, or a
subset if using "constraints".
Constraints are assigned to specific pools as defined in
configuration file and can be referenced in the stage
OSH

DataStage EE GUI generates OSH scripts


Ability to view OSH turned on in Administrator
OSH can be viewed in Designer using job properties

The Framework executes OSH


What is OSH?
Orchestrate shell
Has a UNIX command-line interface
OSH Script

An osh script is a quoted string which


specifies:
The operators and connections of a single
Orchestrate step
In its simplest form, it is:
osh op < in.ds > out.ds

Where:
op is an Orchestrate operator
in.ds is the input data set
out.ds is the output data set
OSH Operators

OSH Operator is an instance of a C++ class inheriting


from APT_Operator
Developers can create new operators
Examples of existing operators:
Import
Export
RemoveDups
Enable Visible OSH in Administrator

Will be enabled
for all projects
View OSH in Designer

Operator

Schema
OSH Practice

Exercise 5-1 Instructor demo (optional)


Elements of a Framework Program
Operators
Datasets: set of rows processed by Framework
Orchestrate data sets:
persistent (terminal) *.ds, and
virtual (internal) *.v.
Also: flat file sets *.fs

Schema: data description (metadata) for datasets and links.


Datasets

Consist of Partitioned Data and Schema


Can be Persistent (*.ds) or Virtual (*.v, Link)
Overcome 2 GB File Limit

What you program: What gets processed:


Node 1 Node 2 Node 3 Node 4

GUI Operator Operator Operator Operator


= A A A A

data files . . .
What gets generated: of x.ds
Multiple files per partition
OSH $ osh operator_A > x.ds Each file up to 2GBytes (or larger)
Computing Architectures: Definition

Dedicated Disk Shared Disk Shared Nothing

Disk Disk Disk Disk Disk Disk

CPU CPU CPU CPU


CPU CPU CPU CPU CPU

Shared Memory Memory Memory Memory Memory


Memory

Uniprocessor SMP System Clusters and MPP Systems


(Symmetric Multiprocessor)
PC IBM, Sun, HP, Compaq 2 to hundreds of processors
Workstation 2 to 64 processors MPP: IBM and NCR Teradata
Single processor server Majority of installations each node is a uniprocessor or SMP
Job Execution:
Orchestrate
Conductor Node
Conductor - initial DS/EE process
C Step Composer
Creates Section Leader processes (one/node)
Consolidates massages, outputs them
Processing Node
Manages orderly shutdown.
SL Section Leader
Forks Players processes (one/Stage)
P P P
Manages up/down communication.

Processing Node Players


The actual processes associated with Stages
SL Combined players: one process only
Send stderr to SL
P P P
Establish connections to other players for data
flow
Communication: Clean up upon completion.
- SMP: Shared Memory
- MPP: TCP
Working with Configuration Files

You can easily switch between config files:


'1-node'file - for sequential execution, lighter reportshandy for testing
'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism
'BigN-nodes' file - aims at full data-partitioned parallelism
Only one file is active while a step is running
The Framework queries (first) the environment variable:
$APT_CONFIG_FILE

# nodes declared in the config file needs not match # CPUs


Same configuration file can be used in development and target
machines
Scheduling
Nodes, Processes, and CPUs
DS/EE does not:
know how many CPUs are available
schedule
Nodes = # logical nodes declared in config. file
Ops = # ops. (approx. # blue boxes in V.O.)
Who knows what? Processes = # Unix processes
CPUs = # available CPUs

Nodes Ops Processes CPUs


User Y N

Orchestrate Y Y Nodes * Ops N


O/S " Y

Who does what?


DS/EE creates (Nodes*Ops) Unix processes
The O/S schedules these processes on the CPUs
Configuring DSEE Node Pools
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
3 4 fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
1 2 resource disk "/orch/n2/d2" {}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Configuring DSEE Disk Pools
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
3 4 pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
1 2 resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Re-Partitioning

Parallel to parallel flow may incur reshuffling:


Records may jump between nodes

node node
1 2

partitioner
Partitioning Methods

Auto
Hash
Entire
Range
Range Map
Collectors

Collectors combine partitions of a dataset into a


single input stream to a sequential Stage

... data partitions

collector

sequential Stage
Collectors do NOT synchronize data
Partitioning and Repartitioning Are
Visible On Job Design
Partitioning and Collecting Icons

Partitioner Collector
Setting a Node Constraint in the GUI
Reading Messages in Director

Set APT_DUMP_SCORE to true


Can be specified as job parameter
Messages sent to Director log
If set, parallel job will produce a report showing
the operators, processes, and datasets in the
running job
Messages With APT_DUMP_SCORE
= True
Exercise

Complete exercise 5-2


Module 6

Transforming Data
Module Objectives

Understand ways DataStage allows you to


transform data
Use this understanding to:
Create column derivations using user-defined code or
system functions
Filter records based on business criteria
Control data flow based on data conditions
Transformed Data

Transformed data is:


Outgoing column is a derivation that may, or may not,
include incoming fields or parts of incoming fields
May be comprised of system variables

Frequently uses functions performed on


something (ie. incoming columns)
Divided into categories I.e.
Date and time
Mathematical
Logical
Null handling
More
Stages Review

Stages that can transform data


Transformer
Parallel
Basic (from Parallel palette)
Aggregator (discussed in later module)

Sample stages that do not transform data


Sequential
FileSet
DataSet
DBMS
Transformer Stage Functions

Control data flow


Create derivations
Flow Control

Separate records flow down links based on data


condition specified in Transformer stage
constraints
Transformer stage can filter records
Other stages can filter records but do not exhibit
advanced flow control
Sequential can send bad records down reject link
Lookup can reject records based on lookup failure
Filter can select records based on data value
Rejecting Data

Reject option on sequential stage


Data does not agree with meta data
Output consists of one column with binary data type

Reject links (from Lookup stage) result from the


drop option of the property If Not Found
Lookup failed
All columns on reject link (no column mapping option)

Reject constraints are controlled from the


constraint editor of the transformer
Can control column mapping
Use the Other/Log checkbox
Rejecting Data Example

Constraint
Other/log option

Property Reject If Not Found


Mode = Output property
Transformer Stage Properties
Transformer Stage Variables

First of transformer stage entities to execute


Execute in order from top to bottom
Can write a program by using one stage variable to
point to the results of a previous stage variable
Multi-purpose
Counters
Hold values for previous rows to make comparison
Hold derivations to be used in multiple field dervations
Can be used to control execution of constraints
Stage Variables

Show/Hide button
Transforming Data

Derivations
Using expressions
Using functions
Date/time

Transformer Stage Issues


Sometimes require sorting before the transformer
stage I.e. using stage variable as accumulator and
need to break on change of column value
Checking for nulls
Checking for Nulls

Nulls can get introduced into the dataflow


because of failed lookups and the way in which
you chose to handle this condition
Can be handled in constraints, derivations, stage
variables, or a combination of these
Transformer - Handling Rejects

Constraint Rejects
All expressions are
false and reject row is
checked
Transformer: Execution Order

Derivations in stage variables are executed first

Constraints are executed before derivations

Column derivations in earlier links are executed before later links

Derivations in higher columns are executed before lower columns


Parallel Palette - Two Transformers

All > Processing > Parallel > Processing


Transformer Basic Transformer
Is the non-Universe Makes server style
transformer transforms available on
the parallel palette
Has a specific set of
functions Can use DS routines
No DS routines available

Program in Basic for both transformers


Transformer Functions From
Derivation Editor

Date & Time


Logical
Null Handling
Number
String
Type Conversion
Exercise

Complete exercises 6-1, 6-2, and 6-3


Module 7

Sorting Data
Objectives

Understand DataStage EE sorting options


Use this understanding to create sorted list of
data to enable functionality within a transformer
stage
Sorting Data

Important because
Some stages require sorted input
Some stages may run faster I.e Aggregator

Can be performed
Option within stages (use input > partitioning tab and
set partitioning to anything other than auto)
As a separate stage (more complex sorts)
Sorting Alternatives

Alternative representation of same flow:


Sort Option on Stage Link
Sort Stage
Sort Utility

DataStage the default


UNIX
Sort Stage - Outputs

Specifies how the output is derived


Sort Specification Options

Input Link Property


Limited functionality
Max memory/partition is 20 MB, then spills to scratch

Sort Stage
Tunable to use more memory before spilling to
scratch.
Note: Spread I/O by adding more scratch file
systems to each node of the APT_CONFIG_FILE
Removing Duplicates

Can be done by Sort stage


Use unique option

OR

Remove Duplicates stage


Has more sophisticated ways to remove duplicates
Exercise

Complete exercise 7-1


Module 8

Combining Data
Objectives

Understand how DataStage can combine data


using the Join, Lookup, Merge, and Aggregator
stages
Use this understanding to create jobs that will
Combine data from separate input streams
Aggregate data to form summary totals
Combining Data

There are two ways to combine data:

Horizontally:
Several input links; one output link (+ optional rejects)
made of columns from different input links. E.g.,
Joins
Lookup
Merge

Vertically:
One input link, one output link with column combining
values from all input rows. E.g.,
Aggregator
Join, Lookup & Merge Stages

These "three Stages" combine two or more input


links according to values of user-designated "key"
column(s).

They differ mainly in:


Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)
Not all Links are Created Equal

Enterprise Edition distinguishes between:


- The Primary Input (Framework port 0)
- Secondary - in some cases "Reference" (other ports)
Naming convention:
Joins Lookup Merge

Primary Input: port 0 Left Source Master


Secondary Input(s): ports 1, Right LU Table(s) Update(s)

Tip:
Check "Input Ordering" tab to make sure
intended Primary is listed first
Join Stage Editor

Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)

One of four variants:


Inner
Several key columns
Left Outer
allowed
Right Outer
Full Outer
1. The Join Stage

Four types:
Inner
Left Outer
Right Outer
Full Outer

2 sorted input links, 1 output link


"left outer" on primary input, "right outer" on secondary input
Pre-sort make joins "lightweight": few rows need to be in RAM
2. The Lookup Stage

Combines:
one source link with
one or more duplicate-free table links

Source One or more no pre-sort necessary


input tables (LUTs) allows multiple keys LUTs
flexible exception handling for
source input rows with no match
0 1 2

0 1
Lookup

Output Reject
The Lookup Stage

Lookup Tables should be small enough to fit


into physical memory (otherwise,
performance hit due to paging)
On an MPP you should partition the lookup
tables using entire partitioning method, or
partition them the same way you partition the
source link
On an SMP, no physical duplication of a
Lookup Table occurs
The Lookup Stage

Lookup File Set


Like a persistent data set only it contains
metadata about the key.
Useful for staging lookup tables

RDBMS LOOKUP
NORMAL
Loads to an in memory hash table first
SPARSE
Select for each row.
Might become a performance
bottleneck.
3. The Merge Stage
Combines
one sorted, duplicate-free master (primary) link with
one or more sorted update (secondary) links.
Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
joins, but opposite to lookup).

Follows the Master-Update model:


Master row and one or more updates row are merged if they have the same
value in user-specified key column(s).
A non-key column occurs in several inputs? The lowest input port number
prevails (e.g., master over update; update values are ignored)
Unmatched ("Bad") master rows can be either
kept
dropped
Unmatched ("Bad") update rows in input link can be captured in a "reject"
link
Matched update rows are consumed.
The Merge Stage

Allows composite keys

Multiple update links


Master One or more
updates
Matched update rows are consumed

0 1 2

0 1 2
Merge

Output Rejects

Unmatched updates can be captured


Synopsis:
Joins, Lookup, & Merge

Joins Lookup Merge


Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort both inputs no all inputs
Duplicates in primary input OK (x-product) OK Warning!
Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1
Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary NONE NONE capture in reject set(s)
On match, secondary entries are reusable reusable consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

In this table:
, <comma> = separator between primary and secondary input links
(out and reject links)
The Aggregator Stage

Purpose: Perform data aggregations


Specify:
Zero or more key columns that define the
aggregation units (or groups)
Columns to be aggregated
Aggregation functions:
count (nulls/non-nulls) sum
max/min/range

The grouping method (hash table or pre-sort)


is a performance issue
Grouping Methods

Hash: results for each aggregation group are stored in a


hash table, and the table is written out after all input has
been processed
doesnt require sorted data
good when number of unique groups is small. Running
tally for each groups aggregate calculations need to fit
easily into memory. Require about 1KB/group of RAM.
Example: average family income by state, requires .05MB
of RAM
Sort: results for only a single aggregation group are kept
in memory; when new group is seen (key value changes),
current group written out.
requires input sorted by grouping keys
can handle unlimited numbers of groups
Example: average daily balance by credit card
Aggregator Functions

Sum
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation
Aggregator Properties
Aggregation Types

Aggregation types
Containers

Two varieties
Local
Shared

Local
Simplifies a large, complex diagram
Shared
Creates reusable object that many jobs can include
Creating a Container

Create a job
Select (loop) portions to containerize
Edit > Construct container > local or shared
Using a Container

Select as though it were a stage


Exercise

Complete exercise 8-1


Module 9

Configuration Files
Objectives

Understand how DataStage EE uses


configuration files to determine parallel behavior
Use this understanding to
Build a EE configuration file for a computer system
Change node configurations to support adding
resources to processes that need them
Create a job that will change resource allocations at
the stage level
Configuration File Concepts

Determine the processing nodes and disk space


connected to each node
When system changes, need only change the
configuration file no need to recompile jobs
When DataStage job runs, platform reads
configuration file
Platform automatically scales the application to fit the
system
Processing Nodes Are

Locations on which the framework runs


applications
Logical rather than physical construct
Do not necessarily correspond to the number of
CPUs in your system
Typically one node for two CPUs
Can define one processing node for multiple
physical nodes or multiple processing nodes for
one physical node
Optimizing Parallelism

Degree of parallelism determined by number of


nodes defined
Parallelism should be optimized, not maximized
Increasing parallelism distributes work load but also
increases Framework overhead
Hardware influences degree of parallelism
possible
System hardware partially determines
configuration
More Factors to Consider

Communication amongst operators


Should be optimized by your configuration
Operators exchanging large amounts of data should
be assigned to nodes communicating by shared
memory or high-speed link
SMP leave some processors for operating
system
Desirable to equalize partitioning of data
Use an experimental approach
Start with small data sets
Try different parallelism while scaling up data set sizes
Factors Affecting Optimal Degree of
Parallelism
CPU intensive applications
Benefit from the greatest possible parallelism
Applications that are disk intensive
Number of logical nodes equals the number of disk
spindles being accessed
Configuration File

Text file containing string data that is passed to


the Framework
Sits on server side
Can be displayed and edited

Name and location found in environmental


variable APT_CONFIG_FILE
Components
Node
Fast name
Pools
Resource
Node Options

Node name name of a processing node used by EE


Typically the network name
Use command uname n to obtain network name
Fastname
Name of node as referred to by fastest network in the system
Operators use physical node name to open connections
NOTE: for SMP, all CPUs share single connection to network
Pools
Names of pools to which this node is assigned
Used to logically group nodes
Can also be used to group resources
Resource
Disk
Scratchdisk
Sample Configuration File

{
node Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/Datasets"
{pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}
Disk Pools

Disk pools allocate storage


pool "bigdata"
By default, EE uses the default
pool, specified by
Sorting Requirements

Resource pools can also be specified for sorting:


The Sort stage looks first for scratch disk resources
in a
sort pool, and then in the default disk pool
Another Configuration File Example

{{
node
node "n1"
"n1" {{
fastname
fastname s1"
s1"
pool
pool ""
"" "n1"
"n1" "s1"
"s1" "sort"
"sort"
resource
resource disk "/data/n1/d1" {}
disk "/data/n1/d1" {}
resource disk "/data/n1/d2"
resource disk "/data/n1/d2" {} {}
resource
resource scratchdisk
scratchdisk "/scratch"
"/scratch" {"sort"}
{"sort"}
}}
node
node "n2"
"n2" {{
fastname
fastname "s2"
6 pool
"s2"
pool "" "n2" "s2"
"" "n2" "s2" "app1"
"app1"
resource
resource disk "/data/n2/d1" {}
disk "/data/n2/d1" {}
resource
resource scratchdisk "/scratch" {}
scratchdisk "/scratch" {}
}}
4 5 node
node "n3"
"n3" {{
fastname
fastname "s3"
"s3"
pool
pool "" "n3" "s3"
"" "n3" "s3" "app1"
"app1"
2 3 resource
resource disk "/data/n3/d1" {}
disk "/data/n3/d1" {}
resource
resource scratchdisk "/scratch" {}
scratchdisk "/scratch" {}
}}
1 node
node "n4"
"n4" {{
fastname
fastname "s4"
"s4"
pool
pool "" "n4" "s4"
"" "n4" "s4" "app1"
"app1"
resource
resource disk
disk "/data/n4/d1"
"/data/n4/d1" {}
{}
resource
resource scratchdisk "/scratch" {}
scratchdisk "/scratch" {}
}}
...
...
}}
Resource Types

Disk
Scratchdisk
DB2
Oracle
Saswork
Sortwork
Can exist in a pool
Groups resources together
Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type


Building a Configuration File

Scoping the hardware:


Is the hardware configuration SMP, Cluster, or MPP?
Define each node structure (an SMP would be single
node):
Number of CPUs
CPU speed
Available memory
Available page/swap space
Connectivity (network/back-panel speed)

Is the machine dedicated to EE? If not, what other


applications are running on it?
Get a breakdown of the resource usage (vmstat, mpstat,
iostat)
Are there other configuration restrictions? E.g. DB only
runs on certain nodes and ETL cannot run on them?
Exercise

Complete exercise 9-1 and 9-2


Module 10

Extending DataStage EE
Objectives

Understand the methods by which you can add


functionality to EE
Use this understanding to:
Build a DataStage EE stage that handles special
processing needs not supplied with the vanilla stages
Build a DataStage EE job that uses the new stage
EE Extensibility Overview

Sometimes it will be to your advantage to


leverage EEs extensibility. This extensibility
includes:

Wrappers
Buildops
Custom Stages
When To Leverage EE Extensibility

Types of situations:
Complex business logic, not easily accomplished using standard
EE stages
Reuse of existing C, C++, Java, COBOL, etc
Wrappers vs. Buildop vs. Custom

Wrappers are good if you cannot or do not


want to modify the application and
performance is not critical.
Buildops: good if you need custom coding but
do not need dynamic (runtime-based) input
and output interfaces.
Custom (C++ coding using framework API): good
if you need custom coding and need dynamic
input and output interfaces.
Building Wrapped Stages

You can wrapper a legacy executable:


Binary
Unix command
Shell script

and turn it into a Enterprise Edition stage


capable, among other things, of parallel execution
As long as the legacy executable is:
amenable to data-partition parallelism
no dependencies between rows
pipe-safe
can read rows sequentially
no random access to data
Wrappers (Contd)

Wrappers are treated as a black box


EE has no knowledge of contents
EE has no means of managing anything that occurs
inside the wrapper
EE only knows how to export data to and import data
from the wrapper
User must know at design time the intended behavior of
the wrapper and its schema interface
If the wrappered application needs to see all records prior
to processing, it cannot run in parallel.
LS Example

Can this command be wrappered?


Creating a Wrapper

To create the ls stage

Used in this job ---


Wrapper Starting Point

Creating Wrapped Stages


From Manager:
Right-Click on Stage Type
> New Parallel Stage > Wrapped

We will "Wrapper an existing


Unix executables the ls
command
Wrapper - General Page

Name of stage

Unix command to be wrapped


The "Creator" Page

Conscientiously maintaining the Creator page for all your wrapped stages
will eventually earn you the thanks of others.
Wrapper Properties Page

If your stage will have properties appear, complete the


Properties page

This will be the name of


the property as it
appears in your stage
Wrapper - Wrapped Page

Interfaces input and output columns -


these should first be entered into the
table definitions meta data (DS
Manager); lets do that now.
Interface schemas

Layout interfaces describe what columns the


stage:
Needs for its inputs (if any)
Creates for its outputs (if any)
Should be created as tables with columns in
Manager
Column Definition for Wrapper
Interface
How Does the Wrapping Work?

Define the schema for export


and import input schema
Schemas become interface export

schemas of the operator and stdin or


allow for by-name column named pipe

access UNIX executable

stdout or
named pipe
import
output schema

QUIZ: Why does export precede import?


Update the Wrapper Interfaces

This wrapper will have no input interface i.e. no input


link. The location will come as a job parameter that will
be passed to the appropriate stage property. Therefore,
only the Output tab entry is needed.
Resulting Job

Wrapped stage
Job Run

Show file from Designer palette


Wrapper Story: Cobol Application

Hardware Environment:
IBM SP2, 2 nodes with 4 CPUs per node.
Software:
DB2/EEE, COBOL, EE
Original COBOL Application:
Extracted source table, performed lookup against table in DB2,
and Loaded results to target table.
4 hours 20 minutes sequential execution
Enterprise Edition Solution:
Used EE to perform Parallel DB2 Extracts and Loads
Used EE to execute COBOL application in Parallel
EE Framework handled data transfer between
DB2/EEE and COBOL application
30 minutes 8-way parallel execution
Buildops

Buildop provides a simple means of extending beyond the


functionality provided by EE, but does not use an existing
executable (like the wrapper)
Reasons to use Buildop include:
Speed / Performance
Complex business logic that cannot be easily represented
using existing stages
Lookups across a range of values
Surrogate key generation
Rolling aggregates
Build once and reusable everywhere within project, no
shared container necessary
Can combine functionality from different stages into one
BuildOps

The DataStage programmer encapsulates the business


logic

The Enterprise Edition interface called buildop


automatically performs the tedious, error-prone tasks:
invoke needed header files, build the necessary
plumbing for a correct and efficient parallel execution.

Exploits extensibility of EE Framework


BuildOp Process Overview

From Manager (or Designer):


Repository pane:
Right-Click on Stage Type
> New Parallel Stage > {Custom | Build | Wrapped}

"Build" stages
from within Enterprise Edition

"Wrapping existing Unix


executables
General Page

Identical
to Wrappers,
except: Under the Build
Tab, your program!
Logic Tab for
Business Logic

Enter Business C/C++


logic and arithmetic in
four pages under the
Logic tab

Main code section goes


in Per-Record page- it
will be applied to all
rows

NOTE: Code will need


to be Ansi C/C++
compliant. If code does
not compile outside of
EE, it wont compile
within EE either!
Code Sections under Logic Tab

Temporary
variables
declared [and
initialized] here

Logic here is Logic here is


executed once executed once
BEFORE AFTER
processing the processing the
FIRST row LAST row
I/O and Transfer

Under Interface tab: Input, Output & Transfer pages

First line:
output 0

Write row In-Repository 'False' setting,


Optional
Table not to interfere
renaming of
Definition with Transfer
output port
Input page: 'Auto Read' page
from default Read next row
"out0"
I/O and Transfer

First line:
Transfer of index 0

Transfer all columns from input to output.


If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written
BuildOp Simple Example

Example - sumNoTransfer
Add input columns "a" and "b"; ignore other columns
that might be present in input
Produce a new "sum" column
Do not transfer input columns

a:int32; b:int32
sumNoTransfer
sum:int32
No Transfer

From Peek:

NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"

Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred

Compare with transfer ON


Transfer

TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"

Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)
Columns vs.
Temporary C++ Variables

Columns Temp C++ variables

DS-EE type C/C++ type


Defined in Table Need declaration (in
Definitions Definitions or Pre-Loop
page)

Value persistent
throughout "loop" over
Value refreshed from row rows, unless modified in
code
to row
Exercise

Complete exercise 10-1 and 10-2


Exercise

Complete exercises 10-3 and 10-4


Custom Stage

Reasons for a custom stage:


Add EE operator not already in DataStage EE
Build your own Operator and add to DataStage EE

Use EE API
Use Custom Stage to add new operator to EE
canvas
Custom Stage

DataStage Manager > select Stage Types branch


> right click
Custom Stage

Number of input and


output links allowed

Name of Orchestrate
operator to be used
Custom Stage Properties Tab
The Result
Module 11

Meta Data in DataStage EE


Objectives

Understand how EE uses meta data, particularly


schemas and runtime column propagation
Use this understanding to:
Build schema definition files to be invoked in
DataStage jobs
Use RCP to manage meta data usage in EE jobs
Establishing Meta Data

Data definitions
Recordization and columnization
Fields have properties that can be set at individual
field level
Data types in GUI are translated to types used by EE
Described as properties on the format/columns tab
(outputs or inputs pages) OR
Using a schema file (can be full or partial)

Schemas
Can be imported into Manager
Can be pointed to by some job stages (i.e. Sequential)
Data Formatting Record Level

Format tab
Meta data described on a record basis
Record level properties
Data Formatting Column Level

Defaults for all columns


Column Overrides

Edit row from within the columns tab


Set individual column properties
Extended Column Properties

Field
and
string
settings
Extended Properties String Type

Note the ability to convert ASCII to EBCDIC


Editing Columns

Properties
depend on the
data type
Schema

Alternative way to specify column definitions for


data used in EE jobs
Written in a plain text file
Can be written as a partial record definition
Can be imported into the DataStage repository
Creating a Schema

Using a text editor


Follow correct syntax for definitions
OR

Import from an existing data set or file set


On DataStage Manager import > Table Definitions >
Orchestrate Schema Definitions
Select checkbox for a file with .fs or .ds
Importing a Schema

Schema location can be


on the server or local
work station
Data Types

Date Vector
Decimal Subrecord
Floating point Raw
Integer Tagged
String
Time
Timestamp
Runtime Column Propagation

DataStage EE is flexible about meta data. It can cope with the


situation where meta data isnt fully defined. You can define part
of your schema and specify that, if your job encounters extra
columns that are not defined in the meta data when it actually
runs, it will adopt these extra columns and propagate them
through the rest of the job. This is known as runtime column
propagation (RCP).
RCP is always on at runtime.
Design and compile time column mapping enforcement.
RCP is off by default.
Enable first at project level. (Administrator project properties)
Enable at job level. (job properties General tab)
Enable at Stage. (Link Output Column tab)
Enabling RCP at Project Level
Enabling RCP at Job Level
Enabling RCP at Stage Level

Go to output links columns tab


For transformer you can find the output links
columns tab by first going to stage properties
Using RCP with Sequential Stages

To utilize runtime column propagation in the


sequential stage you must use the use schema
option
Stages with this restriction:
Sequential
File Set
External Source
External Target
Runtime Column Propagation

When RCP is Disabled


DataStage Designer will enforce Stage Input
Column to Output Column mappings.
At job compile time modify operators are
inserted on output links in the generated osh.
Runtime Column Propagation

When RCP is Enabled


DataStage Designer will not enforce mapping
rules.
No Modify operator inserted at compile time.
Danger of runtime error if column names
incoming do not match column names outgoing
link case sensitivity.
Exercise

Complete exercises 11-1 and 11-2


Module 12

Job Control Using the Job


Sequencer
Objectives

Understand how the DataStage job sequencer


works
Use this understanding to build a control job to
run a sequence of DataStage jobs
Job Control Options

Manually write job control


Code generated in Basic
Use the job control tab on the job properties page
Generates basic code which you can modify

Job Sequencer
Build a controlling job much the same way you build
other jobs
Comprised of stages and links
No basic coding
Job Sequencer

Build like a regular job


Type Job Sequence
Has stages and links
Job Activity stage
represents a DataStage
job
Links represent passing
control

Stages
Example

Job Activity
stage
contains
conditional
triggers
Job Activity Properties

Job to be executed
select from dropdown

Job parameters
to be passed
Job Activity Trigger

Trigger appears as a link in the diagram


Custom options let you define the code
Options

Use custom option for conditionals


Execute if job run successful or warnings only
Can add wait for file to execute
Add execute command stage to drop real tables
and rename new tables to current tables
Job Activity With Multiple Links

Different links
having different
triggers
Sequencer Stage

Build job sequencer to control job for the


collections application

Can be set to
all or any
Notification Stage

Notification
Notification Activity
Sample DataStage log from Mail
Notification
Sample DataStage log from Mail Notification
Notification Activity Message

E-Mail Message
Exercise

Complete exercise 12-1


Module 13

Testing and Debugging


Objectives

Understand spectrum of tools to perform testing


and debugging
Use this understanding to troubleshoot a
DataStage job
Environment Variables
Parallel Environment Variables
Environment Variables
Stage Specific
Environment Variables
Environment Variables
Compiler
The Director

Typical Job Log Messages:

Environment variables
Configuration File information
Framework Info/Warning/Error messages
Output from the Peek Stage
Additional info with "Reporting" environments
Tracing/Debug output
Must compile job in trace mode
Adds overhead
Job Level Environmental Variables

Job Properties, from Menu Bar of Designer


Director will
prompt you
before each
run
Troubleshooting
If you get an error during compile, check the following:
Compilation problems
If Transformer used, check C++ compiler, LD_LIRBARY_PATH
If Buildop errors try buildop from command line
Some stages may not support RCP can cause column mismatch .
Use the Show Error and More buttons
Examine Generated OSH
Check environment variables settings

Very little integrity checking during compile, should run validate from Director.

Highlights source of error


Generating Test Data

Row Generator stage can be used


Column definitions
Data type dependent

Row Generator plus lookup stages provides good


way to create robust test data from pattern files

You might also like