Professional Documents
Culture Documents
TCS, Kolkata
1
Confidential
Ab Initio
Ab Initio >
DAY 1
2
Confidential
Ab Initio
INTRODUCTION
Confidential
Ab Initio
Product Constituents
Co-operating System
On a typical installation, the Co-operating system is
installed on a Unix or Windows NT server while the
GDE is installed on a Pentium PC.
4
Confidential
Ab Initio
Product Architecture
GDE
Host Machine 1
Co>Operating System
Ab Initio Built-in
Component Programs
(Partitions, Transforms etc)
Host Machine 2
User
Programs
Co-Operating
System
User
Programs
Operating System
( Unix , Windows NT )
Operating System
5
Confidential
Ab Initio
CO>Operating System
Layered on the top of the operating system
Unites a network of computing resources CPUs, storage disks, programs, datasets into a data-processing system with scalable performance
Co>Operating system runs on
IBM
AIX 4.1,4.2,4.3.
Sun
Solaris 2.5.1,2.6,2.7.
HP
Sequent
Alpha
Pyramid
Intel
Windows NT 4.0
6
Confidential
Ab Initio
7
Confidential
Ab Initio
The GDE
can talk to the Co-operating system using several protocols like Telnet,
Rexec and FTP
is GUI for building applications in Ab Initio
and the Co-operating system have different release mechanisms, making
Co-operating system upgradation possible without change in the GDE
release
Note: During deployment, GDE sets AB_COMPATIBILITY to the Co>Operating System version number. So, a change in the
Co>Operating System release may require a re-deployment
8
Confidential
Ab Initio
A Graph
is the logical modular unit of an application.
consists of several components that forms the building blocks of
an Ab Initio application
A Component
is a program that does a specific type of job and can be controlled
by its parameter settings.
A Component Organizer
groups all components under different categories.
9
Confidential
Ab Initio
A Sample Graph
Datasets
Dataset
Components
L1
L1
L1*
L1*
Score
Select
Customers
out*
deselect*
Good
Customers
L1
Flows
Other
Customers
10
Confidential
Ab Initio
A Sample Graph
Expression Metadata
Record format metadata
Ports
Layout
11
Confidential
Ab Initio
Files
Formats
Components
Flows
Layouts
Building with mp job
Building with mp run
12
Confidential
Ab Initio
Setup Command
Ab Initio Host (AIH) file
Builds up the environment to run an Ab Initio application.
Graph
End Script
Local to the Graph
13
Confidential
Ab Initio
Runtime Environment
14
Confidential
A sample graph
Ab Initio
15
Confidential
Ab Initio
16
Confidential
Ab Initio
DML
Ab Initio stores metadata in the form of record formats.
Metadata can be embedded within a component or can be
stored external to the graph in a file with a .dml extension.
XFR
Data can be transformed with the help of transform functions.
Transform functions can be embedded within a component or
can be stored external to the graph in a file with a .xfr
extension.
17
Confidential
Ab Initio
CPU2
Memory
BUS
CPU1
Disk
HOST
GDE
CLIENT
HOST
PROCESSING NODES
18
Confidential
Ab Initio
HOST
GDE
Agent
CLIENT
HOST
Agent
PROCESSING NODES
19
Confidential
Ab Initio
HOST
GDE
Agent
CLIENT
HOST
Agent
PROCESSING NODES
20
Confidential
Ab Initio
HOST
GDE
Agent
CLIENT
HOST
Agent
PROCESSING NODES
21
Confidential
Ab Initio
HOST
GDE
Agent
CLIENT
HOST
Agent
PROCESSING NODES
22
Confidential
Ab Initio
CLIENT
HOST
PROCESSING NODES
23
Confidential
Ab Initio
CLIENT
HOST
PROCESSING NODES
24
Confidential
Ab Initio
Agent
CLIENT
HOST
Agent
PROCESSING NODES
25
Confidential
Ab Initio
HOST
GDE
Agent
Agent
CLIENT
HOST
PROCESSING NODES
26
Confidential
Ab Initio
CLIENT
HOST
PROCESSING NODES
27
Confidential
Ab Initio
HOST
GDE
CLIENT
HOST
PROCESSING NODES
28
Confidential
Ab Initio
Ab Initio >
DAY 2
29
Confidential
Ab Initio
Field Names
Data Types
0345John
Smith
0212Sam
Spade
0322Elvis
Jones
0492Sue
West
0221William
Black
record
decimal(4) id;
DML BLOCK
string(6)
first_name;
string(6)
last_name;
end
30
Confidential
Ab Initio
DML Syntax
Record types begin with record and end with end
Fields are declared: data_type(len) field_name;
Field names consist of letters(az,AZ),digits(09) and
underscores(_) and are Case sensitive
Keywords/Reserved words are record, end, date.
31
Confidential
Ab Initio
Data Types
String
Decimal
Integer
Storing Data in binary form
32
Confidential
Ab Initio
0212,05-07-03, 950.00Sam,Spade
0322,17-01-00, 890.50Elvis,Jones
0492,25-12-02,1000.00Sue,West
0221,28-02-03, 500.00William,Black
record
decimal(,) id;
date(DD-MM-YY)(,) join_date;
decimal(7,2) salary_per_day;
Precision
& Scale
string(,)
first_name;
string(\n)
last_name;
end
33
Confidential
Ab Initio
NULL in Ab Initio
Table Data
Header
Body
NULL Body
Trailer
NULL
34
Confidential
Ab Initio
Data Transformation
record
0345,090297John,Smith;
Drop
decimal(7) id;
date(MMDDYY) join_date;
string(,)
first_name;
string(;)
last_name;
end
Reformat
Reformat
Reorder
id+1000000
record
decimal(7) id;
string(8)
last_name;
date(DD-MM-YY)(,) join_date;
end
1000345,Smith
1997-09-02
35
Confidential
Ab Initio
Assignments :
output-records.field : : expression;
36
Confidential
Basic Components
Ab Initio
Filter by Expression
Reformat
Redefine Format
Sort
Join
Replicate
Dedup
Aggregate
Rollup
Scan
37
Confidential
Ab Initio
Filter by Expression
1. Reads record from input port
2. Evaluate the select_expr
3. If result is true, record written to out port
4. If result is false, record written to deselect port
Input port
expr
true?
Yes
Out port
No
Deselect
38
Confidential
Ab Initio
Diagnostic Ports
REJECT
Input records that caused error
ERROR
Associated error message
LOG
Logging records
39
Confidential
Ab Initio
Reformat
1. Reads record from input port
2. Record passes as argument to transform function or xfr
40
Confidential
Reformat
Ab Initio
41
Confidential
Ab Initio
Instrumentation Parameters
Limit
Number of errors to tolerate
Ramp
Scale of errors to tolerate per input
42
Confidential
Ab Initio
Sort
Keys
A key identifies a field or set of fields to organize a dataset
Single Field: employee_number
Multiple field or Composite key: (last_name; first_name)
Modifiers: employee_number descending
Sort Component
Reads records from input port, sorts them by key, writes result to output port
Parameters
Key
Max-core
43
Confidential
Ab Initio
Join
1. Reads records from multiple input ports
PORTS
PARAMETERS
in
count
out
key
unused
override key
reject (optional)
transform
error (optional)
limit
log (optional)
ramp
44
Confidential
Ab Initio
Join
Join Types
Inner
Outer
Explicit
Join Methods
Merge Join
Using sorted inputs
Hash Join
Using in-memory hash tables to group input
45
Confidential
Ab Initio
Priority Assignment
46
Confidential
Ab Initio
Multistage Transform
Aggregate/Rollup/Scan
Generates summary records for group of input records
47
Confidential
Ab Initio
Components contd..
Name
Description
Normalize
Denormalize
Sorted
Consolidates groups of related data records into a single output record with a vector
field for each group
Requires Grouped Input
Validate
Records
Check Order
Compare
Records
Generate
Records
Generates a specified number of data records with fields of specified lengths and types.
Gather Logs
Collects the output from the log ports of components for analysis of a graph after
execution
Sample
Selects a specified number of data records at random from one or multiple input flows
48
Confidential
ROLLUP
Ab Initio
ROLLUP EXAMPLE:
customer_id dt
amount
C002142
1994.03.23 52.20
C002142
1994.06.22 22.25
C003213
1993.02.12 47.95
C003213
1994.11.05 221.24
C003213
1995.12.11 17.42
C004221
1994.08.15 25.25
C008231
1993.10.22 122.00
C008231
1995.12.10 52.10
customer_id total_amount
C002142
74.45
C003213
286.61
C004221
25.25
C008231
174.10
49
Confidential
Ab Initio
ROLLUP
type temporary_type =
record
decimal(8.2) total_amount;
end;
out::initialize(in) =
begin
out.total_amount :: 0;
end;
out::rollup(tmp, in) =
begin
out.total_amount :: temp.total_amount + in.amount;
end;
out::finalize(tmp, in) =
begin
out.customer_id :: in.customer_id;
out.total_amount :: tmp.total_amount;
end;
50
Confidential
Ab Initio
SCAN
type temporary_type =
record
decimal(8.2) amount_to_date;
end;
temp :: initialize(in) =
begin
temp.amount_to_date :: 0;
end;
out :: scan(temp, in) =
begin
out.amount_to_date :: temp.amount_to_date + in.amount;
end;
out :: finalize(temp, in) =
begin
out.customer_id :: in.customer_id;
out.dt :: in.dt;
out.amount_to_date :: temp.amount_to_date;
51
Confidential
Ab Initio
AGGREGATE
record
string(20) customer_id;
decimal(8.2) purchase_amount;
end;
Suppose the output record format is:
record
string(20) customer_id;
decimal(8.2) total;
end;
You can sum the purchases with the following transform function:
out :: agg(temp, in) =
begin
out.total :1: temp.total + in.purchase_amount;
out.total : : in.purchase_amount;
out.customer_id : : in.customer_id;
end;
52
Confidential
Ab Initio
Ab Initio >
DAY 3
53
Confidential
Ab Initio
Built-in Functions
Ab Initio built-in functions are DML expressions that
can manipulate strings, dates, and numbers
access system properties
Function categories
Date functions : now(), today(), date_to_int(), ..
Inquiry and error functions: is_defined(), is_valid(), force_error(), ..
Lookup functions: lookup(), lookup_local(), ..
Math functions: ceiling(), floor(), ..
Miscellaneous functions:decimal_round(), hash_value(), ..
String functions: string_substring(), is_blank(), ..
54
Confidential
Ab Initio
Function
Example
re_get_match()
re_index()
re_replace()
re_replace_first()
55
Confidential
Ab Initio
Database Components
db_config_utility : Generate interface file to the database
Input Table
unloads data records from a database into an Ab Initio graph
Source : DB table or SQL statement to SELECT from table
Output Table
loads data records into a database
Destination : DB table or SQL statement to INSERT into table
Update Table
executes UPDATE or INSERT statements in embedded SQL format to modify a
DB table
56
Confidential
Database Components
Ab Initio
Truncate Table
deletes all the rows in a specified DB table
Run SQL
executes SQL statements in a DB
57
Confidential
Ab Initio
Parallelism
Parallel Runtime Environment
Where some or all of the components of an application datasets and
processing modules are replicated into a number of partitions, each
spawning a process.
Inherent in Ab Initio
Data Parallelism
58
Confidential
Ab Initio
Component Parallelism
When different instances of same component run on separate data sets
Sorting Customers
Sorting Transactions
59
Confidential
Ab Initio
Pipeline Parallelism
When multiple components run on same data set
Processing Record 99
60
Confidential
Ab Initio
Data Parallelism
When data is divided into segments or partitions and processes run simultaneously on each
partition
Expanded View
s
on
i
it
t
r
Pa
Global View
Multifile
61
Confidential
Data Parallelism
Ab Initio
Multifiles
A global view of a set of ordinary files called partitions usually
located on different disks or systems
Ab Initio provides shell level utilities called m_ commands for
handling multifiles (copy, delete,move etc.)
Multifiles reside on Multidirectories
Each is represented using URL notation with mfile as the protocol
part:
mfile://pluto.us.com/usr/ed/mfs1/new.dat
62
Confidential
Ab Initio
A Multidirectory
A directory spanning across partitions on different hosts
mfile://host1/u/jo/mfs/mydir
//host1/u1/jo/mfs
//host1/vol4/pA/mydir
//host2/vol3/pB/mydir
//host3/vol7/pC/mydir
Data
Partition
on Host2
Data
Partition
on Host3
<.mdir>
Control
Partition
Data
Partition
on Host1
63
Confidential
Ab Initio
A Multifile
A file spanning across partitions on different hosts
mfile://host1/u/jo/mfs/mydir/myfile.dat
//host1/u1/jo/mfs/mydir
/myfile.dat
Control
Partition
//host1/vol4/pA/mydir
/myfile.dat
Data
Partition
on Host1
//host2/vol3/pB/mydir
/myfile.dat
//host3/vol7/pC/mydir
/myfile.dat
Data
Partition
on Host2
Data
Partition
on Host3
64
Confidential
Ab Initio
Agent Nodes
A multidirectory
A multifile
Control file
65
Confidential
Ab Initio
66
Confidential
Ab Initio
Ab Initio >
DAY 4
67
Confidential
Ab Initio
68
Confidential
Ab Initio
Partition by Roundrobin
Writes records to the flow partitions in
round-robin way, with block-size
records going into one partition before
moving on to the next
Partition0
A
D
C
G
A
E
Partition1
B
E
D
B
D
A
Partition2
C
F
B
A
F
D
A
B
C
D
E
F
C
D
B
G
B
A
A
D
F
E
A
D
69
Confidential
Ab Initio
Partition by Key
70
Confidential
Ab Initio
Partition by Key
Partition0
A
Partition1
Partition2
Partition0
Partition1
Partition2
Partition0
Partition1
Partition2
71
Confidential
Ab Initio
72
Confidential
Ab Initio
Partition by Percentage
distributes a specified percentage of the total number of input data
records to each output flow
Pct port
73
Confidential
Ab Initio
Broadcast
Acts like a partitioning component when the layout changes
74
Confidential
Ab Initio
Method
Key-Based
Balancing
Uses
Roundrobin
No
Good
Record-independent
parallelism
Hash
Yes
Good
Key-dependent
parallelism
Function
Yes
Application specific
Range
Yes
Depends on splitters
Key-dependent
parallelism, Global
Ordering
Load-level
No
Depends on load
Record-independent
parallelism
75
Confidential
Ab Initio
A
B
C
D
A
D
C
D
B
C
B
A
A
A
A
B
B
B
C
C
C
D
D
D
p0
p1
p2
A
A
C
D
A
D
D
D
B
D
B
A
p3
Partitions skewed
A
A
A
A
A
p0
B
B
p1
p2
D
D
D
D
D
p3
76
Confidential
Ab Initio
Departitioning Components
Data can be de-partitioned using
Gather
Concatenate
Merge
Global View
Interleave
Expanded View
77
Confidential
Ab Initio
Departitioning Components
Gather
Reads data records from the flows connected to the input port
Concatenate
Concatenate appends multiple flow partitions of data records one after another
Merge
Combines data records from multiple flow partitions that have been sorted on
a key
Maintains the sort order
78
Confidential
Ab Initio
Departitioning Components
Key-Based Ordering
Uses
Concatenate
No
Global
Interleave
No
Inverse of Round
Robin partition
Merge
Yes
Sorted
Creating ordered
serial flows
Gather
No
Arbitrary
Unordered
departitioning
79
Confidential
Lookup File
Ab Initio
Serial or Multifiles
Held in main memory
Searching and Retrieval is key-based and faster as compared to files stored on
disks
associates key values with corresponding data values to index records and
retrieve them
Lookup parameters
Key
Record Format
80
Confidential
Ab Initio
Lookup File
Storage Methods
Serial lookup : lookup()
whole file replicated to each partition
Lookup Functions
Name
Arguments
Purpose
lookup()
Returns a data record from a Lookup File which matches with the
values of the expression argument
lookup_count()
- do -
lookup_next()
File Label
lookup_local
lookup_count_local()
- do -
lookup_next_local()
File Label
NOTE: Data needs to be partitioned on same key before using lookup local functions
81
Confidential
Ab Initio
82
Confidential
Ab Initio
Deadlock
A Deadlock happens when
when the graph stops progressing because of mutual dependency of data among
components
Identified when record count in does NOT change over a period of time
Phasing
Checkpointing
Flow Buffering
Join
Concatenate
Compare records
Merge
Interleave
Compare Checksums
83
Confidential
Deadlock
Ab Initio
Blocking on read
Blocking on write
84
Confidential
Ab Initio
Ab Initio >
DAY 5
Progressing to the Next Level:
Performance
85
Confidential
Ab Initio
Temporary files
Phases
Checkpoints
Sort, Merge etc.
Buffered Flows
86
Confidential
Ab Initio
87
Confidential
Ab Initio
88
Confidential
Ab Initio
Factors: CPU
CPU: Number of processes
Processes are faster as long as resources are available
Overdriving machine
Resource intensive applications
Symptoms
Increased processes
Increased system calls
Increased paging of memory
Strange error messages
89
Confidential
Factors: Memory
Ab Initio
Memory: Consumers
Lookup Tables
Serial lookup: lookup()
whole table replicated into each partition
In-memory components
Rollup, scan, normalize
(temporary data) x (number of key groups)
Join
non-driving input(s) loaded into memory
90
Confidential
Ab Initio
Factors: Memory
Memory: max_core
Memory Allocated by component to hold data
Exceeding max-core
Disk spilling if memory is used up writing to disks
Graph aborts if memory not available
Performance bottleneck: slower execution
Issues
Changing data volume over time
91
Confidential
Ab Initio
92
Confidential
Ab Initio
Factors: Phases
Purpose:Controlling # simultaneous processes
Performance Enhancements:
CPU Usage:Optimum resource utilization - limit the number of active
components and thus the number of processes.
Memory usage:
Allocate max-core across components in phase
Place components which need large max-core in separate phases
93
Confidential
Ab Initio
Factors: Phases
Performance Enhancements
Communication bottleneck : ALL-to-ALL flows
N-way to N-way: Uses N network resources
Limit the number of All-to-All flows by using phases
Safety cushion: <=4 per phase
Alternative: Use Two-Stage-Communication
Uses 2Nx(N) channels of communication
applicable if depth is >= 30
94
Confidential
Ab Initio
Factors: Phases
Rule of thumb: Placement of phases
DO NOT put phases after Replicates, across all-to-all flows, after temp
files, after sorts
Data on all flows crossing the current phase boundary is written to
disk, so place phases to minimize writing of data to disks
95
Confidential
Factors: Checkpoint
Ab Initio
Purpose:
Provide same functionality as phase
Additional: Provide restart capability
96
Confidential
Factors: Checkpoint
Ab Initio
97
Confidential
Ab Initio
98
Confidential
Ab Initio
99
Confidential
Ab Initio
100
Confidential
Ab Initio
Vertex: component
Port: of component
101
Confidential
Performance: In a nutshell ..
Ab Initio
Avoid Sorts
Use Lookups
Use In-memory Join/Rollup
Assign Driving Port of Join correctly
Allocate memory correctly
Phasing
102
Confidential
Ab Initio
THANK YOU
103
Confidential