Professional Documents
Culture Documents
General objects
Data Quality Stages
Database connectors
Development and Debug stages
File stages
Processing stages
Real Time stages
Restructure Stages
Sequence activities
Please refer to the list below for a description of the stages used in DataStage and QualityStage.
We classified all stages in order of importancy and frequency of use in real-life deployments
(and also on certification exams). Also, the most widely used stages are marked bold or there is a
link to a subpage available with a detailed description with examples.
General elements
Link indicates a flow of the data. There are three main types of links in Datastage: stream,
reference and lookup.
Row generator produces a set of test data which fits the specified metadata (can be
random or cycled through a specified list of values). Useful for testing and development.
Click here for more..
Column generator adds one or more column to the incoming flow and generates test
data for this column.
Peek stage prints record column values to the job log which can be viewed in Director. It
can have a single input link and multiple output links.Click here for more..
Sample stage samples an input data set. Operates in two modes: percent mode and period
mode.
Head selects the first N rows from each partition of an input data set and copies them to
an output data set.
Tail is similiar to the Head stage. It select the last N rows from each partition.
Write Range Map writes a data set in a form usable by the range partitioning method.
Processing stages
Aggregator joins data vertically by grouping incoming data stream and calculating
summaries (sum, count, min, max, variance, etc.) for each group. The data can be
grouped using two methods: hash table or pre-sort. Click here for more..
Copy - copies input data (a single stream) to one or more output data flows
FTP stage uses FTP protocol to transfer data to a remote machine
Filter filters out records that do not meet specified requirements.Click here for more..
Funnel combines mulitple streams into one. Click here for more..
Join combines two or more inputs according to values of a key column(s). Similiar
concept to relational DBMS SQL join (ability to perform inner, left, right and full outer
joins). Can have 1 left and multiple right inputs (all need to be sorted) and produces
single output stream (no reject link). Click here for more..
Lookup combines two or more inputs according to values of a key column(s). Lookup
stage can have 1 source and multiple lookup tables. Records don't need to be sorted and
produces single output stream and a reject link. Click here for more..
Merge combines one master input with multiple update inputs according to values of a
key column(s). All inputs need to be sorted and unmatched secondary entries can be
captured in multiple reject links. Click here for more..
Modify stage alters the record schema of its input dataset. Useful for renaming columns,
non-default data type conversions and null handling
Remove duplicates stage needs a single sorted data set as input. It removes all duplicate
records according to a specification and writes to a single output
Slowly Changing Dimension automates the process of updating dimension tables, where
the data changes in time. It supports SCD type 1 and SCD type 2.Click here for more..
Sort sorts input columns.Click here for more..
Transformer stage handles extracted data, performs data validation, conversions and
lookups.Click here for more..
Change Capture - captures before and after state of two input data sets and outputs a
single data set whose records represent the changes made.
Change Apply - applies the change operations to a before data set to compute an after
data set. It gets data from a Change Capture stage
Difference stage performs a record-by-record comparison of two input data sets and
outputs a single data set whose records represent the difference between them. Similiar to
Change Capture stage.
Checksum - generates checksum from the specified columns in a row and adds it to the
stream. Used to determine if there are differencies between records.
Compare performs a column-by-column comparison of records in two presorted input
data sets. It can have two input links and one output link.
Encode encodes data with an encoding command, such as gzip.
Decode decodes a data set previously encoded with the Encode Stage.
External Filter permits speicifying an operating system command that acts as a filter on
the processed data
Generic stage allows users to call an OSH operator from within DataStage stage with
options as required.
Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row
to a single column in multiple output rows. Pivoting data results in obtaining a dataset
with fewer number of columns but more rows.
Surrogate Key Generator generates surrogate key for a column and manages the key
source.
Switch stage assigns each input row to an output link based on the value of a selector
field. Provides a similiar concept to the switch statement in most programming
languages.
Compress - packs a data set using a GZIP utility (or compress command on
LINUX/UNIX)
Expand extracts a previously compressed data set back into raw binary data.
File stage types
Sequential file is used to read data from or write data to one or more flat (sequential)
files.Click here for more..(…….)
Data Set stage allows users to read data from or write data to a dataset. Datasets are
operating system files, each of which has a control file (.ds extension by default) and one
or more data files (unreadable by other applications). Click here for more info(…….)
File Set stage allows users to read data from or write data to a fileset. Filesets are
operating system files, each of which has a control file (.fs extension) and data files.
Unlike datasets, filesets preserve formatting and are readable by other applications.
Complex flat file allows reading from complex file structures on a mainframe machine,
such as MVS data sets, header and trailer structured files, files that contain multiple
record types, QSAM and VSAM files.Click here for more info.
External Source - permits reading data that is output from multiple source programs.
External Target - permits writing data to one or more programs.
Lookup File Set is similiar to FileSet stage. It is a partitioned hashed file which can be
used for lookups.
Database stages
Oracle Enterprise allows reading data from and writing data to an Oracle database
(database version from 9.x to 10g are supported).
ODBC Enterprise permits reading data from and writing data to a database defined as an
ODBC source. In most cases it is used for processing data from or to Microsoft Access
databases and Microsoft Excel spreadsheets.
DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.
Teradata permits reading data from and writing data to a Teradata data warehouse.
Three Teradata stages are available: Teradata connector, Teradata Enterprise and
Teradata Multiload
SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl
Server 2005 amd 2008 database.
Sybase permits reading data from and writing data to Sybase databases.
Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL
Server. The Stored Procedure stage can be used as a source (returns a rowset), as a target
(pass a row to a stored procedure to write) or a transform (to invoke procedure processing
within the database).
MS OLEDB helps retrieve information from any type of information repository, such as a
relational source, an ISAM file, a personal database, or a spreadsheet.
Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or
writing to a number of different supported relational DB engines using native interfaces,
such as Oracle, Microsoft SQL Server, DB2, Informix and Sybase.
Informix (CLI or Load)
DB2 UDB (API or Load)
Classic federation
RedBrick Load
Netezza Enterpise
iWay Enterprise
XML Input stage makes it possible to transform hierarchical XML data to flat relational
data sets
XML Output writes tabular data (relational tables, sequential files or any datastage data
streams) to XML structures
XML Transformer converts XML documents using an XSLT stylesheet
Websphere MQ stages provide a collection of connectivity options to access IBM
WebSphere MQ enterprise messaging systems. There are two MQ stage types available
in DataStage and QualityStage: WebSphere MQ connector and WebSphere MQ plug-in
stage.
Web services client
Web services transformer
Java client stage can be used as a source stage, as a target and as a lookup. The java
package consists of three public classes: com.ascentialsoftware.jds.Column,
com.ascentialsoftware.jds.Row, com.ascentialsoftware.jds.Stage
Java transformer stage supports three links: input, output and reject.
WISD Input - Information Services Input stage
WISD Output - Information Services Output stage
Restructure stages
Column export stage exports data from a number of columns of different data types into a
single column of data type ustring, string, or binary. It can have one input link, one output
link and a rejects link. Click here for more..
Column import complementary to the Column Export stage. Typically used to divide data
arriving in a single column into multiple columns.
Combine records stage combines rows which have identical keys, into vectors of
subrecords.
Make subrecord combines specified input vectors into a vector of subrecords whose
columns have the same names and data types as the original vectors.
Make vector joins specified input columns into a vector of columns
Promote subrecord - promotes input subrecord columns to top-level columns
Split subrecord - separates an input subrecord field into a set of top-level vector columns
Split vector promotes the elements of a fixed-length vector to a set of top-level columns
Data quality QualityStage stages
Investigate stage analyzes data content of specified columns of each record from the
source file. Provides character and word investigation methods.
Match frequency stage takes input from a file, database or processing stages and
generates a frequence distribution report.
MNS - multinational address standarization.
QualityStage Legacy
Reference Match
Standarize
Survive
Unduplicate Match
WAVES - worldwide address verification and enhancement system.
Sequence activity stage types
=====================================================================
Configuration file:
The Datastage configuration file is a master control file (a textfile which sits on the
server side) for jobs which describes the parallel system resources and architecture. The
configuration file provides hardware configuration for supporting such architectures
as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or
MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands the
architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed
your processing configurations, or changed servers or platform, you will never have to worry about
it affecting your jobs since all the jobs depend on this configuration file for execution. Datastage
jobs determine which node to run the process on, where to store the temporary data, where to store
the dataset data, based on the entries provide in the configuration file. There is a default
configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration
file is to separate software and hardware configuration from job design. It allows changing
hardware and software resources without changing a job design. Datastage jobs can point to
different configuration files by using job parameters, which means that a job can utilize different
hardware architectures without being recompiled.
The configuration file contains the different processing nodes and also specifies the disk
space provided for each processing node which are logical processing nodes that are specified in
the configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a very
inefficient configuration on your hands.
1. APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one
can have many configuration files for a project) to be used. In fact, this is what is generally used
in production. However, if this environment variable is not defined then how DataStage
determines which file to use ??
1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default
configuration file (config.apt) in following path:
1. Current working directory.
2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of
DataStage installation.
3. What are the different options a logical node can have in the configuration file?
1. fastname – The fastname is the physical node name that stages use to open connections for high
volume data transfers. The attribute of this option is often the network name. Typically, you can
get this name by using Unix command ‘uname -n’.
2. pools – Name of the pools to which the node is assigned to. Based on the characteristics of the
processing nodes you can group nodes into set of pools.
1. A pool can be associated with many nodes and a node can be part of many pools.
2. A node belongs to the default pool unless you explicitly specify apools list for it, and omit the
default pool name (“”) from the list.
3. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of
processing nodes).
1. In case job as well as stage within the job are constrained to run on specific processing nodes
then stage will run on the node which is common to stage as well as job.
3. resource – resource resource_type “location” [{pools “disk_pool_name”}] | resource
resource_type “value” . resource_type can becanonicalhostname (Which takes quoted ethernet
name of a node in cluster that is unconnected to Conductor node by the hight speed
network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute
path name of a directory on a file system where intermediate data will be temporarily stored. It is
local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE,
etc.)
4. How datastage decides on which processing node a stage should be run?
1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a
parallel stage on all nodes defined in the default node pool. (Default Behavior)
2. If the node is constrained then the constrained processing nodes are chosen while executing the
parallel stage.
In Datastage, the degree of parallelism, resources being used, etc. are all determined
during the run time based entirely on the configuration provided in the APT CONFIGURATION
FILE. This is one of the biggest strengths of Datastage. For cases in which you have changed your
processing configurations, or changed servers or platform, you will never have to worry about it
affecting your jobs since all the jobs depend on this configuration file for execution. Datastage
jobs determine which node to run the process on, where to store the temporary data , where to store
the dataset data, based on the entries provide in the configuration file. There is a default
configuration file available whenever the server is installed. You can typically find it under the
<>\IBM\InformationServer\Server\Configurations folder with the name default.apt. Bear in mind
that you will have to optimise these configurations for your server based on your resources.
Basically the configuration file contains the different processing nodes and also
specifies the disk space provided for each processing node. Now when we talk about processing
nodes you have to remember that these can are logical processing nodes that are specified in the
configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a very
inefficient configuration on your hands.
Now lets try our hand in interpreting a configuration file. Lets try the below sample.
{
node “node1″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “}
}
node “node2″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “”}
}
node “node3″
{
fastname “SVR2″
pools “” “sort”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools ”" }
}
This is a 3 node configuration file. Lets go through the basic entries and what it represents.
Fastname – This refers to the node name on a fast network. From this we can imply that the nodes
node1 and node2 are on the same physical node. However if we look at node3 we can see that it is
on a different physical node (identified by SVR2). So basically in node1 and node2 , all the
resources are shared. This means that the disk and scratch disk specified is actually shared between
those two logical nodes. Node3 on the other hand has its own disk and scratch disk space.
Pools – Pools allow us to associate different processing nodes based on their functions and
characteristics. So if you see an entry other entry like “node0” or other reserved node pools like
“sort”,”db2”,etc.. Then it means that this node is part of the specified pool. A node will be by
default associated to the default pool which is indicated by “”. Now if you look at node3 can see
that this node is associated to the sort pool. This will ensure that that the sort stage will run only
on nodes part of the sort pool.
Resource disk - This will specify Specifies the location on your server where the processing node
will write all the data set files. As you might know when Datastage creates a dataset, the file you
see will not contain the actual data. The dataset file will actually point to the place where the actual
data is stored. Now where the dataset data is stored is specified in this line.
Resource scratchdisk – The location of temporary files created during Datastage processes, like
lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk
can also be made part of the sort scratch disk pool. This will ensure that the temporary files created
during sort are stored only in this location. If such a pool is not specified then Datastage determines
if there are any scratch disk resources that belong to the default scratch disk pool on the nodes that
sort is specified to run on. If this is the case then this space will be used.
Below is the sample diagram for 1 node and 4 node resource allocation:
SAMPLE CONFIGURATION FILES
A basic configuration file for a single machine, two node server (2-CPU) is shown below. The
file defines 2 nodes (node1 and node2) on a single dev server (IP address might be provided as
well instead of a hostname) with 3 disk resources (d1 , d2 for the data and Scratch as scratch
space).
node "node1"
{ fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource disk "/IIS/Config/d2" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}
node "node2"
{
fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}
The sample configuration file for a cluster or a grid computing on 4 machines is shown below.
The configuration defines 4 nodes (node[1-4]), node pools (n[1-4]) and s[1-4), resource pools
bigdata and sort and a temporary space.
node "node1"
{
fastname "dev1"
pool "" "n1" "s1" "sort"
resource disk "/IIS/Config1/d1" {}
resource disk "/IIS/Config1/d2" {"bigdata"}
resource scratchdisk "/IIS/Config1/Scratch" {"sort"}
}
node "node2"
{
fastname "dev2"
pool "" "n2" "s2"
resource disk "/IIS/Config2/d1" {}
resource disk "/IIS/Config2/d2" {"bigdata"}
resource scratchdisk "/IIS/Config2/Scratch" {}
}
node "node3"
{
fastname "dev3"
pool "" "n3" "s3"
resource disk "/IIS/Config3/d1" {}
resource scratchdisk "/IIS/Config3/Scratch" {}
}
node "node4"
{
fastname "dev4"
pool "n4" "s4"
resource disk "/IIS/Config4/d1" {}
resource scratchdisk "/IIS/Config4/Scratch" {}
}
Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource
disk.
Resource scratch disk : Here also a path to folder is defined. This path is used by the parallel job
stages for buffering of the data when the parallel job runs.
=====================================================================
Sequentional_Stage :
Sequential File:
The Sequential File stage is a file stage. It allows you to read data from or write
data to one or more flat files as shown in Below Figure:
In order read a sequential file datastage needs to know about the format of the file.
If you are reading a delimited file you need to specify delimiter in the format tab.
Source:
Important Options:
First Line is Column Names:If set true, the first line of a file contains column names on writing and
is ignored on reading.
Keep File Partitions:Set True to partition the read data set according to the organization of the input
file(s).
Reject Mode: Continue to simply discard any rejected rows; Fail to stop if any row is rejected; Output
to send rejected rows down a reject link.
For fixed-width files, however, you can configure the stage to behave differently:
* You can specify that single files can be read by multiple nodes. This can improve performance on
cluster systems.
* You can specify that a number of readers run on a single node. This means, for example, that a
single file can be partitioned as it is read.
These two options are mutually exclusive.
Scenario 1:
Scenario 2:
Once we add Read From Multiple Node = Yes then stage by default executes in Parallel mode.
If you run the job with above configuration it will abort with following fatal error.
sff_SourceFile: The multinode option requires fixed length records.(That means you can use this
option to read fixed width files only)
In order to fix the above issue go the format tab and add additions parameters as shown below.
Scenario 3:Read Delimted file with By Adding Number of Readers Pernode instead of multinode
option to improve the read performance and once we add this option sequential file stage will execute
in default parallel mode.
If we are reading from and writing to fixed width file it is always good practice to add
APT_STRING_PADCHAR Datastage Env variable and assign 0×20 as default value then it will pad with
spaces ,otherwise datastage will pad null value(Datastage Default padding character).
Always Keep Reject Mode = Fail to make sure datastage job will fail if we get from format from source
systems.
In a parallel job after reading the sequential file 2 more sequential files should be created, one with
duplicate records and the other without duplicates.
File 1 records separated by space: 1 1 2 2
File 2 records separated by space: 3 4 5 6
How will you do it
Sol1:
1. Introduce a sort stage very next to sequential file,
2. Select a property (key change column) in sort stage and you can assign 0-Unique or 1- duplicate or
viceversa as you wish.
3. Put a filter or transformer next to it and now you have unique in 1 link and duplicates in other link.
First of all take a source file then connect it to copy stage. Then, 1 link is connected to the aggregator
stage and another link is connected to the lookup stage or join stage. In Aggregator stage using the
count function, Calculate how many times the values are repeating in the key column.
After calculating that it is connected to the filter stage where we filter the cnt=1(cnt is new column for
repeating rows).
Then the o/p from the filter is connected to the lookup stage as reference. In the lookup stage LOOKUP
FAILURE=REJECT.
Then place two output links for the lookup, One collects the non-repeated values and another collects
the repeated values in reject link.
Possible solution:
Change capture stage. First, i am going to use source as A and refrerence as B both of them are
connected to Change capture stage. From, change capture stage it connected to filter stage and
then targets X,Y and Z. In the filter stage: keychange column=2 it goes to X [1,2,3,4,5]
Keychange column=0 it goes to Y [6,7,8,9,10] Keychange column=1 it goes to Z
[11,12,13,14,15]
Solution 2:
Create one px job.
src file= seq1 (1,2,3,4,5,6,7,8,9,10)
1st lkp = seq2 (6,7,8,9,10,11,12,13,14,15)
o/p - matching recs - o/p 1 (6,7,8,9,10)
not-matching records - o/p 2 (1,2,3,4,5)
2nd lkp:
src file - o/p 1 (6,7,8,9,10)
lkp file - seq 2 (6,7,8,9,10,11,12,13,14,15)
not matching recs - o/p 3 (11,12,13,14,15)
Dataset :
Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These
carry meta data with them, both column definitions and information about the configuration that
was in effect when the data set was created. If for example, you have a stage which limits
execution to a subset of available nodes, and the data set was created by a stage using all nodes,
InfoSphere DataStage can detect that the data will need repartitioning.
If required, data sets can be landed as persistent data sets, represented by a Data Set stage
.This is the most efficient way of moving data between linked jobs. Persistent data sets are stored
in a series of files linked by a control file (note that you should not attempt to manipulate these
files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere
DataStage).
there are the two groups of Datasets - persistent and virtual.
The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual
datasets *.v extension is reserved. (It's important to mention, that no *.v files might be visible in
the Unix file system, as long as they exist only virtually, while inhabiting RAM memory.
Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of scripting).
Further differences are much more significant. Primarily, persistent Datasets are being stored
in Unix files using internal Datastage EE format, while virtual Datasets are never stored on
disk - they do exist within links, and in EE format, but in RAM memory. Finally,
persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual
Datasets - might be passed through in memory.
A data set comprises a descriptor file and a number of other files that are added as the data set
grows. These files are stored on multiple disks in your system. A data set is organized in terms
of partitions and segments.
Each partition of a data set is stored on a single processing node. Each data segment contains all
the records written by a single job. So a segment can contain files from many partitions, and a
partition has files from many segments.
Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo
the same processes and modifications. In a word, all of them must go through the same
successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore
they cannot be treated commonly.
Alias names of Datasets are
1) Orchestrate File
2) Operating System file
In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.
3. Transformer Stage :
Sort Stage:
When we select any option and keep true. It will create the Group id's group wise.
Data will be divided into the groups based on the key column and it will give (1) for
the first row of every group and (0) for rest of the rows in all groups.
Key change column and Cluster Key change column used based on the data we are getting
If the data we are getting is not sorted , then we use key change column to create
group id's
If the data we are getting is sorted data, then we use Cluster Key change Column to
And if you are getting not sorted data . Keep Key Change Column as True
If your data is already Sorted you need to keep cluster Key change Column as True
Aggregator_Stage :
Aggregator stage is a processing stage in datastage is used to grouping and summary operations.By
Default Aggregator stage will execute in parallel mode in parallel jobs.
Note:In a Parallel environment ,the way that we partition data before grouping and
summary will affect the results.If you parition data using round-robin method and then
records with same key values will distruute across different partiions and that will give in
correct results.
Aggregation Method:
2)Sort: Sortmode requires the input data set to have been partition sorted with all of the grouping
keys specified as hashing and sorting keys.Unlike the Hash Aggregator, the Sort Aggregator requires
presorted data, but only maintains the calculations for the current group in memory.
By default aggregator stage calculation output column is double data type and if you want decimal
output then add following property as shown in below figure.
If you are using single key column for the grouping keys then there is no need to sort or hash
partition the incoming data.
table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo
And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one target.
In Filter Stage
We can use Aggregator stage to find number of people each in each department.
e_id,e_name,dept_no
1,sam,10
2,tom,20
3,pinky,10
4,lin,20
5,jim,10
6,emy,30
7,pom,10
8,jem,20
9,vin,30
10,den,20
Seq.-------Agg.Stage--------Seq.File
100,sam,clerck,2000,10
200,tom,salesman,1200,20
300,lin,driver,1600,20
400,tim,manager,2500,10
500,zim,pa,2200,10
600,eli,clerck,2300,20
Here our requirement is to find the maximum salary from each dept. number.
According to this sample data, we have two departments.
Take Sequential File to read the data and take Aggregator for calculations.
And Take sequential file to load into the target.
Seq.File--------Aggregator-----------Seq.File
3 comments:
Ram R said...
This comment has been removed by the author.
December 27, 2013 at 12:19 AM
Ram R said...
Hi,
I tried this one and have some questions.
If we have a data as below
table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo
And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one target.
My question:
I placed 2 seq files, one with count >1 and other with count <=1, 1 seq file output was
this :
dno count
10 3
20 2
dno count
40 1
30 1
Join Stage:
If we have three tables to join and we don't have same key column in all the tables to
In this case we can use Multiple join stages to join the tables.
soft_com_1
e_id,e_name,e_job,dept_no
001,james,developer,10
002,merlin,tester,20
003,jonathan,developer,10
004,morgan,tester,20
005,mary,tester,20
soft_com_2
dept_no,d_name,loc_id
10,developer,200
20,tester,300
soft_com_3
loc_id,add_1,add_2
10,melbourne,victoria
20,brisbane,queensland
Click Ok
Click ok
If we like to join the tables using Join stage , we need to have common key
columns in those tables. But some times we get the data without common key column.
In that case we can use column generator to create common column in both the
tables.
Now Go to the Join Stage and select Key column which we have created( You can give
any name, based on business requirement you can give understandable name)
Table1
e_id,e_name,e_loc
100,andi,chicago
200,borny,Indiana
300,Tommy,NewYork
Table2
Bizno,Job
20,clerk
30,salesman
xyz1 (Table 1 )
e_id,e_name,e_add
1,tim,la
2,sam,wsn
3,kim,mex
4,lin,ind
5,elina,chc
xyz2 (Table 2 )
e_id,address
1,los angeles
2,washington
3,mexico
4,indiana
5,chicago
e_id, e_name,address
1,tim,los angeles
2,sam,washington
3,kim,meixico
4,lin,indiana
5,elina,chicago
In Out put Column Drag and Drop Required Columns to go to output file and click ok.
Compile and Run the Job . You will get the Required Output in the Target File.
Inner Join:
Say if we have duplicates in left table on key field? What will happen?
We all get all matching records. We will get all matching Duplicates all well here is the
table Representation of join.
LeftOuter Join:
All the records from left table and all matching records. If we dont exists in the right table it will be
populated with nulls.
All the records from right table and all matching records.
Full Outer Join:
Lookup_Stage :
Lookup Stage:
The Lookup stage is most appropriate when the reference data for all lookup stages in a job
is small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of shared memory. If the Data Sets are larger than available memory resources, the JOIN or
MERGE stage should be used.
Lookup stages do not require data on the input link or reference links to be sorted. Be aware,
though, that large in-memory lookup tables will degrade performance because of their paging
requirements. Each record of the output data set contains columns from a source record plus columns
from all the corresponding lookup records where corresponding source and lookup records have the
same value for the lookup key columns. The lookup key columns do not have to have the same
names in the primary and the reference links.
The optional reject link carries source records that do not have a corresponding entry in the
input lookup tables.
You can also perform a range lookup, which compares the value of a source column to a range of
values between two lookup table columns. If the source column value falls within the required range,
a row is passed to the output link. Alternatively, you can compare the value of a lookup column to a
range of values between two source columns. Range lookups must be based on column values, not
constant values. Multiple ranges are supported.
There are some special partitioning considerations for Lookup stages. You need to ensure that the
data being looked up in the lookup table is in the same partition as the input data referencing it. One
way of doing this is to partition the lookup tables using the Entire method.
Scenario1: Continue
Choose entire partition on the reference link
Scenario2:Fail
Job aborted with the following error:
Scenari03:Drop
Scenario4:Reject
If we select reject as lookup failure condition then we need to add reject link otherwise we get
compilation error.
Range Lookup:
Business scenario:we have input data with customer id and customer name and transaction date.We
have customer dimension table with customer address information.Customer can have multiple
records with different start and active dates and we want to select the record where incoming
transcation date falls between start and end date of the customer from dim table.
Ex Input Data:
1 UMA 2011-03-01
1 UMA 2010-05-01
Ex Di Data:
stg_Lkp,0: Ignoring duplicate entry; no further warnings will be issued for this table
Range Look Up is used to check the range of the records from another table records.
For example If we have the employees list, getting salaries from $1500 to $ 3000.
And Open Lookup file--- Select e_sal in the first table data
Click Ok
Than Drag and Drop the Required columns into the output and click Ok
Then Compile and Run the Job . That's it you will get the required Output.
Note:Please remember we go for lookup only we have small reference data.If we go for big data it is
performance issue(I/O work will increase here) and also some times job will abort.
Normal look-up:all the reference table data is stored in the buffer for cross- check with the primary
table data.
Sparse lookup:each record of the primary table is cross checked with the reference table datethe
types of look-ups will araise only if the reference table is in database.so depending on the size of the
reference table we will set the type of lookup to implement.
Merge_Stage :
Merge Stage:
The Merge stage is a processing stage. It can have any number of input links, a single output
link, and the same number of reject links as there are update input links.(according to DS
documentation)
Merge stage combines a mster dataset with one or more update datasets based on the key
columns.the output record contains all the columns from master record plus any additional columns
from each update record that are required.
A master record and update record will be merged only if both have same key column values.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures
that rows with the same key column values are located in the same partition and will be processed by
the same node. It also minimizes memory requirements because fewer rows need to be in memory at
any one time.
As part of preprocessing your data for the Merge stage, you should also remove duplicate
records from the master data set. If you have more than one update data set, you must remove
duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject
links. You can route update link rows that fail to match a master row down a reject link that is specific
for that link. You must have the same number of reject links as you have update links. The Link
Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject
links. You can also specify whether to drop unmatched master rows, or output them on the output
data link.
Example :
Master
dataset:
CUSTOMER_ID CUSTOMER_NAME
1 UMA
2 POOJITHA
Update
dataset1
1 CYPRESS 90630 M
2 CYPRESS 90630 F
Output:
Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the
master link are output; Drop means that unmatched rows are dropped instead.
Warn On Reject Updates:True to generate a warning when bad records from any update links are
rejected.
Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the
master link.
Scenario 2:
stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.
Scenarios 3:Drop unmatched master record and capture reject records from updateds1
Update
Dataset2
CUSTOMER_ID CITIZENSHIP
1 INDIAN
2 AMERICAN
Still we have duplicate row in the master dataset.if you compile the job with above design you will get
compilation error like below.
If you look ate the above figure you can see 2 rows
in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .
Scenario 5:add a duplicate row for customer_id=1 in
updateds1 dataset.
No change the results and merge stage automatically dropped the duplicate row.
Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630
instead of 90620.
=====================================================================================
Filter_Stage :
Filter Stage:
Filter stage is a processing stage used to filter database based on filter condition.
Scenario1:Check for empty values in the customer name field.We are reading from sequential file and
hence we should check for empty value instead of null.
Scenario 2:Comparing incoming fields.check transaction date falls between strt_dt and end_dt and
filter those records.
Input Data:
Partition data based on CUSTOMER_ID to make sure all rows with same key values process on the
same node.
Output :
Reject :
e_id,e_name,e_sal
1,sam,2000
2,ram,2200
3,pollard,1800
4,ponting,2200
5,sachin,2200
Seq.File---------Filter------------DatasetFile
e_sal=2200
Click Ok
Go to Target Dataset file and give some name to the file and that's it
( You can get the link order number in link ordering Option)
Copy Stage is one of the processing stage that have one input and 'n' number of outputs. The
copy stage is used to send the one source data to multiple copies and this can be used for the multiple
purpose. The records which we are sending through copy stage can be copied with any modifications and
also we can do the following.
a) Columns order can be altered .
b) And columns can be dropped.
c) We can change the column names.
In Copy Stage, we have the option called Force. It will be false in Default and if we kept to true, it is used
to specify that datastage should not try optimize the job by removing a copy operation where there is one
input and one output .
================================================================================
Funnel_Stage :
Funnel Stage:
Funnel stage is used to combine multiple input datasets into a single input dataset.This stage can have
any number of input links and single output link.
It operates in 3 modes:
Sort Funnel combines the input records in the order defined by one or more key fields;
Sequence copies all records from the first input data set to the output data set, then all the records
from the second input data set, etc.
Sort funnel requires data must be sorted and partitioned by the same key columns as to be used by
the funnel operation.
Hash Partition guarantees that all records with same key column values are located in the same
partition and are processed in the same node.
1)Continuous funnel:
Go to the properties of the funnel stage page and set Funnel Type to continuous funnel.
2)Sequence:
Note:In order to use sequence funnel you need to specify which order the input links you
need to process and also make sure the stage runs in sequential mode.
Usually we use sequence funnel when we create a file with header,detail and trailer records.
3)Sort Funnel:
Note: If you are running your sort funnel stage in parallel, you should be aware of the
various
considerations about sorting data and partitions
Some times we get data in multiple files which belongs to same bank customers information.
In that time we need to funnel the tables to get the multiple files data into the single file.( table)
xyzbank1
e_id,e_name,e_loc
111,tom,sydney
222,renu,melboourne
333,james,canberra
444,merlin,melbourne
xyzbank2
e_id,e_name,e_loc
555,,flower,perth
666,paul,goldenbeach
777,raun,Aucland
888,ten,kiwi
For Funnel take the Job design as
Column Generator :
Column Generator is a development stage/ generating stage that is used to generate column
Seq.File--------------Col.Gen------------------Ds
xyzbank
e_id,e_name,e_loc
555,flower,perth
666,paul,goldencopy
777,james,aucland
888,cheffler,kiwi
In order to generate column ( for ex: unique_id)
Go to column write column name and you can change data type for unique_id in sql type and
Surrogate_Key_Stage :
And in natural key, it may have alphanumeric composite key but the surrogate is
Surrogate key is used to generate key columns, for which characteristics can be
specified. The surrogate key generates sequential incremental and unique integers for a
provided start point. It can have a single input and a single output link.
And in Surrogate Key Duplicates are allowed, where it cant be happened in the Primary Key .
By using Surrogate key we can continue the sequence for any jobs. If any job was aborted at the
n records loaded.. By using surrogate key you can continue the sequence from n+1.
A surrogate key is a unique primary key that is not derived from the data that it represents, therefore
changes to the data will not change the primary key. In a star schema database, surrogate keys are
used to join a fact table to a dimension table.
Double click on the surrogate key stage and click on properties tab.
Properties:
Key Source Action = create
Source Type : FlatFile or Database sequence(in this case we are using FlatFile)
If you want to the check the content change the View Stat File = YES and check the job log for details.
if you try to create the same file again job will abort with the following error.
To update the stat file add surrogate key stage to the job with single input link from other stage.
1)open the surrogate key stage editor and go to the properties tab.
If the stat file exists we can update otherwise we can create and update it.
We are using SkeyValue parameter to update the stat file using transformer stage.
Now we have created stat file and will generate keys using the stat key file.
Click on the surrogate keys stage and go to properties add add type a name for the surrogate key
column in the Generated Output Column Name property
Go to ouput and define the mapping like below.
Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the output.
I have updated the stat file with 100 and below is the output.
If you want to generate the key value from begining you can use following property in the surrogate
key stage.
a. If the key source is a flat file, specify how keys are generated:
o To generate keys in sequence from the highest value that was last used, set the Generate Key from
Last Highest Value property to Yes. Any gaps in the key range are ignored.
o To specify a value to initialize the key source, add the File Initial Value property to the Options group,
and specify the start value for key generation.
o To control the block size for key ranges, add the File Block Size property to the Options group, set this
property toUser specified, and specify a value for the block size.
b. If there is no input link, add the Number of Records property to the Options group, and specify how
many records to generate.
=====================================================================================
SCD :
Scd's are the dimensions that have the data that changes slowly. Rather than
They are
Type-1 SCD
Type-2 SCD
Type-3 SCD
Type -1 SCD: In the type -1 SCD methodology, it will overwrites the older data
( Records ) with the new data ( Records) and therefore it will not maintain the
historical information.
This will used for the correcting the spellings of names, and for small updates of
customers.
TYpe -2 SCD: In the Type-2 SCS methodology, it will tracks the complete historical
information by creating the multilple records for the given natural key ( Primary
Here we use differet type of options inorder to track the historical data of
customers like
a) Active flag
b) Date functions
c) Version Numbers
d) Surrogate Keys
Type-3 SCD: In the Type-2 SCD, it will maintain the partial historical
information.
HOW TO USE TYPE -2 SCD IN DATASTAGE?
SCD'S is nothing but Slowly changing Dimensions.
Slowly Changing Dimensions are the dimensions that have the data that change slowly rather than
changing in a time period, i.e regular schedule.
Type-2 SCD:-- The Type-2 methodology tracks the Complete Historical information by creating the
multiple records for a given natural keys in the dimension tables with the separate surrogate keys or
different version numbers.
And we have unlimited history preservation as every time new record is inserted each time a change is
made.
For example, you may have a customer dimension in a retail domain. Let say the customer is in
India and every month he does some shopping. Now creating the sales report for the customers is
easy. Now assume that the customer is transferred to United States and he does shopping there.
How to record such a change in your customer dimension?
You could sum or average the sales done by the customers. In this case you won't get the exact
comparison of the sales done by the customers. As the customer salary is increased after the
transfer, he/she might do more shopping in United States compared to in India. If you sum the total
sales, then the sales done by the customer might look stronger even if it is good. You can create a
second customer record and treat the transferred customer as the new customer. However this will
create problems too.
Handling these issues involves SCD management methodologies which referred to as Type 1 to
Type 3. The different types of slowly changing dimensions are explained in detail below.
SCD Type 1: SCD type 1 methodology is used when there is no need to store historical data in the
dimension table. This method overwrites the old data in the dimension table with the new data. It is
used to correct data errors in the dimension.
------------------------------------------------
1 1 Marspton Illions
Here the customer name is misspelt. It should be Marston instead of Marspton. If you use type1
method, it just simply overwrites the data. The data in the updated table will be.
------------------------------------------------
1 1 Marston Illions
The advantage of type1 is ease of maintenance and less space occupied. The disadvantage is that
there is no historical data kept in the data warehouse.
SCD Type 3: In type 3 method, only the current status and previous status of the row is maintained
in the table. To track these changes two separate columns are created in the table. The customer
dimension table in the type 3 method will look as
--------------------------------------------------------------------------
Let say, the customer moves from Illions to Seattle and the updated table will look as
--------------------------------------------------------------------------
Now again if the customer moves from seattle to NewYork, then the updated table will be
surrogate_key customer_id customer_name Current_Location previous_location
--------------------------------------------------------------------------
The type 3 method will have limited history and it depends on the number of columns you create.
SCD Type 2: SCD type 2 stores the entire history the data in the dimension table. With type 2 we
can store unlimited history in the dimension table. In type 2, you can store the data in three different
ways. They are
Versioning
Flagging
Effective Date
SCD Type 2 Versioning: In versioning method, a sequence number is used to represent the
change. The latest sequence number always represents the current row and the previous sequence
numbers represents the past data.
As an example, let’s use the same example of customer who changes the location. Initially the
customer is in Illions location and the data in dimension table will look as.
--------------------------------------------------------
1 1 Marston Illions 1
The customer moves from Illions to Seattle and the version number will be incremented. The
dimension table will look as
--------------------------------------------------------
1 1 Marston Illions 1
2 1 Marston Seattle 2
Now again if the customer is moved to another location, a new record will be inserted into the
dimension table with the next version number.
SCD Type 2 Flagging: In flagging method, a flag column is created in the dimension table. The
current record will have the flag value as 1 and the previous records will have the flag as 0.
Now for the first time, the customer dimension will look as.
--------------------------------------------------------
1 1 Marston Illions 1
Now when the customer moves to a new location, the old records will be updated with flag value as
0 and the latest record will have the flag value as 1.
--------------------------------------------------------
1 1 Marston Illions 0
2 1 Marston Seattle 1
SCD Type 2 Effective Date: In Effective Date method, the period of the change is tracked using the
start_date and end_date columns in the dimension table.
-------------------------------------------------------------------------
The NULL in the End_Date indicates the current version of the data and the remaining records
indicate the past data.
Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An
additional dimension record is created and the segmenting between the old record values and the new
(current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact
table usually stores dimension key and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of
many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which
has the following structure and data:
D_CUSTOMER dimension table before loading
The most important facts and stages of the CUST_SCD2 job processing:
• The dimension table with customers is refreshed daily and one of the data sources is a text file. For
the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it
is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:
• There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text
file.
• A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to
separate columns.
SCD 2 lookup transformer
• A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes
through only records that differ.
SCD 2 check discrepancies transformer
• A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated
with current indictator flag set to no and the new record is inserted with current indictator flag set to
yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer
• ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and
the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed
where part of an SQL statement.
• ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and
the key column is CUST_ID.
D_CUSTOMER dimension table after Datawarehouse refresh
===============================================================
Pivot_Enterprise_Stage:
Pivot enterprise stage is a processing stage which pivots data vertically and horizontally depending
upon the requirements. There are two types
1. Horizontal
2. Vertical
Horizontal Pivot operation sets input column to the multiple rows which is exactly opposite to the
Vertical Pivot Operation. It sets input rows to the multiple columns.
Let’s try to understand it one by one with following example.
Select ‘Horizontal’ for Pivot Type from drop-down menu under ‘Properties’ tab for horizontal Pivot
operation.
Step 3: Click on‘Pivot Properties’ tab. Under which we need to check box against ‘Pivot Index’. After
which column of name ‘Pivot_Index’ will appear under ‘Name’ column also declare a new column of
name ’Color’ as shown below.
Step 4: Now we have to mention columns to be pivoted under ‘Derivation’ against column ‘Color’.
Double click on it. Following Window will pop up.
Select columns to be pivoted from ‘Available column’ pane as shown. Click ‘OK’.
Step 5: Under ‘Output’ tab, only map pivoted column as shown.
Configure output stage. Give the file path. See below image for reference.
Step 6: Compile and Run the job. Let’s see what is happen to the output.
This is how we can set multiple input columns to the single column (As here for colors).
Vertical Pivot Operation:
Here, we are going to use ‘Pivot Enterprise’ stage to vertically pivot data. We are going to set multiple
input rows to a single row. The main advantage of this stage is we can use aggregation functions like
avg, sum, min, max, first, last etc. for pivoted column. Let’s see how it works.
Consider an output data of Horizontal Operation as input data for the Pivot Enterprise stage. Here, we
will be adding one extra column for aggregation function as shown in below table.
Step 2: Open Pivot Enterprise stage and select Pivot type as vertical under properties tab.
Step 3: Under Pivot Properties tab minimum one pivot column and one group by column. Here, we
declared Product as group by column. Color and prize as Pivot columns.Lets see how to use
‘Aggregation functions’ in next step.
Step 4: On clicking Aggregation functions required for this column for particular column following
window will pop up. In which we can select functions whichever required for that particular column.
Here we are using ‘min’, ’max’ and ‘average’ functions with proper precision and scale for Prize column
as shown.
Step 5: Now we just have to do mapping under output tab as shown below.
Step 6: compile and Run the job. Lets see what will be the output is.
Output :
Let me first tell you that a Pivot stage only CONVERTS COLUMNS INTO ROWS and
nothing else. Some DS Professionals refer to this as NORMALIZATION. Another fact
about the Pivot stage is that it's irreplaceable i.e no other stage has this functionality
of converting columns into rows!!! So , that makes it unique, doesn't!!!
Let's cover how exactly it does it....
For example, lets take a file with the following fields: Item, Quantity1, Quantity2,
Quantity3....
Item~Quantity1~Quantity2~Quantity3
ABC~100~1000~10000
DEF~200~2000~20000
GHI~300~3000~30000
Basically you would use a pivot stage when u need to convert those 3 Quantity fields
into a single field whch contains a unique Quantity value per row...i.e. You would need
the following output
Item~Quantity
ABC~100
ABC~1000
ABC~10000
DEF~200
DEF~2000
DEF~20000
GHI~300
GHI~3000
GHI~30000
Unlike other stages, a pivot stage doesn't use the generic GUI stage page. It has a
stage page of its own. And by default the Output columns page would not have
any fields. Hence, you need to manually type in the fields. In this case just type in
the 2 field names : Item and Quantity. However manual typing of the columns
becomes a tedious process when the number of fields is more. In this case you can
use the Metadata Save - Load feature. Go the input columns tab of the pivot stage,
save the table definitions and load them in the output columns tab. This is the way
I use it!!!
Now, you have the following fields in the Output Column's tab...Item and
Quantity....Here comes the tricky part i.e you need to specify the DERIVATION
....In case the field names of Output columns tab are same as the Input tab, you
need not specify any derivation i.e in this case for the Item field, you need not
specify any derivation. But if the Output columns tab has new field names, you
need to specify Derivation or you would get a RUN-TIME error for free....
For our example, you need to type the Derivation for the Quantity field as
Just attach another file stage and view your output!!! So, objective met!!!
Sequence_Activities :
In this article i will explain how to use datastage looping acitvities in sequencer.
I have a requirement where i need to pass file id as parameter reading from a file.In Future file id’s
will increase so that i don’t have to add job or change sequencer if I take advantage of datastage
looping.
1|200
2|300
3|400
I need to read the above file and pass second field as parameter to the job.I have created one parallel
job with pFileID as parameter.
Step:1 Count the number of lines in the file so that we can set the upper limit in the datastage start
loop activity.
End
Now we use startLoop.$Counter variable to get the file id by using combination of grep and awk
command.
===============================================================
TRANSFORMER STAGE TO FILTER THE DATA :
If our requirement is to filter the data department wise from the file below
samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40
In Transformer Stage just Drag and Drop the data to the target tables.
Shared Container :
Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000
row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and
your job going to be aborted then.. How can short out the problem.Suppose job sequencer
synchronies or control 4 job but job 1 have problem, in this condition should go director and
check it what type of problem showing either data type problem, warning massage, job fail or job
aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this
option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job
defiantly u get successful massage
----------------------------------------------------------------------------------------------------------
Question: I want to process 3 files in sequentially one by one how can i do that. while processing
the files it should fetch files automatically .
Ans:If the metadata for all the files r same then create a job having file name as parameter then
use same job in routine and call the job with different file name...or u can create sequencer to use
the job..
---------------------------------------------------------------------------------------------------------------------
-----------------
Parameterize the file name.
Build the job using that parameter
Build job sequencer which will call this job and will accept the parameter for file name.
Write a UNIX shell script which will call the job sequencer three times by passing different file
each time.
RE: What Happens if RCP is disable ?
In such case Osh has to perform Import and export every time whenthe job runs and the
processing time job is also increased...
--------------------------------------------------------------------------------------------------------------------
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated at
run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime column
propagation
Question:
Source: Target
Requirement:
field will be right justified zero filled, Take last 18 characters
Solution:
Right("0000000000":Trim(Lnk_Xfm_Trans.link),18)
Scenario 1:
We have two datasets with 4 cols each with different names. We should create a dataset with 4
cols 3 from one dataset and one col with the record count of one dataset.
We can use aggregator with a dummy column and get the count from one dataset and do a look
up from other dataset and map it to the 3 rd dataset
Something similar to the below design:
Scenario 2:
Following is the existing job design. But requirement got changed to: Head and trailer datasets
should populate even if detail records is not present in the source file. Below job don't do that
job.
Hence changed the above job to this following requirement:
Used row generator with a copy stage. Given default value( zero) for col( count) coming in from
row generator. If no detail records it will pick the record count from row generator.
We have a source which is a sequential file with header and footer. How to remove the header
and footer while reading this file using sequential file stage of Datastage?
Sol:Type command in putty: sed '1d;$d' file_name>new_file_name (type this in job
before job subroutine then use new file in seq stage)
IF I HAVE SOURCE LIKE COL1 A A B AND TARGET LIKE COL1 COL2 A 1 A 2 B1.
HOW TO ACHIEVE THIS OUTPUT USING STAGE VARIABLE IN TRANSFORMER
STAGE?
In such case Osh has to perform Import and export every time when the job runs and the
processing time job is also increased...
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated at
run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime column
propagation
Question:
Source: Target
COMPANY LOCATION
IBM HYD
TCS BAN
IBM CHE
HCL HYD
TCS CHE
IBM BAN
HCL BAN
HCL CHE
LIKE THIS.......
TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
2)input is like this:
no,char
1,a
2,b
3,a
4,b
5,a
6,a
7,b
8,a
But the output is in this form with row numbering of Duplicate occurence
output:
no,char,Count
"1","a","1"
"6","a","2"
"5","a","3"
"8","a","4"
"3","a","5"
"2","b","1"
"7","b","2"
"4","b","3"
3)Input is like this:
file1
10
20
10
10
20
30
Output is like:
file2 file3(duplicates)
10 10
20 10
30 20
4)Input is like:
file1
10
20
10
10
20
30
Output is like Multiple occurrences in one file and single occurrences in one file:
file2 file3
10 30
10
10
20
20
Output is like:
file2 file3
10 30
20
Output is like:
file2(odd) file3(even)
1 2
3 4
5 6
7 8
9 10
8)How to find out First sal, Last sal in each dept with out using aggregator stage
9)How many ways are there to perform remove duplicates function with out using
Remove duplicate stage..
Scenario:
LIKE THIS.......
TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
Solution:
Seqfile......>Sort......>Trans......>RemoveDuplicates..........Dataset
Sort Trans:
Key=Company create stage variable as Company1
Sort order=Asc Company1=If(in.keychange=1) then in.Location Else
Company1:',':in.Location
create keychange=True Drag and Drop in derivation
Company ....................Company
Company1........................Location
RemoveDup:
Key=Company
Duplicates To Retain=Last
11)The input is
Shirt|red|blue|green
Pant|pink|red|blue
Output should be,
Shirt:red
Shirt:blue
Shirt:green
pant:pink
pant:red
pant:blue
Solution:
it is reverse to pivote stage
use
seq------sort------tr----rd-----tr----tg
in the sort stage use create key change column is true
in trans create stage variable=if colu=1 then key c.value else key v::colum
rd stage use duplicates retain last
tran stage use field function superate columns
similar Scenario: :
source
col1 col3
1 samsung
1 nokia
1 ercisson
2 iphone
2 motrolla
3 lava
3 blackberry
3 reliance
Expected Output
col 1 col2 col3 col4
1 samsung nokia ercission
2 iphone motrolla
3 lava blackberry reliance
You can get it by using Sort stage --- Transformer stage --- RemoveDuplicates ---
Transformer --tgt
Ok
First Read and Load the data into your source file( For Example Sequential File )
And in Sort stage select key change column = True ( To Generate Group ids)
Go to Transformer stage
You can do this by right click in stage variable go to properties and name it as your wish
( For example temp)
This column name is the one you want in the required column with delimited commas.
On remove duplicates stage key is col1 and set option duplicates retain to--> Last.
in transformer drop col3 and define 3 columns like col2,col3,col4
in col1 derivation give Field(InputColumn,",",1) and
in col1 derivation give Field(InputColumn,",",2) and
in col1 derivation give Field(InputColumn,",",3)
Scenario:
12)Consider the following employees data as source?
employee_id, salary
-------------------
10, 1000
20, 2000
30, 3000
40, 5000
Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.
Scenario:
to the out put 4 -> the records which are common to both 1 and 2 should go.
to the output 3 -> the records which are only in 1 but not in 2 should go
to the output 5 -> the records which are only in 2 but not in 1 should go.
sltn:src1----->copy1------>----------------------------------->output_1(only left table)
Join(inner type)----> ouput_1
src2----->copy2------>----------------------------------->output_3(only right table)
Consider the following employees data as source?
employee_id, salary
-------------------
10, 1000
20, 2000
30, 3000
40, 5000
Scenario:
Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.
Take Source --->Transformer(Add new Column on both the output links and assign a
value as 1 )------------------------> 1) Aggregator (Do group by using
that new column)
2)lookup/join( join on that new column)-------->tgt.
Scenario:
sno,sname,mark1,mark2,mark3
1,rajesh,70,68,79
2,mamatha,39,45,78
3,anjali,67,39,78
4,pavani,89,56,45
5,indu,56,67,78
out put is
sno,snmae,mark1,mark2,mark3,delimetercount
1,rajesh,70,68,79,4
2,mamatha,39,45,78,4
3,anjali,67,39,78,4
4,pavani,89,56,45,4
5,indu,56,67,78,4
seq--->trans--->seq
scenario:
sname total_vowels_count
Allen 2
Scott 1
Ward 1
Under Transformer Stage Description:
total_Vowels_Count=Count(DSLink3.last_name,"a")+Count(DSLink3.last_name,"e")+Count
(DSLink3.last_name,"i")+Count(DSLink3.last_name,"o")+Count(DSLink3.last_name,"u").
Scenario:
1)On daily we r getting some huge files data so all files metadata is same we have to load in
to target table how we can load?
Use File Pattern in sequential file
2) One column having 10 records at run time we have to send 5th and 6th record to target
at run time how we can send?
Can get through,by using UNIX command in sequential file filter option
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=548 OR
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=546
where date_18 column is the column having that date which needs to be less or equal to
18 months and 548 is no. of days for 18 months and for leap year it is
546(these numbers you need to check).
Compile option only checks for all mandatory requirements like link requirements, stage
options and all. But it will not check if the database connections are valid.
Validate is equivalent to Running a job except for extraction/loading of data. That is,
validate option will test database connectivity by making connections to databases.
1. Sort and partition the input data of the transformer on the key(s) which defines the duplicate.
2. Define two stage variables, let's say StgVarPrevKeyCol(data type same as KeyCol) and StgVarCntr as Integer with
default value 0
where KeyCol is your input column which defines the duplicate.
DSLinknn.KeyCol
3. Now in constrain, if you filter rows where StgVarCntr = 1 will give you the unique records and if you filter
StgVarCntr > 1 will give you duplicate records.
My source is Like
Sr_no, Name
10,a
10,b
20,c
30,d
30,e
40,f
source-->aggregator-->transformat-->target
perform count in aggregator, and take two op links in trasformer, filter data count>1 for one llink
and put count=1 for second link.
Scenario:
in my i/p source i have N no.of records
*****************
source--->trans---->target
in trans use conditions on constraints
mod(empno,3)=1
mod(empno,3)=2
mod(empno,3)=0
Scenario:
im having i/p as
col A
a_b_c
x_F_I
DE_GH_IF
we hav to mak it as
*********************
Transformer
create 3 columns with derivation
col1 Field(colA,'_',1)
col2 Field(colA,'_',2)
col3 Field(colA,'_',3)
**************
Field function divides the column based on the delimeter,
if the data in the col is like A,B,C
then
Field(col,',',1) gives A
Field(col,',',2) gives B
Field(col,',',3) gives C
Scenario:
Scenario:
Scenario:
2 comments:
baba_007 said...
another way to find the duplicate value can be using a sorter stage before transformer.
====================================================================
Scenarios_Unix :
cols=12
fs=;
NR output_file
8) The below i have shown the demo for the “A” and “65″.
Ascii value of character: It can be done in 2 ways:
1. printf “%d” “‘A”
2. echo “A” | tr -d “\n” | od -An -t dC
Character value from Ascii: awk -v char=65 ‘BEGIN { printf “%c\n”, char; exit }’
———————————————————————————————————
9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online
cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline
Output –>crmplp1 cmis461 No Online cmis462 No Offline
crmplp2 cmis462 No Online cmis463 No Offline
Command:
awk ‘NR%2?ORS=FS:ORS=RS’ file
———————————————————————————————————
10) Variable can used in AWK
awk -F”$c” -v var=”$c” ‘{print $1var$2}’ filename
———————————————————————————————————
11) Search pattern and use special character in sed command:
sed -e ‘/COMAttachJob/s#”)#.”:JobID)#g’ input_file—————————————————
——————————————————
12) Get the content between two patterns:sed -n ‘/CREATE TABLE
table/,/MONITORING/p’ table_Script.sql——————————————————————
—————————————
13) Pring debugging script output in log file Add following command in script:
exec 1>> logfilename
exec 2>>logfilename——————————————————————————————
—————
14) Check Sql connection:#!/bin/sh
ID=abc
PASSWD=avd
DB=sdf
exit | sqlplus -s -l $ID/$PASSWD@$DB
echo variable:$?
exit | sqlplus -s -L avd/df@dfg > /dev/null
echo variable_crr: $?——————————————————————————————
—————
15) Trim the spaces using sed command
17)
How to get a files from different servers to one server in datastage by using unix command?
scp test.ksh dsadm@10.87.130.111:/home/dsadm/sys/
============================================================================
8. How to replace the n-th line in a file with a new line in Unix?
sed -i'' '10 d' filename # d stands for delete
sed -i'' '10 i new inserted line' filename # i stands for insert
9. How to check if the last command was successful in Unix?
echo $?
11. How will you find which operating system your system is running on in UNIX?
uname -a
16. How do you rename the files in a directory with _new as suffix?
ls -lrt|grep '^-'| awk '{print "mv "$9" "$9".new"}' | sh
17. Write a command to convert a string from lower case to upper case?
echo "apple" | tr [a-z] [A-Z]
18. Write a command to convert a string to Initcap.
echo apple | awk '{print toupper(substr($1,1,1)) tolower(substr($1,2))}'
19. Write a command to redirect the output of date command to multiple files?
The tee command writes the output to multiple files and also displays the output on the terminal.
date | tee -a file1 file2 file3
20. How do you list the hidden files in current directory?
ls -a | grep '^\.'
21. List out some of the Hot Keys available in bash shell?
Ctrl+l - Clears the Screen.
Ctrl+r - Does a search in previously given commands in shell.
Ctrl+u - Clears the typing before the hotkey.
Ctrl+a - Places cursor at the beginning of the command at shell.
Ctrl+e - Places cursor at the end of the command at shell.
Ctrl+d - Kills the shell.
Ctrl+z - Places the currently running process into background.
22. How do you make an existing file empty?
cat /dev/null > filename
23. How do you remove the first number on 10th line in file?
sed '10 s/[0-9][0-9]*//' < filename
24. What is the difference between join -v and join -a?
join -v : outputs only matched lines between two files.
join -a : In addition to the matched lines, this will output unmatched lines also.
25. How do you display from the 5th character to the end of the line from a file?
cut -c 5- filename
26. Display all the files in current directory sorted by size?
ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'
Write a command to search for the file 'map' in the current directory?
find -name map -type f
How to display the first 10 characters from each line of a file?
cut -c -10 filename
Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename
How to print the file names in a directory that has the word "term"?
grep -l term *
The '-l' option make the grep command to print only the filename without printing the content of
the file. As soon as the grep command finds the pattern in a file, it prints the pattern and stops
searching other lines in the file.
How to run awk command specified in a file?
awk -f filename
How do you display the calendar for the month march in the year 1985?
The cal command can be used to display the current month calendar. You can pass the month
and year as arguments to display the required year, month combination calendar.
cal 03 1985
This will display the calendar for the March month and year 1985.
Write a command to find the total number of lines in a file?
wc -l filename
Other ways to pring the total number of lines are
awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename
awk 'END{print NR}' filename
How to duplicate empty lines in a file?
sed '/^$/ p' < filename
Explain iostat, vmstat and netstat?
Iostat: reports on terminal, disk and tape I/O activity.
Vmstat: reports on virtual memory statistics for processes, disk, tape and CPU activity.
Netstat: reports on the contents of network data structures.
27. How do you write the contents of 3 files into a single file?
cat file1 file2 file3 > file
28. How to display the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename
29. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | grep '^-'| awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'
30. Write a command to print the lines which end with the word "end"?
grep 'end$' filename
The '$' symbol specifies the grep command to search for the pattern at the end of the line.
31. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename
The '-w' option makes the grep command to search for exact whole words. If the specified
pattern is found in a string, then it is not considered as a whole word. For example: In the string
"mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.
32. How to remove the first 10 lines from a file?
sed '1,10 d' < filename
33. Write a command to duplicate each line in a file?
sed 'p' < filename
34. How to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '
35. Write a command to list the files in '/usr' directory that start with 'ch' and then display the
number of lines in each file?
wc -l /usr/ch*
Another way is
find /usr -name 'ch*' -type f -exec wc -l {} \;
36. How to remove blank lines in a file ?
grep -v ‘^$’ filename > new_filename
37. How to display the processes that were run by your user name ?
ps -aef | grep <user_name>
38. Write a command to display all the files recursively with path under current directory?
find . -depth -print
39. Display zero byte size files in the current directory?
find -size 0 -type f
40. Write a command to display the third and fifth character from each line of a file?
cut -c 3,5 filename
41. Write a command to print the fields from 10th to the end of the line. The fields in the line are
delimited by a comma?
cut -d',' -f10- filename
42 How to replace the word "Gun" with "Pen" in the first 100 lines of a file?
sed '1,00 s/Gun/Pen/' < filename
43. Write a Unix command to display the lines in a file that do not contain the word "RAM"?
grep -v RAM filename
The '-v' option tells the grep to print the lines that do not contain the specified pattern.
44 How to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'
45. Write a command to display the files in the directory by file size?
ls -l | grep '^-' |sort -nr -k 5
46. How to find out the usage of the CPU by the processes?
The top utility can be used to display the CPU usage by the processes.
47. Write a command to remove the prefix of the string ending with '/'.
The basename utility deletes any prefix ending in /. The usage is mentioned below:
basename /usr/local/bin/file
This will display only file
48. How to display zero byte size files?
ls -l | grep '^-' | awk '/^-/ {if ($5 !=0 ) print $9 }'
49. How to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename
50. How to remove all the occurrences of the word "jhon" except the first one in a line with in
the entire file?
sed 's/jhon//2g' < filename
51. How to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename
52. How to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f
53. How to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f
54. How to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f
55. How to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename
56. Write a command to find the number of files in a directory.
ls -l|grep '^-'|wc -l
57. Write a command to display your name 100 times.
The Yes utility can be used to repeatedly output a line with the specified string or 'y'.
yes <your_name> | head -100
58. Write a command to display the first 10 characters from each line of a file?
cut -c -10 filename
59. The fields in each line are delimited by comma. Write a command to display third field from
each line of a file?
cut -d',' -f2 filename
60. Write a command to print the fields from 10 to 20 from each line of a file?
cut -d',' -f10-20 filename
61. Write a command to print the first 5 fields from each line?
cut -d',' -f-5 filename
62. By default the cut command displays the entire line if there is no delimiter in it. Which cut
option is used to supress these kind of lines?
The -s option is used to supress the lines that do not contain the delimiter.
63. Write a command to replace the word "bad" with "good" in file?
sed s/bad/good/ < filename
64. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename
65. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename
66. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename
67. Write a command to display the characters from 10 to 20 from each line of a file?
cut -c 10-20 filename
68. Write a command to print the lines that has the the pattern "july" in all the files in a particular
directory?
grep july *
This will print all the lines in all files that contain the word “july” along with the file name. If
any of the files contain words like "JULY" or "July", the above command would not print those
lines.
69. Write a command to print the lines that has the word "july" in all the files in a directory and
also suppress the filename in the output.
grep -h july *
70. Write a command to print the lines that has the word "july" while ignoring the case.
grep -i july *
The option i make the grep command to treat the pattern as case insensitive.
71. When you use a single file as input to the grep command to search for a pattern, it won't print
the filename in the output. Now write a grep command to print the filename in the output without
using the '-H' option.
grep pattern filename /dev/null
The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is
always an empty file.
Another way to print the filename is using the '-H' option. The grep command for this is
grep -H pattern filename
72. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *
The '-L' option makes the grep command to print the filenames that do not contain the specified
pattern.
73. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename
The '-n' option is used to print the line numbers in a file. The line numbers start from 1
74. Write a command to print the lines that starts with the word "start"?
grep '^start' filename
The '^' symbol specifies the grep command to search for the pattern at the start of the line.
75. In the text file, some lines are delimited by colon and some are delimited by space. Write a
command to print the third field of each line.
awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename
76. Write a command to print the line number before each line?
awk '{print NR, $0}' filename
77. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename
78. How to create an alias for the complex command and remove the alias?
The alias utility is used to create the alias for a command. The below command creates alias for
ps -aef command.
alias pg='ps -aef'
If you use pg, it will work the same way as ps -aef.
To remove the alias simply use the unalias command as
unalias pg
79. Write a command to display todays date in the format of 'yyyy-mm-dd'?
The date command can be used to display todays date with time
date '+%Y-%m-%d'
------------------------------------------------------------------------------------------------------
REF_PERIOD
PERIOD_NAME
ACCOUNT_VALUE
CDR_CODE
PRODUCT
PROJECT
SEGMENT_CODE
PARTNER
ORIGIN
BILLING_ACCRUAL
Output:
REF_PERIOD PERIOD_NAME ACCOUNT_VALUE CDR_CODE PRODUCT PROJECT
SEGMENT_CODE PARTNER ORIGIN BILLING_ACCRUAL
Operator ~ is for comparing with the regular expressions. If it matches the default action i.e print
whole line will be performed.
For eg: Input file contain single column with 84 rows then output should be single column data
converted to multiple of 12 columns i.e. 12 column * 7 rows with field separtor (fs ;)
Script:
#!/bin/sh
cols=12
fs=;
NR<r*c{printf("%s",NR%c?$0"$":$0"\n");next}{print}
output:
201-2011.csv
8) The below i have shown the demo for the “A” and “65″.
Character value from Ascii: awk -v char=65 ‘BEGIN { printf “%c\n”, char; exit }’
———————————————————————————————————
9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online
cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline
Command:
awk ‘NR%2?ORS=FS:ORS=RS’ file
———————————————————————————————————
———————————————————————————————————
=====================================================================
Sql queries :
1.Query to display middle records drop first 5 last 5 records in emp table
select * from emp where rownum<=(select count(*)-5 from emp) - select * from emp where
rownum<=5;
SELECT SUBPRODUCT_UID
,SUBPRODUCT_PROVIDER_UID
,SUBPRODUCT_TYPE_UID
,DESCRIPTION
,EXTERNAL_ID
,OPTION_ID
,NEGOTIABLE_OFFER_IND
,UPDATED_BY
,UPDATED_ON
,CREATED_ON
,CREATED_BY FROM schemaname.SUBPRODUCT
=====================================================================
What are the key differences in snowflake and star schema? Where should they be
applied?
The Star schema vs Snowflake schema comparison brings forth four fundamental differences to
the fore:
1. Data optimization:
Snowflake model uses normalized data, i.e. the data is organized inside the database in order to
eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business
and its dimensions are preserved in the data model through referential integrity.
Star schema provides fast response to queries and forms the ideal source for cube structures.
Learn all about star schema in this article.
4. ETL
Snowflake model loads the data marts and hence the ETL job is more complex in design and
cannot be parallelized as dependency model restricts it.
The Star model loads dimension table without dependency between dimensions and hence the
ETL job is simpler and can achieve higher parallelism.
This brings us to the end of the Star schema vs Snowflake schema debate. But where exactly do
these approaches make sense?
g) Column Mismatch
Solution:I resolved changing the extended col under meta data of the transformer to
unicode
Syntax error: Error in "group" operator: Error in output redirection: Error in output
parameters: Error in modify adapter: Error in binding: Could not find type: "subrec", line
35
Solution:Its the issue of level number of those columns which were being added in
transformer. Their level number was blank and the columns that were being taken from
cff file had it as 02. Added the level number and job worked.
Out_Trailer: When checking operator: When binding output schema variable "outRec":
When binding output interface field "STDCA_TRLR_REC_CNT" to field
"STDCA_TRLR_REC_CNT": Implicit conversion from source type "dfloat" to result
type "decimal[10,0]": Possible range/precision limitation.
CE_Trailer: When checking operator: When binding output interface field "Data" to field
"Data": Implicit conversion from source type "string" to result type "string[max=500]":
Possible truncation of variable length string.
Implicit conversion from source type "dfloat" to result type "decimal[10,0]": Possible
range/precision limitation.
Problem(Abstract)
Jobs that process a large amount of data in a column can abort with this error:
the record is too big to fit in a block; the length requested is: xxxx, the max block length
is: xxxx.
Resolving the problem
To fix this error you need to increase the block size to accommodate the record size:
1. Log into Designer and open the job.
2. Open the job properties--> parameters-->add environment variable and select:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
3. You can set this up to 256MB but you really shouldn't need to go over 1MB.
NOTE: value is in KB
. While connecting “Remote Desktop”, Terminal server has been exceeded maximum
number of allowed connections
SOL: In Command Prompt, type mstsc /v: ip address of server /admin
http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.m
essages.sql.doc%2Fdoc%2Fmsql20521n.html
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log
ADMU1211I: To obtain a full trace of the failure, use the -trace option.
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log
SOL: Wasadmin and XMeta passwords needs to be reset and commands are below..
-password Wasadmin0708
SOL: Most of the time "The specified field: XXXXXX does not exist in the view
adapted schema" occurred when we missed a field to map. Every stage has got an output
tab if used in the between of the job. Make sure you have mapped every single field
required for the next stage.
Sometime even after mapping the fields this error can be occurred and one of the reason
could be that the view adapter has not linked the input and output fields. Hence in this
case the required field mapping should be dropped and recreated.
Just to give an insight on this, the view adapter is an operator which is responsible for
mapping the input and output fields. Hence DataStage creates an instance of
APT_ViewAdapter which translate the components of the operator input interface
schema to matching components of the interface schema. So if the interface schema is not
having the same columns as operator input interface schema then this error will be
reported.
1)When we use same partitioning in datastage transformer stage we get the following
warning in 7.5.2 version.
This is known issue and you can safely demote that warning into informational by adding
this warning to Project specific message handler.
2) Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0
Resolution: Clear the preserve partition flag before Sequential file stages.
3)DataStage parallel job fails with fork() failed, Resource temporarily unavailable
On aix execute following command to check maxuproc setting and increase it if you plan
to run multiple jobs at the same time.
Resolution: use the Modify stage explicitly convert the data type before sending to
aggregator stage.
5)Warning: A user defined sort operator does not satisfy the requirements.
Resolution:check the order of sorting columns and make sure use the same order when
use join stage after sort to joing two inputs.
Resolution:check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.
Resolution: Sort the data before sending to join stage and check for the order of sorting
keys and join keys and make sure both are in the same order.
Resolution:If you are using join,diff,merge or comp stages make sure both links have the
differnt column names other than key columns
Resolution:If you are reading from oracle database or in any processing stage where
incoming column is defined as nullable and if you define metadata in datastage as non-
nullable then you will get above issue.if you want to convert a nullable field to
non nullable make sure you apply available null functions in datastage or in the extract
query.
SOL: SyncProject cmd that is installed with DataStage 8.5 can be run to analyze and
recover projects
5. Failed to authenticate the current user against the selected Domain: Could not connect
to server.
Server is down
SOL: Update the host file on client system so that the server hostname can be resolved
from client.
Make sure the WebSphere TCP/IP ports are opened by the firewall.
Make sure the WebSphere application server is running. (OR)
6. The connection was refused or the RPC daemon is not running (81016)
RC: The dsprcd process must be running in order to be able to login to DataStage.
If you restart DataStage, but the socket used by the dsrpcd (default is 31538) was busy,
the dsrpcd will fail to start. The socket may be held by dsapi_slave processes that were
still running or recently killed when DataStage was restarted.
SOL: Run “ps -ef | grep dsrpcd” to confirm the dsrpcd process is not running.
Run “ps -ef | grep dsapi_slave” to check if any dsapi_slave processes exist. If so, kill
them.
Run “netstat -a | grep dsprc” to see if any processes have sockets that are
ESTABLISHED, FIN_WAIT, or CLOSE_WAIT. These will prevent the dsprcd from
starting. The sockets with status FIN_WAIT or CLOSE_WAIT will eventually time out
and disappear, allowing you to restart DataStage.
Then Restart DSEngine. (if above doesn’t work) Needs to reboot the system.
b) In director client, Project tab à Print à select print to file option and save it in local
directory.
8. “Run time error ’457′. This Key is already associated with an element of this
collection.”
c) Click on Command
d) Issue the command ds.tools
Kill -9 process_id
SOL: cd /opt/ibm/InformationServer/server/DSEngine
(Without local entry, Job monitor will be unable to use the ports correctly)
12. SQL0752N. Connect to a database is not permitted within logical unit of work
CONNECT type 1 settings is in use.
2. Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0
SOL: Clear the preserve partition flag before Sequential file stages.
3. Warning: A user defined sort operator does not satisfy the requirements.
SOL: Check the order of sorting columns and make sure use the same order when use
join stage after sort to joing two inputs.
SOL: check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.
SOL:
cd /opt/ibm/InformationServer/Server/DSEngine/bin
Press ALT+SPACE
Now, double click again and try whether properties window appears.
Go to /opt/ibm/InformationServer/server/Projects/proj_name/
ls RT_SCT* then
rm –f RT_SCTEMP
10. While attempting to compile job, “failed to invoke GenRunTime using Phantom
process helper”
[ODBC SOURCES]
<local uv>
DBMSTYPE = UNIVERSE
Network = TCP/IP
Service = uvserver
Host = 127.0.0.1
ps –ef|grep slave
Ask the application team to close the active or stale sessions running from application’s
user.
If they have closed the sessions, but sessions are still there, then kill those sessions.
Check for output for below command before stopping Datastage services.
If any processes are in established, check any job or stale or active or osh sessions are not
running.
If any processes are in close_wait, then wait for some time, those processes
cd $DSHOME
./dsenv
cd $DSHOME/bin
Wait for 10 to 15 min for shared memory to be released by process holding them.
If asking for dsadm password while firing the command , then enable
impersonation.through root user
${DSHOME}/scripts/DSEnable_impersonation.sh
Equ DSJS.RUNNING To 0 This is the only status that means the job is actually running
Equ DSJS.RUNOK To 1 Job finished a normal run with no warnings
Equ DSJS.RUNWARN To 2 Job finished a normal run with warnings
This warning is seen when there are multiple records with the same key
column is present in the reference table from which lookup is done. Lookup,
by default, will fetch the first record which it gets as match and will throw the
warning
since it doesn’t know which value is the correct one to be returned from the
reference.
To solve this problem you can either one of the reference links from “Multiple
rows returned from link” dropdown, in Lookup constraints. In this case Lookup
will return multiple rows for each row that is matched.
Else use some method to eradicate duplicate multiple rows with same key
columns according to the business requirements.
Posted by manohar at 11:19 PM No comments:
^M is DOS line break character which shows up in UNIX files when uploaded from a windows file system in
ascii format.
:%s/(ctrl-v)(ctrl-m)//g
Important!! – press (Ctrl-v) (Ctrl-m) combination to enter ^M character, dont use “^” and M.
Also,
Your substitution command may catch more ^M then necessary. Your file may contain valid ^M in the
middle of a line of code for example. Use the following command instead to remove only those at the very
end of lines:
:%s/(ctrl-v)(ctrl-m)*$//g
Using sed:
We are going to read the above data from a sequential file and transform it to look like this
xy FGH Sam
xy FGH Dean
xy FGH Winchester
In the adjacent image you can see a new box called Loop Condition. This where we are going to
control the loop variables.
The Loop While constraint is used to implement a functionality similar to “WHILE” statement in
programming. So, similar to a while statement need to have a condition to identify how many times
the loop is supposed to be executed.
To achieve this @ITERATION system variable was introduced. In our example we need to loop the
data 3 times to get the column data onto subsequent rows.
Now all we have to do is map this Loop variable LoopName to our output column Name
Lets map the output to a sequential file stage and see if the output is a desired.
After running the job, we did a view data on the output stage and here is the data as desired.
Making some tweaks to the above design we can implement things like
Input:Below is the sample data of three students, their marks in two subjects, the
corresponding grades and the dates on which they were graded.
Output:Our requirement is to sum the marks obtained by each student in a subject and display it in
the output
Step 1: Once we have read the data from the source we have to sort data on our key field. In our
example the key field is the student name
Once the data is sorted we have to implement the looping function in transformer to calculate the
aggregate value
o SaveInputRecord(): This function saves the entire record in cache and returns the number of
records that are currently stored in cache
o LastRowInGroup(input-column): When a input key column is passed to this function it will return
1 when the last row for that column is found and in all other cases it will return 0
To give an example, lets say our input is
Student Code
ABC 1
ABC 2
ABC 3
DEF 2
o For the first two records the function will return 0 but for the last record ABC,3 it will return 1
indicating that it is the last record for the group where student name is “ABC”
o GetSavedInputRecord(): This function returns the record that was stored in cache by the function
SaveInputRecord()
Back to the task at hand, we need 7 stage variables to perform the aggregation operation
successfully.
1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated
for a student (subject 1)
5. SumSub2: Similar to SumSub1 (for subject 2)
6. IntermediateSumSub2: Similar to IntermediateSumSub1 (for subject 2)
7. LoopBreakNum: Holds the value for the number of times the loop has to run
Below is the screenshot of the stage variables
We also need to define the Loop Variables so that the loop will execute for a student until his final
record is identified
When the first record comes to stage variables, it is saved in the cache using the function
SaveInputRecord() in first stage variableLoopNumber
The second stage variable checks if it is the last record for this particular student, if it is it stores 1
else 0
The third SumSub1 is executed only if the record is the last record
The fourth IntermediateSumSum1 is executed when the input record is not the last record, thereby
storing the intermediate sum of the subject for a student
Seven will have the first value as 1 and for the second record also if the same student is fetched it will
change to 2 and so on
The loop variable will be executed until the final record for a student is identified and the
GetSavedInputRecord() function will make sure the current record is processed before the next record
is brought for processing.
What the above logic does is for each and every record it will send the sum of marks scored by each
student to the output. But our requirement is to have only one record per student in the output.
So we simply add a remove duplicates stage and add the student name as a primary key
Run the job and the output will be according to our initial expectation
In a star schema, you would collapse those into a single "store" dimension. In a snowflake, you would
keep them apart with the store connecting to the fact.
Second Answer: First of all, some definitions are in order. In a star schema, dimensions
that reflect a hierarchy are flattened into a single table. For example, a star schema
Geography Dimension would have columns like country, state/province, city, state and
postal code. In the source system, this hierarchy would probably be normalized with
multiple tables with one-to-many relationships.
A snowflake schema does not flatten a hierarchy dimension into a single table. It would,
instead, have two or more tables with a one-to-many relationship. This is a more
normalized structure. For example, one table may have state/province and country columns
and a second table would have city and postal code. The table with city and postal code
would have a many-to-one relationship to the table with the state/province columns.
There are some good for reasons snowflake dimension tables. One example is a company
that has many types of products. Some products have a few attributes, others have many,
many. The products are very different from each other. The thing to do here is to create a
core Product dimension that has common attributes for all the products such as product
type, manufacturer, brand, product group, etc. Create a separate sub-dimension table for
each distinct group of products where each group shares common attributes. The sub-
product tables must contain a foreign key of the core Product dimension table.
One of the criticisms of using snowflake dimensions is that it is difficult for some of the
multidimensional front-end presentation tools to generate a query on a snowflake
dimension. However, you can create a view for each combination of the core product/sub-
product dimension tables and give the view a suitably description name (Frozen Food
Product, Hardware Product, etc.) and then these tools will have no problem.
25 Check the write cache of Hash file. If the same hash file is used for Look up and as well
as target disable this Option.
26 If the hash file is used only for lookup then enable Preload to memory . This will
improve the performance. Also check the order of execution of the routines.
27 Don't use more than 7 lookups in the same transformer; introduce new transformers if it
exceeds 7 lookups.
28 Use Preload to memory option in the hash file output.
29 Use Write to cache in the hash file input.
30 Write into the error tables only after all the transformer stages.
31 Reduce the width of the input record - remove the columns that you would not use.
32 Cache the hash files you are reading from and writing into. Make sure your cache is big
enough to hold the hash files.
33 Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
34 Ideally, if the amount of data to be processed is small, configuration files with less
number of nodes should be used while if data volume is more , configuration files with
larger number of nodes should be used.
35 Partitioning should be set in such a way so as to have balanced data flow i.e. nearly
equal partitioning of data should occur and data skew should be minimized.
36 In DataStage Jobs where high volume of data is processed, virtual memory settings for
the job should be optimized. Jobs often abort in cases where a single lookup has
multiple reference links. This happens due to low temp memory space. In such jobs
$APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should
be set to sufficiently large values.
37 Sequential files should be used in following conditions. When we are reading a flat file
(fixed width or delimited) from UNIX environment which is FTP ed from some external
system
38 When some UNIX operations has to be done on the file Don’t use sequential file for
intermediate storage between jobs. It causes performance overhead, as it needs to do
data conversion before writing and reading from a UNIX file
39 In order to have faster reading from the Stage the number of readers per node can be
increased (default value is one).
40 Usage of Dataset results in a good performance in a set of linked jobs. They help in
achieving end-to-end parallelism by writing data in partitioned form and maintaining the
sort order.
41 Look up Stage is faster when the data volume is less. If the reference data volume is
more, usage of Lookup Stage should be avoided as all reference data is pulled in to local
memory
42 Sparse lookup type should be chosen only if primary input data volume is small.
43 Join should be used when the data volume is high. It is a good alternative to the lookup
stage and should be used when handling huge volumes of data.
44 Even though data can be sorted on a link, Sort Stage is used when the data to be sorted
is huge.When we sort data on link ( sort / unique option) once the data size is beyond
the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if
the volume of data is large explicit sort stage should be used instead of sort on link.Sort
Stage gives an option on increasing the buffer memory used for sorting this would mean
lower I/O and better performance.
45 It is also advisable to reduce the number of transformers in a Job by combining the logic
into a single transformer rather than having multiple transformers.
46 Presence of a Funnel Stage reduces the performance of a job. It would increase the time
taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is
better to isolate itself to one job. Write the output to Datasets and funnel them in new
job.
47 Funnel Stage should be run in “continuous” mode, without hindrance.
48 A single job should not be overloaded with Stages. Each extra Stage put in a Job
corresponds to lesser number of resources available for every Stage, which directly
affects the Jobs Performance. If possible, big jobs having large number of Stages should
be logically split into smaller units.
49 Unnecessary column propagation should not be done. As far as possible, RCP (Runtime
Column Propagation) should be disabled in the jobs
50 Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this
option to “true”. This improves the Sort Stage performance a great deal
51 In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the
data and reduce sorting in the job
52 Reduce the number of Stage variables used.
53 The Copy stage should be used instead of a Transformer for simple operations
54 The “upsert” works well if the data is sorted on the primary key column of the table
which is being loaded.
55 Don’t read from a Sequential File using SAME partitioning
56 By using hashfile stage we can improve the performance.
In case of hashfile stage we can define the read cache size
& write cache size but the default size is 128M.B.
57 By using active-to-active link performance also we can
improve the performance.
Here we can improve the performance by enabling the row
buffer, the default row buffer size is 128K.B.
===================================================================================
If our requirement is to filter the data department wise from the file below
samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40
In Transformer Stage just Drag and Drop the data to the target tables.
Click ok
================================================================================