You are on page 1of 153

All Datastage Stages

Datastage parallel stages groups


DataStage and QualityStage stages are grouped into the following logical sections:

 General objects
 Data Quality Stages
 Database connectors
 Development and Debug stages
 File stages
 Processing stages
 Real Time stages
 Restructure Stages
 Sequence activities

Please refer to the list below for a description of the stages used in DataStage and QualityStage.
We classified all stages in order of importancy and frequency of use in real-life deployments
(and also on certification exams). Also, the most widely used stages are marked bold or there is a
link to a subpage available with a detailed description with examples.

DataStage and QualityStage parallel stages and activities

General elements

 Link indicates a flow of the data. There are three main types of links in Datastage: stream,
reference and lookup.

 Container (can be private or shared) - the main outcome of having containers is to


simplify visually a complex datastage job design and keep the design easy to understand.
 Annotation is used for adding floating datastage job notes and descriptions on a job
canvas. Annotations provide a great way to document the ETL process and help
understand what a given job does.
 Description Annotation shows the contents of a job description field. One description
annotation is allowed in a datastage job.

Debug and development stages

 Row generator produces a set of test data which fits the specified metadata (can be
random or cycled through a specified list of values). Useful for testing and development.
Click here for more..
 Column generator adds one or more column to the incoming flow and generates test
data for this column.
 Peek stage prints record column values to the job log which can be viewed in Director. It
can have a single input link and multiple output links.Click here for more..
 Sample stage samples an input data set. Operates in two modes: percent mode and period
mode.
 Head selects the first N rows from each partition of an input data set and copies them to
an output data set.
 Tail is similiar to the Head stage. It select the last N rows from each partition.
 Write Range Map writes a data set in a form usable by the range partitioning method.
Processing stages

 Aggregator joins data vertically by grouping incoming data stream and calculating
summaries (sum, count, min, max, variance, etc.) for each group. The data can be
grouped using two methods: hash table or pre-sort. Click here for more..
 Copy - copies input data (a single stream) to one or more output data flows
 FTP stage uses FTP protocol to transfer data to a remote machine
 Filter filters out records that do not meet specified requirements.Click here for more..
 Funnel combines mulitple streams into one. Click here for more..
 Join combines two or more inputs according to values of a key column(s). Similiar
concept to relational DBMS SQL join (ability to perform inner, left, right and full outer
joins). Can have 1 left and multiple right inputs (all need to be sorted) and produces
single output stream (no reject link). Click here for more..
 Lookup combines two or more inputs according to values of a key column(s). Lookup
stage can have 1 source and multiple lookup tables. Records don't need to be sorted and
produces single output stream and a reject link. Click here for more..
 Merge combines one master input with multiple update inputs according to values of a
key column(s). All inputs need to be sorted and unmatched secondary entries can be
captured in multiple reject links. Click here for more..
 Modify stage alters the record schema of its input dataset. Useful for renaming columns,
non-default data type conversions and null handling
 Remove duplicates stage needs a single sorted data set as input. It removes all duplicate
records according to a specification and writes to a single output
 Slowly Changing Dimension automates the process of updating dimension tables, where
the data changes in time. It supports SCD type 1 and SCD type 2.Click here for more..
 Sort sorts input columns.Click here for more..
 Transformer stage handles extracted data, performs data validation, conversions and
lookups.Click here for more..
 Change Capture - captures before and after state of two input data sets and outputs a
single data set whose records represent the changes made.
 Change Apply - applies the change operations to a before data set to compute an after
data set. It gets data from a Change Capture stage
 Difference stage performs a record-by-record comparison of two input data sets and
outputs a single data set whose records represent the difference between them. Similiar to
Change Capture stage.
 Checksum - generates checksum from the specified columns in a row and adds it to the
stream. Used to determine if there are differencies between records.
 Compare performs a column-by-column comparison of records in two presorted input
data sets. It can have two input links and one output link.
 Encode encodes data with an encoding command, such as gzip.
 Decode decodes a data set previously encoded with the Encode Stage.
 External Filter permits speicifying an operating system command that acts as a filter on
the processed data
 Generic stage allows users to call an OSH operator from within DataStage stage with
options as required.
 Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row
to a single column in multiple output rows. Pivoting data results in obtaining a dataset
with fewer number of columns but more rows.
 Surrogate Key Generator generates surrogate key for a column and manages the key
source.
 Switch stage assigns each input row to an output link based on the value of a selector
field. Provides a similiar concept to the switch statement in most programming
languages.
 Compress - packs a data set using a GZIP utility (or compress command on
LINUX/UNIX)
 Expand extracts a previously compressed data set back into raw binary data.
File stage types

 Sequential file is used to read data from or write data to one or more flat (sequential)
files.Click here for more..(…….)
 Data Set stage allows users to read data from or write data to a dataset. Datasets are
operating system files, each of which has a control file (.ds extension by default) and one
or more data files (unreadable by other applications). Click here for more info(…….)
 File Set stage allows users to read data from or write data to a fileset. Filesets are
operating system files, each of which has a control file (.fs extension) and data files.
Unlike datasets, filesets preserve formatting and are readable by other applications.
 Complex flat file allows reading from complex file structures on a mainframe machine,
such as MVS data sets, header and trailer structured files, files that contain multiple
record types, QSAM and VSAM files.Click here for more info.
 External Source - permits reading data that is output from multiple source programs.
 External Target - permits writing data to one or more programs.
 Lookup File Set is similiar to FileSet stage. It is a partitioned hashed file which can be
used for lookups.

Database stages
 Oracle Enterprise allows reading data from and writing data to an Oracle database
(database version from 9.x to 10g are supported).
 ODBC Enterprise permits reading data from and writing data to a database defined as an
ODBC source. In most cases it is used for processing data from or to Microsoft Access
databases and Microsoft Excel spreadsheets.
 DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.
 Teradata permits reading data from and writing data to a Teradata data warehouse.
Three Teradata stages are available: Teradata connector, Teradata Enterprise and
Teradata Multiload
 SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl
Server 2005 amd 2008 database.
 Sybase permits reading data from and writing data to Sybase databases.
 Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL
Server. The Stored Procedure stage can be used as a source (returns a rowset), as a target
(pass a row to a stored procedure to write) or a transform (to invoke procedure processing
within the database).
 MS OLEDB helps retrieve information from any type of information repository, such as a
relational source, an ISAM file, a personal database, or a spreadsheet.
 Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or
writing to a number of different supported relational DB engines using native interfaces,
such as Oracle, Microsoft SQL Server, DB2, Informix and Sybase.
 Informix (CLI or Load)
 DB2 UDB (API or Load)
 Classic federation
 RedBrick Load
 Netezza Enterpise
 iWay Enterprise

Real Time stages

 XML Input stage makes it possible to transform hierarchical XML data to flat relational
data sets
 XML Output writes tabular data (relational tables, sequential files or any datastage data
streams) to XML structures
 XML Transformer converts XML documents using an XSLT stylesheet
 Websphere MQ stages provide a collection of connectivity options to access IBM
WebSphere MQ enterprise messaging systems. There are two MQ stage types available
in DataStage and QualityStage: WebSphere MQ connector and WebSphere MQ plug-in
stage.
 Web services client
 Web services transformer
 Java client stage can be used as a source stage, as a target and as a lookup. The java
package consists of three public classes: com.ascentialsoftware.jds.Column,
com.ascentialsoftware.jds.Row, com.ascentialsoftware.jds.Stage
 Java transformer stage supports three links: input, output and reject.
 WISD Input - Information Services Input stage
 WISD Output - Information Services Output stage

Restructure stages

 Column export stage exports data from a number of columns of different data types into a
single column of data type ustring, string, or binary. It can have one input link, one output
link and a rejects link. Click here for more..
 Column import complementary to the Column Export stage. Typically used to divide data
arriving in a single column into multiple columns.
 Combine records stage combines rows which have identical keys, into vectors of
subrecords.
 Make subrecord combines specified input vectors into a vector of subrecords whose
columns have the same names and data types as the original vectors.
 Make vector joins specified input columns into a vector of columns
 Promote subrecord - promotes input subrecord columns to top-level columns
 Split subrecord - separates an input subrecord field into a set of top-level vector columns
 Split vector promotes the elements of a fixed-length vector to a set of top-level columns
Data quality QualityStage stages

 Investigate stage analyzes data content of specified columns of each record from the
source file. Provides character and word investigation methods.
 Match frequency stage takes input from a file, database or processing stages and
generates a frequence distribution report.
 MNS - multinational address standarization.
 QualityStage Legacy
 Reference Match
 Standarize
 Survive
 Unduplicate Match
 WAVES - worldwide address verification and enhancement system.
Sequence activity stage types

 Job Activity specifies a Datastage server or parallel job to execute.


 Notification Activity - used for sending emails to user defined recipients from within
Datastage
 Sequencer used for synchronization of a control flow of multiple activities in a job
sequence.
 Terminator Activity permits shutting down the whole sequence once a certain situation
occurs.
 Wait for file Activity - waits for a specific file to appear or disappear and launches the
processing.
 EndLoop Activity
 Exception Handler
 Execute Command
 Nested Condition
 Routine Activity
 StartLoop Activity
 UserVariables Activity

=====================================================================

Configuration file:

The Datastage configuration file is a master control file (a textfile which sits on the
server side) for jobs which describes the parallel system resources and architecture. The
configuration file provides hardware configuration for supporting such architectures
as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or
MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands the
architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed
your processing configurations, or changed servers or platform, you will never have to worry about
it affecting your jobs since all the jobs depend on this configuration file for execution. Datastage
jobs determine which node to run the process on, where to store the temporary data, where to store
the dataset data, based on the entries provide in the configuration file. There is a default
configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration
file is to separate software and hardware configuration from job design. It allows changing
hardware and software resources without changing a job design. Datastage jobs can point to
different configuration files by using job parameters, which means that a job can utilize different
hardware architectures without being recompiled.
The configuration file contains the different processing nodes and also specifies the disk
space provided for each processing node which are logical processing nodes that are specified in
the configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a very
inefficient configuration on your hands.

1. APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one
can have many configuration files for a project) to be used. In fact, this is what is generally used
in production. However, if this environment variable is not defined then how DataStage
determines which file to use ??
1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default
configuration file (config.apt) in following path:
1. Current working directory.
2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of
DataStage installation.

2. Define Node in configuration file


A Node is a logical processing unit. Each node in a configuration file is distinguished by a virtual
name and defines a number and speed of CPUs, memory availability, page and swap space,
network connectivity details, etc.

3. What are the different options a logical node can have in the configuration file?
1. fastname – The fastname is the physical node name that stages use to open connections for high
volume data transfers. The attribute of this option is often the network name. Typically, you can
get this name by using Unix command ‘uname -n’.
2. pools – Name of the pools to which the node is assigned to. Based on the characteristics of the
processing nodes you can group nodes into set of pools.
1. A pool can be associated with many nodes and a node can be part of many pools.
2. A node belongs to the default pool unless you explicitly specify apools list for it, and omit the
default pool name (“”) from the list.
3. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of
processing nodes).
1. In case job as well as stage within the job are constrained to run on specific processing nodes
then stage will run on the node which is common to stage as well as job.
3. resource – resource resource_type “location” [{pools “disk_pool_name”}] | resource
resource_type “value” . resource_type can becanonicalhostname (Which takes quoted ethernet
name of a node in cluster that is unconnected to Conductor node by the hight speed
network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute
path name of a directory on a file system where intermediate data will be temporarily stored. It is
local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE,
etc.)
4. How datastage decides on which processing node a stage should be run?
1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a
parallel stage on all nodes defined in the default node pool. (Default Behavior)
2. If the node is constrained then the constrained processing nodes are chosen while executing the
parallel stage.

In Datastage, the degree of parallelism, resources being used, etc. are all determined
during the run time based entirely on the configuration provided in the APT CONFIGURATION
FILE. This is one of the biggest strengths of Datastage. For cases in which you have changed your
processing configurations, or changed servers or platform, you will never have to worry about it
affecting your jobs since all the jobs depend on this configuration file for execution. Datastage
jobs determine which node to run the process on, where to store the temporary data , where to store
the dataset data, based on the entries provide in the configuration file. There is a default
configuration file available whenever the server is installed. You can typically find it under the
<>\IBM\InformationServer\Server\Configurations folder with the name default.apt. Bear in mind
that you will have to optimise these configurations for your server based on your resources.

Basically the configuration file contains the different processing nodes and also
specifies the disk space provided for each processing node. Now when we talk about processing
nodes you have to remember that these can are logical processing nodes that are specified in the
configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a very
inefficient configuration on your hands.

Now lets try our hand in interpreting a configuration file. Lets try the below sample.

{
node “node1″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “}
}
node “node2″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “”}
}
node “node3″
{
fastname “SVR2″
pools “” “sort”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools ”" }
}

This is a 3 node configuration file. Lets go through the basic entries and what it represents.

Fastname – This refers to the node name on a fast network. From this we can imply that the nodes
node1 and node2 are on the same physical node. However if we look at node3 we can see that it is
on a different physical node (identified by SVR2). So basically in node1 and node2 , all the
resources are shared. This means that the disk and scratch disk specified is actually shared between
those two logical nodes. Node3 on the other hand has its own disk and scratch disk space.

Pools – Pools allow us to associate different processing nodes based on their functions and
characteristics. So if you see an entry other entry like “node0” or other reserved node pools like
“sort”,”db2”,etc.. Then it means that this node is part of the specified pool. A node will be by
default associated to the default pool which is indicated by “”. Now if you look at node3 can see
that this node is associated to the sort pool. This will ensure that that the sort stage will run only
on nodes part of the sort pool.

Resource disk - This will specify Specifies the location on your server where the processing node
will write all the data set files. As you might know when Datastage creates a dataset, the file you
see will not contain the actual data. The dataset file will actually point to the place where the actual
data is stored. Now where the dataset data is stored is specified in this line.
Resource scratchdisk – The location of temporary files created during Datastage processes, like
lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk
can also be made part of the sort scratch disk pool. This will ensure that the temporary files created
during sort are stored only in this location. If such a pool is not specified then Datastage determines
if there are any scratch disk resources that belong to the default scratch disk pool on the nodes that
sort is specified to run on. If this is the case then this space will be used.

Below is the sample diagram for 1 node and 4 node resource allocation:
SAMPLE CONFIGURATION FILES

Configuration file for a simple SMP

A basic configuration file for a single machine, two node server (2-CPU) is shown below. The
file defines 2 nodes (node1 and node2) on a single dev server (IP address might be provided as
well instead of a hostname) with 3 disk resources (d1 , d2 for the data and Scratch as scratch
space).

The configuration file is shown below:

node "node1"
{ fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource disk "/IIS/Config/d2" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}

node "node2"
{
fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}

Configuration file for a cluster / MPP / grid

The sample configuration file for a cluster or a grid computing on 4 machines is shown below.
The configuration defines 4 nodes (node[1-4]), node pools (n[1-4]) and s[1-4), resource pools
bigdata and sort and a temporary space.
node "node1"
{
fastname "dev1"
pool "" "n1" "s1" "sort"
resource disk "/IIS/Config1/d1" {}
resource disk "/IIS/Config1/d2" {"bigdata"}
resource scratchdisk "/IIS/Config1/Scratch" {"sort"}
}

node "node2"
{
fastname "dev2"
pool "" "n2" "s2"
resource disk "/IIS/Config2/d1" {}
resource disk "/IIS/Config2/d2" {"bigdata"}
resource scratchdisk "/IIS/Config2/Scratch" {}
}

node "node3"
{
fastname "dev3"
pool "" "n3" "s3"
resource disk "/IIS/Config3/d1" {}
resource scratchdisk "/IIS/Config3/Scratch" {}
}

node "node4"
{
fastname "dev4"
pool "n4" "s4"
resource disk "/IIS/Config4/d1" {}
resource scratchdisk "/IIS/Config4/Scratch" {}
}

Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource
disk.

Resource scratch disk : Here also a path to folder is defined. This path is used by the parallel job
stages for buffering of the data when the parallel job runs.

=====================================================================

Sequentional_Stage :
Sequential File:
The Sequential File stage is a file stage. It allows you to read data from or write
data to one or more flat files as shown in Below Figure:

The stage executes in parallel mode


by default if reading multiple files but executes sequentially if it is only reading one file.

In order read a sequential file datastage needs to know about the format of the file.

If you are reading a delimited file you need to specify delimiter in the format tab.

Reading Fixed width File:

Double click on the sequential file stage and go to properties tab.

Source:

File:Give the file name including path

Read Method:Whether to specify filenames explicitly or use a file pattern.

Important Options:

First Line is Column Names:If set true, the first line of a file contains column names on writing and
is ignored on reading.

Keep File Partitions:Set True to partition the read data set according to the organization of the input
file(s).

Reject Mode: Continue to simply discard any rejected rows; Fail to stop if any row is rejected; Output
to send rejected rows down a reject link.

For fixed-width files, however, you can configure the stage to behave differently:
* You can specify that single files can be read by multiple nodes. This can improve performance on
cluster systems.
* You can specify that a number of readers run on a single node. This means, for example, that a
single file can be partitioned as it is read.
These two options are mutually exclusive.

Scenario 1:

Reading file sequentially.

Scenario 2:

Read From Multiple Nodes = Yes

Once we add Read From Multiple Node = Yes then stage by default executes in Parallel mode.
If you run the job with above configuration it will abort with following fatal error.

sff_SourceFile: The multinode option requires fixed length records.(That means you can use this
option to read fixed width files only)

In order to fix the above issue go the format tab and add additions parameters as shown below.

Now job finished successfully and


please below datastage monitor for performance improvements compare with reading from single
node.

Scenario 3:Read Delimted file with By Adding Number of Readers Pernode instead of multinode
option to improve the read performance and once we add this option sequential file stage will execute
in default parallel mode.
If we are reading from and writing to fixed width file it is always good practice to add
APT_STRING_PADCHAR Datastage Env variable and assign 0×20 as default value then it will pad with
spaces ,otherwise datastage will pad null value(Datastage Default padding character).

Always Keep Reject Mode = Fail to make sure datastage job will fail if we get from format from source
systems.

Sequential File Best Performance Settings/Tips

Important Scenarios using sequential file


stage:
Sequential file with Duplicate Records

Splitting input files into three different files using lookup

Sequential file with Duplicate Records

Sequential file with Duplicate Records:


A sequential file has 8 records with one column, below are the values in the column separated by space,
11223456

In a parallel job after reading the sequential file 2 more sequential files should be created, one with
duplicate records and the other without duplicates.
File 1 records separated by space: 1 1 2 2
File 2 records separated by space: 3 4 5 6
How will you do it

Sol1:
1. Introduce a sort stage very next to sequential file,
2. Select a property (key change column) in sort stage and you can assign 0-Unique or 1- duplicate or
viceversa as you wish.
3. Put a filter or transformer next to it and now you have unique in 1 link and duplicates in other link.

Sol2:(Should check though)

First of all take a source file then connect it to copy stage. Then, 1 link is connected to the aggregator
stage and another link is connected to the lookup stage or join stage. In Aggregator stage using the
count function, Calculate how many times the values are repeating in the key column.

After calculating that it is connected to the filter stage where we filter the cnt=1(cnt is new column for
repeating rows).
Then the o/p from the filter is connected to the lookup stage as reference. In the lookup stage LOOKUP
FAILURE=REJECT.

Then place two output links for the lookup, One collects the non-repeated values and another collects
the repeated values in reject link.

Splitting input files into three different files using lookup :

Splitting input files into three different files


Input file A contains
1
2
3
4
5
6
7
8
9
10

input file B contains


6
7
8
9
10
11
12
13
14
15

Output file X contains


1
2
3
4
5

Output file y contains


6
7
8
9
10

Output file z contains


11
12
13
14
15

Possible solution:

Change capture stage. First, i am going to use source as A and refrerence as B both of them are
connected to Change capture stage. From, change capture stage it connected to filter stage and
then targets X,Y and Z. In the filter stage: keychange column=2 it goes to X [1,2,3,4,5]
Keychange column=0 it goes to Y [6,7,8,9,10] Keychange column=1 it goes to Z
[11,12,13,14,15]

Solution 2:
Create one px job.
src file= seq1 (1,2,3,4,5,6,7,8,9,10)
1st lkp = seq2 (6,7,8,9,10,11,12,13,14,15)
o/p - matching recs - o/p 1 (6,7,8,9,10)
not-matching records - o/p 2 (1,2,3,4,5)
2nd lkp:
src file - o/p 1 (6,7,8,9,10)
lkp file - seq 2 (6,7,8,9,10,11,12,13,14,15)
not matching recs - o/p 3 (11,12,13,14,15)
Dataset :

Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These
carry meta data with them, both column definitions and information about the configuration that
was in effect when the data set was created. If for example, you have a stage which limits
execution to a subset of available nodes, and the data set was created by a stage using all nodes,
InfoSphere DataStage can detect that the data will need repartitioning.

If required, data sets can be landed as persistent data sets, represented by a Data Set stage
.This is the most efficient way of moving data between linked jobs. Persistent data sets are stored
in a series of files linked by a control file (note that you should not attempt to manipulate these
files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere
DataStage).
there are the two groups of Datasets - persistent and virtual.
The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual
datasets *.v extension is reserved. (It's important to mention, that no *.v files might be visible in
the Unix file system, as long as they exist only virtually, while inhabiting RAM memory.
Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of scripting).
Further differences are much more significant. Primarily, persistent Datasets are being stored
in Unix files using internal Datastage EE format, while virtual Datasets are never stored on
disk - they do exist within links, and in EE format, but in RAM memory. Finally,
persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual
Datasets - might be passed through in memory.

A data set comprises a descriptor file and a number of other files that are added as the data set
grows. These files are stored on multiple disks in your system. A data set is organized in terms
of partitions and segments.

Each partition of a data set is stored on a single processing node. Each data segment contains all
the records written by a single job. So a segment can contain files from many partitions, and a
partition has files from many segments.

Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo
the same processes and modifications. In a word, all of them must go through the same
successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore
they cannot be treated commonly.
Alias names of Datasets are

1) Orchestrate File
2) Operating System file

And Dataset is multiple files. They are


a) Descriptor File
b) Data File
c) Control file
d) Header Files

In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.

Starting a Dataset Manager:


Choose Tools ► Data Set Management, a Browse Files dialog box appears:
1. Navigate to the directory containing the data set you want to manage. By convention, data set
files have the suffix .ds.
2. Select the data set you want to manage and click OK. The Data Set Viewer appears. From here
you can copy or delete the chosen data set. You can also view its schema (column definitions)
or the data it contains.

3. Transformer Stage :

4. Various functionalities of Transformer Stage:


5.
6. Generating surrogate key using Transformer
7.
8. Transformer stage using stripwhitespaces
9.
10. TRANSFORMER STAGE TO FILTER THE DATA
11.
12. TRANSFORMER STAGE USING PADSTRING FUNCTION
13.
14. CONCATENATE DATA USING TRANSFORMER STAGE
15.
16. FIELD FUNCTION IN TRANSFORMER STAGE
17.
18. TRANSFORMER STAGE WITH SIMPLE EXAMPLE
19.
20. TRANSFORMER STAGE FOR DEPARTMENT WISE DATA
21.
22. HOW TO CONVERT ROWS INTO THE COLUMNS IN DATASTAGE
23.
24. SORT STAGE AND TRANSFORMER STAGE WITH SAMPLE DATA EXAMPLE
25.
26. FIELD FUNCTION IN TRANSFORMER STAGE WITH EXAMPLE
27.
28. RIGHT AND LEFT FUNCTIONS IN TRANSFORMER STAGE WITH EXAMPLE
29. SOME OTHER IMPORTANT FUNCTIONS:
30.
How to perform aggregation using a Transformer
31.
Date and time string functions
32.
Null handling functions
33.
Vector function-Transformer
34.
Type conversion functions-Transformer
35.
How to convert a single row into multiple rows ?

Data Stage Transformer Usage Guidelines


=========================================================================================

Sort Stage:

SORT STAGE PROPERTIES:


SORT STAGE WITH TWO KEY VALUES

HOW TO CREATE GROUP ID IN SORT STAGE IN DATASTAGE

Group ids are created in two different ways.

We can create group id's by using

a) Key Change Column

b) Cluster Key change Column

Both of the options used to create group id's .

When we select any option and keep true. It will create the Group id's group wise.

Data will be divided into the groups based on the key column and it will give (1) for

the first row of every group and (0) for rest of the rows in all groups.

Key change column and Cluster Key change column used based on the data we are getting

from the source.

If the data we are getting is not sorted , then we use key change column to create

group id's

If the data we are getting is sorted data, then we use Cluster Key change Column to

create Group Id's


Open Sort Stage Properties .

And Select Key column

And if you are getting not sorted data . Keep Key Change Column as True

And Drag and Drop in Output

Group Id's will be generated as 0's and 1's Group Wise.

If your data is already Sorted you need to keep cluster Key change Column as True

( Dont Select Key Change Column )

And Same process as above.

Aggregator_Stage :

The Aggregator Stage:

Aggregator stage is a processing stage in datastage is used to grouping and summary operations.By
Default Aggregator stage will execute in parallel mode in parallel jobs.

Note:In a Parallel environment ,the way that we partition data before grouping and
summary will affect the results.If you parition data using round-robin method and then
records with same key values will distruute across different partiions and that will give in
correct results.

Aggregation Method:

Aggregator stage has two different aggregation Methods.


1)Hash:Use hash mode for a relatively small number of groups; generally, fewer than about 1000
groups per megabyte of memory.

2)Sort: Sortmode requires the input data set to have been partition sorted with all of the grouping
keys specified as hashing and sorting keys.Unlike the Hash Aggregator, the Sort Aggregator requires
presorted data, but only maintains the calculations for the current group in memory.

Aggregation Data Type:

By default aggregator stage calculation output column is double data type and if you want decimal
output then add following property as shown in below figure.

If you are using single key column for the grouping keys then there is no need to sort or hash
partition the incoming data.

AGGREGATOR STAGE AND FILTER STAGE WITH EXAMPLE


If we have a data as below

table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo

And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one target.

Take Job design as

Read and load the data in sequential file.

In Aggregator stage select group =dno

Aggregator type = count rows

Count output column =dno_cpunt( user defined )

In output Drag and Drop the columns required.Than click ok

In Filter Stage

----- At first where clause dno_count>1


-----Output link =0
-----At second where clause dno_count<=1 -----output link=0 Drag and drop the outputs to the two targets.
Give Target file names and Compile and Run the JOb. You will get the required data to the Targets.

AGGREGATOR STAGE TO FIND NUMBER OF PEOPLE GROUP WISE

We can use Aggregator stage to find number of people each in each department.

For example, if we have the data as below

e_id,e_name,dept_no
1,sam,10
2,tom,20
3,pinky,10
4,lin,20
5,jim,10
6,emy,30
7,pom,10
8,jem,20
9,vin,30
10,den,20

Take Job Design as below

Seq.-------Agg.Stage--------Seq.File

Read and load the data in source file.

Go to Aggregator Stage and Select Group as Dept_No

and Aggregator type = Count Rows

Count Output Column = Count ( This is User Determined)

Click Ok ( Give File name at the target as your wish )

Compile and Run the Job

AGGREGATOR STAGE WITH REAL TIME SCENARIO EXAMPLE

Aggregator stage works on groups.


It is used for the calculations and counting.
It supports 1 Input and 1 Outout

Example for Aggregator stage:

Input Table to Read

e_id, e_name, e_job,e_sal,deptno

100,sam,clerck,2000,10
200,tom,salesman,1200,20
300,lin,driver,1600,20
400,tim,manager,2500,10
500,zim,pa,2200,10
600,eli,clerck,2300,20

Here our requirement is to find the maximum salary from each dept. number.
According to this sample data, we have two departments.

Take Sequential File to read the data and take Aggregator for calculations.
And Take sequential file to load into the target.

That is we can take like this

Seq.File--------Aggregator-----------Seq.File

Read the data in Seq.Fie

And in Aggregator Stage ---In Properties---- Select Group =DeptNo

And Select e_sal in Column for calculations

i.e because to calculate maximum salary based on dept. Group.

Select output file name in second sequential file.

Now compile And run.

It will work fine.

3 comments:

Ram R said...
This comment has been removed by the author.
December 27, 2013 at 12:19 AM

Ram R said...

Hi,
I tried this one and have some questions.
If we have a data as below

table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo
And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one target.

My question:
I placed 2 seq files, one with count >1 and other with count <=1, 1 seq file output was
this :
dno count
10 3
20 2

2 seq file output was like this:

dno count
40 1
30 1

Instead I wanted output like this:


dno name
10 siva
10 ram
10 sam
20 tom
20 tiny

2nd output file should be:


dno name
30 emy
40 remo

Join Stage:

MULTIPLE JOIN STAGES TO JOIN THREE TABLES:

If we have three tables to join and we don't have same key column in all the tables to

join the tables using one join stage.

In this case we can use Multiple join stages to join the tables.

You can take sample data as below

soft_com_1
e_id,e_name,e_job,dept_no
001,james,developer,10
002,merlin,tester,20
003,jonathan,developer,10
004,morgan,tester,20
005,mary,tester,20

soft_com_2
dept_no,d_name,loc_id
10,developer,200
20,tester,300

soft_com_3
loc_id,add_1,add_2
10,melbourne,victoria
20,brisbane,queensland

Take Job Design as below

Read and load the data in three sequential files.

In first Join stage ,

Go to Properties ----Select Key column as Deptno

and you can select Join type = Inner

Drag and drop the required columns in Output

Click Ok

In Second Join Stage

Go to Properties ---- Select Key column as loc_id

and you can select Join type = Inner

Drag and Drop the required columns in the output

Click ok

Give file name to the Target file, That's it

Compile and Run the Job


You can Learn more on Join Stage with example here

JOIN STAGE WITHOUT COMMON KEY COLUMN:

If we like to join the tables using Join stage , we need to have common key

columns in those tables. But some times we get the data without common key column.

In that case we can use column generator to create common column in both the

tables.

Read and load the data in Seq. Files

Go to Column Generator to create column and sample data.

In properties select name to create.

and Drag and Drop the columns into the target

Now Go to the Join Stage and select Key column which we have created( You can give

any name, based on business requirement you can give understandable name)

In Output Drag and Drop all required columns

Give File name to Target File. Than

Compile and Run the Job.

Sample Tables You can take as below

Table1

e_id,e_name,e_loc
100,andi,chicago
200,borny,Indiana
300,Tommy,NewYork

Table2

Bizno,Job
20,clerk
30,salesman

INNER JOIN IN JOIN STAGE WITH EXAMPLE:


If we have a Source data as below

xyz1 (Table 1 )

e_id,e_name,e_add
1,tim,la
2,sam,wsn
3,kim,mex
4,lin,ind
5,elina,chc

xyz2 (Table 2 )

e_id,address
1,los angeles
2,washington
3,mexico
4,indiana
5,chicago

We need the output as a

e_id, e_name,address
1,tim,los angeles
2,sam,washington
3,kim,meixico
4,lin,indiana
5,elina,chicago

Take job design as below


Read and Load the both the sourc tables in seq. files

And go to Join stage properties

Select Key column as e_id

JOIN Type = Inner

In Out put Column Drag and Drop Required Columns to go to output file and click ok.

Give file name for Target dataset and then

Compile and Run the Job . You will get the Required Output in the Target File.

Join stages and its types explained:

Inner Join:
Say if we have duplicates in left table on key field? What will happen?

We all get all matching records. We will get all matching Duplicates all well here is the
table Representation of join.
LeftOuter Join:

All the records from left table and all matching records. If we dont exists in the right table it will be
populated with nulls.

Right Outer Join:

All the records from right table and all matching records.
Full Outer Join:

All records and all matching records:


=====================================================================

Lookup_Stage :

Lookup Stage:

The Lookup stage is most appropriate when the reference data for all lookup stages in a job
is small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of shared memory. If the Data Sets are larger than available memory resources, the JOIN or
MERGE stage should be used.

Lookup stages do not require data on the input link or reference links to be sorted. Be aware,
though, that large in-memory lookup tables will degrade performance because of their paging
requirements. Each record of the output data set contains columns from a source record plus columns
from all the corresponding lookup records where corresponding source and lookup records have the
same value for the lookup key columns. The lookup key columns do not have to have the same
names in the primary and the reference links.

The optional reject link carries source records that do not have a corresponding entry in the
input lookup tables.

You can also perform a range lookup, which compares the value of a source column to a range of
values between two lookup table columns. If the source column value falls within the required range,
a row is passed to the output link. Alternatively, you can compare the value of a lookup column to a
range of values between two source columns. Range lookups must be based on column values, not
constant values. Multiple ranges are supported.
There are some special partitioning considerations for Lookup stages. You need to ensure that the
data being looked up in the lookup table is in the same partition as the input data referencing it. One
way of doing this is to partition the lookup tables using the Entire method.

Lookup stage Configuration:Equal lookup


You can specify what action need to perform if lookup fails.

Scenario1: Continue
Choose entire partition on the reference link
Scenario2:Fail
Job aborted with the following error:

stg_Lkp,0: Failed a key lookup for record 2 Key Values: CUSTOMER_ID: 3

Scenari03:Drop
Scenario4:Reject
If we select reject as lookup failure condition then we need to add reject link otherwise we get
compilation error.
Range Lookup:

Business scenario:we have input data with customer id and customer name and transaction date.We
have customer dimension table with customer address information.Customer can have multiple
records with different start and active dates and we want to select the record where incoming
transcation date falls between start and end date of the customer from dim table.

Ex Input Data:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT

1 UMA 2011-03-01

1 UMA 2010-05-01
Ex Di Data:

CUSTOMER_ID CITY ZIP_CODE START_DT END_DT

1 BUENA PARK 90620 2010-01- 2010-12-


01 31

1 CYPRESS 90630 2011-01- 2011-04-


01 30
Expected Output:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT CITY ZIP_CODE


1 UMA 2011-03-01 CYPRESS 90630

1 UMA 2010-05-01 BUENA 90620


PARK
Configure the lookup stage as shown below.Double click on Lnk_input.TRANSACTION_DATE
column.(specifying condition on the input link)
You need to
specify return multiple rows from the reference link otherwise you will get following warning in the job
log.Even though we have two distinct rows base on customer_id,start_dt and end_dt columns but
datastage is considering duplicate rows based on customer_id key only.

stg_Lkp,0: Ignoring duplicate entry; no further warnings will be issued for this table

Compile and Run the job:

Scenario 2:Specify range on reference link:


This concludes lookup stage configuration for different scenarios.

RANGE LOOKUP WITH EXAMPLE IN DATASTAGE:

Range Look Up is used to check the range of the records from another table records.

For example If we have the employees list, getting salaries from $1500 to $ 3000.

If we like to check the range of the employees with respect to salaries.

We can do it by using Range Lookup.

For Example if we have the following sample data.

xyzcomp ( Table Name )


e_id,e_name,e_sal
100,james,2000
200,sammy,1600
300,williams,1900
400,robin,1700
500,ponting,2200
600,flower,1800
700,mary,2100

lsal is nothing but low salary

hsal is nothing but High salary

Now Read and load the data in Sequential files

And Open Lookup file--- Select e_sal in the first table data

And Open Key expression and

Here Select e_sal >=lsal And


e_sal <=hsal

Click Ok

Than Drag and Drop the Required columns into the output and click Ok

Give File name to the Target File.

Then Compile and Run the Job . That's it you will get the required Output.

Why Entire partition is used in LOOKUP stage ?


Entire partition has all data across the nodes So while matching(in lookup) the records all data should be
present across all nodes.
For lookup sorting is not required.so when we are not using entire partition then reference data splits into
all nodes. Then each primary record need check with all nodes for matched reference record.Then we
face performance issue.If we use entire in lookup then one primary record needs to look into 1 node is
enough.if match found then that record goes to target otherwise it move to reject,drop etc(based on
requirement)no need check in another node.In this case if we are running job in 4 nodes then at a time 4
records should process.

Note:Please remember we go for lookup only we have small reference data.If we go for big data it is
performance issue(I/O work will increase here) and also some times job will abort.

Difference between normal and sparse lookup?

Normal look-up:all the reference table data is stored in the buffer for cross- check with the primary
table data.
Sparse lookup:each record of the primary table is cross checked with the reference table datethe
types of look-ups will araise only if the reference table is in database.so depending on the size of the
reference table we will set the type of lookup to implement.

During lookup, what if we have duplicates in


reference table/file?
=====================================================================================

Merge_Stage :

Merge Stage:

The Merge stage is a processing stage. It can have any number of input links, a single output
link, and the same number of reject links as there are update input links.(according to DS
documentation)
Merge stage combines a mster dataset with one or more update datasets based on the key
columns.the output record contains all the columns from master record plus any additional columns
from each update record that are required.

A master record and update record will be merged only if both have same key column values.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures
that rows with the same key column values are located in the same partition and will be processed by
the same node. It also minimizes memory requirements because fewer rows need to be in memory at
any one time.

As part of preprocessing your data for the Merge stage, you should also remove duplicate
records from the master data set. If you have more than one update data set, you must remove
duplicate records from the update data sets as well.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject
links. You can route update link rows that fail to match a master row down a reject link that is specific
for that link. You must have the same number of reject links as you have update links. The Link
Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject
links. You can also specify whether to drop unmatched master rows, or output them on the output
data link.

Example :

Master
dataset:

CUSTOMER_ID CUSTOMER_NAME

1 UMA

2 POOJITHA

Update
dataset1

CUSTOMER_ID CITY ZIP_CODE SEX

1 CYPRESS 90630 M

2 CYPRESS 90630 F

Output:

CUSTOMER_ID CUSTOMER_NAME CITY ZIP_CODE SEX

1 UMA CYPRESS 90630 M

2 POOJITHA CYPRESS 90630 F


Merge stage configuration steps:
Options:

Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the
master link are output; Drop means that unmatched rows are dropped instead.

Warn On Reject Updates:True to generate a warning when bad records from any update links are
rejected.

Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the
master link.

Partitioning:Hash on both master input and update


input as shown below:
Compile and run the job :

Scenario 2:

Remove a record from the updateds1 and check the output:


Check for the datastage warning in the job log as we have selected Warn on unmatched masters =
TRUE

stg_merge,0: Master record (0) has no updates.

stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.

Scenarios 3:Drop unmatched master record and capture reject records from updateds1

Scenario 4:Insert a duplicate record with same


customer id in the master dataset and check for the results.

Look at the output and it is clear that merge stage


automatically dropped the duplicate record from master dataset.

Scenario 4:Added new updatedataset2 which


contains following data.

Update
Dataset2
CUSTOMER_ID CITIZENSHIP

1 INDIAN

2 AMERICAN

Still we have duplicate row in the master dataset.if you compile the job with above design you will get
compilation error like below.

If you look ate the above figure you can see 2 rows
in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .
Scenario 5:add a duplicate row for customer_id=1 in
updateds1 dataset.

Now we have duplicate record both in master


dataset and updateds1.Run the job and check the results and warnings in the job log.

No change the results and merge stage automatically dropped the duplicate row.

Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630
instead of 90620.

Run the job and check output results.


I ran the same job multiple times and found the
merge stage is taking first record coming as input from the updateds1 and dropping the next records
with same customer id.

This post covered most of the merge scenarios.

=====================================================================================

Filter_Stage :

Filter Stage:

Filter stage is a processing stage used to filter database based on filter condition.

The filter stage is configured by creating expression in the where clause.

Scenario1:Check for empty values in the customer name field.We are reading from sequential file and
hence we should check for empty value instead of null.
Scenario 2:Comparing incoming fields.check transaction date falls between strt_dt and end_dt and
filter those records.

Input Data:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT STR_DT END_DT

1 UMA 1/1/2010 5/20/2010 12/20/2010

1 UMA 5/28/2011 5/20/2010 12/20/2010


Output:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT STR_DT END_DT

1 UMA 5/28/2011 5/20/2010 12/20/2010


Reject:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT STR_DT END_DT

1 UMA 1/1/2010 5/20/2010 12/20/2010

Partition data based on CUSTOMER_ID to make sure all rows with same key values process on the
same node.

Condition : where TRANSACTION_DT Between STRT_DT and END_DT


Actual Output:

Actual Reject Data:

Scenario 3:Evaluating input column


data

ex:Where CUSTOMER_NAME=’UMA’ AND CUSTOMER_ID=’1′

Output :
Reject :

This covers most filter stage scenarios.

FILTER STAGE WITH REAL TIME EXAMPLE:


Filter Stage is used to write the conditions on Columns.

We can write Conditions on any number of columns.

For Example if you have the data like as follows

e_id,e_name,e_sal

1,sam,2000
2,ram,2200
3,pollard,1800
4,ponting,2200
5,sachin,2200

If we need to find who are getting the salary of 2200.

( In real time there will thousands of records at the source)


We can take Sequential file to read the and filter stage for writing Conditions.

And Dataset file to load the data into the Target.

Design as follows: ---

Seq.File---------Filter------------DatasetFile

Open Sequential File And

Read the data.

In filter stage -- Properties -- Write Condition in Where clause as

e_sal=2200

Go to Output -- Drag and Drop

Click Ok

Go to Target Dataset file and give some name to the file and that's it

Compile and Run

You will get the required output in Target file.

If you are trying to write conditions on multiple columns

Write condition in where clause

and give output like=(Link order number ) For EXAMPLE : 1

And Write another condition and select output link =0

( You can get the link order number in link ordering Option)

Design as follows : ----

Compile And Run

You will get the data to the both the Targets.


Copy Stage :
COPY STAGE:

Copy Stage is one of the processing stage that have one input and 'n' number of outputs. The
copy stage is used to send the one source data to multiple copies and this can be used for the multiple
purpose. The records which we are sending through copy stage can be copied with any modifications and
also we can do the following.
a) Columns order can be altered .
b) And columns can be dropped.
c) We can change the column names.

In Copy Stage, we have the option called Force. It will be false in Default and if we kept to true, it is used
to specify that datastage should not try optimize the job by removing a copy operation where there is one
input and one output .
================================================================================

Funnel_Stage :

Funnel Stage:

Funnel stage is used to combine multiple input datasets into a single input dataset.This stage can have
any number of input links and single output link.
It operates in 3 modes:

Continuous Funnel combines records as they arrive (i.e. no particular order);

Sort Funnel combines the input records in the order defined by one or more key fields;

Sequence copies all records from the first input data set to the output data set, then all the records
from the second input data set, etc.

Note:Metadata for all inputs must be identical.

Sort funnel requires data must be sorted and partitioned by the same key columns as to be used by
the funnel operation.

Hash Partition guarantees that all records with same key column values are located in the same
partition and are processed in the same node.
1)Continuous funnel:

Go to the properties of the funnel stage page and set Funnel Type to continuous funnel.
2)Sequence:
Note:In order to use sequence funnel you need to specify which order the input links you
need to process and also make sure the stage runs in sequential mode.

Usually we use sequence funnel when we create a file with header,detail and trailer records.

3)Sort Funnel:
Note: If you are running your sort funnel stage in parallel, you should be aware of the
various
considerations about sorting data and partitions

Thats all about funnel stage usage in datastage.

FUNNEL STAGE WITH REAL TIME EXAMPLE

Some times we get data in multiple files which belongs to same bank customers information.

In that time we need to funnel the tables to get the multiple files data into the single file.( table)

For Example , if we have the data two files as below

xyzbank1
e_id,e_name,e_loc
111,tom,sydney
222,renu,melboourne
333,james,canberra
444,merlin,melbourne

xyzbank2
e_id,e_name,e_loc
555,,flower,perth
666,paul,goldenbeach
777,raun,Aucland
888,ten,kiwi
For Funnel take the Job design as

Read and Load the data into two sequential files.

Go to Funnel stage Properties and

Select Funnel Type = Continous Funnel

( Or Any other according to your requirement )

Go to output Drag and drop the Columns

( Remember Source Columns Stucture Should be same ) Then click ok

Give file name for the target dataset then

compile and run th job

Column Generator :
Column Generator is a development stage/ generating stage that is used to generate column

with sample data based on user defined data type .

Take Job Design as

Seq.File--------------Col.Gen------------------Ds

Take source data as a

xyzbank
e_id,e_name,e_loc
555,flower,perth
666,paul,goldencopy
777,james,aucland
888,cheffler,kiwi
In order to generate column ( for ex: unique_id)

First read and load the data in seq.file

Go to Column Generator stage -- Properties -- Select column method as explicit

In column to generate = give column name ( For ex: unique_id)

In Output drag and drop

Go to column write column name and you can change data type for unique_id in sql type and

can give length with suitable name

Then compile and Run


================================================================================

Surrogate_Key_Stage :

Surrogate Key Importance:


SURROGATE KEY IN DATASTAGE:
Surrogate Key is a unique identification key. It is alternative to natural key .

And in natural key, it may have alphanumeric composite key but the surrogate is

always single numeric key.

Surrogate key is used to generate key columns, for which characteristics can be

specified. The surrogate key generates sequential incremental and unique integers for a

provided start point. It can have a single input and a single output link.

WHAT IS THE IMPORTANCE OF OF SURROGATE KEY?


Surrogate Key is a Primary Key for a dimensional table. ( Surrogate key is alternate to Primary
Key) The most importance of using Surrogate key is not affected by the changes going on with a
database.

And in Surrogate Key Duplicates are allowed, where it cant be happened in the Primary Key .

By using Surrogate key we can continue the sequence for any jobs. If any job was aborted at the
n records loaded.. By using surrogate key you can continue the sequence from n+1.

Surrogate Key Generator:


The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and
maintains the key source.

A surrogate key is a unique primary key that is not derived from the data that it represents, therefore
changes to the data will not change the primary key. In a star schema database, surrogate keys are
used to join a fact table to a dimension table.

surrogate key generator stage uses:

 Create or delete the key source before other jobs run


 Update a state file with a range of key values
 Generate surrogate key columns and pass them to the next stage in the job
 View the contents of the state file
Generated keys are 64 bit integers and the key source can be stat file or database sequence.

Creating the key source:


Drag the surrogate key stage from palette to parallel job canvas with no input and output links.

Double click on the surrogate key stage and click on properties tab.

Properties:
Key Source Action = create

Source Type : FlatFile or Database sequence(in this case we are using FlatFile)

When you run the job it will create an empty file.

If you want to the check the content change the View Stat File = YES and check the job log for details.

skey_genstage,0: State file /tmp/skeycutomerdim.stat is empty.

if you try to create the same file again job will abort with the following error.

skey_genstage,0: Unable to create state file /tmp/skeycutomerdim.stat: File exists.

Deleting the key source:

Updating the stat File:

To update the stat file add surrogate key stage to the job with single input link from other stage.

We use this process to update the stat file if it is corrupted or deleted.

1)open the surrogate key stage editor and go to the properties tab.
If the stat file exists we can update otherwise we can create and update it.

We are using SkeyValue parameter to update the stat file using transformer stage.

Generating Surrogate Keys:

Now we have created stat file and will generate keys using the stat key file.

Click on the surrogate keys stage and go to properties add add type a name for the surrogate key
column in the Generated Output Column Name property
Go to ouput and define the mapping like below.

Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the output.

I have updated the stat file with 100 and below is the output.
If you want to generate the key value from begining you can use following property in the surrogate
key stage.

a. If the key source is a flat file, specify how keys are generated:
o To generate keys in sequence from the highest value that was last used, set the Generate Key from
Last Highest Value property to Yes. Any gaps in the key range are ignored.
o To specify a value to initialize the key source, add the File Initial Value property to the Options group,
and specify the start value for key generation.
o To control the block size for key ranges, add the File Block Size property to the Options group, set this
property toUser specified, and specify a value for the block size.
b. If there is no input link, add the Number of Records property to the Options group, and specify how
many records to generate.
=====================================================================================

SCD :

WHAT IS SCD IN DATASTAGE ? TYPES OF SCD IN DATASTAGE?

SCD's are nothing but Slowly changing dimension.

Scd's are the dimensions that have the data that changes slowly. Rather than

changing in a time period. That is a regular schedule.

The Scd's are performed mainly into three types.

They are
Type-1 SCD

Type-2 SCD

Type-3 SCD

Type -1 SCD: In the type -1 SCD methodology, it will overwrites the older data

( Records ) with the new data ( Records) and therefore it will not maintain the

historical information.

This will used for the correcting the spellings of names, and for small updates of

customers.

TYpe -2 SCD: In the Type-2 SCS methodology, it will tracks the complete historical

information by creating the multilple records for the given natural key ( Primary

key) in the dimension tables with a separate surrogate keys or a different

version numbers. We have a unlimited historical data preservation, as a new

record is inserted each time a change is made.

Here we use differet type of options inorder to track the historical data of

customers like

a) Active flag

b) Date functions

c) Version Numbers

d) Surrogate Keys

We use this to track all the historical data of the customer.

According to our input, we use required function to track.

Type-3 SCD: In the Type-2 SCD, it will maintain the partial historical

information.
HOW TO USE TYPE -2 SCD IN DATASTAGE?
SCD'S is nothing but Slowly changing Dimensions.

Slowly Changing Dimensions are the dimensions that have the data that change slowly rather than
changing in a time period, i.e regular schedule.

The most common Slowly Changing Dimensions are three types.


They are Type -1 , Type -2 , Type -3 SCD's

Type-2 SCD:-- The Type-2 methodology tracks the Complete Historical information by creating the
multiple records for a given natural keys in the dimension tables with the separate surrogate keys or
different version numbers.
And we have unlimited history preservation as every time new record is inserted each time a change is
made.

SLOWLY CHANGING DIMENSIONS (SCD) - TYPES | DATA WAREHOUSE


Slowly Changing Dimensions: Slowly changing dimensions are the dimensions in which the data
changes slowly, rather than changing regularly on a time basis.

For example, you may have a customer dimension in a retail domain. Let say the customer is in
India and every month he does some shopping. Now creating the sales report for the customers is
easy. Now assume that the customer is transferred to United States and he does shopping there.
How to record such a change in your customer dimension?

You could sum or average the sales done by the customers. In this case you won't get the exact
comparison of the sales done by the customers. As the customer salary is increased after the
transfer, he/she might do more shopping in United States compared to in India. If you sum the total
sales, then the sales done by the customer might look stronger even if it is good. You can create a
second customer record and treat the transferred customer as the new customer. However this will
create problems too.

Handling these issues involves SCD management methodologies which referred to as Type 1 to
Type 3. The different types of slowly changing dimensions are explained in detail below.

SCD Type 1: SCD type 1 methodology is used when there is no need to store historical data in the
dimension table. This method overwrites the old data in the dimension table with the new data. It is
used to correct data errors in the dimension.

As an example, i have the customer table with the below data.

surrogate_key customer_id customer_name Location

------------------------------------------------
1 1 Marspton Illions

Here the customer name is misspelt. It should be Marston instead of Marspton. If you use type1
method, it just simply overwrites the data. The data in the updated table will be.

surrogate_key customer_id customer_name Location

------------------------------------------------

1 1 Marston Illions

The advantage of type1 is ease of maintenance and less space occupied. The disadvantage is that
there is no historical data kept in the data warehouse.

SCD Type 3: In type 3 method, only the current status and previous status of the row is maintained
in the table. To track these changes two separate columns are created in the table. The customer
dimension table in the type 3 method will look as

surrogate_key customer_id customer_name Current_Location previous_location

--------------------------------------------------------------------------

1 1 Marston Illions NULL

Let say, the customer moves from Illions to Seattle and the updated table will look as

surrogate_key customer_id customer_name Current_Location previous_location

--------------------------------------------------------------------------

1 1 Marston Seattle Illions

Now again if the customer moves from seattle to NewYork, then the updated table will be
surrogate_key customer_id customer_name Current_Location previous_location

--------------------------------------------------------------------------

1 1 Marston NewYork Seattle

The type 3 method will have limited history and it depends on the number of columns you create.

SCD Type 2: SCD type 2 stores the entire history the data in the dimension table. With type 2 we
can store unlimited history in the dimension table. In type 2, you can store the data in three different
ways. They are

 Versioning
 Flagging
 Effective Date

SCD Type 2 Versioning: In versioning method, a sequence number is used to represent the
change. The latest sequence number always represents the current row and the previous sequence
numbers represents the past data.

As an example, let’s use the same example of customer who changes the location. Initially the
customer is in Illions location and the data in dimension table will look as.

surrogate_key customer_id customer_name Location Version

--------------------------------------------------------

1 1 Marston Illions 1

The customer moves from Illions to Seattle and the version number will be incremented. The
dimension table will look as

surrogate_key customer_id customer_name Location Version

--------------------------------------------------------

1 1 Marston Illions 1

2 1 Marston Seattle 2
Now again if the customer is moved to another location, a new record will be inserted into the
dimension table with the next version number.

SCD Type 2 Flagging: In flagging method, a flag column is created in the dimension table. The
current record will have the flag value as 1 and the previous records will have the flag as 0.

Now for the first time, the customer dimension will look as.

surrogate_key customer_id customer_name Location flag

--------------------------------------------------------

1 1 Marston Illions 1

Now when the customer moves to a new location, the old records will be updated with flag value as
0 and the latest record will have the flag value as 1.

surrogate_key customer_id customer_name Location Version

--------------------------------------------------------

1 1 Marston Illions 0

2 1 Marston Seattle 1

SCD Type 2 Effective Date: In Effective Date method, the period of the change is tracked using the
start_date and end_date columns in the dimension table.

surrogate_key customer_id customer_name Location Start_date End_date

-------------------------------------------------------------------------

1 1 Marston Illions 01-Mar-2010 20-Fdb-2011

2 1 Marston Seattle 21-Feb-2011 NULL

The NULL in the End_Date indicates the current version of the data and the remaining records
indicate the past data.

SCD-2 Implementation in Datastage:

Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An
additional dimension record is created and the segmenting between the old record values and the new
(current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact
table usually stores dimension key and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of
many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which
has the following structure and data:
D_CUSTOMER dimension table before loading

Datastage SCD2 job design

The most important facts and stages of the CUST_SCD2 job processing:
• The dimension table with customers is refreshed daily and one of the data sources is a text file. For
the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it
is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:

• There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text
file.
• A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to
separate columns.
SCD 2 lookup transformer

• A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes
through only records that differ.
SCD 2 check discrepancies transformer

• A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated
with current indictator flag set to no and the new record is inserted with current indictator flag set to
yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer

• ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and
the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed
where part of an SQL statement.
• ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and
the key column is CUST_ID.
D_CUSTOMER dimension table after Datawarehouse refresh

===============================================================

Pivot_Enterprise_Stage:
Pivot enterprise stage is a processing stage which pivots data vertically and horizontally depending
upon the requirements. There are two types
1. Horizontal
2. Vertical

Horizontal Pivot operation sets input column to the multiple rows which is exactly opposite to the
Vertical Pivot Operation. It sets input rows to the multiple columns.
Let’s try to understand it one by one with following example.

1. Horizontal Pivot Operation.

Consider following Table.

Product Type Color_1 Color_2 Color_3


Pen Yellow Blue Green
Dress Pink Yellow Purple
Step 1: Design Your Job Structure Like below.

Configure above table with input sequential stage ‘se_product_clr_det’.


Step 2: Let’s configure ‘Pivot enterprise stage’. Double click on it. Following window will pop up.

Select ‘Horizontal’ for Pivot Type from drop-down menu under ‘Properties’ tab for horizontal Pivot
operation.
Step 3: Click on‘Pivot Properties’ tab. Under which we need to check box against ‘Pivot Index’. After
which column of name ‘Pivot_Index’ will appear under ‘Name’ column also declare a new column of
name ’Color’ as shown below.

Step 4: Now we have to mention columns to be pivoted under ‘Derivation’ against column ‘Color’.
Double click on it. Following Window will pop up.

Select columns to be pivoted from ‘Available column’ pane as shown. Click ‘OK’.
Step 5: Under ‘Output’ tab, only map pivoted column as shown.
Configure output stage. Give the file path. See below image for reference.

Step 6: Compile and Run the job. Let’s see what is happen to the output.
This is how we can set multiple input columns to the single column (As here for colors).
Vertical Pivot Operation:
Here, we are going to use ‘Pivot Enterprise’ stage to vertically pivot data. We are going to set multiple
input rows to a single row. The main advantage of this stage is we can use aggregation functions like
avg, sum, min, max, first, last etc. for pivoted column. Let’s see how it works.
Consider an output data of Horizontal Operation as input data for the Pivot Enterprise stage. Here, we
will be adding one extra column for aggregation function as shown in below table.

Product Color Prize


Pen Yellow 38
Pen Blue 43
Pen Green 25
Dress Pink 1000
Dress Yellow 695
Dress purple 738
Let’s study for vertical pivot operation step by step.
Step 1: Design your job structure like below. Configure above table data with input sequential file
‘se_product_det’.

Step 2: Open Pivot Enterprise stage and select Pivot type as vertical under properties tab.
Step 3: Under Pivot Properties tab minimum one pivot column and one group by column. Here, we
declared Product as group by column. Color and prize as Pivot columns.Lets see how to use
‘Aggregation functions’ in next step.

Step 4: On clicking Aggregation functions required for this column for particular column following
window will pop up. In which we can select functions whichever required for that particular column.
Here we are using ‘min’, ’max’ and ‘average’ functions with proper precision and scale for Prize column
as shown.
Step 5: Now we just have to do mapping under output tab as shown below.

Step 6: compile and Run the job. Lets see what will be the output is.
Output :

One more approach:

Many people have the following misconceptions about Pivot stage.


1) It converts rows into columns
2) By using a pivot stage, we can convert 10 rows into 100 columns and 100 columns
into 10 rows
3) You can add more points here!!

Let me first tell you that a Pivot stage only CONVERTS COLUMNS INTO ROWS and
nothing else. Some DS Professionals refer to this as NORMALIZATION. Another fact
about the Pivot stage is that it's irreplaceable i.e no other stage has this functionality
of converting columns into rows!!! So , that makes it unique, doesn't!!!
Let's cover how exactly it does it....

For example, lets take a file with the following fields: Item, Quantity1, Quantity2,
Quantity3....
Item~Quantity1~Quantity2~Quantity3
ABC~100~1000~10000
DEF~200~2000~20000
GHI~300~3000~30000

Basically you would use a pivot stage when u need to convert those 3 Quantity fields
into a single field whch contains a unique Quantity value per row...i.e. You would need
the following output

Item~Quantity
ABC~100
ABC~1000
ABC~10000
DEF~200
DEF~2000
DEF~20000
GHI~300
GHI~3000
GHI~30000

How to achieve the above in Datastage???


In this case our source would be a flat file. Read it using any file stage of your
choice: Sequential file stage, File set stage or Dataset stage. Specify 4 columns in
the Output column derivation tab.
Now connect a Pivot stage from the Tool pallette to the above output link and
create an output link for the Pivot stage itself (fr enabling the Output tab for the
pivot stage).

Unlike other stages, a pivot stage doesn't use the generic GUI stage page. It has a
stage page of its own. And by default the Output columns page would not have
any fields. Hence, you need to manually type in the fields. In this case just type in
the 2 field names : Item and Quantity. However manual typing of the columns
becomes a tedious process when the number of fields is more. In this case you can
use the Metadata Save - Load feature. Go the input columns tab of the pivot stage,
save the table definitions and load them in the output columns tab. This is the way
I use it!!!

Now, you have the following fields in the Output Column's tab...Item and
Quantity....Here comes the tricky part i.e you need to specify the DERIVATION
....In case the field names of Output columns tab are same as the Input tab, you
need not specify any derivation i.e in this case for the Item field, you need not
specify any derivation. But if the Output columns tab has new field names, you
need to specify Derivation or you would get a RUN-TIME error for free....

For our example, you need to type the Derivation for the Quantity field as

Column name Derivation


Item Item (or you can leave this blank)
Quantity Quantity1, Quantity2, Quantity3.

Just attach another file stage and view your output!!! So, objective met!!!

Sequence_Activities :

In this article i will explain how to use datastage looping acitvities in sequencer.

I have a requirement where i need to pass file id as parameter reading from a file.In Future file id’s
will increase so that i don’t have to add job or change sequencer if I take advantage of datastage
looping.

Contents in the File:

1|200

2|300

3|400

I need to read the above file and pass second field as parameter to the job.I have created one parallel
job with pFileID as parameter.

Step:1 Count the number of lines in the file so that we can set the upper limit in the datastage start
loop activity.

sample routine to count lines in a file:

Argument : FileName(Including path)

Deffun DSRMessage(A1, A2, A3) Calling “*DataStage*DSR_MESSAGE”


Equate RoutineName To “CountLines”

Command = “wc -l”:” “:FileName:”| awk ‘{print $1}’”

Call DSLogInfo(“Executing Command To Get the Record Count “,Command)


* call support routine that executes a Shell command.
Call DSExecute(“UNIX”, Command, Output, SystemReturnCode)

* Log any and all output as an Information type log message,


* unless system’s return code indicated that an error occurred,
* when we log a slightly different Warning type message.
vOutput=convert(char(254),”",Output)
If (SystemReturnCode = 0) And (Num(vOutput)=1) Then
Call DSLogInfo(“Command Executed Successfully “,Command)
Output=convert(char(254),”",Output)
Call DSLogInfo(“Here is the Record Count In “:FileName:” = “:Output,Output)
Ans = Output
*GoTo NormalExit
End Else
Call DSLogInfo(“Error when executing command “,Command)
Call DSLogFatal(Output, RoutineName)
Ans = 1

End
Now we use startLoop.$Counter variable to get the file id by using combination of grep and awk
command.

for each iteration it will get file id.

Finally the seq job looks like below.

I hope every one likes this post.

===============================================================
TRANSFORMER STAGE TO FILTER THE DATA :

TRANSFORMER STAGE TO FILTER THE DATA

Take Job Design as below

If our requirement is to filter the data department wise from the file below

samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40

And our requirement is to get the target data as below

In Target1 we need 10th & 40th dept employees.

In Target2 we need 30th dept employees.

In Target1 we need 20th & 40th dept employees.

Read and Load the data in Source file

In Transformer Stage just Drag and Drop the data to the target tables.

Write expression in constraints as below

dept_no=10 or dept_no= 40 for table 1

dept_no=30 for table 1

dept_no=20 or dept_no= 40 for table 1


Click ok

Give file name at the target file and

Compile and Run the Job to get the Output

Shared Container :

Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000
row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and
your job going to be aborted then.. How can short out the problem.Suppose job sequencer
synchronies or control 4 job but job 1 have problem, in this condition should go director and
check it what type of problem showing either data type problem, warning massage, job fail or job
aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this
option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job
defiantly u get successful massage
----------------------------------------------------------------------------------------------------------

Question: I want to process 3 files in sequentially one by one how can i do that. while processing
the files it should fetch files automatically .
Ans:If the metadata for all the files r same then create a job having file name as parameter then
use same job in routine and call the job with different file name...or u can create sequencer to use
the job..
---------------------------------------------------------------------------------------------------------------------
-----------------
Parameterize the file name.
Build the job using that parameter
Build job sequencer which will call this job and will accept the parameter for file name.
Write a UNIX shell script which will call the job sequencer three times by passing different file
each time.
RE: What Happens if RCP is disable ?
In such case Osh has to perform Import and export every time whenthe job runs and the
processing time job is also increased...
--------------------------------------------------------------------------------------------------------------------
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated at
run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime column
propagation

Question:
Source: Target

Eno Ename Eno Ename


1 a,b 1 a
2 c,d 2 b
3 e,f 3 c

Difference Between Join,Lookup and Merge :


Datastage Scenarios and solutions :

Field mapping using Transformer stage:

Requirement:
field will be right justified zero filled, Take last 18 characters

Solution:
Right("0000000000":Trim(Lnk_Xfm_Trans.link),18)

Scenario 1:

We have two datasets with 4 cols each with different names. We should create a dataset with 4
cols 3 from one dataset and one col with the record count of one dataset.

We can use aggregator with a dummy column and get the count from one dataset and do a look
up from other dataset and map it to the 3 rd dataset
Something similar to the below design:

Scenario 2:
Following is the existing job design. But requirement got changed to: Head and trailer datasets
should populate even if detail records is not present in the source file. Below job don't do that
job.
Hence changed the above job to this following requirement:

Used row generator with a copy stage. Given default value( zero) for col( count) coming in from
row generator. If no detail records it will pick the record count from row generator.

We have a source which is a sequential file with header and footer. How to remove the header
and footer while reading this file using sequential file stage of Datastage?
Sol:Type command in putty: sed '1d;$d' file_name>new_file_name (type this in job
before job subroutine then use new file in seq stage)

IF I HAVE SOURCE LIKE COL1 A A B AND TARGET LIKE COL1 COL2 A 1 A 2 B1.
HOW TO ACHIEVE THIS OUTPUT USING STAGE VARIABLE IN TRANSFORMER
STAGE?

If keyChange =1 Then 1 Else stagevaraible+1


Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000
row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and
your job going to be aborted then.. How can short out the problem.Suppose job sequencer
synchronies or control 4 job but job 1 have problem, in this condition should go director and
check it what type of problem showing either data type problem, warning massage, job fail or job
aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this
option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job
defiantly u get successful massage
Question: I want to process 3 files in sequentially one by one how can i do that. while processing
the files it should fetch files automatically .
Ans:If the metadata for all the files r same then create a job having file name as parameter then
use same job in routine and call the job with different file name...or u can create sequencer to use
the job..

Parameterize the file name.


Build the job using that parameter
Build job sequencer which will call this job and will accept the parameter for file name.
Write a UNIX shell script which will call the job sequencer three times by passing different file
each time.
RE: What Happens if RCP is disable ?

In such case Osh has to perform Import and export every time when the job runs and the
processing time job is also increased...
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated at
run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime column
propagation

Question:
Source: Target

Eno Ename Eno Ename


1 a,b 1 a
2 c,d 2 b
3 e,f 3 c
source has 2 fields like

COMPANY LOCATION
IBM HYD
TCS BAN
IBM CHE
HCL HYD
TCS CHE
IBM BAN
HCL BAN
HCL CHE

LIKE THIS.......

THEN THE OUTPUT LOOKS LIKE THIS....

Company loc count

TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
2)input is like this:
no,char
1,a
2,b
3,a
4,b
5,a
6,a
7,b
8,a

But the output is in this form with row numbering of Duplicate occurence
output:

no,char,Count
"1","a","1"
"6","a","2"
"5","a","3"
"8","a","4"
"3","a","5"
"2","b","1"
"7","b","2"
"4","b","3"
3)Input is like this:
file1
10
20
10
10
20
30

Output is like:
file2 file3(duplicates)
10 10
20 10
30 20

4)Input is like:
file1
10
20
10
10
20
30

Output is like Multiple occurrences in one file and single occurrences in one file:
file2 file3
10 30
10
10
20
20

5)Input is like this:


file1
10
20
10
10
20
30

Output is like:
file2 file3
10 30
20

6)Input is like this:


file1
1
2
3
4
5
6
7
8
9
10

Output is like:
file2(odd) file3(even)
1 2
3 4
5 6
7 8
9 10

7)How to calculate Sum(sal), Avg(sal), Min(sal), Max(sal) with out


using Aggregator stage..

8)How to find out First sal, Last sal in each dept with out using aggregator stage

9)How many ways are there to perform remove duplicates function with out using
Remove duplicate stage..

Scenario:

source has 2 fields like


COMPANY LOCATION
IBM HYD
TCS BAN
IBM CHE
HCL HYD
TCS CHE
IBM BAN
HCL BAN
HCL CHE

LIKE THIS.......

THEN THE OUTPUT LOOKS LIKE THIS....

Company loc count

TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE

Solution:

Seqfile......>Sort......>Trans......>RemoveDuplicates..........Dataset

Sort Trans:
Key=Company create stage variable as Company1
Sort order=Asc Company1=If(in.keychange=1) then in.Location Else
Company1:',':in.Location
create keychange=True Drag and Drop in derivation
Company ....................Company
Company1........................Location
RemoveDup:
Key=Company
Duplicates To Retain=Last

11)The input is
Shirt|red|blue|green
Pant|pink|red|blue
Output should be,

Shirt:red
Shirt:blue
Shirt:green
pant:pink
pant:red
pant:blue

Solution:
it is reverse to pivote stage
use
seq------sort------tr----rd-----tr----tg
in the sort stage use create key change column is true
in trans create stage variable=if colu=1 then key c.value else key v::colum
rd stage use duplicates retain last
tran stage use field function superate columns

similar Scenario: :
source
col1 col3
1 samsung
1 nokia
1 ercisson
2 iphone
2 motrolla
3 lava
3 blackberry
3 reliance

Expected Output
col 1 col2 col3 col4
1 samsung nokia ercission
2 iphone motrolla
3 lava blackberry reliance

You can get it by using Sort stage --- Transformer stage --- RemoveDuplicates ---
Transformer --tgt

Ok

First Read and Load the data into your source file( For Example Sequential File )

And in Sort stage select key change column = True ( To Generate Group ids)

Go to Transformer stage

Create one stage variable.

You can do this by right click in stage variable go to properties and name it as your wish
( For example temp)

and in expression write as below

if keychange column =1 then column name else temp:',':column name

This column name is the one you want in the required column with delimited commas.

On remove duplicates stage key is col1 and set option duplicates retain to--> Last.
in transformer drop col3 and define 3 columns like col2,col3,col4
in col1 derivation give Field(InputColumn,",",1) and
in col1 derivation give Field(InputColumn,",",2) and
in col1 derivation give Field(InputColumn,",",3)

Scenario:
12)Consider the following employees data as source?
employee_id, salary
-------------------
10, 1000
20, 2000
30, 3000
40, 5000

Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.

The output should look like as

employee_id, salary, salary_sum


-------------------------------
10, 1000, 11000
20, 2000, 11000
30, 3000, 11000
40, 5000, 11000

Scenario:

I have two source tables/files numbered 1 and 2.


In the the target, there are three output tables/files, numbered 3,4 and 5.

The scenario is that,

to the out put 4 -> the records which are common to both 1 and 2 should go.

to the output 3 -> the records which are only in 1 but not in 2 should go

to the output 5 -> the records which are only in 2 but not in 1 should go.
sltn:src1----->copy1------>----------------------------------->output_1(only left table)
Join(inner type)----> ouput_1
src2----->copy2------>----------------------------------->output_3(only right table)
Consider the following employees data as source?
employee_id, salary
-------------------
10, 1000
20, 2000
30, 3000
40, 5000

Scenario:

Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.

The output should look like as

employee_id, salary, salary_sum


-------------------------------
10, 1000, 11000
20, 2000, 11000
30, 3000, 11000
40, 5000, 11000
sltn:

Take Source --->Transformer(Add new Column on both the output links and assign a
value as 1 )------------------------> 1) Aggregator (Do group by using
that new column)
2)lookup/join( join on that new column)-------->tgt.

Scenario:

sno,sname,mark1,mark2,mark3
1,rajesh,70,68,79
2,mamatha,39,45,78
3,anjali,67,39,78
4,pavani,89,56,45
5,indu,56,67,78

out put is
sno,snmae,mark1,mark2,mark3,delimetercount
1,rajesh,70,68,79,4
2,mamatha,39,45,78,4
3,anjali,67,39,78,4
4,pavani,89,56,45,4
5,indu,56,67,78,4

seq--->trans--->seq

create one stage variable as delimiter..


and put derivation on stage as DSLink4.sno : "," : DSLink4.sname : "," : DSLink4.mark1
: "," :DSLink4.mark2 : "," : DSLink4.mark3
and do mapping and create one more column count as integer type.

and put derivation on count column as Count(delimter, ",")

scenario:
sname total_vowels_count
Allen 2
Scott 1
Ward 1
Under Transformer Stage Description:

total_Vowels_Count=Count(DSLink3.last_name,"a")+Count(DSLink3.last_name,"e")+Count
(DSLink3.last_name,"i")+Count(DSLink3.last_name,"o")+Count(DSLink3.last_name,"u").

Scenario:

1)On daily we r getting some huge files data so all files metadata is same we have to load in
to target table how we can load?
Use File Pattern in sequential file

2) One column having 10 records at run time we have to send 5th and 6th record to target
at run time how we can send?
Can get through,by using UNIX command in sequential file filter option

How can we get 18 months date data in transformer stage?


Use transformer stage after input seq file and try this one as constraint in transformer
stage :

DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=548 OR
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=546

where date_18 column is the column having that date which needs to be less or equal to
18 months and 548 is no. of days for 18 months and for leap year it is
546(these numbers you need to check).

What is differences between Force Compile and Compile ?

Diff b/w Compile and Validate?

Compile option only checks for all mandatory requirements like link requirements, stage
options and all. But it will not check if the database connections are valid.
Validate is equivalent to Running a job except for extraction/loading of data. That is,
validate option will test database connectivity by making connections to databases.

How to FInd Out Duplicate Values Using Transformer?


You can capture the duplicate records based on keys using Transformer stage variables.

1. Sort and partition the input data of the transformer on the key(s) which defines the duplicate.
2. Define two stage variables, let's say StgVarPrevKeyCol(data type same as KeyCol) and StgVarCntr as Integer with
default value 0
where KeyCol is your input column which defines the duplicate.

Expression for StgVarCntr(1st stg var-- maintain order):

If DSLinknn.KeyCol = StgVarPrevKeyCol Then StgVarCntr + 1 Else 1

Expression for StgVarPrevKeyCol(2nd stg var):

DSLinknn.KeyCol

3. Now in constrain, if you filter rows where StgVarCntr = 1 will give you the unique records and if you filter
StgVarCntr > 1 will give you duplicate records.

My source is Like
Sr_no, Name
10,a
10,b
20,c
30,d
30,e
40,f

My target Should Like:

Target 1:(Only unique means which records r only once)


20,c
40,f

Target 2:(Records which r having more than 1 time)


10,a
10,b
30,d
30,e

How to do this in DataStage....


**************

use aggregator and transformer stages

source-->aggregator-->transformat-->target
perform count in aggregator, and take two op links in trasformer, filter data count>1 for one llink
and put count=1 for second link.
Scenario:
in my i/p source i have N no.of records

In output i have 3 targets

i want o/p like 1st rec goes to 1st targt and

2nd rec goes to 2nd target and

3rd rec goes to 3rd target again

4th rec goes to 1st taget ............ like this

do this ""without using partition techniques "" remember it.

*****************
source--->trans---->target
in trans use conditions on constraints
mod(empno,3)=1
mod(empno,3)=2
mod(empno,3)=0
Scenario:
im having i/p as
col A
a_b_c
x_F_I
DE_GH_IF

we hav to mak it as

col1 col 2 col3


abc
xfi
de gh if

*********************

Transformer
create 3 columns with derivation
col1 Field(colA,'_',1)
col2 Field(colA,'_',2)
col3 Field(colA,'_',3)

**************
Field function divides the column based on the delimeter,
if the data in the col is like A,B,C
then
Field(col,',',1) gives A
Field(col,',',2) gives B
Field(col,',',3) gives C

Scenario:

Scenario:
Scenario:

2 comments:

baba_007 said...

How to Find Out Duplicate Values Using Transformer?

another way to find the duplicate value can be using a sorter stage before transformer.

In sorter: make Cluster Key change = TRUE


on the Key
then in Transformer filter the oulput on basic of value of cluste key change which can be
put in stage variable.

====================================================================

Scenarios_Unix :

1) Convert single column to single row:


Input: filename : try
REF_PERIOD
PERIOD_NAME
ACCOUNT_VALUE
CDR_CODE
PRODUCT
PROJECT
SEGMENT_CODE
PARTNER
ORIGIN
BILLING_ACCRUAL
Output:
REF_PERIOD PERIOD_NAME ACCOUNT_VALUE CDR_CODE PRODUCT PROJECT
SEGMENT_CODE PARTNER ORIGIN BILLING_ACCRUAL

Command: cat try | awk ‘{printf “%s “,$1}’

2) Print the list of employees in Technology department :


Now department name is available as a fourth field, so need to check if $4 matches with the
string “Technology”, if yes print the line.
Command: $ awk ‘$4 ~/Technology/’ employee.txt
200 Jason Developer Technology $5,500
300 Sanjay Sysadmin Technology $7,000
500 Randy DBA Technology $6,000
Operator ~ is for comparing with the regular expressions. If it matches the default action i.e print
whole line will be performed.
3) Convert single column to multiple column :
For eg: Input file contain single column with 84 rows then output should be single column data
converted to multiple of 12 columns i.e. 12 column * 7 rows with field separtor (fs ;)
Script:
#!/bin/sh

rows=`cat input_file | wc -l`

cols=12

fs=;

awk -v r=$rows -v c=$cols -v t=$fs '

NR output_file

4) Last field print:


input:
a=/Data/Files/201-2011.csv
output:
201-2011.csv
Command: echo $a | awk -F/ ‘{print $NF}’

5) Count no. of fields in file:


file1: a, b, c, d, 1, 2, man, fruit
Command: cat file1 | awk ‘BEGIN{FS=”,”};{print NF}’
and you will get the output as:8

6) Find ip address in unix server:


Command: grep -i your_hostname /etc/hosts

7) Replace the word corresponding to search pattern:


>cat file

the black cat was chased by the brown dog.

the black cat was not chased by the brown dog.

>sed -e '/not/s/black/white/g' file

the black cat was chased by the brown dog.

the white cat was not chased by the brown dog.

8) The below i have shown the demo for the “A” and “65″.
Ascii value of character: It can be done in 2 ways:
1. printf “%d” “‘A”
2. echo “A” | tr -d “\n” | od -An -t dC
Character value from Ascii: awk -v char=65 ‘BEGIN { printf “%c\n”, char; exit }’
———————————————————————————————————
9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online
cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline
Output –>crmplp1 cmis461 No Online cmis462 No Offline
crmplp2 cmis462 No Online cmis463 No Offline
Command:
awk ‘NR%2?ORS=FS:ORS=RS’ file
———————————————————————————————————
10) Variable can used in AWK
awk -F”$c” -v var=”$c” ‘{print $1var$2}’ filename
———————————————————————————————————
11) Search pattern and use special character in sed command:
sed -e ‘/COMAttachJob/s#”)#.”:JobID)#g’ input_file—————————————————
——————————————————
12) Get the content between two patterns:sed -n ‘/CREATE TABLE
table/,/MONITORING/p’ table_Script.sql——————————————————————
—————————————
13) Pring debugging script output in log file Add following command in script:
exec 1>> logfilename
exec 2>>logfilename——————————————————————————————
—————
14) Check Sql connection:#!/bin/sh
ID=abc
PASSWD=avd
DB=sdf
exit | sqlplus -s -l $ID/$PASSWD@$DB
echo variable:$?
exit | sqlplus -s -L avd/df@dfg > /dev/null
echo variable_crr: $?——————————————————————————————
—————
15) Trim the spaces using sed command

echo “$var” | sed -e ‘s/^[[:space:]]*//’ -e ‘s/[[:space:]]*$//’


Another option is:
Code:
var=$(echo “$var” | sed -e ‘s/^[[:space:]]*//’ -e ‘s/[[:space:]]*$//’)
echo “Start $var End”——————————————————————————————
—————
16) How to add sigle quote in statement using awk:Input:
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
command:
cat command.txt | sed -e ‘s/[[:space:]]/ /g’ | awk -F’ ‘ ‘{print
\x27″$1,$2,$3″\x27″,”\x27″$4,$5″\x27″}’
output:
‘/Admin/script.sh abc 2011/08′ ’29/02/2012 00:00:00′
‘/Admin/script.sh abc 2011/08′ ’29/02/2012 00:00:00′

17)
How to get a files from different servers to one server in datastage by using unix command?
scp test.ksh dsadm@10.87.130.111:/home/dsadm/sys/

============================================================================

Unix Interview Questions :


1. How to display the 10th line of a file?
head -10 filename | tail -1
2. How to remove the header from a file?
sed -i '1 d' filename

3. How to remove the footer from a file?

sed -i '$ d' filename

4. Write a command to find the length of a line in a file?


The below command can be used to get a line from a file.
sed –n '<n> p' filename
We will see how to find the length of 10th line in a file
sed -n '10 p' filename|wc -c

5. How to get the nth word of a line in Unix?


cut –f<n> -d' '

6. How to reverse a string in unix?


echo "java" | rev

7. How to get the last word from a line in Unix file?


echo "unix is good" | rev | cut -f1 -d' ' | rev

8. How to replace the n-th line in a file with a new line in Unix?
sed -i'' '10 d' filename # d stands for delete
sed -i'' '10 i new inserted line' filename # i stands for insert
9. How to check if the last command was successful in Unix?
echo $?

10. Write command to list all the links from a directory?


ls -lrt | grep "^l"

11. How will you find which operating system your system is running on in UNIX?
uname -a

12. Create a read-only file in your home directory?


touch file; chmod 400 file

13. How do you see command line history in UNIX?


The 'history' command can be used to get the list of commands that we are executed.

14. How to display the first 20 lines of a file?


By default, the head command displays the first 10 lines from a file. If we change the option of
head, then we can display as many lines as we want.
head -20 filename
An alternative solution is using the sed command
sed '21,$ d' filename
The d option here deletes the lines from 21 to the end of the file

15. Write a command to print the last line of a file?


The tail command can be used to display the last lines from a file.
tail -1 filename
Alternative solutions are:
sed -n '$ p' filename
awk 'END{print $0}' filename

16. How do you rename the files in a directory with _new as suffix?
ls -lrt|grep '^-'| awk '{print "mv "$9" "$9".new"}' | sh
17. Write a command to convert a string from lower case to upper case?
echo "apple" | tr [a-z] [A-Z]
18. Write a command to convert a string to Initcap.
echo apple | awk '{print toupper(substr($1,1,1)) tolower(substr($1,2))}'
19. Write a command to redirect the output of date command to multiple files?
The tee command writes the output to multiple files and also displays the output on the terminal.
date | tee -a file1 file2 file3
20. How do you list the hidden files in current directory?
ls -a | grep '^\.'
21. List out some of the Hot Keys available in bash shell?
 Ctrl+l - Clears the Screen.
 Ctrl+r - Does a search in previously given commands in shell.
 Ctrl+u - Clears the typing before the hotkey.
 Ctrl+a - Places cursor at the beginning of the command at shell.
 Ctrl+e - Places cursor at the end of the command at shell.
 Ctrl+d - Kills the shell.
 Ctrl+z - Places the currently running process into background.
22. How do you make an existing file empty?
cat /dev/null > filename
23. How do you remove the first number on 10th line in file?
sed '10 s/[0-9][0-9]*//' < filename
24. What is the difference between join -v and join -a?
join -v : outputs only matched lines between two files.
join -a : In addition to the matched lines, this will output unmatched lines also.
25. How do you display from the 5th character to the end of the line from a file?
cut -c 5- filename
26. Display all the files in current directory sorted by size?
ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'
Write a command to search for the file 'map' in the current directory?
find -name map -type f
How to display the first 10 characters from each line of a file?
cut -c -10 filename
Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename
How to print the file names in a directory that has the word "term"?
grep -l term *
The '-l' option make the grep command to print only the filename without printing the content of
the file. As soon as the grep command finds the pattern in a file, it prints the pattern and stops
searching other lines in the file.
How to run awk command specified in a file?
awk -f filename
How do you display the calendar for the month march in the year 1985?
The cal command can be used to display the current month calendar. You can pass the month
and year as arguments to display the required year, month combination calendar.
cal 03 1985
This will display the calendar for the March month and year 1985.
Write a command to find the total number of lines in a file?
wc -l filename
Other ways to pring the total number of lines are
awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename
awk 'END{print NR}' filename
How to duplicate empty lines in a file?
sed '/^$/ p' < filename
Explain iostat, vmstat and netstat?
 Iostat: reports on terminal, disk and tape I/O activity.
 Vmstat: reports on virtual memory statistics for processes, disk, tape and CPU activity.
 Netstat: reports on the contents of network data structures.
27. How do you write the contents of 3 files into a single file?
cat file1 file2 file3 > file
28. How to display the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename
29. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | grep '^-'| awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'
30. Write a command to print the lines which end with the word "end"?
grep 'end$' filename
The '$' symbol specifies the grep command to search for the pattern at the end of the line.
31. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename
The '-w' option makes the grep command to search for exact whole words. If the specified
pattern is found in a string, then it is not considered as a whole word. For example: In the string
"mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.
32. How to remove the first 10 lines from a file?
sed '1,10 d' < filename
33. Write a command to duplicate each line in a file?
sed 'p' < filename
34. How to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '
35. Write a command to list the files in '/usr' directory that start with 'ch' and then display the
number of lines in each file?
wc -l /usr/ch*
Another way is
find /usr -name 'ch*' -type f -exec wc -l {} \;
36. How to remove blank lines in a file ?
grep -v ‘^$’ filename > new_filename
37. How to display the processes that were run by your user name ?
ps -aef | grep <user_name>
38. Write a command to display all the files recursively with path under current directory?
find . -depth -print
39. Display zero byte size files in the current directory?
find -size 0 -type f
40. Write a command to display the third and fifth character from each line of a file?
cut -c 3,5 filename
41. Write a command to print the fields from 10th to the end of the line. The fields in the line are
delimited by a comma?
cut -d',' -f10- filename
42 How to replace the word "Gun" with "Pen" in the first 100 lines of a file?
sed '1,00 s/Gun/Pen/' < filename
43. Write a Unix command to display the lines in a file that do not contain the word "RAM"?
grep -v RAM filename
The '-v' option tells the grep to print the lines that do not contain the specified pattern.
44 How to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'
45. Write a command to display the files in the directory by file size?
ls -l | grep '^-' |sort -nr -k 5
46. How to find out the usage of the CPU by the processes?
The top utility can be used to display the CPU usage by the processes.
47. Write a command to remove the prefix of the string ending with '/'.
The basename utility deletes any prefix ending in /. The usage is mentioned below:
basename /usr/local/bin/file
This will display only file
48. How to display zero byte size files?
ls -l | grep '^-' | awk '/^-/ {if ($5 !=0 ) print $9 }'
49. How to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename
50. How to remove all the occurrences of the word "jhon" except the first one in a line with in
the entire file?
sed 's/jhon//2g' < filename
51. How to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename
52. How to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f
53. How to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f
54. How to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f
55. How to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename
56. Write a command to find the number of files in a directory.
ls -l|grep '^-'|wc -l
57. Write a command to display your name 100 times.
The Yes utility can be used to repeatedly output a line with the specified string or 'y'.
yes <your_name> | head -100
58. Write a command to display the first 10 characters from each line of a file?
cut -c -10 filename
59. The fields in each line are delimited by comma. Write a command to display third field from
each line of a file?
cut -d',' -f2 filename
60. Write a command to print the fields from 10 to 20 from each line of a file?
cut -d',' -f10-20 filename
61. Write a command to print the first 5 fields from each line?
cut -d',' -f-5 filename
62. By default the cut command displays the entire line if there is no delimiter in it. Which cut
option is used to supress these kind of lines?
The -s option is used to supress the lines that do not contain the delimiter.
63. Write a command to replace the word "bad" with "good" in file?
sed s/bad/good/ < filename
64. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename
65. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename
66. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename
67. Write a command to display the characters from 10 to 20 from each line of a file?
cut -c 10-20 filename
68. Write a command to print the lines that has the the pattern "july" in all the files in a particular
directory?
grep july *
This will print all the lines in all files that contain the word “july” along with the file name. If
any of the files contain words like "JULY" or "July", the above command would not print those
lines.
69. Write a command to print the lines that has the word "july" in all the files in a directory and
also suppress the filename in the output.
grep -h july *
70. Write a command to print the lines that has the word "july" while ignoring the case.
grep -i july *
The option i make the grep command to treat the pattern as case insensitive.
71. When you use a single file as input to the grep command to search for a pattern, it won't print
the filename in the output. Now write a grep command to print the filename in the output without
using the '-H' option.
grep pattern filename /dev/null
The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is
always an empty file.
Another way to print the filename is using the '-H' option. The grep command for this is
grep -H pattern filename
72. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *
The '-L' option makes the grep command to print the filenames that do not contain the specified
pattern.
73. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename
The '-n' option is used to print the line numbers in a file. The line numbers start from 1
74. Write a command to print the lines that starts with the word "start"?
grep '^start' filename
The '^' symbol specifies the grep command to search for the pattern at the start of the line.
75. In the text file, some lines are delimited by colon and some are delimited by space. Write a
command to print the third field of each line.
awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename
76. Write a command to print the line number before each line?
awk '{print NR, $0}' filename
77. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename
78. How to create an alias for the complex command and remove the alias?
The alias utility is used to create the alias for a command. The below command creates alias for
ps -aef command.
alias pg='ps -aef'
If you use pg, it will work the same way as ps -aef.
To remove the alias simply use the unalias command as
unalias pg
79. Write a command to display todays date in the format of 'yyyy-mm-dd'?
The date command can be used to display todays date with time
date '+%Y-%m-%d'
------------------------------------------------------------------------------------------------------

1) Convert single column to single row:


Input: filename : try

REF_PERIOD
PERIOD_NAME
ACCOUNT_VALUE
CDR_CODE
PRODUCT
PROJECT
SEGMENT_CODE
PARTNER
ORIGIN
BILLING_ACCRUAL

Output:
REF_PERIOD PERIOD_NAME ACCOUNT_VALUE CDR_CODE PRODUCT PROJECT
SEGMENT_CODE PARTNER ORIGIN BILLING_ACCRUAL

Command: cat try | awk ‘{printf “%s “,$1}’

2) Print the list of employees in Technology department :


Now department name is available as a fourth field, so need to check if $4 matches with the
string “Technology”, if yes print the line.

Command: $ awk ‘$4 ~/Technology/’ employee.txt


200 Jason Developer Technology $5,500
300 Sanjay Sysadmin Technology $7,000
500 Randy DBA Technology $6,000

Operator ~ is for comparing with the regular expressions. If it matches the default action i.e print
whole line will be performed.

3) Convert single column to multiple column :

For eg: Input file contain single column with 84 rows then output should be single column data
converted to multiple of 12 columns i.e. 12 column * 7 rows with field separtor (fs ;)

Script:
#!/bin/sh

rows=`cat input_file | wc -l`

cols=12

fs=;

awk -v r=$rows -v c=$cols -v t=$fs '

NR<r*c{printf("%s",NR%c?$0"$":$0"\n");next}{print}

END{if(NR%c&&NR<r*c){print ""}}' input_file > output_file

4) Last field print:


input:
a=/Data/Files/201-2011.csv

output:
201-2011.csv

Command: echo $a | awk -F/ ‘{print $NF}’


5) Count no. of fields in file:
file1: a, b, c, d, 1, 2, man, fruit

Command: cat file1 | awk ‘BEGIN{FS=”,”};{print NF}’

and you will get the output as:8

6) Find ip address in unix server:


Command: grep -i your_hostname /etc/hosts

7) Replace the word corresponding to search pattern:


>cat file

the black cat was chased by the brown dog.

the black cat was not chased by the brown dog.

>sed -e '/not/s/black/white/g' file

the black cat was chased by the brown dog.

the white cat was not chased by the brown dog.

8) The below i have shown the demo for the “A” and “65″.

Ascii value of character: It can be done in 2 ways:

1. printf “%d” “‘A”


2. echo “A” | tr -d “\n” | od -An -t dC

Character value from Ascii: awk -v char=65 ‘BEGIN { printf “%c\n”, char; exit }’

———————————————————————————————————
9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online
cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline

Output –>crmplp1 cmis461 No Online cmis462 No Offline


crmplp2 cmis462 No Online cmis463 No Offline

Command:
awk ‘NR%2?ORS=FS:ORS=RS’ file

———————————————————————————————————

10) Variable can used in AWK

awk -F”$c” -v var=”$c” ‘{print $1var$2}’ filename

———————————————————————————————————

11) Search pattern and use special character in sed command:

sed -e ‘/COMAttachJob/s#”)#.”:JobID)#g’ input_file————————————————


———————————————————
12) Get the content between two patterns:sed -n ‘/CREATE TABLE
table/,/MONITORING/p’ table_Script.sql—————————————————————
——————————————
13) Pring debugging script output in log file Add following command in script:
exec 1>> logfilename
exec 2>>logfilename—————————————————————————————
——————
14) Check Sql connection:#!/bin/sh
ID=abc
PASSWD=avd
DB=sdf
exit | sqlplus -s -l $ID/$PASSWD@$DB
echo variable:$?
exit | sqlplus -s -L avd/df@dfg > /dev/null
echo variable_crr: $?—————————————————————————————
——————
15) Trim the spaces using sed command

echo “$var” | sed -e ‘s/^[[:space:]]*//’ -e ‘s/[[:space:]]*$//’


Another option is:
Code:
var=$(echo “$var” | sed -e ‘s/^[[:space:]]*//’ -e ‘s/[[:space:]]*$//’)
echo “Start $var End”—————————————————————————————
——————
16) How to add sigle quote in statement using awk:Input:
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
command:
cat command.txt | sed -e ‘s/[[:space:]]/ /g’ | awk -F’ ‘ ‘{print
\x27″$1,$2,$3″\x27″,”\x27″$4,$5″\x27″}’
output:
‘/Admin/script.sh abc 2011/08′ ’29/02/2012 00:00:00′
‘/Admin/script.sh abc 2011/08′ ’29/02/2012 00:00:00′

=====================================================================

Monday, March 18, 2013

Sql queries :
1.Query to display middle records drop first 5 last 5 records in emp table
select * from emp where rownum<=(select count(*)-5 from emp) - select * from emp where
rownum<=5;

2.Query to display first N records


select * from(select * from emp order by rowid) where rownum<=&n;

3.Query to display odd records only?


Q). select * from emp where (rowid,1) in (select rowid,mod (rownum,2) from emp);

4.Query to display even records only?


Q.) select * from emp where (rowid,0) in (select rowid,mod (rownum,2) from emp);

5.How to display duplicate rows in a table?


Q). select * from emp where deptno=any
(select deptno from emp having count(deptno)>1 group by deptno);

6.Query to display 3rd highest and 3rd lowest salary?


Q). select * from emp e1 where 3=(select count(distinct sal) from emp e2 where e1.sal<=e2.sal)
union
select * from emp e3 where 3=(select count(distinct sal) from emp e4 where e3.sal>=e4.sal);

7.Query to display Nth record from the table?


Q). select * from emp where rownum<=&n minus select * from emp where rownum<&n;

8.Query to display the records from M to N;


Q.) select ename from emp group by rownum,ename having rownum>1 and rownum<6;
select deptno,ename,sal from emp where rowid in(select rowid from emp
where rownum<=7 minus select rowid from emp where rownum<4);
select * from emp where rownum<=7 minus select * from emp where rownum<5;

9.Query to delete the duplicate records?


Q). delete from dup where rowid not in(select max(rowid)from dup group by eno);

10.Query to display the duplicate records?


Q). select * from dup where rowid not in(select max(rowid)from dup group by eno);

11.Query for joining two tables(OUTER JOIN)?


Q). select e.ename,d.deptno from emp e,dept d where e.deptno(+)=d.deptno order by e.deptno;
select empno,ename,sal,dept.* from emp full outer join dept on emp.deptno=dept.deptno;
Right Outer Join:
select empno,ename,sal,dept.* from emp right outer join dept on emp.deptno=dept.deptno;
Left Outer Join:
select empno,ename,sal,dept.* from emp left outer join dept on emp.deptno=dept.deptno

12.Query for joining table it self(SELF JOIN)?


Q). select e.ename “employee name”,e1.ename “manger name” from emp e,emp e1 where
e.mgr=e1.empno;

13.Query for combining two tables(INNER JOIN)?


select emp.empno,emp.ename,dept.deptno from emp,dept where emp.deptno=dept.deptno;
By using aliases:
select e.empno,e.ename,d.deptno from emp e,dept d where e.deptno=d.deptno;
select empno,ename,sal,dept.* from emp join dept on emp.deptno=dept.deptno:

14.Find the particular employee salary?


for maximum:
select * from emp where sal in(select min(sal)from
(select sal from emp group by sal order by sal desc)
where rownum<=&n);
select * from emp a where &n=(select count(distinct(sal)) from emp b
where a.sal<=b.sal);
for minimum:
select * from emp where sal in(select max(sal) from(select sal from emp group by sal order by
sal asc) where rownum<=&n);
select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal>=b.sal)

15.Find the lowest 5 employee salaries?


Q). select * from (select * from emp order by sal asc) where rownum<6;
Find the top 5 employee salaries queries
select * from (select * from emp order by sal desc) where rownum<6;

16.Find lowest salary queries


select * from emp where sal=(select min(sal) from emp);

17.Find highest salary queries


select * from emp where sal=(select max(sal) from emp);

Sample Sql Queries :

Simple select command:

SELECT SUBPRODUCT_UID
,SUBPRODUCT_PROVIDER_UID
,SUBPRODUCT_TYPE_UID
,DESCRIPTION
,EXTERNAL_ID
,OPTION_ID
,NEGOTIABLE_OFFER_IND
,UPDATED_BY
,UPDATED_ON
,CREATED_ON
,CREATED_BY FROM schemaname.SUBPRODUCT

With Inner Join:

SELECT eft.AMOUNT AS AMOUNT,


ceft.MERCHANT_ID AS MERCHANT_ID,
ca.ACCOUNT_NUMBER AS ACCOUNT_NUMBER,
bf.MBNA_CREDIT_CARD_NUMBER AS MBNA_CREDIT_CARD_NUMBER,
ceft.CUSTOMER_FIRST_NAME AS CUSTOMER_FIRST_NAME,
ceft.CUSTOMER_LAST_NAME AS CUSTOMER_LAST_NAME,
btr.TRACE_ID AS TRACE_ID,
ROWNUM
FROM schemaname.bt_fulfillment bf

INNER JOIN schemaname.balance_transfer_request btr


ON btr.bt_fulfillment_uid = bf.bt_fulfillment_uid

INNER JOIN schemaname.electronic_funds_transfer eft


ON eft.bt_fulfillment_uid = bf.bt_fulfillment_uid

INNER JOIN schemaname.creditor_eft ceft


ON ceft.ELECTRONIC_FUNDS_TRANSFER_UID =
eft.ELECTRONIC_FUNDS_TRANSFER_UID

INNER JOIN schemaname.credit_account ca


ON ca.ELECTRONIC_FUNDS_TRANSFER_UID =
ceft.ELECTRONIC_FUNDS_TRANSFER_UID

WHERE ((btr.TYPE ='CREATE_CREDIT' AND btr.STATUS ='PENDING')


OR (btr.TYPE ='RETRY_CREDIT' AND btr.STATUS ='PENDING'))
AND btr.RELEASE_DATE < CURRENT_TIMESTAMP

=====================================================================

Star schema vs. snowflake schema: Which is better?

What are the key differences in snowflake and star schema? Where should they be
applied?
The Star schema vs Snowflake schema comparison brings forth four fundamental differences to
the fore:
1. Data optimization:
Snowflake model uses normalized data, i.e. the data is organized inside the database in order to
eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business
and its dimensions are preserved in the data model through referential integrity.

Figure 1 – Snow flake model


Star model on the other hand uses de-normalized data. In the star model, dimensions directly
refer to fact table and business hierarchy is not implemented via referential integrity between
dimensions.

Figure 2 – Star model


2. Business model:
Primary key is a single unique key (data attribute) that is selected for a particular data. In the
previous ‘advertiser’ example, the Advertiser_ID will be the primary key (business key) of a
dimension table. The foreign key (referential attribute) is just a field in one table that matches a
primary key of another dimension table. In our example, the Advertiser_ID could be a foreign
key in Account_dimension.
In the snowflake model, the business hierarchy of data model is represented in a primary key –
Foreign key relationship between the various dimension tables.
In the star model all required dimension-tables have only foreign keys in the fact tables.
3. Performance:
The third differentiator in this Star schema vs Snowflake schema face off is the performance of
these models. The Snowflake model has higher number of joins between dimension table and
then again the fact table and hence the performance is slower. For instance, if you want to know
the Advertiser details, this model will ask for a lot of information such as the Advertiser Name,
ID and address for which advertiser and account table needs to be joined with each other and
then joined with fact table.
The Star model on the other hand has lesser joins between dimension tables and the facts table.
In this model if you need information on the advertiser you will just have to join Advertiser
dimension table with fact table.

Star schema explained

Star schema provides fast response to queries and forms the ideal source for cube structures.
Learn all about star schema in this article.
4. ETL
Snowflake model loads the data marts and hence the ETL job is more complex in design and
cannot be parallelized as dependency model restricts it.
The Star model loads dimension table without dependency between dimensions and hence the
ETL job is simpler and can achieve higher parallelism.
This brings us to the end of the Star schema vs Snowflake schema debate. But where exactly do
these approaches make sense?

Where do the two methods fit in?


With the snowflake model, dimension analysis is easier. For example, ‘how many accounts or
campaigns are online for a given Advertiser?’
The star schema model is useful for Metrics analysis, such as – ‘What is the revenue for a given
customer?’

Datastage Errors and Resolution :


You may get many errors in datastage while compiling the jobs or running the jobs.

Some of the errors are as follows

a)Source file not found.


If you are trying to read the file, which was not there with that name.

b)Some times you may get Fatal Errors.

c) Data type mismatches.


This will occur when data type mismaches occurs in the jobs.

d) Field Size errors.

e) Meta data Mismach

f) Data type size between source and target different

g) Column Mismatch

i) Pricess time out.


If server is busy. This error will come some time.

Some of the errors in detail:


ds_Trailer_Rec: When checking operator: When binding output schema variable
"outRec": When binding output interface field "TrailerDetailRecCount" to field
"TrailerDetailRecCount": Implicit conversion from source type "ustring" to result type
"string[max=255]": Possible truncation of variable length ustring when converting to
string using codepage ISO-8859-1.

Solution:I resolved changing the extended col under meta data of the transformer to
unicode

When checking operator: A sequential operator cannot preserve the partitioning


of the parallel data set on input port 0.

Solution:I resolved by changing the preserve partioning to 'clear' under transformer


properties

Syntax error: Error in "group" operator: Error in output redirection: Error in output
parameters: Error in modify adapter: Error in binding: Could not find type: "subrec", line
35

Solution:Its the issue of level number of those columns which were being added in
transformer. Their level number was blank and the columns that were being taken from
cff file had it as 02. Added the level number and job worked.

Out_Trailer: When checking operator: When binding output schema variable "outRec":
When binding output interface field "STDCA_TRLR_REC_CNT" to field
"STDCA_TRLR_REC_CNT": Implicit conversion from source type "dfloat" to result
type "decimal[10,0]": Possible range/precision limitation.

CE_Trailer: When checking operator: When binding output interface field "Data" to field
"Data": Implicit conversion from source type "string" to result type "string[max=500]":
Possible truncation of variable length string.

Implicit conversion from source type "dfloat" to result type "decimal[10,0]": Possible
range/precision limitation.

Solution: Used to transformer function'DFloatToDecimal'. As target field is Decimal. By


default the output from aggregator output is double, getting the above by using
above function able to resolve the warning.
When binding output schema variable "outputData": When binding output interface field
"RecordCount" to field "RecordCount": Implicit conversion from source type
"string[max=255]" to result type "int16": Converting string to number.

Problem(Abstract)
Jobs that process a large amount of data in a column can abort with this error:
the record is too big to fit in a block; the length requested is: xxxx, the max block length
is: xxxx.
Resolving the problem
To fix this error you need to increase the block size to accommodate the record size:
1. Log into Designer and open the job.
2. Open the job properties--> parameters-->add environment variable and select:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
3. You can set this up to 256MB but you really shouldn't need to go over 1MB.
NOTE: value is in KB

For example to set the value to 1MB:


APT_DEFAULT_TRANSPORT_BLOCK_SIZE=1048576

The default for this value is 128kb.

When setting APT_DEFAULT_TRANSPORT_BLOCK_SIZE you want to use the


smallest possible value since this value will be used for all links in the job.

For example if your job fails with APT_DEFAULT_TRANSPORT_BLOCK_SIZE set to


1 MB and succeeds at 4 MB you would want to do further testing to see what it the
smallest value between 1 MB and 4 MB that will allow the job to run and use that value.
Using 4 MB could cause the job to use more memory than needed since all the links
would use a 4 MB transport block size.

NOTE: If this error appears for a dataset use


APT_PHYSICAL_DATASET_BLOCK_SIZE.

. While connecting “Remote Desktop”, Terminal server has been exceeded maximum
number of allowed connections
SOL: In Command Prompt, type mstsc /v: ip address of server /admin

OR mstsc /v: ip address /console

2. SQL20521N. Error occurred processing a conditional compilation directive near


string. Reason code=rc.
Following link has issue description:

http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.m
essages.sql.doc%2Fdoc%2Fmsql20521n.html

3. SK_RETAILER_GROUP_BRDIGE,1: runLocally() did not reach EOF on its input


data set 0.

SOL: Warning will be disappeared by regenerating SK File.

4. While connecting to Datastage client, there is no response, and while restarting


websphere services, following errors occurred

[root@poluloro01 bin]# ./stopServer.sh server1 -user wasadmin -password


Wasadmin0708

ADMU0116I: Tool information is being logged in file

/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log

ADMU0128I: Starting tool with the default profile

ADMU3100I: Reading configuration for server: server1

ADMU0111E: Program exiting with error: javax.management.JMRuntimeException:

ADMN0022E: Access is denied for the stop operation on Server MBean

because of insufficient or empty credentials.


ADMU4113E: Verify that username and password information is on the command line

(-username and -password) or in the <conntype>.client.props file.

ADMU1211I: To obtain a full trace of the failure, use the -trace option.

ADMU0211I: Error details may be seen in the file:

/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log

SOL: Wasadmin and XMeta passwords needs to be reset and commands are below..

[root@poluloro01 bin]# cd /opt/ibm/InformationServer/ASBServer/bin/

[root@poluloro01 bin]# ./AppServerAdmin.sh -was -user wasadmin

-password Wasadmin0708

Info WAS instance /Node:poluloro01/Server:server1/ updated with new user information

Info MetadataServer daemon script updated with new user information

[root@poluloro01 bin]# ./AppServerAdmin.sh -was -user xmeta -password Xmeta0708

Info WAS instance /Node:poluloro01/Server:server1/ updated with new user information

Info MetadataServer daemon script updated with new user information

5. “The specified field doesn’t exist in view adapted schema”

SOL: Most of the time "The specified field: XXXXXX does not exist in the view
adapted schema" occurred when we missed a field to map. Every stage has got an output
tab if used in the between of the job. Make sure you have mapped every single field
required for the next stage.
Sometime even after mapping the fields this error can be occurred and one of the reason
could be that the view adapter has not linked the input and output fields. Hence in this
case the required field mapping should be dropped and recreated.

Just to give an insight on this, the view adapter is an operator which is responsible for
mapping the input and output fields. Hence DataStage creates an instance of
APT_ViewAdapter which translate the components of the operator input interface
schema to matching components of the interface schema. So if the interface schema is not
having the same columns as operator input interface schema then this error will be
reported.

1)When we use same partitioning in datastage transformer stage we get the following
warning in 7.5.2 version.

TFCP000043 2 3 input_tfm: Input dataset 0 has a partitioning method other


than entire specified; disabling memory sharing.

This is known issue and you can safely demote that warning into informational by adding
this warning to Project specific message handler.

2) Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0

Resolution: Clear the preserve partition flag before Sequential file stages.

3)DataStage parallel job fails with fork() failed, Resource temporarily unavailable

On aix execute following command to check maxuproc setting and increase it if you plan
to run multiple jobs at the same time.

lsattr -E -l sys0 | grep maxuproc


maxuproc 1024 Maximum number of PROCESSES allowed per
user True

4)TFIP000000 3 Agg_stg: When checking operator: When binding input


interface field “CUST_ACT_NBR” to field “CUST_ACT_NBR”: Implicit conversion
from source type “string[5]” to result type “dfloat”: Converting string to number.

Resolution: use the Modify stage explicitly convert the data type before sending to
aggregator stage.

5)Warning: A user defined sort operator does not satisfy the requirements.

Resolution:check the order of sorting columns and make sure use the same order when
use join stage after sort to joing two inputs.

6)TFTM000000 2 3 Stg_tfm_header,1: Conversion error calling conversion


routine timestamp_from_string data may have been lost

TFTM000000 1 xfmJournals,1: Conversion error calling conversion routine


decimal_from_string data may have been lost

Resolution:check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.

7)TOSO000119 2 3 Join_sort: When checking operator: Data claims to already


be sorted on the specified keys the ‘sorted’ option can be used to confirm this. Data will
be resorted as necessary. Performance may improve if this sort is removed from the flow

Resolution: Sort the data before sending to join stage and check for the order of sorting
keys and join keys and make sure both are in the same order.

8)TFOR000000 2 1 Join_Outer: When checking operator: Dropping


component “CUST_NBR” because of a prior component with the same name.

Resolution:If you are using join,diff,merge or comp stages make sure both links have the
differnt column names other than key columns

9)TFIP000022 1 oci_oracle_source: When checking operator: When binding


output interface field “MEMBER_NAME” to field “MEMBER_NAME”: Converting a
nullable source to a non-nullable result;

Resolution:If you are reading from oracle database or in any processing stage where
incoming column is defined as nullable and if you define metadata in datastage as non-
nullable then you will get above issue.if you want to convert a nullable field to
non nullable make sure you apply available null functions in datastage or in the extract
query.

DATASTAGE COMMON ERRORS/WARNINGS AND SOLUTIONS – 2


1. No jobs or logs showing in IBM DataStage Director Client, however jobs are still
accessible from the Designer Client.

SOL: SyncProject cmd that is installed with DataStage 8.5 can be run to analyze and
recover projects

SyncProject -ISFile islogin -project dstage3 dstage5 –Fix

2. CASHOUT_DTL: Invalid property value /Connection/Database


(CC_StringProperty::getValue, file CC_StringProperty.cpp, line 104)

SOL: Change the Data Connection properties manually in the produced

DB2 Connector stage.

A patch fix is available for this issue JR35643

3. Import .dsx file from command line

SOL: DSXImportService -ISFile dataconnection –DSProject dstage –DSXFile


c:\export\oldproject.dsx

4. Generate Surrogate Key without Surrogate Key Stage

SOL: @PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM – 1)) + 1

Use above Formula in Transformer stage to generate a surrogate key.

5. Failed to authenticate the current user against the selected Domain: Could not connect
to server.

RC: Client has invalid entry in host file

Server listening port might be blocked by a firewall

Server is down

SOL: Update the host file on client system so that the server hostname can be resolved
from client.

Make sure the WebSphere TCP/IP ports are opened by the firewall.
Make sure the WebSphere application server is running. (OR)

Restart Websphere services.

6. The connection was refused or the RPC daemon is not running (81016)

RC: The dsprcd process must be running in order to be able to login to DataStage.

If you restart DataStage, but the socket used by the dsrpcd (default is 31538) was busy,
the dsrpcd will fail to start. The socket may be held by dsapi_slave processes that were
still running or recently killed when DataStage was restarted.

SOL: Run “ps -ef | grep dsrpcd” to confirm the dsrpcd process is not running.

Run “ps -ef | grep dsapi_slave” to check if any dsapi_slave processes exist. If so, kill
them.

Run “netstat -a | grep dsprc” to see if any processes have sockets that are
ESTABLISHED, FIN_WAIT, or CLOSE_WAIT. These will prevent the dsprcd from
starting. The sockets with status FIN_WAIT or CLOSE_WAIT will eventually time out
and disappear, allowing you to restart DataStage.

Then Restart DSEngine. (if above doesn’t work) Needs to reboot the system.

7. To save Datastage logs in notepad or readable format

SOL: a) /opt/ibm/InformationServer/server/DSEngine (go to this directory)

./bin/dsjob -logdetail project_name job_name >/home/dsadm/log.txt

b) In director client, Project tab à Print à select print to file option and save it in local
directory.

8. “Run time error ’457′. This Key is already associated with an element of this
collection.”

SOL: Needs to rebuild repository objects.

a) Login to the Administrator client

b) Select the project

c) Click on Command
d) Issue the command ds.tools

e) Select option ‘2’

f) Keep clicking next until it finishes.

g) All objects will be updated.

9. To stop the datastage jobs in linux level

SOL: ps –ef | grep dsadm

To Check process id and phantom jobs

Kill -9 process_id

10. To run datastage jobs from command line

SOL: cd /opt/ibm/InformationServer/server/DSEngine

./dsjob -server $server_nm -user $user_nm -password $pwd -run $project_nm


$job_nm

11. Failed to connect to JobMonApp on port 13401.

SOL: needs to restart jobmoninit script (in


/opt/ibm/InformationServer/Server/PXEngine/Java)

Type sh jobmoninit start $APT_ORCHHOME

Add 127.0.0.1 local host in /etc/hosts file

(Without local entry, Job monitor will be unable to use the ports correctly)

12. SQL0752N. Connect to a database is not permitted within logical unit of work
CONNECT type 1 settings is in use.

SOL: COMMIT or ROLLBACK statement before requesting connection to another


database.

1. While running ./NodeAgents.sh start command… getting the following error:


“LoggingAgent.sh process stopped unexpectedly”
SOL: needs to kill LoggingAgentSocketImpl

Ps –ef | grep LoggingAgentSocketImpl (OR)

PS –ef | grep Agent (to check the process id of the above)

2. Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0

SOL: Clear the preserve partition flag before Sequential file stages.

3. Warning: A user defined sort operator does not satisfy the requirements.

SOL: Check the order of sorting columns and make sure use the same order when use
join stage after sort to joing two inputs.

4. Conversion error calling conversion routine timestamp_from_string data may have


been lost. xfmJournals,1: Conversion error calling conversion routine
decimal_from_string data may have been lost

SOL: check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.

5. To display all the jobs in command line

SOL:

cd /opt/ibm/InformationServer/Server/DSEngine/bin

./dsjob -ljobs <project_name>

6. “Error trying to query dsadm[]. There might be an issue in database server”

SOL: Check XMETA connectivity.

db2 connect to xmeta (A connection to or activation of database “xmeta” cannot be made


because of BACKUP pending)

7. “DSR_ADMIN: Unable to find the new project location”

SOL: Template.ini file might be missing in /opt/ibm/InformationServer/Server.


Copy the file from another severs.

8. “Designer LOCKS UP while trying to open any stage”

SOL: Double click on the stage that locks up datastage

Press ALT+SPACE

Windows menu will popup and select Restore

It will show your properties window now

Click on “X” to close this window.

Now, double click again and try whether properties window appears.

9. “Error Setting up internal communications (fifo RT_SCTEMP/job_name.fifo)

SOL: Remove the locks and try to run (OR)

Restart DSEngine and try to run (OR)

Go to /opt/ibm/InformationServer/server/Projects/proj_name/

ls RT_SCT* then

rm –f RT_SCTEMP

then try to restart it.

10. While attempting to compile job, “failed to invoke GenRunTime using Phantom
process helper”

RC: /tmp space might be full

Job status is incorrect

Format problems with projects uvodbc.config file

SOL: a) clean up /tmp directory

b) DS Director à JOB à clear status file


c) confirm uvodbc.config has the following entry/format:

[ODBC SOURCES]

<local uv>

DBMSTYPE = UNIVERSE

Network = TCP/IP

Service = uvserver

Host = 127.0.0.1

ERROR:Phantom error in jobs

Resolution – Datastage Services have to be started

So follow the following steps.

Login to server through putty using dsadm user.

Check whether active or stale sessions are there.

ps –ef|grep slave

Ask the application team to close the active or stale sessions running from application’s
user.

If they have closed the sessions, but sessions are still there, then kill those sessions.

Make sure no jobs are running

If any, ask the application team to stop the job


ps –ef|grep dsd.run

Check for output for below command before stopping Datastage services.

netstat –a|grep dsrpc

If any processes are in established, check any job or stale or active or osh sessions are not
running.

If any processes are in close_wait, then wait for some time, those processes

will not be visible.

Stop the Datastage services.

cd $DSHOME

./dsenv

cd $DSHOME/bin

./uv –admin –stop

Check whether Datastage services are stopped.

netstat –a|grep dsrpc

No output should come for above command.

Wait for 10 to 15 min for shared memory to be released by process holding them.

Start the Datastage services.

./uv –admin –start

If asking for dsadm password while firing the command , then enable
impersonation.through root user

${DSHOME}/scripts/DSEnable_impersonation.sh

Friday, December 6, 2013


InfoSphere DataStage Jobstatus returned Codes from dsjob

Equ DSJS.RUNNING To 0 This is the only status that means the job is actually running
Equ DSJS.RUNOK To 1 Job finished a normal run with no warnings
Equ DSJS.RUNWARN To 2 Job finished a normal run with warnings

3 Job finished a normal run with a fatal error


Equ DSJS.RUNFAILED To
Equ DSJS.QUEUED To 4 Job queued waiting for resource allocation
Equ DSJS.VALOK To 11 Job finished a validation run with no warnings
Equ DSJS.VALWARN To 12 Job finished a validation run with warnings
Equ DSJS.VALFAILED To 13 Job failed a validation run
Equ DSJS.RESET To 21 Job finished a reset run
Equ DSJS.CRASHED To 96 Job has crashed
Equ DSJS.STOPPED To 97 Job was stopped by operator intervention (can't tell run type)
Equ DSJS.NOTRUNNABLE To 98 Job has not been compiled
Equ DSJS.NOTRUNNING To 99 Any other status
Posted by manohar at 4:26 PM No comments:

Thursday, October 17, 2013


Warning : Ignoring duplicate entry at table record no further warnings will be
issued for this table

This warning is seen when there are multiple records with the same key
column is present in the reference table from which lookup is done. Lookup,
by default, will fetch the first record which it gets as match and will throw the
warning
since it doesn’t know which value is the correct one to be returned from the
reference.
To solve this problem you can either one of the reference links from “Multiple
rows returned from link” dropdown, in Lookup constraints. In this case Lookup
will return multiple rows for each row that is matched.

Else use some method to eradicate duplicate multiple rows with same key
columns according to the business requirements.
Posted by manohar at 11:19 PM No comments:

Monday, June 24, 2013


How to replace ^M character in VI editor/sed?

^M is DOS line break character which shows up in UNIX files when uploaded from a windows file system in

ascii format.

To remove this, open your file in vi editor and type

:%s/(ctrl-v)(ctrl-m)//g

and press Enter key.

Important!! – press (Ctrl-v) (Ctrl-m) combination to enter ^M character, dont use “^” and M.

If anything goes wrong exit with q!.

Also,

Your substitution command may catch more ^M then necessary. Your file may contain valid ^M in the

middle of a line of code for example. Use the following command instead to remove only those at the very

end of lines:

:%s/(ctrl-v)(ctrl-m)*$//g

Using sed:

sed -e "s/^M//g" old_file_name > new_file_name

Posted by manohar at 8:15 AM No comments:

Thursday, June 6, 2013


How to convert a single row into multiple rows ?

Below is a screenshot of our input data

City State Name1 Name2 Name3

xy FGH Sam Dean Winchester

We are going to read the above data from a sequential file and transform it to look like this

City State Name

xy FGH Sam

xy FGH Dean

xy FGH Winchester

So lets get to the job design

Step 1: Read the input data

Step 2: Logic for Looping in Transformer

In the adjacent image you can see a new box called Loop Condition. This where we are going to
control the loop variables.

Below is the screenshot when we expand the Loop Condition box

The Loop While constraint is used to implement a functionality similar to “WHILE” statement in
programming. So, similar to a while statement need to have a condition to identify how many times
the loop is supposed to be executed.

To achieve this @ITERATION system variable was introduced. In our example we need to loop the
data 3 times to get the column data onto subsequent rows.

So lets have @ITERATION <=3

Now create a new Loop variable with the name LoopName

The derivation for this loop variable should be

If @ITERATION=1 Then DSLink2.Name1 Else If @ITERATION=2 Then DSLink2.Name2 Else


DSLink2.Name3

Below is a screenshot illustrating the same

Now all we have to do is map this Loop variable LoopName to our output column Name
Lets map the output to a sequential file stage and see if the output is a desired.

After running the job, we did a view data on the output stage and here is the data as desired.

Making some tweaks to the above design we can implement things like

 Adding new rows to existing rows


 Splitting data in a single column to multiple rows and many more such stuff..
Posted by manohar at 9:20 PM No comments:

How to perform aggregation using a Transformer

Input:Below is the sample data of three students, their marks in two subjects, the
corresponding grades and the dates on which they were graded.
Output:Our requirement is to sum the marks obtained by each student in a subject and display it in
the output

Step 1: Once we have read the data from the source we have to sort data on our key field. In our
example the key field is the student name

Once the data is sorted we have to implement the looping function in transformer to calculate the
aggregate value

Before we get into the details, we need to know a couple of functions

o SaveInputRecord(): This function saves the entire record in cache and returns the number of
records that are currently stored in cache
o LastRowInGroup(input-column): When a input key column is passed to this function it will return
1 when the last row for that column is found and in all other cases it will return 0
To give an example, lets say our input is

Student Code

ABC 1

ABC 2

ABC 3

DEF 2

o For the first two records the function will return 0 but for the last record ABC,3 it will return 1
indicating that it is the last record for the group where student name is “ABC”

o GetSavedInputRecord(): This function returns the record that was stored in cache by the function
SaveInputRecord()
Back to the task at hand, we need 7 stage variables to perform the aggregation operation
successfully.

1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated
for a student (subject 1)
5. SumSub2: Similar to SumSub1 (for subject 2)
6. IntermediateSumSub2: Similar to IntermediateSumSub1 (for subject 2)
7. LoopBreakNum: Holds the value for the number of times the loop has to run
Below is the screenshot of the stage variables

We also need to define the Loop Variables so that the loop will execute for a student until his final
record is identified

To explain the above use of variables -

When the first record comes to stage variables, it is saved in the cache using the function
SaveInputRecord() in first stage variableLoopNumber

The second stage variable checks if it is the last record for this particular student, if it is it stores 1
else 0

The third SumSub1 is executed only if the record is the last record

The fourth IntermediateSumSum1 is executed when the input record is not the last record, thereby
storing the intermediate sum of the subject for a student

Fifth and sixth are the same as 3 and 4 stage variables

Seven will have the first value as 1 and for the second record also if the same student is fetched it will
change to 2 and so on

The loop variable will be executed until the final record for a student is identified and the
GetSavedInputRecord() function will make sure the current record is processed before the next record
is brought for processing.

What the above logic does is for each and every record it will send the sum of marks scored by each
student to the output. But our requirement is to have only one record per student in the output.

So we simply add a remove duplicates stage and add the student name as a primary key

Run the job and the output will be according to our initial expectation

We have successfully implemented AGGREGATION using TRANSFORMER Stage

Posted by manohar at 9:16 PM 3 comments:

Thursday, May 23, 2013


Star vs Snowflake Schemas
First Answer: My personal opinion is to use the star by default, but if the product you are using for
the business community prefers a snowflake, then I would snowflake it. The major difference between
snowflake and star is that a snowflake will have multiple tables for a “dimension” and a start with a
single table. For example, your company structure might be

Corporate  Region  Department  Store

In a star schema, you would collapse those into a single "store" dimension. In a snowflake, you would
keep them apart with the store connecting to the fact.

Second Answer: First of all, some definitions are in order. In a star schema, dimensions
that reflect a hierarchy are flattened into a single table. For example, a star schema
Geography Dimension would have columns like country, state/province, city, state and
postal code. In the source system, this hierarchy would probably be normalized with
multiple tables with one-to-many relationships.

A snowflake schema does not flatten a hierarchy dimension into a single table. It would,
instead, have two or more tables with a one-to-many relationship. This is a more
normalized structure. For example, one table may have state/province and country columns
and a second table would have city and postal code. The table with city and postal code
would have a many-to-one relationship to the table with the state/province columns.

There are some good for reasons snowflake dimension tables. One example is a company
that has many types of products. Some products have a few attributes, others have many,
many. The products are very different from each other. The thing to do here is to create a
core Product dimension that has common attributes for all the products such as product
type, manufacturer, brand, product group, etc. Create a separate sub-dimension table for
each distinct group of products where each group shares common attributes. The sub-
product tables must contain a foreign key of the core Product dimension table.

One of the criticisms of using snowflake dimensions is that it is difficult for some of the
multidimensional front-end presentation tools to generate a query on a snowflake
dimension. However, you can create a view for each combination of the core product/sub-
product dimension tables and give the view a suitably description name (Frozen Food
Product, Hardware Product, etc.) and then these tools will have no problem.

Posted by manohar at 10:20 PM No comments:

Tuesday, May 14, 2013


Performance Tuning in Datastage
1 Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server
using Hash/Sequential files for optimum performance
2 Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for
faster inserts, updates and selects.
3 Tuned the 'Project Tunables' in Administrator for better performance
4 Used sorted data for Aggregator
5 Sorted the data as much as possible in DB and reduced the use of DS-Sort for
betterperformance of jobs
6 Removed the data not used from the source as early as possible in the job
7 Worked with DB-admin to create appropriate Indexes on tables for betterperformance of
DS queries
8 Converted some of the complex joins/business in DS to Stored Procedures on DS for
faster execution of the jobs.
9 If an input file has an excessive number of rows and can be split-up then use standard
logic to run jobs in parallel.
10 Before writing a routine or a transform, make sure that there is not the functionality
required in one of the standard routines supplied in the sdk or ds utilities
categories.Constraints are generally CPU intensive and take a significant amount of time
to process. This may be the case if the constraint calls routines or external macros but if
it is inline code then the overhead will be minimal.
11 Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate
the unnecessary records even getting in before joins are made.
12 Tuning should occur on a job-by-job basis.
13 Use the power of DBMS.
14 Try not to use a sort stage when you can use an ORDER BY clause in the database.
15 Using a constraint to filter a record set is much slower than performing a SELECT …
WHERE….
16 Make every attempt to use the bulk loader for your particular database. Bulk loaders are
generally faster than using ODBC or OLE.
17 Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator
18 Use SQL Code while extracting the data
19 Handle the nulls
20 Minimize the warnings
21 Reduce the number of lookups in a job design
22 Try not to use more than 20stages in a job
23 Use IPC stage between two passive stages Reduces processing time
24 Drop indexes before data loading and recreate after loading data into tables

25 Check the write cache of Hash file. If the same hash file is used for Look up and as well
as target disable this Option.
26 If the hash file is used only for lookup then enable Preload to memory . This will
improve the performance. Also check the order of execution of the routines.
27 Don't use more than 7 lookups in the same transformer; introduce new transformers if it
exceeds 7 lookups.
28 Use Preload to memory option in the hash file output.
29 Use Write to cache in the hash file input.
30 Write into the error tables only after all the transformer stages.
31 Reduce the width of the input record - remove the columns that you would not use.
32 Cache the hash files you are reading from and writing into. Make sure your cache is big
enough to hold the hash files.
33 Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
34 Ideally, if the amount of data to be processed is small, configuration files with less
number of nodes should be used while if data volume is more , configuration files with
larger number of nodes should be used.
35 Partitioning should be set in such a way so as to have balanced data flow i.e. nearly
equal partitioning of data should occur and data skew should be minimized.
36 In DataStage Jobs where high volume of data is processed, virtual memory settings for
the job should be optimized. Jobs often abort in cases where a single lookup has
multiple reference links. This happens due to low temp memory space. In such jobs
$APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should
be set to sufficiently large values.
37 Sequential files should be used in following conditions. When we are reading a flat file
(fixed width or delimited) from UNIX environment which is FTP ed from some external
system
38 When some UNIX operations has to be done on the file Don’t use sequential file for
intermediate storage between jobs. It causes performance overhead, as it needs to do
data conversion before writing and reading from a UNIX file
39 In order to have faster reading from the Stage the number of readers per node can be
increased (default value is one).
40 Usage of Dataset results in a good performance in a set of linked jobs. They help in
achieving end-to-end parallelism by writing data in partitioned form and maintaining the
sort order.
41 Look up Stage is faster when the data volume is less. If the reference data volume is
more, usage of Lookup Stage should be avoided as all reference data is pulled in to local
memory
42 Sparse lookup type should be chosen only if primary input data volume is small.
43 Join should be used when the data volume is high. It is a good alternative to the lookup
stage and should be used when handling huge volumes of data.
44 Even though data can be sorted on a link, Sort Stage is used when the data to be sorted
is huge.When we sort data on link ( sort / unique option) once the data size is beyond
the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if
the volume of data is large explicit sort stage should be used instead of sort on link.Sort
Stage gives an option on increasing the buffer memory used for sorting this would mean
lower I/O and better performance.
45 It is also advisable to reduce the number of transformers in a Job by combining the logic
into a single transformer rather than having multiple transformers.
46 Presence of a Funnel Stage reduces the performance of a job. It would increase the time
taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is
better to isolate itself to one job. Write the output to Datasets and funnel them in new
job.
47 Funnel Stage should be run in “continuous” mode, without hindrance.
48 A single job should not be overloaded with Stages. Each extra Stage put in a Job
corresponds to lesser number of resources available for every Stage, which directly
affects the Jobs Performance. If possible, big jobs having large number of Stages should
be logically split into smaller units.
49 Unnecessary column propagation should not be done. As far as possible, RCP (Runtime
Column Propagation) should be disabled in the jobs
50 Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this
option to “true”. This improves the Sort Stage performance a great deal
51 In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the
data and reduce sorting in the job
52 Reduce the number of Stage variables used.
53 The Copy stage should be used instead of a Transformer for simple operations
54 The “upsert” works well if the data is sorted on the primary key column of the table
which is being loaded.
55 Don’t read from a Sequential File using SAME partitioning
56 By using hashfile stage we can improve the performance.
In case of hashfile stage we can define the read cache size
& write cache size but the default size is 128M.B.
57 By using active-to-active link performance also we can
improve the performance.
Here we can improve the performance by enabling the row
buffer, the default row buffer size is 128K.B.

===================================================================================

TRANSFORMER STAGE TO FILTER THE DATA

TRANSFORMER STAGE TO FILTER THE DATA

Take Job Design as below

If our requirement is to filter the data department wise from the file below

samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40

And our requirement is to get the target data as below

In Target1 we need 10th & 40th dept employees.

In Target2 we need 30th dept employees.


In Target1 we need 20th & 40th dept employees.

Read and Load the data in Source file

In Transformer Stage just Drag and Drop the data to the target tables.

Write expression in constraints as below

dept_no=10 or dept_no= 40 for table 1

dept_no=30 for table 1

dept_no=20 or dept_no= 40 for table 1

Click ok

Give file name at the target file and

Compile and Run the Job to get the Output.

================================================================================

You might also like