You are on page 1of 8

DataStage command line integration

Leveraging one of the most used inter-process communication


mechanisms in Linux and UNIX
Alberto Ortiz (aortizg@mx1.ibm.com)
Software IT Architect
IBM

04 April 2013

Develop IBM InfoSphere DataStage jobs that can be called from a command line or shell
script using UNIX pipes for more compact and efficient integration. This technique has the
potential of saving storage space and bypassing landing intermediate files that eventually will
feed an ETL job. This also reduces overall execution time and allows sharing the power of
DataStage jobs through remote execution.

Integration scenario
DataStage jobs are usually run to process data in batches, which are scheduled to run at specific
intervals. When there is no specific schedule to follow, the DataStage operator can start the job
manually through the DataStage and QualityStage Director client, or at the command line. If the
job is run at the command line, you would most likely do it as follows.
dsjob -run -param input_file=/path/to/in_file -param output_file=/path/to/out_file
dstage1 job1.

A diagram representing this command is shown in Figure 1.

Copyright IBM Corporation 2013


DataStage command line integration

Trademarks
Page 1 of 8

developerWorks

ibm.com/developerWorks/

Figure 1. Invoking a DataStage job

In normal circumstances, the in_file and out_file are stored in a file system on the machine where
DataStage runs. But, in Linux or UNIX, input and output can be piped in a series of commands.
For example, when a program requires sorting, you can do the following. command|sort |uniq
> /path/to/out_file. In this case, Figure 2 shows the flow of data, where the output of one
command becomes the input of the next, and the final output is landed in the file system.

Figure 2. UNIX typical pipe usage

Assuming the intermediate processes produce many millions of lines, you are potentially avoiding
landing the intermediate files, thus saving space in the file system and the time to write those
files. DataStage jobs do not take standard input through a pipe, like many programs or commands
executed in UNIX. This article will describe a method and show the script to make that happen, as
well as the practical uses of it.
If the job should accept standard input and produce standard output like a regular UNIX command,
then it would have to be called through a wrapper script as follows. command1|piped_ds_job.sh|
command2 > /path/to/out_file.
Or maybe you will have to send the output to a file such as the following. command1|
piped_ds_job.sh > /path/to/out_file.
The diagram in Figure 3 shows you how the script should be structured.
DataStage command line integration

Page 2 of 8

ibm.com/developerWorks/

developerWorks

Figure 3. Wrapper script for a DataStage job

The script will have to convert standard input into a named pipe, and also convert the output file of
the DataStage job into standard output. In the next sections, you will learn how to accomplish this.

Developing the DataStage job


The DataStage job does not require any special treatment. For this example, you will create a job
to sort a file which, if run normally, would take at least two parameters: input file and output file.
However, the job could have more parameters if its function required it, but for this exercise is
better to keep it simple.
The job is shown in figure 4.

Figure 4. Simple sort DataStage job

The DSX for this job is available in the downloads section of this article. The job simply takes a text
file, treats the full line as a single column, sorts it, and writes to the output file.
Additionally, the job will have to allow multiple instance execution. It should take the input line with
no separator and no quotes, and the output file will have the same characteristics.

DataStage command line integration

Page 3 of 8

developerWorks

ibm.com/developerWorks/

Writing and using the wrapper script


The wrapper script will contain the code required to create temporary files for the named pipes,
and create the command line for invoking the DataStage job (dsjob). Specifically the script will
have to perform the following.
Direct the standard input (this is the output of the command which is piping to it) to a named
pipe.
Make the output of the job to be written to another named pipe that then will be streamed to
the standard output of the process so the next command can read the output in a pipe as well.
Invoke the DataStage job specifying the input file and output file parameters using the file
names of the named pipes created earlier.
Clean up the temporary files created for the named pipes.
Now begin the writing of the wrapper script. The first group of commands will prepare the
environment, sourcing the dsenv file from the installation directory and set some variables. You
can use the process ID (pid) as the identifier to create a temporary file in a temporary directory, as
shown in Listing 1.

Listing 1. Preparing the DataStage environment


#!/bin/bash
dshome=`cat /.dshome`
. $dshome/dsenv
export PATH=$PATH:$DSHOME/bin
pid=$$
fifodir=/data/datastage/tmp
infname=$fifodir/infname.$pid
outfname=$fifodir/outfname.$pid

You can proceed to do the FIFO creation and the dsjob execution. At this point, the job will wait
until the pipe starts receiving input. The code warns you if the DataStage job execution has thrown
an error, as shown in Listing 2.

Listing 2. Creating the named pipes and invoking the job


mkfifo $infname
mkfifo $outfname
dsjob -run -param inputFile=$infname \
-param outputFile=$outfname dstage1 ds_sort.$pid 2> /dev/null

&

if [ $? -ne 0 ]; then
echo "error calling DataStage job."
rm $infname
rm $outfname
exit 1
fi

At the end of the dsjob command, you see an ampersand, which is necessary since the job is
waiting for the input named pipe to send data, but the data will be streamed a few lines ahead.
The following code prepares the output to be sent to standard output via a simple cat command.
As you can see the cat command and the rm command are within parenthesis, meaning that
DataStage command line integration

Page 4 of 8

ibm.com/developerWorks/

developerWorks

those two commands are invoked in a sub-shell that is sent to the background (specified by the
ampersand at the end of the line), as shown in Listing 3.

Listing 3. Handling the input and output named pipes


(cat $outfname;rm $outfname)&
if [ -z $1 ]; then
cat > $infname
else
cat $1 > $infname
fi
rm $infname

The latter is necessary so when the job is finished writing the output, the temporary named pipe
file name is removed. The code that follows, tests if the script was called with a parameter as a file,
or if you are receiving from the data from a pipe. After the input stream (file or pipe) is sent to the
input named pipe, you finish and remove the file.
You can name the script as piped_ds_job.sh and execute it as mentioned previously. command1|
piped_ds_job.sh > /path/to/out_file. The fact that the script can receive the input via an
anonymous pipe, allows the uses shown in Listing 4.

Listing 4. Wrapper script uses


command1|piped_ds_job.sh|command2
zcat compressedfile.gz |piped_ds_job.sh > /path/to/out_file
zcat compressedfile.gz |ssh -l dsadm@victory.ibm.com piped_ds_job.sh| command2

The last sample where you use SSH assumes that you are executing from another machine, and
therefore the DataStage job is somehow used as a service. This also would be a representative
usage of how you can bypass the file transmission (and decompression in this case).

Conclusion
The mechanism described in this article allows for a more flexible DataStage job invocation at the
command line and in shell scripting. The explained wrapper script can easily be customized to
make it more general and flexible. The technique is a simple one that can be quickly implemented
for current jobs and can convert them in services through remote execution via SSH. The benefits
in avoiding landing data in a regular file are most notable when file sizes are in the order of dozens
of million of rows, but even if your data is not that large, the integration use case is very valuable.

DataStage command line integration

Page 5 of 8

developerWorks

ibm.com/developerWorks/

Downloads
Description

Name

Size

Sample DataStage job and wrapper script

job_and_script.zip

10KB

DataStage command line integration

Page 6 of 8

ibm.com/developerWorks/

developerWorks

Resources
Learn
Read about Information Server and DataStage in the InfoSphere Information Server 9.1
Information Center.
Review the UNIX IPC in the "Speaking UNIX: Interprocess communication with shared
memory" developerWorks article.
Visit the developerWorks Information Management zone to find more resources for DB2
developers and administrators.
Stay current with developerWorks technical events and webcasts focused on a variety of
IBM products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and
tools as well as IT industry trends.
Follow developerWorks on Twitter.
Watch developerWorks on-demand demos ranging from product installation and setup demos
for beginners, to advanced functionality for experienced developers.
Get products and technologies
Build your next development project with IBM trial software, available for download directly
from developerWorks.
Evaluate IBM products in the way that suits you best: Download a product trial, try a product
online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox
learning how to implement Service Oriented Architecture efficiently.
Discuss
Participate in the discussion forum for this content.
Get involved in the My developerWorks community. Connect with other developerWorks
users while exploring the developer-driven blogs, forums, groups, and wikis.

DataStage command line integration

Page 7 of 8

developerWorks

ibm.com/developerWorks/

About the author


Alberto Ortiz
Alberto Ortiz is a Software IT Architect for IBM Mexico where he designs and deploys
cross-brand software solutions. Since joining IBM in 2009, he participated in project
implementations of varying sizes and complexities, from tuning an extract, transform,
and load (ETL) platform for a data warehouse for a bank, to an IBM Identity Insight
deployment with 400+ million records for a federal government police intelligence
system. Before joining IBM, Alberto worked in several technical fields for projects in
industries, such as telecommunications, manufacturing, finance, and government,
for local and international clients. Alberto holds a B.Sc. in Computer Systems
Engineering from Universidad de las Americas Puebla in Mexico and currently studies
for an MBA at Instituto Tecnologico y de Estudios Superiores de Monterrey (ITESM).
Copyright IBM Corporation 2013
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)

DataStage command line integration

Page 8 of 8

You might also like