You are on page 1of 69

Parallel Architecture, DataStage v8 Configuration, Metadata

Parallel processing = executing your application on multiple CPUs

Parallel processing environments


The environment in which you run your parallel jobs is defined by your systems architecture and hardware resources. All parallel processing environments are categorized as one of: 1. SMP (symmetric multiprocessing), in which some hardware resources may be shared among processors. The processors communicate via shared memory and have a single operating system. 2. Cluster or MPP (massively parallel processing), also known as shared-nothing, in which each processor has exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network.

Pipeline Parallelism
1. 2. 3. 4. Extract, Transform and Load processes execute simultaneously The downstream process starts while the upstream process is running like a conveyor belt moving rows from process to process Advantages: Reduces disk usage for staging areas and Keeps processors busy Still has limits on scalability

Pipeline Parallelism
1. 2. 3. Divide the incoming stream of data into subsets known as partitions to be processed separately Each partition is processed in the same way Facilitates near-linear scalability. However the data needs to be evenly distributed across the partitions; otherwise the benefits of partitioning are reduced

Within parallel jobs pipelining, partitioning and repartitioning are automatic. Job developer only identifies 1. Sequential or Parallel mode (by stage) 2. Partitioning Method 3. Collection Method 4. Configuration file

Configuration File
One of the great strengths of the WebSphere DataStage Enterprise Edition is that, when designing parallel jobs, you dont have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you dont necessarily have to change your job design. WebSphere DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs. The WebSphere DataStage Designer provides a configuration file editor to help you define configuration files for the parallel engine. To use the editor, choose Tools Configurations, the Configurations dialog box appears. You specify which configuration will be used by setting the $APT_CONFIG_FILE environment variable. This is set on installation to point to the default configuration file, but you can set it on a project wide level from the WebSphere DataStage Administrator or for individual jobs from the Job Properties dialog. Configuration files are text files containing string data. The general form of a configuration file is as follows:

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } } Node names Each node you define is followed by its name enclosed in quotation marks, for example: node "orch0" For a single CPU node or workstation, the nodes name is typically the network name of a processing node on a connection such as a high-speed switch or Ethernet. Issue the following UNIX command to learn a nodes network name:
$ uname -n

On an SMP, if you are defining multiple logical nodes corresponding to the same physical node, you replace the network name with a logical node name. In this case, you need a fast name for each logical node. If you run an application from a node that is undefined in the corresponding configuration file, each user must set the environment variable APT_PM_CONDUCTOR_NODENAME to the fast name of the node invoking the parallel job. Fastname Syntax: fastname "name" This option takes as its quoted attribute the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. For an SMP, all CPUs share a single connection to the network, and this setting is the same for all parallel engine processing nodes defined for an SMP. Typically, this is the principal node name, as returned by the UNIX command uname -n. Node pools and the default node pool Node pools allow association of processing nodes based on their characteristics. For example, certain nodes can have large amounts of physical memory, and you can designate them as compute nodes. Others can connect directly to a mainframe or some form of high-speed I/O. These nodes can be grouped into an I/O node pool. The option pools is followed by the quoted names of the node pools to which the node belongs. A node can be assigned to multiple pools, as in the following example, where node1 is assigned to the default pool () as well as the pools node1, node1_css, and pool4. node "node1" { fastname "node1_css" pools "" "node1" "node1_css" "pool4" resource disk "/orch/s0" {} resource scratchdisk "/scratch" {} } A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name ( ) from the list.

Once you have defined a node pool, you can constrain a parallel stage or parallel job to run only on that pool, that is, only on the processing nodes belonging to it. If you constrain both a stage and a job, the stage runs only on the nodes that appear in both pools. Nodes or resources that name a pool declare their membership in that pool. We suggest that when you initially configure your system you place all nodes in pools that are named after the nodes name and fast name. Additionally include the default node pool in this pool, as in the following example: node "n1" { fastname "nfast" pools "" "n1" "nfast" } By default, the parallel engine executes a parallel stage on all nodes defined in the default node pool. You can constrain the processing nodes used by the parallel engine either by removing node descriptions from the configuration file or by constraining a job or stage to a particular node pool. Disk and scratch disk pools and their defaults When you define a processing node, you can specify the options resource disk and resource scratchdisk. They indicate the directories of file systems available to the node. You can also group disks and scratch disks in pools. Pools reserve storage for a particular use, such as holding very large data sets. Pools defined by disk and scratchdisk are not combined; therefore, two pools that have the same name and belong to both resource disk and resource scratchdisk define two separate pools. A disk that does not specify a pool is assigned to the default pool. The default pool may also be identified by by and { } (the empty pool list). For example, the following code configures the disks for node1: node "n1" { resource disk "/orch/s0" {pools "" "pool1"} resource disk "/orch/s1" {pools "" "pool1"} resource disk "/orch/s2" { } /* empty pool list */ resource disk "/orch/s3" {pools "pool2"} resource scratchdisk "/scratch" {pools "" "scratch_pool1"} } In this example: 1. The first two disks are assigned to the default pool. 2. The first two disks are assigned to pool1. 3. The third disk is also assigned to the default pool, indicated by { }. 4. The fourth disk is assigned to pool2 and is not assigned to the default pool. 5. The scratch disk is assigned to the default scratch disk pool and to scratch_pool1. Buffer scratch disk pools Under certain circumstances, the parallel engine uses both memory and disk storage to buffer virtual data set records.The amount of memory defaults to 3 MB per buffer per processing node. The amount of disk space for each processing node defaults to the amount of available disk space specified in the default scratchdisk setting for the node. The parallel engine uses the default scratch disk for temporary storage other than buffering. If you define a buffer scratch disk pool for a node in the configuration file, the parallel engine uses that scratch disk pool rather than the default scratch disk for buffering, and all other scratch disk pools defined are used for temporary storage other than buffering.

Here is an example configuration file that defines a buffer scratch disk pool: { node node1 { fastname "node1_css" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } node node2 { fastname "node2_css" pools "" "node2" "node2_css" resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } } In this example, each processing node has a single scratch disk resource in the buffer pool, so buffering will use /scratch0 but not /scratch1. However, if /scratch0 were not in the buffer pool, both /scratch0 and /scratch1 would be used because both would then be in the default pool.

Partitioning
The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, ensuring an even load across your processors. When performing some operations however, you will need to take control of partitioning to ensure that you get consistent results. A good example of this would be where you are using an aggregator stage to summarize your data. To get the answers you want (and need) you must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition.

Round robin partitioner


The first record goes to the first processing node, the second to the second processing node, and so on. When WebSphere DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when WebSphere DataStage initially partitions data.

Random partitioner
Records are randomly distributed across all processing nodes. Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.

Entire partitioner
Every instance of a stage on every processing node receives the complete data set as input. It is useful when you want the benefits of parallel execution, but every instance of the operator needs access to the entire input data set. You are most likely to use this partitioning method with stages that create lookup tables from their input.

Same partitioner
The stage using the data set as input performs no repartitioning and takes as input the partitions output by the preceding stage. With this partitioning method, records stay on the same processing node; that is, they are not redistributed. Same is the fastest partitioning method. This is normally the method WebSphere DataStage uses when passing data between stages in your job.

Hash partitioner
Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record. The hash partitioner examines one or more fields of each input record (the hash key fields). Records with the same values for all hash key fields are assigned to the same processing node. This method is useful for ensuring that related records are in the same partition, which may be a prerequisite for a processing operation. For example, for a remove duplicates operation, you can hash partition records so that records with the same partitioning key values are on the same node. You can then sort the records on each node using the hash key fields as sorting key fields, then remove duplicates, again using the same keys. Although the data is distributed across partitions, the hash partitioner ensures that records with identical keys are in the same partition, allowing duplicates to be found. Hash partitioning does not necessarily result in an even distribution of data between partitions. For example, if you hash partition a data set based on a zip code field, where a large percentage of your records are from one or two zip codes, you can end up with a few partitions containing most of your records. This behavior can lead to bottlenecks because some nodes are required to process more records than other nodes.

Modulus partitioner
Partitioning is based on a key column modulo the number of partitions. This method is similar to hash by field, but involves simpler computation. In data mining, data is often arranged in buckets, that is, each record has a tag containing its bucket number. You can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns each record of an input data set to a partition of its output data set as determined by a specified key field in the input data set. This field can be the tag field. The partition number of each record is calculated as follows:
partition_number = fieldname mod number_of_partitions

where: fieldname is a numeric field of the input data set and number_of_partitions is the number of processing nodes on which the partitioner executes. If a partitioner is executed on three processing nodes it has three partitions.

Range partitioner
Divides a data set into approximately equal-sized partitions, each of which contains records with key columns within a specified range. This method is also useful for ensuring that related records are in the same partition. A range partitioner divides a data set into approximately equal size partitions based on one or more partitioning keys. In order to use a range partitioner, you have to make a range map. You can do this using the Write Range Map stage. The range partitioner guarantees that all records with the same partitioning key values are assigned to the same partition and that the partitions are approximately equal in size so all nodes perform an equal amount of work when processing the data set. Range partitioning is not the only partitioning method that guarantees equivalent-sized partitions. The random and round robin partitioning methods also guarantee that the partitions of a data set are equivalent in size. However,

these partitioning methods are keyless; that is, they do not allow you to control how records of a data set are grouped together within a partition.

DB2 partitioner
Partitions an input data set in the same way that DB2 would partition it. For example, if you use this method to partition an input data set containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then, during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node. Any reads and writes of the DB2 table would entail no network activity.

Auto partitioner
The most common method you will see on the WebSphere DataStage stages is Auto. This just means that you are leaving it to WebSphere DataStage to determine the best partitioning method to use depending on the type of stage, and what the previous stage in the job has done. Typically WebSphere DataStage would use round robin when initially partitioning data, and same for the intermediate stages of a job.

Collecting
Collecting is the process of joining the multiple partitions of a single data set back together again into a single partition. There may be a stage in your job that you want to run sequentially rather than in parallel, in which case you will need to collect all your partitioned data at this stage to make sure it is operating on the whole data set. Note that collecting methods are mostly non-deterministic. That is, if you run the same job twice with the same data, you are unlikely to get data collected in the same order each time. If order matters, you need to use the sorted merge collection method.

Round robin collector


Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, starts over. After reaching the final record in any partition, skips that partition in the remaining rounds.

Ordered collector
Reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves the order of totally sorted input data sets. In a totally sorted data set, both the records in each partition and the partitions themselves are ordered. This may be useful as a preprocessing action before exporting a sorted data set to a single data file.

Sorted merge collector


Read records in an order based on one or more columns of the record. The columns used to define record order are called collecting keys. Typically, you use the sorted merge collector with a partition-sorted data set (as created by a sort stage). In this case, you specify as the collecting key fields those fields you specified as sorting key fields to the sort stage. The data type of a collecting key can be any type except raw, subrec, tagged, or vector.

Auto collector

The most common method you will see on the parallel stages is Auto. This normally means that WebSphere DataStage will eagerly read any row from any input partition as it becomes available, but if it detects that, for example, the data needs sorting as it is collected, it will do that. This is the fastest collecting method.

Preserve partitioning flag


A stage can also request that the next stage in the job preserves whatever partitioning it has implemented. It does this by setting the preserve partitioning flag for its output link. Note, however, that the next stage may ignore this request. In most cases you are best leaving the preserve partitioning flag in its default state. The exception to this is where preserving existing partitioning is important. The flag will not prevent repartitioning, but it will warn you that it has happened when you run the job. If the Preserve Partitioning flag is cleared, this means that the current stage doesnt care what the next stage in the job does about partitioning. On some stages, the Preserve Partitioning flag can be set to Propagate. In this case the stage sets the flag on its output link according to what the previous stage in the job has set. If the previous job is also set to Propagate, the setting from the stage before is used and so on until a Set or Clear flag is encountered earlier in the job. If the stage has multiple inputs and has a flag set to Propagate, its Preserve Partitioning flag is set if it is set on any of the inputs, or cleared if all the inputs are clear.

Parallel Job Score


At runtime, the Job SCORE can be examined to identify: 1. Number of UNIX processes generated for a given job and $APT_CONFIG_FILE 2. Operator combination 3. Partitioning methods between operators 4. Framework-inserted components - Including Sorts, Partitioners, and Buffer operators Set $APT_DUMP_SCORE=1 to output the Score to the DataStage job log For each job run, 2 separate Score Dumps are written to the log 1. First score is actually from the license operator 2. Second score entry is the actual job score Job scores are divided into two sections 1. Datasets - partitioning and collecting 2. Operators - node/operator mapping

Example score dump


The following score dump shows a flow with a single data set, which has a hash partitioner, partitioning on key a. It shows three operators: generator, tsort, and peek. Tsort and peek are combined, indicating that they have been optimized into the same process. All the operators in this flow are running on one node.

The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows that are consumed by downstream stages (operators or data sets) Partitioning method is associated with producer. Collector method is associated with consumer. eCollectAny is specified for parallel consumers, although no collection occurs! The producer and consumer are separated by the following indicators: -> Sequential to Sequential <> Sequential to Parallel => Parallel to Parallel (SAME) #> Parallel to Parallel (not SAME) >> Parallel to Sequential > No producer or no consumer May also include [pp] notation when Preserve Partitioning flag is set At runtime, the DataStage Parallel Framework can only combine stages (operators) that: 1. Use the same partitioning method Repartitioning prevents operator combination between the corresponding producer and consumer stages Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination 2. Are Combinable Set automatically within the stage/operator definition Set within DataStage Designer: Advanced stage properties The Lookup stage is a composite operator. Internally it contains more than one component, but to the user it appears to be one stage 1. LUTCreateImpl - Reads the reference data into memory 2. LUTProcessImpl - Performs actual lookup processing once reference data has been loaded At runtime, each internal component is assigned to operators independently

Job Compilation
1. Operators. These underlie the stages in a WebSphere DataStage job. A single stage may correspond to a single
operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, WebSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job. 2. OSH. This is the scripting language used internally by the WebSphere DataStage parallel engine.

3. Players. Players are the workhorse processes in a parallel job. There is generally a player for each operator on
each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file). DataStage Designer client generates all code - Validates link requirements, mandatory stage options, transformer logic, etc. 1. Generates OSH representation of job data flow and stages GUI stages are representations of Framework operators Stages in parallel shared containers are statically inserted in the job flow Each server shared container becomes a dsjobsh operator 2. Generates transform code for each parallel Transformer Compiled on the DataStage server into C++ and then to corresponding native operators To improve compilation times, previously compiled Transformers that have not been modified are not recompiled Force Compile recompiles all Transformers (use after client upgrades) 3. Buildop stages must be compiled manually within the GUI or using buildop UNIX command line Viewing of generated OSH is enabled in DS Administrator OSH is visible in 1. Job Properties 2. Job run log 3. View Data 4. Table Definitions

Generated OSH Primer


Designer inserts comment blocks to assist in understanding the generated OSH. Note that operator order within the generated OSH is the order a stage was added to the job canvas OSH uses the familiar syntax of the UNIX shell to create applications for Data Stage Enterprise Edition 1. operator name 2. operator options (use -name value format) Schema (for generator, import, export) Inputs Outputs The following data sources are supported as input/output Virtual data set, (name.v) Persistent data set (name.ds or [ds] name) File sets (name.fs or [fs] name) External files (name or [file] name) Every operator has inputs numbered sequentially starting from 0. For example: op1 0> dst op1 1<src

Terminology

Framework schema property type virtual record/field operator step, flow, OSH command Framework

DataStage table definition format SQL type + length [and scale] dataset link row/column stage job DS engine

GUI uses both terminologies Log messages (info, warnings, errors) use Framework term

Example Stage / Operator Mapping


Within Designer, stages represent operators, but there is not always a 1:1 correspondence. Sequential File _ Source: import _ Target: export DataSet: copy Sort (DataStage): tsort Aggregator: group Row Generator, Column Generator, Surrogate Key Generator: generator Oracle _ Source: oraread _ Sparse Lookup: oralookup _ Target Load: orawrite _ Target Upsert: oraupsert Lookup File Set _ Target: lookup -createOnly

Runtime Architecture
Generated OSH and Configuration file are used to compose a job SCORE similar to the way an RDBMS builds a query optimization plan 1. Identifies degree of parallelism and node assignment for each operator 2. Inserts sorts and partitioners as needed to ensure correct results 3. Defines connection topology (datasets) between adjacent operators 4. Inserts buffer operators to prevent deadlocks (eg. fork-joins) 5. Defines number of actual UNIX processes -Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements 6. Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control. Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage log It is only after these steps that processing begins. This is the startup overhead of an Enterprise Edition job Job processing ends when - Last row (end of data) is processed by final operator in the flow (or) A fatal error is encountered by any operator (or) Job is halted (SIGINT) by DataStage Job Control or human intervention (eg. DataStage Director STOP)

Job Execution: The Orchestra

Conductor - initial Framework process Score Composer Creates Section Leader processes (one/node) Consolidates massages, to DataStage log Manages orderly shutdown Section Leader (one per Node) Forks Players processes (one/Stage) Manages up/down communication Players The actual processes associated with Stages Combined players: one process only Sends stderr, stdout to Section Leader Establish connections to other players for data flow Clean up upon completion Default Communication: SMP: Shared Memory MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)

Introduction
What is IBM Websphere DataStage? 1. Design jobs for ETL 2. Ideal tool for data integration projects 3. Import, export, create and manage metadata for use within jobs 4. Schedule, run and monitor jobs all within DataStage 5. Administer your DataStage development and execution environments 6. Create batch (controlling) jobs What are the components/applications in IBM Information Server Suite? 1. DataStage 2. Quality Stage 3. Metadata Server consisting of Metadata Access Services and Metadata Analysis Services 4. Repository which is DB2 by default 5. Business Glossary 6. Federation Server 7. Information Services Director 8. Information Analyzer 9. Information Server console Explain the DataStage Architecture? The DataStage client components are Administrator - Administers DataStage projects and conducts housekeeping on the server Designer - Creates DataStage jobs that are compiled into executable programs Director - Used to run and monitor DataStage jobs The Repository is used to store DataStage objects. The Repository which is DB2 by default is shared by other applications in the Suite What are the uses of DataStage Adminsitrator? The Administrator is used to add and delete projects, and to set project properties. The Adminsitrator also provides a command line interface to the DataStage repository. Use Administrator Project Properties window to 1. Enable job administration in Director, enable run time column propogation, auto purging options, protect project and set environment vaiables on the General tab 2. Set user and group priveleges on the Permissions tab 3. Enable or disable server side tracing on the Tracing tab 4. Specifying a username and password for scheduling jobs on the Schedule tab 5. Specify parallel job defaults on the Parallel tab 6. Specify job sequencer defaults on the Sequencer tab Explain the DataStage Development workflow? 1. Define poject properties - Administrator 2. Open (attach to) your project 3. Import metadata that defines the format of data stores your jobs will read from or write to 4. Design the job - Designer 5. Compile and debug the job - Designer

6. Run and monitor the job - Director What is the DataStage project repository? All your work is stored in a DataStage project. Projects are created during and after the installation process. You can add projects after installation on the Projects tab of Adminsitrator. The project directory is used by DataStage to store your jobs and other DataStage objects and metadata on your server. Although multiple projects can be open at the same time, they are seperate environments. You can however, import and export objects between them. Multiple users can be working in the same project at the same time. However, DataStage will prevent multiple users from editing the same DataStage object (job, table definition, etc) at the same time. What are the different types of DataStage jobs? Parallel Jobs1. Executed by DataStage parallel engine 2. Built-in functionality for pipeline and partition parallelism 3. Compiled into OSH (Orchestrate Scripting Language) 4. OSH executes operators (Execute C++ class instances) Server Jobs1. Executed by DataStage server engine 2. Compiled into basic Job Sequencers1. Master Server jobs that kick-off jobs and other activities 2. Can kick-off Server or Parallel jobs 3. Executed by DataStage server engine What are the design elements of parallel jobs? Stages - Implemented as OSH operators Passive Stages (E and L of ETL) - Read/Write data Eg., Sequential file, DB2, Oracle, Peek stages Active Stages (T of ETL) - Transform/Filter/Aggregate/Generate/Split/Merge data Eg., Transformer, Aggregator, Join, Sort stages Links - Pipes though which the data moves from stage to stage What are the different types of parallelism? Pipeline Parallelism 1. Transform, clean, load processes execute simultaneously 2. Start downstream process while upstream process is running 3. Reduces disk usage for staging areas 4. Keeps processor busy 5. Still has limits on scalability Partition Parallelism 1. Divide the incoming stream of data into subsets(partitions) to be processed by the same operator 2. The operation is performed on each partition of data seperately and in parallel 3. Facilitates near-linear scalability provided the data is evenly distributed 4. If the data is evenly distributes, the data will be processed n times faster on n nodes.

Installation and Deployment


What gets deployed as part of Information Server Domain? 1. Metadata Server, hosted by an IBM WebSphere Application Server instance 2. One or more DataStage servers 3. One DB2 UDB instance containing the repository database Additional Server application 1. Business Glossary 2. Federation Server 3. Information Analyzer 4. Information Services Director 5. Rational Data Architect What are the Information Server clients? 1. Administration Console 2. Reporting Console 3. DataStage Clients - Administrator, Designer, Director What are the different types of Information Server deployment? 1. Everything on One machine - All the applicaions in the domain are deployed in one machine 2. The domain is split between two machines - DataStage Server in one machine , Metadata Server and DB2 Repository in one machine 3. The domain is split between three machines - DataStage Server, Metadata Server and DB2 Repository on 3 different machines Additional DataStage Servers can be part of this domain, but they would have to be seperate from one another There is a possibility of additional DataStage player-node machines connected to the DataStage server machine using a high speed network What are the components that should be running if Application Server(hosting the metadata server) and DataStage server are running on different machines? 1. The Application Server 2. The ASB agent Administering DataStage Explain the User and Group Management? Suite Authorization can be provided to users or groups. Users that are members of a group acquire authorizations of the group. Authorization are provided in the form of roles 1. Suite roles a. Administrator - Performs user and group management tasks. Includes all the priveleges of the Suite User role b. User - Create views of scheduled tasks and logged messages. Create and run reports

2. Suite Component roles a. DataStage Administrator - Full permission to work in DataStage Administrator, Designer and Director b. DataStage user - Permissions are assigned within DataStage - Developer, Operator, Super Operator and Production Manager A DataStage user cannot delete projects and cannot set permissions A user ID that is assigned Suite roles can immediately log onto the Information Server Console. What about a user ID that is assigned a DataStage Suite Component role? If the user ID is assigned the DataStage Administrator role, then the user will immediately acquire the DataStage Administrato permission for all projects. If the the user ID is assigned the DataStage user role, one moe step is required. A DataStage administrator must assign a corresponding role to that user ID on the permissions tab. When Suite users or groups have been assigned DataStage Administrator role they automatically appear on the permissions. Suite users or groups that have a DataStage User role need to be manually added. Explain The DataStage Credential Mapping? All the Suite users without their own DataStage credentials will be mapped to this user ID and password. Here the username and password are demohawk/demohwak. demohawk is assumed to be a valid user on the DataStage Server machine and has file permissions on the DataStage engine and project directories Suite users can also be mapped individually to specific users Note that demohawk need not be a Suite administrator or user What information are required to login into DataStage Administrator? Domain - Host name , port number of the application server. Recall that multiple DataStage servers can exist in a domain, although they must be on different machines. DataStage server - The Server that has the DataStage projects you want to administer Explain the DataStage roles? 1. DataStage Developer - full access to all areas of a DataStage project 2. DataStage Operator - run and manage release DataStage jobs 3. DataStage Super Operator - can open the Designer and view the repository in read-only mode 4. DataStage Production Managet - create and manipulate protected projects DataStage Designer Explain Import and Export and their corresponding procedures? 1. Backing up jobs and projects 2. Maintaining different versions of a job or project 3. Moving DataStage objects from one project to another 4. Sharing jobs and projects between developers Export-->DataStage components

By default, objects are exported to a text file in a specific format. By default, the extension is dsx. Alternatively, you can export the objects to a XML document. The directory you export is on the DataStage client, not the server. Objects can also be exported from the list of found objects using search functionality. Import-->DataStage components Import all to begin the import process. Use Import selected to import selected objects from the list. Select Overwrite without query button to overwrite objects with the same name without warning. For large imports you may want to disable "Perform impact analysis." This adds overhead to the import process Import-->Table Definitions Table definition describes the column and format of files and tables The table definition for the following can be imported 1. Sequential file 2. Relational tables 3. Cobol files 4. XML 5. ODBC data sources etc Table definitions can be loaded into job stages that access data with the same format. In this sense the metadata is reusable. Creating Parallel Jobs What is a Parallel Job? A parallel job is an executable DataStage program created in DataStage designer using components from repository. It compiles into Orchestrate script language(OSH) and object code(from generated C++) DataStage jobs are 1. Designed and built in Designer 2. Scheduled, invoked and monitored in Director 3. Executed under the control of DataStage Use the import process in Designer to import metadata defining sources and targets What are the benefits of renaming links and stages? 1. Documentation 2. Clarity 3. Fewer development errors Explain the Row Generator stage? 1. Produces mock data 2. No input link;single output link 3. On Properties tab, specify number of rows 4. On Columns tab, load or specify column definitions

You have a cluster of nodes available to run DataStage jobs. The network configuration between the servers is a private network with a 1 GB connection between each node. The public name is on a 100 MB network, which is what each hostname is identified with. In order to use the private network for communications between each node you need to use an alias for each node in the cluster. The Information Server Engine node (conductor node) is where the DataStage job starts. Which environment variable must be used to identify the hostname for the Engine node? A. $APT_SERVER_ENGINE B. $APT_ENGINE_NODE C. $APT_PM_CONDUCTOR_HOSTNAME D. $APT_PM_NETWORK_NAME Answer: C Which three privileges must the user possess when running a parallel job? (Choose three.) A. read access to APT_ORCHHOME B. execute permissions on local copies of programs and scripts C. read/write permissions to the UNIX/etc directory D. read/write permissions to APT_ORCHHOME E. read/write access to disk and scratch disk resources Answer: A,B,E Which two tasks will create DataStage projects? (Choose two.) A. Export and import a DataStage project from DataStage Manager. B. Add new projects from DataStage Administrator. C. Install the DataStage engine. D. Copy a project in DataStage Administrator. Answer: B,C Which three defaults are set in DataStage Administrator? (Choose three.) A. default prompting options, such as Autosave job before compile B. default SMTP mail server name C. project level default for Runtime Column Propagation D. project level defaults for environment variables E. project level default for Auto-purge of job log entries Answer: C,D,E Which two must be specified to manage Runtime Column Propagation? (Choose two.) A. enabled in DataStage Administrator B. attached to a table definition in DataStage Manager C. enabled at the stage level D. enabled with environmental parameters set at runtime Answer: A,C You are reading customer data using a Sequential File stage and transforming it using the Transformer stage. The Transformer is used to cleanse the data by trimming spaces from character fields in the input. The cleansed data is to be written to a target DB2 table. Which partitioning method would yield optimal performance without violating the business requirements? A. Hash on the customer ID field B. Round Robin C. Random D. Entire

Answer: B A job contains a Sort stage that sorts a large volume of data across a cluster of servers. The customer has requested that this sorting be done on a subset of servers identified in the configuration file to minimize impact on database nodes. Which two steps will accomplish this? (Choose two.) A. Create a sort scratch disk pool with a subset of nodes in the parallel configuration file. B. Set the execution mode of the Sort stage to sequential. C. Specify the appropriate node constraint within the Sort stage. D. Define a non-default node pool with a subset of nodes in the parallel configuration file. Answer: C,D You have a compiled job and parallel configuration file. Which three methods can be used to determine the number of nodes actually used to run the job in parallel? (Choose three.) A. within DataStage Designer, generate report and retain intermediate XML B. within DataStage Designer, show performance statistics C. within DataStage Director, examine log entry for parallel configuration file D. within DataStage Director, examine log entry for parallel job score E. within DataStage Director, open a new DataStage Job Monitor Answer: C,D,E Which environment variable, when set to true, causes a report to be produced which shows the operators, processes and data sets in the job? A. APT_DUMP_SCORE B. APT_JOB_REPORT C. APT_MONITOR_SIZE D. APT_RECORD_COUNTS Answer: A A job reads from a dataset using a DataSet stage. This data goes to a Transformer stage and then is written to a sequential file using a Sequential File stage. The default configuration file has 3 nodes. The job creating the dataset and the current job both use the default configuration file. How many instances of the Transformer run in parallel? A. 3 B. 1 C. 7 D. 9 Answer: A Your job reads from a file using a Sequential File stage running sequentially. The DataStage server is running on a single SMP system. One of the columns contains a product ID. In a Lookup stage following the Sequential File stage, you decide to look up the product description from a reference table. Which two partition settings would correctly find matching product descriptions? (Choose two.) A. Hash algorithm, specifying the product ID field as the key, on both the link coming from the Sequential File stage and the link coming from the reference table. B. Round Robin on both the link coming from the Sequential File stage and the link coming from the reference table. C. Round Robin on the link coming from the Sequential File stage and Entire on the link coming from the reference table. D. Entire on the link coming from the Sequential File stage and Hash, specifying the product ID field as the key, on the link coming from the reference table.

Answer: A,C

A job design consists of an input fileset followed by a Peek stage, followed by a Filter stage, followed by an output fileset. The environment variable APT_DISABLE_COMBINATION is set to true, and the job executes on an SMP using a configuration file with 8 nodes defined. Assume also that the input dataset was created with the same 8 node configuration file. Approximately how many data processing processes will this job create? A. 32 B. 8 C. 16 D. 1 Answer: A Which two statements are true of the column data types used in Orchestrate schemas? (Choose two.) A. Orchestrate schema column data types are the same as those used in DataStage stages. B. Examples of Orchestrate schema column data types are varchar and integer. C. Examples of Orchestrate schema column data types are int32 and string[max=30]. D. OSH import operators are needed to convert data read from sequential files into schema types. Answer: C,D You have set the "Preserve Partitioning" flag for a Sort stage to request that the next stage preserves whatever partitioning it has implemented. Which statement describes what will happen next? A. The job will compile but will abort when run. B. The job will not compile. C. The next stage can ignore this request but a warning is logged when the job is run depending on the stage type that ignores the flag. D. The next stage disables the partition options that are normally available in the Partitioning tab. Answer: C What is the purpose of the uv command in a UNIX DataStage server? A. Cleanup resources from a failed DataStage job. B. Start and stop the DataStage engine. C. Provide read access to a DataStage EE configuration file. D. Report DataStage client connections. Answer: B Which two statements regarding the usage of data types in the parallel engine are correct? (Choose two.) A. The best way to import RDBMS data types is using the ODBC importer. B. The parallel engine will use its interpretation of the Oracle meta data (e.g, exact data types) based on interrogation of Oracle, overriding what you may have specified in the Columns tabs. C. The best way to import RDBMS data types is using the Import Orchestrate Schema Definitions using orchdbutil. D. The parallel engine and server engine have exactly the same data types so there is no conversion cost overhead from moving data between the engines. Answer: B,C Which two describe a DataStage EE installation in a clustered environment? (Choose two.) A. The C++ compiler must be installed on all cluster nodes. B. Transform operators must be copied to all nodes of the cluster. C. The DataStage parallel engine must be installed or accessible in the same directory on all machines in the cluster. D. A remote shell must be configured to support communication between the conductor and section leader nodes. Answer: C,D

Which partitioning method would yield the most even distribution of data without duplication? A. Entire B. Round Robin C. Hash D. Random Answer: B Which three accurately describe the differences between a DataStage server root installation and a non-root installation? (Choose three.) A. A non-root installation enables auto-start on reboot. B. A root installation must specify the user "dsadm" as the DataStage administrative user. C. A non-root installation inherits the permissions of the user who starts the DataStage services. D. A root installation will start DataStage services in impersonation mode. E. A root installation enables auto-start on reboot. Answer: C,D,E Your job reads from a file using a Sequential File stage running sequentially. You are using a Transformer following the Sequential File stage to format the data in some of the columns. Which partitioning algorithm would yield optimized performance? A. Hash B. Random C. Round Robin D. Entire Answer: C Which three UNIX kernel parameters have minimum requirements for DataStage installations? (Choose three.) A. MAXUPROC - maximum number of processes per user B. NOFILES - number of open files C. MAXPERM - disk cache threshold D. NOPROC - no process limit E. SHMMAX - maximum shared memory segment size Answer: A,B,E Which partitioning method requires specifying a key? A. Random B. DB2 C. Entire D. Modulus Answer: D When a sequential file is written using a Sequential File stage, the parallel engine inserts an operator to convert the data from the internal format to the external format. Which operator is inserted? A. export operator B. copy operator C. import operator D. tsort operator Answer: A

Which statement is true when Runtime Column Propagation (RCP) is enabled? A. DataStage Manager does not import meta data. B. DataStage Director does not supply row counts in the job log. C. DataStage Designer does not enforce mapping rules. D. DataStage Administrator does not allow default settings for environment variables. Answer: C

Persistent Storage
Sequential file stage
The Sequential File stage is a file stage. It allows you to read data from or write data to one or more flat files. The stage can have a single input link or a single output link, and a single rejects link. The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file). For fixed-width files, however, you can configure the stage to behave differently:

1. You can specify that single file can be read by multiple nodes. This can improve performance on cluster
systems.

2. You can specify that a number of readers run on a single node. This means, for example, that a single file
can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor node). (These two options are mutually exclusive.)

File This property defines the flat file that data will be read from. You can type in a pathname, or browse for a file.
You can specify multiple files by repeating the File property

File pattern Specifies a group of files to import. Specify file containing a list of files or a job parameter
representing the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a list of file names.

Read method This property specifies whether you are reading from a specific file or files or using a file pattern to
select files (e.g., *.txt).

Missing file mode Specifies the action to take if one of your File properties has specified a file that does not exist.
Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.

Keep file partitions Set this to True to partition the imported data set according to the organization of the input
file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.

Reject mode Allows you to specify behavior if a read record does not match the expected schema (record does not
match the metadata defined in column definition). Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Report progress Choose Yes or No to enable or disable reporting. By default the stage displays a progress report
at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file.

Number Of readers per node This is an optional property and only applies to files containing fixed-length
records, it is mutually exclusive with the Read from multiple nodes property. Specifies the number of instances of the file read operator on a processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file.

This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP system.

Read from multiple nodes This is an optional property and only applies to files containing fixed-length records,
it is mutually exclusive with the Number of Readers Per Node property. Set this to Yes to allow individual files to be read by several nodes. This can improve performance on a cluster system. WebSphere DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates the reader on each node a separate region within the file to process. The regions will be of roughly equal size. Note that sequential row order cannot be maintained when reading a file in parallel

File update mode This property defines how the specified file or files are updated. The same method applies to all
files being written to. Choose from Append to append to existing files, Overwrite to overwrite existing files, or Create to create a new file. If you specify the Create property for a file that already exists you will get an error at runtime. By default this property is set to Overwrite.

Using RCP With Sequential Stages Runtime column propagation (RCP) allows WebSphere DataStage to be
flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask WebSphere DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so WebSphere DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see Schema File on page Schema File and on page Schema File) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: 1. 2. 3. 4. 5. 6. Sequential File File Set External Source External Target Column Import Column Export

Improving Sequential File Performance


If the source file is fixed width, the Readers Per Node option can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained. If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage. On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables

$APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE can be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance. Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations. $APT_CONSISTENT_BUFFERIO_SIZE - Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Setting APT_CONSISTENT_BUFFERIO_SIZE=N will force stages to read data in chunks which are size N or a multiple of N.

Partitioning Sequential File Reads


Care must be taken to choose the appropriate partitioning method from a Sequential File read: Dont read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned). When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each files data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

Sequential File (Export) Buffering


By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

Reading from and Writing to Fixed-Length Files


Particular attention must be taken when processing fixed-length fields using the Sequential File stage: If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property. If a field is nullable, you must define the null field value and length in the Nullable section of the column property. Double-click on the column number in the grid dialog to set these

Data set stage


The Data Set stage is a file stage. It allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode. What is a data set? Parallel jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other WebSphere DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the WebSphere DataStage Designer or Director A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. The descriptor file for a data set contains the following information:

1.

Data set header information.

2. Creation time and date of the data set. 3. The schema (metadata) of the data set. 4. A copy of the configuration file used when the data set was created.
Data Sets are the structured internal representation of data within the Parallel Framework Consist of: Framework Schema (format=name, type, nullability) Data Records (data) Partition (subset of rows for each node) Virtual Data Sets exist in-memory correspond to DataStage Designer links Persistent Data Sets are stored on-disk Descriptor file (metadata, configuration file, data file locations, flags) Multiple Data Files (one per node, stored in disk resource file systems) node1:/local/disk1/ node2:/local/disk2/ There is no DataSet operator the Designer GUI inserts a copy operator When to Use Persistent Data Sets When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints) Stored in native internal format (no conversion overhead) Retain data partitioning and sort order (end-to-end parallelism across jobs) Maximum performance through parallel I/O Why Data Sets are not intended for long-term or archive storage Internal format is subject to change with new DataStage releases Requires access to named resources (node names, file system paths, etc) Binary format is platform-specific For fail-over scenarios, servers should be able to cross-mount filesystems Can read a dataset as long as your current $APT_CONFIG_FILE defines the same NODE names (fastnames may differ) orchadmin x lets you recover data from a dataset if the node names are no longer available

Data Set Management


1. Viewing the schema

Click the Schema icon from the tool bar to view the record schema of the current data set. This is presented in text form in the Record Schema window.

2.

Viewing the data

Click the Data icon from the tool bar to view the data held by the current data set. This options the Data Viewer Options dialog box, which allows you to select a subset of the data to view. Rows to display. Specify the number of rows of data you want the data browser to display. Skip count. Skip the specified number of rows before viewing data. Period. Display every Pth record where P is the period. You can start after records have been skipped by using the Skip property. P must equal or be greater than 1. Partitions. Choose between viewing the data in All partitions or the data in the partition selected from the drop-down list.Click OK to view the selected data, the Data Viewer window appears. 3. Copying data sets

Click the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears, allowing you to specify a path where the new data set will be stored. The new data set will have the same record schema, number of partitions and contents as the original data set. Note: You cannot use the UNIX cp command to copy a data set because WebSphere DataStage represents a single data set with multiple files. 4. Deleting data sets

Click the Delete icon on the tool bar to delete the current data set data set. You will be asked to confirm the deletion. Note: You cannot use the UNIX rm command to copy a data set because WebSphere DataStage represents a single data set with multiple files. Using rm simply removes the descriptor file, leaving the much larger data files behind.

Orchadmin Commands
Orchadmin is a command line utility provided by datastage to research on data sets. The general callable format is : $orchadmin <command> [options] [descriptor file] Before using orchadmin, you should make sure that either the working directory or the $APT_ORCHHOME/etc contains the file config.apt OR The environment variable $APT_CONFIG_FILE should be defined for your session. The various commands available with orchadmin are 1. CHECK: $orchadmin check

Validates the configuration file contents like , accesibility of all nodes defined in the configuration file, scratch disk definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined properly 2. COPY : $orchadmin copy <source.ds> <destination.ds>

Makes a complete copy of the datasets of source with new destination descriptor file name. Please not that

a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied. b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing file that was in use with the source. 3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles.

The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to delete one or more persistent data sets. -f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans. -x forces to use the current config file to be used while deleting than the one stored in data set. 4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds

This is the single most important command. 1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and preserve partitioning flag details of the persistent dataset. -c : Print the configuration file that is written in the dataset if any -p: Lists down the partition level information. -f: Lists down the file level information in each partition -e: List down the segment level information . -s: List down the meta-data schema of the information. -v: Lists all segemnts , valid or otherwise -l : Long listing. Equivalent to -f -p -s -v -e 5. DUMP: $orchadmin dump [options] descriptorfile.ds

The dump command is used to dump (extract) the records from the dataset. Without any options the dump command lists down all the records starting from first record from first partition till last record in last partition. -delim <string> : Uses the given string as delimtor for fields instead of space. -field <name> : Lists only the given field instead of all fields. -name : List all the values preceded by field name and a colon -n numrecs : List only the given number of records per partition. -p period(N) : Lists every Nth record from each partition starting from first record. -skip N: Skip the first N records from each partition. -x : Use the current system configuration file rather than the one stored in dataset.

6.

TRUNCATE: $orchadmin truncate [options] descriptorfile.ds

Without options deletes all the data(ie Segments) from the dataset. -f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones. -x: Uses current system config file rather than the default one stored in the dataset. -n N: Leaves the first N segments in each partition and truncates the remaining. 7. HELP: $orchadmin -help OR $orchadmin <command> -help

Help manual about the usage of orchadmin or orchadmin commands.

File set stage


The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can have a single input link, a single output link, and a single rejects link. It only executes in parallel mode. What is a file set? WebSphere DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on: 1. The number of processing nodes in the default node pool 2. The number of disks in the export or default disk pool connected to each processing node in the default node pool 3. The size of the partitions of the data set The File Set stage enables you to create and write to file sets, and to read data back from file sets. Unlike data sets, file sets carry formatting information that describes the format of the files to be read or written. Filesets are similar to datasets 1. Partitioned 2. Implemented with header file and data files Filesets are different from datasets 1. The data files of filesets are text files and hence are readable by other applications whereas the data files of datasets are stored in native internal format and are readable only DataStage

Lookup file set stage


The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.

When performing lookups, Lookup File Set stages are used with Lookup stages. When you use a Lookup File Set stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup http://www.dsxchange.com/viewtopic.php?t=113394 A Hashed File is only available in server jobs. It uses a hashing algorithm (without building an index) to determine the location of keys within its structure. It is not amenable to parallelism. The contents of a hashed file may be cached in memory when using the Hashed File stage to service a reference input link. New rows to be written to a hashed file may first be written to a memory cache, then flushed to disk. All writes to a hashed file using an existing key overwrite the previous row. Duplicate key values are not permitted. A Lookup File Set is only available in parallel jobs. It uses an index (based on a hash table) to determine the location of keys within its structure. It is a parallel structure; it has its records spread over the processing nodes specified when it was created. The records in the Lookup File Set are loaded into a virtual Data Set before use, and the index is also loaded into memory. Duplicate key values are (optionally) permitted. If the option is not selected, duplicates are rejected when writing to the Lookup File Set. http://www.dsxchange.com/viewtopic.php?t=93287 I did testing on a Windows machine processing 100,000 primary rows against 100,000 lookup rows with a 1 to 1 match. Two key fields of char 255 and two non key fields also of char 255. I deliberately chose fat key fields. The dataset as a lookup took 2-3 minutes. The fileset as a lookup took about 40 seconds. Ran it a few times with the same results. One interesting result was memory utilisation, the fileset was consistently lighter then the dataset, by as much as 30% on RAM memory. This may be due to the keep/drop key field option of the fileset stage. If you set keep to false the key fields in the fileset are not loaded into memory as they are not required on the output side of the lookup. I am guessing that the fileset version was moving and storing 510 char less for each lookup then the dataset version. In a normal lookup these key fields travel up the reference link and back down it again, in a lookup fileset they only travel up. When I switch the same job onto an AIX box with several gig of RAM I get 7 seconds for the dataset and 4 for the fileset. With an increase to 500,000 rows I get 23 seconds for the dataset and 7 seconds for the fileset. This difference may not be so apparent if your key fields are shorter. The major drawback of a lookup fileset is that it doesn't have the Append option of a dataset, you can only overwrite it.

Creating a lookup file set


1. In the Input Link Properties Tab: Specify the key that the lookup on this file set will ultimately be performed on. You can repeat this property to specify multiple key columns. You must specify the key when you create the file set, you cannot specify it when performing the lookup Specify the name of the Lookup File Set. Specify a lookup range, or accept the default setting of No. Set Allow Duplicates, or accept the default setting of False. 2. Ensure column meta data has been specified for the lookup file set.

Looking up a lookup file set

1. In the Output Link Properties Tab specify the name of the lookup file set being used in the lookup. 2. Ensure column meta data has been specified for the lookup file set. By default the stage will write to the file set in entire mode. The complete data set is written to each partition. If the Lookup File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default (auto) collection method.

Complex Flat File stage


The Complex Flat File (CFF) stage is a file stage. You can use the stage to read a file or write to a file, but you cannot use the same stage to do both. As a source, the CFF stage can have multiple output links and a single reject link. You can read data from one or more complex flat files, including MVS data sets with QSAM and VSAM files. You can also read data from files that contain multiple record types. The source data can contain one or more of the following clauses: 1. GROUP 2. REDEFINES 3. OCCURS 4. OCCURS DEPENDING ON CFF source stages run in parallel mode when they are used to read multiple files, but you can configure the stage to run sequentially if it is reading only one file with a single reader. As a target, the CFF stage can have a single input link and a single reject link. You can write data to one or more complex flat files. You cannot write to MVS data sets or to files that contain multiple record types.

Editing a Complex Flat File stage as a source


To edit a CFF stage as a source, you must provide details about the file that the stage will read, create record definitions for the data, define the column metadata, specify record ID constraints, and select output columns. To edit a CFF stage as a source: 1. Open the CFF stage editor 2. On the Stage page, specify information about the stage data: a. On the File Options tab, provide details about the file that the stage will read. b. On the Record Options tab, describe the format of the data in the file. c. If the stage is reading a file that contains multiple record types, on the Records tab, create record definitions for the data. d. On the Records tab, create or load column definitions for the data. e. If the stage is reading a file that contains multiple record types, on the Records ID tab, define the record ID constraint for each record. f. Optional: On the Advanced tab, change the processing settings. 3. On the Output page, specify how to read data from the source file: a. On the Selection tab, select one or more columns for each output link. b. Optional: On the Constraint tab, define a constraint to filter the rows on each output link. c. Optional: On the Advanced tab, change the buffering settings. 4. Click OK to save your changes and to close the CFF stage editor.

Creating record definitions


If you are reading data from a file that contains multiple record types, you must create a separate record definition for each type. COBOL copybooks with multiple record types can be imported as COBOL file definition (Eg. Insurance.cfd). Each record type is stores as a separate DataStage table definition (Eg. If the Insurance.cfd has 3 record types for Client, Policy and Coverage then there will be 3 table definitions one for each record type) To create record definitions: 1. Click the Records tab on the Stage page. 2. Clear the Single record check box. 3. Right-click the default record definition RECORD_1 and select Rename Current Record. 4. Type a new name for the default record definition.

5. Add another record by clicking one of the buttons at the bottom of the records list. Each button offers a different insertion point. A new record is created with the default name of NEWRECORD. 6. Double-click NEWRECORD to rename it. 7. Repeat steps 3 and 4 for each new record that you need to create. 8. Right-click the master record in the list and select Toggle Master Record. Only one master record is permitted.

Column definitions
You must define columns to specify what data the CFF stage will read or write. If the stage will read data from a file that contains multiple record types, you must first create record definitions on the Records tab. If the source file contains only one record type, or if the stage will write data to a target file, then the columns belong to the default record called RECORD_1. You can load column definitions from a table in the repository, or you can type column definitions into the columns grid. You can also define columns by dragging a table definition from the Repository window to the CFF stage icon on the Designer canvas.

Loading columns
The fastest way to define column metadata is to load columns from a table definition in the repository. To load columns: 1. Click the Records tab on the Stage page. 2. Click Load to open the Table Definitions window. This window displays all of the repository objects that are in the current project. 3. Select a table definition in the repository tree and click OK. 4. Select the columns to load in the Select Columns From Table window and click OK. 5. If flattening is an option for any arrays in the column structure, specify how to handle array data in the Complex File Load Option window.

Typing columns
You can also define column metadata by typing column definitions in the columns grid. To type columns: 1. Click the Records tab on the Stage page. 2. In the Level number field of the grid, specify the COBOL level number where the data is defined. If you do not specify a level number, a default value of 05 is used. 3. In the Column name field, type the name of the column. 4. In the Native type field, select the native data type. 5. In the Length field, specify the data precision. 6. In the Scale field, specify the data scale factor. 7. Optional: In the Description field, type a description of the column.

Defining record ID constraints


If you are using the CFF stage to read data from a file that contains multiple record types, you must specify a record ID constraint to identify the format of each record. Columns that are identified in the record ID clause must be in the same physical storage location across records. The constraint must be a simple equality expression, where a column equals a value. To define a record ID constraint: 1. Click the Records ID tab on the Stage page. 2. Select a record from the Records list.

3. Select the record ID column from the Column list. This list displays all columns from the selected record,
except the first OCCURS DEPENDING ON (ODO) column and any columns that follow it.

4. Select the = operator from the Op list. 5. Type the identifying value for the record ID column in the Value field. Character values must be enclosed in
single quotation marks.

Selecting output columns


By selecting output columns, you specify which columns from the source file the CFF stage should pass to the output links. You can select columns from multiple record types to output from the stage. If you do not select columns to output on each link, the CFF stage automatically propagates all of the stage columns except group columns to each empty output link when you click OK to exit the stage. To select output columns: 1. Click the Selection tab on the Output page. 2. If you have multiple output links, select the link that you want from the Output name list.

Defining output link constraints


By defining a constraint, you can filter the data on each output link from the CFF stage. You can set the output link constraint to match the record ID constraint for each selected output record by clicking Default on the Constraint tab on the Output page. The Default button is available only when the constraint grid is empty. To define an output link constraint: 1. Click the Constraint tab on the Output page. 2. In the ( field of the grid, select an opening parenthesis if needed. You can use parentheses to specify the order in evaluating a complex constraint expression. 3. In the Column field, select a column or job parameter. (Group columns cannot be used in constraint expressions and are not displayed.) 4. In the Op field, select an operator or a logical function. 5. In the Column/Value field, select a column or job parameter, or double-click in the cell to type a value. Enclose character values in single quotation marks. 6. In the ) field, select a closing parenthesis if needed. 7. If you are building a complex expression, in the Logical field, select AND or OR to continue the expression in the next row. 8. Click Verify. If errors are found, you must either correct the expression, click Clear All to start over, or cancel. You cannot save an incorrect constraint.

Editing a Complex Flat File stage as a target


To edit a CFF stage as a target, you must provide details about the file that the stage will write, define the record format of the data, and define the column metadata. To edit a CFF stage as a target: 1. Open the CFF stage editor. 2. On the Stage page, specify information about the stage data: a. On the File Options tab, provide details about the file that the stage will write. b. On the Record Options tab, describe the format of the data in the file. c. On the Records tab, create or load column definitions for the data d. Optional: On the Advanced tab, change the processing settings. 3. Optional: On the Input page, specify how to write data to the target file: a. On the Advanced tab, change the buffering settings.

b.

On the Partitioning tab, change the partitioning settings.

4. Click OK to save your changes and to close the CFF stage editor.

Reject links
The CFF stage can have a single reject link, whether you use the stage as a source or a target. For CFF source stages, reject links are supported only if the source file contains a single record type without any OCCURS DEPENDING ON (ODO) columns. For CFF target stages, reject links are supported only if the target file does not contain ODO columns. You cannot change the selection properties of a reject link. The Selection tab for a reject link is blank. You cannot edit the column definitions for a reject link. For writing files, the reject link uses the input link column definitions. For reading files, the reject link uses a single column named rejected that contains raw data for the columns that were rejected after reading because they did not match the schema.

FTP Enterprise Stage


The FTP Enterprise stage transfers multiple files in parallel. These are sets of files that are transferred from one or more FTP servers into WebSphere DataStage or from WebSphere DataStage to one or more FTP servers. The source or target for the file is identified by a URI (Universal Resource Identifier). The FTP Enterprise stage invokes an FTP client program and transfers files to or from a remote host using the FTP Protocol. URI Is a pathname connecting the Stage to a target file on a remote host. It has the Open dependent property. You can repeat this property to specify multiple URIs. You can specify an absolute or a relative pathname. Open command Is required if you perform any operation besides navigating to the directory where the file exists. There can be multiple Open commands. This is a dependent property of URI. ftp command Is an optional command that you can specify if you do not want to use the default ftp command. For example, you could specify /opt/gnu/bin/wuftp. You can enter the path of the command (on the server) directly in this field. You can also specify a job parameter if you want to be able to specify the ftp command at run time. User Name Specify the user name for the transfer. You can enter it directly in this field, or you can specify a job parameter if you want to be able to specify the user name at run time. You can specify multiple user names. User1 corresponds to URI1 and so on. When the number of users is less than the number of URIs, the last user name is set for remaining URIs. If no User Name is specified, the FTP Enterprise Stage tries to use .netrc file in the home directory. Password Enter the password in this field. You can also specify a job parameter if you want to be able to specify the password at run time. Specify a password for each user name. Password1 corresponds to URI1. When the number of passwords is less than the numbers of URIs, the last password is set for the remaining URIs. Transfer Protocol Select the type of FTP service to transfer files between computers. You can choose either FTP or Secure FTP (SFTP). 1. FTP Select this option if you want to transfer files using the standard FTP protocol. This is a nonsecure protocol. By default FTP enterprise stage uses this protocol to transfer files.

2. Secure FTP (SFTP) Select this option if you want to transfer files between computers in a secured channel.
Secure FTP (SFTP) uses the SSH (Secured Shell) protected channel for data transfer between computers over a nonsecure network such as a TCP/IP network. Before you can use SFTP to transfer files, you should configure the SSH connection without any pass phrase for RSA authentication. Force Parallelism You can set either Yes or No. In general, the FTP Enterprise stage tries to start as many processes as needed to transfer the n files in parallel. However, you can force the parallel transfer of data by specifying this property to yes. This allows m number of processes at a time where m is the number specified in WebSphere DataStage configuration file. If m is less than n, then the stage waits to transfer the first m files and then start the next m until n files are transferred. When you set Force Parallelism to Yes, you should only give one URI. Overwrite Set this option to have any existing files overwritten by this transfer. Restartable Mode When you specify a restartable mode of Restartable transfer, WebSphere DataStage creates a directory for recording information about the transfer in a restart directory. If the transfer fails, you can run an identical job with the restartable mode property set to Restart transfer, which will reattempt the transfer. If the transfer repeatedly fails, you can run an identical job with the restartable mode option set to Abandon transfer, which will delete the restart directory Restartable mode has the following dependent properties: 1. Job Id Identifies a restartable transfer job. This is used to name the restart directory. 2. Checkpoint directory Optionally specifies a checkpoint directory to contain restart directories. If you do not specify this, the current working directory is used. For example, if you specify a job_id of 100 and a checkpoint directory of /home/bgamsworth/checkpoint the files would be written to /home/bgamsworth/checkpoint/pftp_jobid_100. Schema file Contains a schema for storing data. Setting this option overrides any settings on the Columns tab. You can enter the path name of a schema file, or specify a job parameter, so the schema file name can be specified at run time. Transfer Type Select a data transfer type to transfer files between computers. You can select either the Binary or ASCII mode of data transfer. The default data transfer mode is binary. When reading a delimited Sequential File, you are instructed to interpret two contiguous field delimiters as NULL for the corresponding field regardless of data type. Which three actions must you take? (Choose three.) A. Set the data type to Varchar. B. Set the field to nullable. C. Set the "NULL Field Value" to two field delimiters (e.g., "||" for pipes). D. Set the "NULL Field Value" to ''. E. Set the environment variable $APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL. Answer: B,D,E $APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL - When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.

Which two attributes are found in a Data Set descriptor file? (Choose two.) A. A copy of the job score. B. The schema of the Data Set. C. A copy of the partitioned data. D. A copy of the configuration file used when Data Set was created. Answer: B,D When importing a COBOL file definition, which two are required? (Choose two.) A. The file you are importing is accessible from your client workstation. B. The file you are importing contains level 01 items. C. The column definitions are in a COBOL copybook file and not, for example, in a COBOL source file. D. The file does not contain any OCCURS DEPENDING ON clauses. Answer: A,B Which three features of datasets make them suitable for job restart points? (Choose three.) A. They are indexed for fast data access. B. They are partitioned. C. They use datatypes that are in the parallel engine internal format. D. They are persistent. E. They are compressed to minimize storage space. Answer: B,C,D Which statement describes a process for capturing a COBOL copybook from a z/OS system? A. FTP the COBOL copybook to the server platform in text mode and capture the metadata through Manager. B. Select the COBOL copybook using the Browse button and capture the COBOL copybook with Manager. C. FTP the COBOL copybook to the client workstation in text mode and capture the copybook with Manager. D. FTP the COBOL copybook to the client workstation in binary and capture the metadata through Manager. Answer: C The high performance ETL server on which DataStage EE is installed is networked with several other servers in the IT department with a very high bandwidth switch. A list of seven files (all of which contain records with the same record layout) must be retrieved from three of the other servers using FTP. Given the high bandwidth network and high performance ETL server, which approach will retrieve and process all seven files in the minimal amount of time? A. In a single job, use seven separate FTP Enterprise stages the output links of which lead to a single Sort Funnel stage, then process the records without landing to disk. B. Setup a sequence of seven separate DataStage EE jobs, each of which retrieves a single file and appends to a common dataset, then process the resulting dataset in an eighth DataStage EE job. C. Use three FTP Plug-in stages (one for each machine) to retrieve the seven files and store them to a single file on the fourth server, then use the FTP Enterprise stage to retrieve the single file and process the records without landing to disk. D. Use a single FTP Enterprise stage and specify seven URI properties, one for each file, then process the records without landing to disk. Answer: D An XML file is being processed by the XML Input stage. How can repetition elements be identified on the stage? A. No special settings are required. XML Input stage automatically detects the repetition element from the XPath expression. B. Set the "Key" property for the column on the output link to "Yes".

C. Check the "Repetition Element Required" box on the output link tab. D. Set the "Nullable" property for the column on the output link to "Yes". Answer: B Using FTP, a file is transferred from an MVS system to a LINUX system in binary transfer mode. Which data conversion must be used to read a packed decimal field in the file? A. treat the field as a packed decimal B. packed decimal fields are not supported C. treat the field as ASCII D. treat the field as EBCDIC Answer: A When a sequential file is read using a Sequential File stage, the parallel engine inserts an operator to convert the data to the internal format. Which operator is inserted? A. import operator B. copy operator C. tsort operator D. export operator Answer: A Which type of file is both partitioned and readable by external applications? A. fileset B. Lookup fileset C. dataset D. sequential file Answer: A Which two statements are true about XML Meta Data Importer? (Choose two.) A. XML Meta Data Importer is capable of reporting syntax and semantic errors from an XML file. B. XPATH expressions that are created during XML metadata import cannot be modified. C. XML Meta Data Importer can import Table Definitions from only XML documents. D. XPATH expressions that are created during XML metadata import are used by XML Input stage and XML Output stage. Answer: A,D Which two statements are correct about XML stages and their usage? (Choose two.) A. XML Input stage converts XML data to tabular format. B. XML Output stage converts tabular data to XML hierarchical structure. C. XML Output stage uses XSLT stylesheet for XML to tabular transformations. D. XML Transformer stage converts XML data to tabular format. Answer: A,B Which "Reject Mode" option in the Sequential File stage will write records to a reject link? A. Output B. Fail C. Drop D. Continue Answer: A

A single sequential file exists on a single node. To read this sequential file in parallel, what should be done? A. Set the Execution mode to "Parallel". B. A sequential file cannot be read in parallel using the Sequential File stage. C. Select "File Pattern" as the Read Method. D. Set the "Number of Readers Per Node" optional property to a value greater than 1. Answer: D When a sequential file is written using a Sequential File stage, the parallel engine inserts an operator to convert the data from the internal format to the external format. Which operator is inserted? A. export operator B. copy operator C. import operator D. tsort operator Answer: A A bank receives daily credit score updates from a credit agency in the form of a fixed width flat file. The monthly_income column is an unsigned nullable integer (int32) whose width is specified as 10, and null values are represented as spaces. Which Sequential File property will properly import any nulls in the monthly_income column of the input file? A. Set the record level fill char property to the space character (' '). B. Set the null field value property to a single space (' '). C. Set the C_format property to '"%d. 10"'. D. Set the null field value property to ten spaces (' '). Answer: D An XML file is being processed by the XML Input stage. How can repetition elements be identified on the stage? A. Set the "Nullable" property for the column on the output link to "Yes". B. Set the "Key" property for the column on the output link to "Yes". C. Check the "Repetition Element Required" box on the output link tab. D. No special settings are required. XML Input stage automatically detects the repetition element from the XPath expression. Answer: B During a sequential file read, you experience an error with the data. What is a valid technique for identifying the column causing the difficulty? A. Set the "data format" option to text on the Record Options tab. B. Enable tracing in the DataStage Administrator Tracing panel. C. Enable the "print field" option at the Record Options tab. D. Set the APT_IMPORT_DEBUG environmental variable. Answer: C On which two does the number of data files created by a fileset depend ? (Choose two.) A. the size of the partitions of the dataset B. the number of CPUs C. the schema of the file D. the number of processing nodes in the default node pool Answer: A,D What are two ways to delete a persistent parallel dataset? (Choose two.) A. standard UNIX command rm B. orchadmin command rm

C. delete the dataset Table Definition in DataStage Manager D. delete the dataset in Data Set Manager Answer: B,D A parts supplier has a single fixed width sequential file. Reading the file has been slow, so the supplier would like to try to read it in parallel. If the job executes using a configuration file consisting of four nodes, which two Sequential File stage settings will cause the DataStage parallel engine to read the file using four parallel readers? (Choose two.) (Note: Assume the file path and name is /data/parts_input.txt.) A. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the number of readers per node option to 2. B. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the read from multiple nodes option to yes. C. Set read method to file pattern, and set the file pattern property to '/data/(@PART_COUNT)parts_input.txt'. D. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the number of readers per node option to 4. Answer: B,D

Data Transformation
Transformer Stage
Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure. In order to write efficient Transformer stage derivations, it is useful to understand what items get evaluated and when. The evaluation sequence is as follows: Evaluate each stage variable initial value For each input row to process: Evaluate each stage variable derivation value, unless the derivation is empty For each output link: Evaluate each column derivation value Write the output record Next output link Next input row The stage variables and the columns within a link are evaluated in the order in which they are displayed on the parallel job canvas. Similarly, the output links are also evaluated in the order in which they are displayed. System variables WebSphere DataStage provides a set of variables containing useful system information that you can access from an output derivation or constraint.

1. 2. 3. 4. 5. 6.

@FALSE The value is replaced with 0. @TRUE The value is replaced with 1. @INROWNUM Input row counter. @OUTROWNUM Output row counter (per link). @NUMPARTITIONS The total number of partitions for the stage. @PARTITIONNUM The partition number for the particular instance.

Triggers tab The Triggers tab allows you to choose routines to be executed at specific execution points as the transformer stage runs in a job. The execution point is per-instance, i.e., if a job has two transformer stage instances running in parallel, the routine will be called twice, once for each instance. The available execution points are Before-stage and After-stage. At this release, the only available built-in routine is SetCustomSummaryInfo. You can also define custom routines to be executed; to do this you define a C function, make it available in UNIX shared library, and then define a Parallel routine which calls it (see WebSphere DataStage Designer Client Guide for details on defining a Parallel Routine). Note that the function should not return a value.

A constraint otherwise link can be defined by: 1. Clicking on the Otherwise/Log field so a tick appears and leaving the Constraint fields blank. This will catch any rows that have failed to meet constraints on all the previous output links. 2. Set the constraint to OTHERWISE. This will be set whenever a row is rejected on a link because the row fails to match a constraint. OTHERWISE is cleared by any output link that accepts the row. 3. The otherwise link must occur after the output links in link order so it will catch rows that have failed to meet the constraints of all the output links. If it is not last rows may be sent down the otherwise link which satisfy a constraint on a later link and is sent down that link as well. 4. Clicking on the Otherwise/Log field so a tick appears and defining a Constraint. This will result in the number of rows written to that link (i.e. rows which satisfy the constraint) to be recorded in the job log as a warning message. Note: You can also specify a reject link which will catch rows that have not been written on any output links due to a write error or null expression error. Define this outside Transformer stage by adding a link and using the shortcut menu to convert it to a reject link. Conditionally Aborting a Job Use the Abort After Rows setting in the output link constraints of the parallel Transformer to conditionally abort a parallel job. You can specify an abort condition for any output link. The abort occurs after the specified number of rows occurs in one of the partitions. When the Abort After Rows threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers Functions and Operators Concatenation Operator : Substring operator Input_ String[starting position, length] String Functions 1. Len(<string>) 2. Trim(<string>) 3. UpCase/DownCase(<string>) Null Handling functions 1. IsNull 2. IsNotNull 3. NulltoValue 4. NulltoZero 5. SetNull() Type Conversions 1. StringtoTimestamp 2. StringtoDecimal Using Transformer stages In general, it is good practice not to use more Transformer stages than you have to. You should especially avoid using multiple Transformer stages where the logic can be combined into a single stage. It is often better to use other stage types for certain types of operation: 1. Use a Copy stage rather than a Transformer for simple operations such as: Providing a job design placeholder on the canvas. (Provided you do not set the Force property to True on the Copy stage, the copy will be optimized out of the job at run time.) Renaming columns. Dropping columns. Implicit type conversions. Note that, if runtime column propagation is disabled, you can also use output mapping on a stage to rename, drop, or convert columns on a stage that has both inputs and outputs. 2. Use the Modify stage for explicit type conversion and null handling. 3. Where complex, reusable logic is required, or where existing Transformer-stage based job flows do not meet performance requirements, consider building your own custom stage

4.

Use a BASIC Transformer stage where you want to take advantage of user-defined functions and routines.

SCD Stage
The SCD stage reads source data on the input link, performs a dimension table lookup on the reference link, and writes data on the output link. The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. The dimension update link is a separate output link that carries changes to the dimension. You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements. SCD stages support both SCD Type 1 and SCD Type 2 processing: 1. SCD Type 1 Overwrites an attribute in a dimension table. 2. SCD Type 2 Adds a new row to a dimension table. Each SCD stage processes a single dimension and performs lookups by using an equality matching technique. If the dimension is a database table, the stage reads the database to build a lookup table in memory. If a match is found, the SCD stage updates rows in the dimension table to reflect the changed data. If a match is not found, the stage creates a new row in the dimension table. All of the columns that are needed to create a new dimension row must be present in the source data.

Purpose codes in a Slowly Changing Dimension stage


Purpose codes are an attribute of dimension columns in SCD stages. Purpose codes are used to build the lookup table, to detect dimension changes, and to update the dimension table. Building the lookup table The SCD stage uses purpose codes to determine how to build the lookup table for the dimension lookup. If a dimension has only Type 1 columns, the stage builds the lookup table by using all dimension rows. If any Type 2 columns exist, the stage builds the lookup table by using only the current rows. If a dimension has a Current Indicator column, the stage uses the derivation value of this column on the Dim Update tab to identify the current rows of the dimension table. If a dimension does not have a Current Indicator column, then the stage uses the Expiration Date column and its derivation value to identify the current rows. Any dimension columns that are not needed are not used. This technique minimizes the amount of memory that is required by the lookup table. Detecting dimension changes Purpose codes are also used to detect dimension changes. The SCD stage compares Type 1 and Type 2 column values to source column values to determine whether to update an existing row, insert a new row, or expire a row in the dimension table. Updating the dimension table Purpose codes are part of the column metadata that the SCD stage propagates to the dimension update link. You can send this column metadata to a database stage in the same job, or you can save the metadata on the Columns tab and load it into a database stage in a different job. When the database stage uses the auto-generated SQL option to perform inserts and updates, it uses the purpose codes to generate the correct SQL statements.

Selecting purpose codes


Purpose codes specify how the SCD stage should process dimension data. Purpose codes apply to columns on the dimension reference link and on the dimension update link. Select purpose codes according to the type of columns in a dimension:

1. If a dimension contains a Type 2 column, you must select a Current Indicator column, an Expiration Date
column, or both. An Effective Date column is optional. You cannot assign Type 2 and Current Indicator to the same column. 2. If a dimension contains only Type 1 columns, no Current Indicator, Effective Date, Expiration Date, or SK Chain columns are allowed.

Purpose code definitions


The SCD stage provides nine purpose codes to support dimension processing.

1. 2. 3. 4.
5. 6. 7. 8. 9.

(blank) The column has no SCD purpose. This purpose code is the default. Surrogate Key The column is a surrogate key that is used to identify dimension records. Business Key The column is a business key that is typically used in the lookup condition. Type 1 The column is an SCD Type 1 field. SCD Type 1 column values are always current. When changes occur, the SCD stage overwrites existing values in the dimension table. Type 2 The column is an SCD Type 2 field. SCD Type 2 column values represent a point in time. When changes occur, the SCD stage creates a new dimension row. Current Indicator (Type 2) The column is the current record indicator for SCD Type 2 processing. Only one Current Indicator column is allowed. Effective Date (Type 2) The column is the effective date for SCD Type 2 processing. Only one Effective Date column is allowed. Expiration Date (Type 2) The column is the expiration date for SCD Type 2 processing. An Expiration Date column is required if there is no Current Indicator column, otherwise it is optional. SK Chain The column is used to link a record to the previous record or the next record by using the value of the Surrogate Key column. Only one Surrogate Key column can exist if you have an SK Chain column.

Surrogate keys in a Slowly Changing Dimension stage


Surrogate keys are used to join a dimension table to a fact table in a star schema database. When the SCD stage performs a dimension lookup, it retrieves the value of the existing surrogate key if a matching record is found. If a match is not found, the stage obtains a new surrogate key value by using the derivation of the Surrogate Key column on the Dim Update tab. If you want the SCD stage to generate new surrogate keys by using a key source that you created with a Surrogate Key Generator stage, you must use the NextSurrogateKey function to derive the Surrogate Key column. If you want to use your own method to handle surrogate keys, you should derive the Surrogate Key column from a source column. You can replace the dimension information in the source data stream with the surrogate key value by mapping the Surrogate Key column to the output link.

Specifying information about a key source


If you created a key source with a Surrogate Key Generator stage, you must specify how the SCD stage should use the source to generate surrogate keys. The key source can be a flat file or a database sequence. The key source must exist before the job runs. If the key source is a flat file, the file must be accessible from all nodes that run the SCD stage.

To use the key source: 1. On the Input page, select the reference link in the Input name field.

2. Click the Surrogate Key tab. 3. In the Source type field, select the source type. 4. In the Source name field, type the name of the key source, or click the arrow button to browse for a file or to insert a job parameter. If the source is a flat file, type the name and fully qualified path of the state file, such as C:/SKG/ProdDim. If the source is a database sequence, type the name of the sequence, such as PRODUCT_KEY_SEQ. 5. Provide additional information about the key source according to the type: If the source is a flat file, specify information in the Flat File area. If the source is a database sequence, specify information in the DB sequence area. Calls to the key source are made by the NextSurrogateKey function. On the Dim Update tab, create a derivation that uses the NextSurrogateKey function for the column that has a purpose code of Surrogate Key. The NextSurrogateKey function returns the value of the next surrogate key when the SCD stage creates a new dimension row. A DataStage job contains a parallel Transformer with a single input link and a single output link. The Transformer has a constraint that should produce 1000 records, however only 900 came out through the output link. What should be done to identify the missing records? A. Turn trace on using DataStage Administrator. B. Add a Reject link to the Transformer stage. C. Scan generated osh script for possible errors. D. Remove the constraint on the output link. Answer: B Which three actions are performed using stage variables in a parallel Transformer stage? (Choose three.) A. A function can be executed once per record. B. A function can be executed once per run. C. Identify the first row of an input group. D. Identify the last row of an input group. E. Lookup up a value from a reference dataset. Answer: A,B,C Which two system variables must be used in a parallel Transformer derivation to generate a unique sequence of integers across partitions? (Choose two.) A. @PARTITIONNUM B. @INROWNUM C. @DATE D. @NUMPARTITIONS Answer: A,D What would require creating a new parallel Custom stage rather than a new parallel BuildOp stage? A. A Custom stage can be created with properties. BuildOp stages cannot be created with properties. B. In a Custom stage, the number of input links does not have to be fixed, but can vary, for example from one to two. BuildOp stages require a fixed number of input links. C. Creating a Custom stage requires knowledge of C/C++. You do not need knowledge of C/C++ to create a BuildOp stage. D. Custom stages can be created for parallel execution. BuildOp stages can only be built to run sequentially. Answer: B

Your input rows contain customer data from a variety of locations. You want to select just those rows from a specified location based on a parameter value. You are trying to decide whether to use a Transformer or a Filter stage to accomplish this. Which statement is true? A. The Transformer stage will yield better performance because the Filter stage Where clause is interpreted at runtime. B. You cannot use a Filter stage because you cannot use parameters in a Filter stage Where clause. C. The Filter stage will yield better performance because it has less overhead than a Transformer stage. D. You cannot use the Transformer stage because you cannot use parameters in a Transformer stage constraint. Answer: A In a Transformer you add a new column to an output link named JobName that is to contain the name of the job that is running. What can be used to derive values for this column? A. a DataStage function B. a link variable C. a system variable D. a DataStage macro Answer: D Which statement describes how to add functionality to the Transformer stage? A. Create a new parallel routine in the Routines category that specifies the name, path, type, and return type of a function written and compiled in C++. B. Create a new parallel routine in the Routines category that specifies the name, path, type, and return type of an external program. C. Create a new server routine in the Routines category that specifies the name and category of a function written in DataStage Basic. D. Edit the C++ code generated by the Transformer stage. Answer: A Which three statements about the Enterprise Edition parallel Transformer stage are correct? (Choose three.) A. The Transformer allows you to copy columns. B. The Transformer allows you to do lookups. C. The Transformer allows you to apply transforms using routines. D. The Transformer stage automatically applies 'NullToValue' function to all non-nullable output columns. E. The Transformer allows you to do data type conversions. Answer: A,C,E Which two stages allow field names to be specified using job parameters? (Choose two.) A. Transformer stage B. Funnel stage C. Modify stage D. Filter stage Answer: C,D The parallel dataset input into a Transformer stage contains null values. What should you do to properly handle these null values? A. Convert null values to a valid values in a stage variable. B. Convert null values to a valid value in the output column derivation. C. Null values are automatically converted to blanks and zero, depending on the target data type. D. Trap the null values in a link constraint to avoid derivations. Answer: A

Which two would require the use of a Transformer stage instead of a Copy stage? (Choose two.) A. Drop a column. B. Send the input data to multiple output streams. C. Trim spaces from a character field. D. Select certain output rows based on a condition. Answer: C,D In which situation should a BASIC Transformer stage be used in a DataStage EE job? A. in a job containing complex routines migrated from DataStage Server Edition B. in a job requiring lookups to hashed files C. in a large-volume job flow D. in a job requiring complex, reusable logic Answer: A You have three output links coming out of a Transformer. Two of them (A and B) have constraints you have defined. The third you want to be an Otherwise link that is to contain all of the rows that do not satisfy the constraints of A and B. This Otherwise link must work correctly even if the A and B constraints are modified. Which two are required? (Choose two.) A. The Otherwise link must be first in the link ordering. B. A constraint must be coded for the Otherwise link. C. The Otherwise link must be last in the link ordering. D. The Otherwise check box must be checked. Answer: C,D Which two statements are true about DataStage Parallel Buildop stages? (Choose two.) A. Unlike standard DataStage stages they do not have properties. B. They are coded using C/C++. C. They are coded using DataStage Basic. D. Table Definitions are used to define the input and output interfaces of the BuildOp. Answer: B,D

Job Control and Run time Management


Message Handlers
When you run a parallel job, any error messages and warnings are written to an error log and can be viewed from the Director. You can choose to handle specified errors in a different way by creating one or more message handlers. A message handler defines rules about how to handle messages generated when a parallel job is running. You can, for example, use one to specify that certain types of message should not be written to the log. You can edit message handlers in the DataStage Manager or in the DataStage Director. The recommended way to create them is by using the Add rule to message handler feature in the Director. You can specify message handler use at different levels: Project Level. You define a project level message handler in the DataStage Administrator, and this applies to all parallel jobs within the specified project. Job Level. From the Designer and Manager you can specify that any existing handler should apply to a specific job. When you compile the job, the handler is included in the job executable as a local handler (and so can be exported to other systems if required). You can also add rules to handlers when you run a job from the Director (regardless of whether it currently has a local handler included). This is useful, for example, where a job is generating a message for every row it is processing. You can suppress that particular message. When the job runs it will look in the local handler (if one exists) for each message to see if any rules exist for that message type. If a particular message is not handled locally, it will look to the projectwide handler for rules. If there are none there, it writes the message to the job log. Note that message handlers do not deal with fatal error messages, these will always be written to the job log. You cannot add message rules to jobs from an earlier release of DataStage without first re-running those jobs.

Adding Rules to Message Handlers


You can add rules to message handlers 'on the fly' from within the Director. Using this method, you can add rules to handlers that are local to the current job, to the project default handler, or to any previously-defined handler. To add rules in this way, highlight the message you want to add a rule about in the job log and choose Add rule to message handler... from the job log shortcut menu or from the Job menu on the menu bar. The Add rule to message handler dialog box appears. To add a rule: 1. Choose an option to specify which handler you want to add the new rule to. Choose between the local runtime handler for the currently selected job, the project-level message handler, or a specific message handler. If you want to edit a specific message handler, select the handler from the Message Handler dropdown list. Choose (New) to create a new message handler. 2. Choose an Action from the drop down list. Choose from: Suppress from log. The message is not written to the job's log as it runs. Promote to Warning. Promote an informational message to a warning message. Demote to Informational. Demote a warning message to become an informational one. The Message ID, Message type and Example of message text fields are all filled in from the log entry you have currently selected. You cannot edit these.

3. Click Add Rule to add the new message rule to the chosen handler.

Managing Message Handlers


To open the Message Handler Manager, choose Tools Message Handlers (you can also open the manager from the Add rule to message handler dialog box). The Edit Message Handlers dialog box appears.

Message Handler File Format


A message handler is a plain text file and has the suffix .msh. It is stored in the folder $DSHOME/../DataStage/MsgHandlers. The following is an example message file. TUTL 0000311 1 The open file limit is 100; raising to 1024 TFSC 0000011 2 APT configuration file TFSC 0000432 3 Attempt to Cleanup after ABORT raised in stage Each line in the file represents message rule, and comprises four tabseparated fields: - Message ID. Case-specific string uniquely identifying the message - Type. 1 for Info, 2 for Warn - Action. 1 = Suppress, 2 = Promote, 3 = Demote - Message. Example text of the message

Identify the use of dsjob command line utility


You can start, stop, validate, and reset jobs using the run option.

Running a job
dsjob run [ mode [ NORMAL | RESET | VALIDATE ] ] [ param name=value ] [ warn n ] [ rows n ] [ wait ] [ stop ] [ jobstatus] [userstatus] [local] [opmetadata [TRUE | FALSE]] [-disableprjhandler] [-disablejobhandler] [useid] project job|job_id mode specifies the type of job run. NORMAL starts a job run, RESET resets the job and VALIDATE validates the job. If mode is not specified, a normal job run is started. param specifies a parameter value to pass to the job. The value is in the format name=value, where name is the parameter name, and value is the value to be set. If you use this to pass a value of an environment variable for a job (as you may do for parallel jobs), you need to quote the environment variable and its value, for example param '$APT_CONFIG_FILE=chris.apt' otherwise the current value of the environment variable will be used. warn n sets warning limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITWARN specified as the LimitType parameter). rows n sets row limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITROWS specified as the LimitType parameter). wait waits for the job to complete (equivalent to the DSWaitForJob function). stop terminates a running job (equivalent to the DSStopJob function). jobstatus waits for the job to complete, then returns an exit code derived from the job status. userstatus waits for the job to complete, then returns an exit code derived from the user status if that status is defined. The user status is a string, and it is converted to an integer exit code. The exit code 0 indicates that the job completed without an error, but that the user status string could not be converted. If a job returns a negative user status value, it is interpreted as an error.

-local use this when running a DataStage job from withing a shellscript on a UNIX server. Provided the script is run in the project directory, the job will pick up the settings for any environment variables set in the script and any setting specific to the user environment. -opmetadata use this to have the job generate operational meta data as it runs. If MetaStage, or the Process Meta Data MetaBroker, is not installed on the machine, then the option has no effect. If you specify TRUE, operational meta data is generated, whatever the default setting for the project. If you specify FALSE, the job will not generate operational meta data, whatever the default setting for the project. -disableprjhandler use this to disable any error message handler that has been set on a project wide basis -disablejobhandler use this to disable any error message handler that has been set for this job useid specify this if you intend to use a job alias (jobid) rather than ajob name (job) to identify the job. project is the name of the project containing the job. job is the name of the job. To run a job invocation, use the format job.invocation_id. job_id is an alias for the job that has been set using the dsjob jobid command

Stopping a job
You can stop a job using the stop option. dsjob stop [useid] project job|job_id stop terminates a running job (equivalent to the DSStopJobfunction). useid specify this if you intend to use a job alias (jobid) rather than a job name (job) to identify the job. project is the name of the project containing the job. job is the name of the job. To stop a job invocation, use the format job.invocation_id. job_id is an alias for the job that has been set using the dsjob jobid command

Listing Projects
The following syntax displays a list of all known projects on the server: dsjob lprojects This syntax is equivalent to the DSGetProjectList function.

Listing Jobs
The following syntax displays a list of all jobs in the specified project: dsjob ljobs project project is the name of the project containing the jobs to list. This syntax is equivalent to the DSGetProjectInfo function.

Listing Stages
The following syntax displays a list of all stages in a job: dsjob lstages [useid] project job|job_id This syntax is equivalent to the DSGetJobInfo function with DSJ_STAGELIST specified as the InfoType parameter.

Listing Links
The following syntax displays a list of all the links to or from a stage: dsjob llinks [useid] project job|job_id stage This syntax is equivalent to the DSGetStageInfo function with DSJ_LINKLIST specified as the InfoType parameter.

Listing Parameters
The following syntax display a list of all the parameters in a job and their values: dsjob lparams [useid] project job|job_id

Listing Invocations
The following syntax displays a list of the invocations of a job: dsjob linvocations

Setting an Alias for a Job


The dsjob command can be used to specify your own ID for a DataStage job. Other commands can then use that alias to refer to the job. dsjob jobid [my_ID] project job my_ID is the alias you want to set for the job. If you omit my_ID, the command will return the current alias for the specified job. An alias must be unique within the project, if the alias already exists an error message is displayed

Displaying Job Information


The following syntax displays the available information about a specified job: dsjob jobinfo [useid] project job|job_id This syntax is equivalent to the DSGetJobInfo function.

Displaying Stage Information


The following syntax displays all the available information about a stage: dsjob stageinfo [useid] project job|job_id stage This syntax is equivalent to the DSGetStageInfo function.

Displaying Link Information


The following syntax displays information about a specified link to or from a stage: dsjob linkinfo [useid] project job|job_id stage link This syntax is equivalent to the DSGetLinkInfo function.

Displaying Parameter Information


This syntax displays information about the specified parameter: dsjob paraminfo [useid] project job|job_id param The following information is displayed: The parameter type The parameter value Help text for the parameter that was provided by the jobs designer Whether the value should be prompted for The default value that was specified by the jobs designer Any list of values The list of values provided by the jobs designer This syntax is equivalent to the DSGetParamInfo function.

Adding a Log Entry


The following syntax adds an entry to the specified log file. The text for the entry is taken from standard input to the terminal, ending with Ctrl-D. dsjob log [ info | warn ] [useid] project job|job_id info specifies an information message. This is the default if no log entry type is specified. warn specifies a warning message.

Displaying a Short Log Entry


The following syntax displays a summary of entries in a job log file: dsjob logsum [type type] [ max n ] [useid] project job|job_id type type specifies the type of log entry to retrieve. If type type is not specified, all the entries are retrieved. type can be one of the following options:

INFO Information. WARNING Warning. FATAL Fatal error. REJECT Rejected rows from a Transformer stage. STARTED All control logs. RESET Job reset. BATCH Batch control. ANY All entries of any type. This is the default if type is not specified. max n limits the number of entries retrieved to n.

Displaying a Specific Log Entry


The following syntax displays the specified entry in a job log file: dsjob logdetail [useid] project job|job_id entry entry is the event number assigned to the entry. The first entry in the file is 0. This syntax is equivalent to the DSGetLogEntry function.

Identifying the Newest Entry


The following syntax displays the ID of the newest log entry of the specified type: dsjob lognewest [useid] project job|job_id type INFO Information. WARNING Warning. FATAL Fatal error. REJECT Rejected rows from a Transformer stage. STARTED Job started. RESET Job reset. BATCH Batch control. This syntax is equivalent to the DSGetNewestLogId function.

Importing Job Executables


The dsjob command can be used to import job executables from a DSX file into a specified project. Note that this command is only available on UNIX servers. dsjob import project DSXfilename [-OVERWRITE] [-JOB[S] jobname ] | [-LIST] project is the project to import into. DSXfilename is the DSX file containing the job executables. -OVERWRITE specifies that any existing jobs in the project with the same name will be overwritten. -JOB[S] jobname specifies that one or more named job executables should be imported (otherwise all the executable in the DSX file are imported). -LIST causes DataStage to list the executables in a DSX file rather than import them.

Generating a Report
The dsjob command can be used to generate an XML format report containing job, stage, and link information. dsjob report [useid] project job|jobid [report_type] report_type is one of the following: BASIC Text string containing start/end time, time elapsed and status of job. DETAIL As basic report, but also contains information about individual stages and links within the job. LIST Text string containing full XML report. By default the generated XML will not contain a <?xml-stylesheet?> processing instruction. If a stylesheet is required, specify a RetportLevel of 2 and append the name of the required stylesheet URL, i.e., 2:styleSheetURL. This inserts a processing instruction into the generated XML of the form:

<?xml-stylesheet type=text/xsl href=styleSheetURL?> The generated report is written to stdout. This syntax is equivalent to the DSMakeJobReport function.DETAIL As basic report, but also contains information about individual stages and links within the job. LIST Text string containing full XML report.

Job Sequence
What is a Job Sequence? 1. A master controlling job that controls the execution set of subordinate jobs 2. Passes values to subordinate job parameters 3. Controls the order of execution (links) 4. Specifies conditions under which the subordinate jobs get executed (triggers) 5. Specified complex flow of control Loops, All/Any sequencer, Wait for file 6. Perform system activities ( Email, Execute system commands and executables) 7. Can include Restart checkpoints What are the Job Sequence stages? 1. Run stages Job Activity: Run a job, Execute Command/Routine Activity: Run a system command, Notification Activity: Send an email 2. Flow Control stages Sequencer: Go All/Any, Wait for file: Go when file exists/doesnt exist, Loop: Start loop and End Loop, Nested Condition: Go if condition satisfied 3. Error handling Exception Handler, Terminator 4. Variables User Variables What are the compilation options in Job Sequence properties? 1. Add checkpoints so sequence is restartable on failure Restart Functionality 2. Automatically handle activities that fail Exception stage to handle aborts 3. Log warnings after activities that finish with status other than OK 4. Log report messages after each run What are the inputs for Job Activity stage? 1. Job name (select from list) 2. Execution Action (select from list) 3. Parameters 4. Do not checkpoint run (select/unselect checkbox) What are the Job Activity Execution Actions? 1. Run 2. Reset if required, then run 3. Validate What are the different types of triggers for a Job Activity? OK (Conditional) Failed (Conditional) Warning (Conditional) Custom (Conditional) UserStatus (Conditional) Unconditional Otherwise Custom Trigger Example Job_1.$JobStatus=DSJS.RUNOK or Job_1.$JobStatus= DSJS.RUNWARN

What are the inputs for Execute Command stage? 1. Command 2. Parameters 3. Do not checkpoint run (select/unselect checkbox) What are the inputs for Notification stage? 1. SMTP Mail server name 2. Senders email address 3. Recipients email address 4. Email subject 5. Attachment 6. Email body 7. Include job status in email (select/unselect checkbox) 8. Do not checkpoint run (select/unselect checkbox) What are the inputs for Wait for file stage? 1. Filename 2. Wait for file to appear / Wait for file to appear (Select one of the two options) 3. Timeout length (disabled if the Do not timeout option is selected) 4. Do not timeout 5. Do not checkpoint run Explain the Nested Condition stage? The Nested Condition stage is used to branch out to other activities based on trigger conditions. Explain the Loop stage? The Loop stage is made up of Start Loop and End Loop. The Start Loop connects to one of the Run activities (preferably Job Activity). This Activity stage connects to the End Loop. The End Loop connects to the Start Loop activity by means of a reference link. The 2 types of looping are 1. Numeric (For counter n to n Step n) 2. List (For each thing in list) Explain the Error handling and Restartability? Error handling is enabled using Automatically handle activities that fail option. The control is passed to the Exception stage when an Activity fails Restartability is enabled using Add checkpoints so sequence is restartable on failure option. If a sequence fails, then when the Sequence is re-run, activities that completed successfully in the prior run are skipped over (unless the Do not checkpoint run option was set for an activity). Which three are valid ways within a Job Sequence to pass parameters to Activity stages? (Choose three.) A. ExecCommand Activity stage B. UserVariables Activity stage C. Sequencer Activity stage D. Routine Activity stage E. Nested Condition Activity stage Answer: A,B,D Which three are valid trigger expressions in a stage in a Job Sequence? (Choose three.) A. Equality(Conditional)

B. Unconditional C. ReturnValue(Conditional) D. Difference(Conditional) E. Custom(Conditional) Answer: B,C,E A client requires that any job that aborts in a Job Sequence halt processing. Which three activities would provide this capability? (Choose three.) A. Nested Condition Activity B. Exception Handler C. Sequencer Activity D. Sendmail Activity E. Job trigger Answer: A,B,E Which command can be used to execute DataStage jobs from a UNIX shell script? A. dsjob B. DSRunJob C. osh D. DSExecute Answer: A Which three are the critical stages that would be necessary to build a Job Sequence that: picks up data from a file that will arrive in an directory overnight, launches a job once the file has arrived, sends an email to the administrator upon successful completion of the flow? (Choose three.) A. Sequencer B. Notification Activity C. Wait For File Activity D. Job Activity E. Terminator Activity Answer: B,C,D Which two statements describe functionality that is available using the dsjob command? (Choose two.) A. dsjob can be used to get a report containing job, stage, and link information. B. dsjob can be used to add a log entry for a specified job. C. dsjob can be used to compile a job. D. dsjob can be used to export job executables. Answer: A,B

Other Topics
Environment Variables
APT_BUFFER_FREE_RUN This environment variable is available in the DataStage Administrator, under the Parallel category. It specifies how much of the available inmemory buffer to consume before the buffer resists. This is expressed as a decimal representing the percentage of Maximum memory buffer size (for example, 0.5 is 50%). When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk. APT_BUFFER_MAXIMUM_MEMORY Sets the default value of Maximum memory buffer size. The default value is 3145728 (3 MB). Specifies the maximum amount of virtual memory, in bytes, used per buffer. APT_BUFFER_MAXIMUM_TIMEOUT DataStage buffering is self tuning, which can theoretically lead to long delays between retries. This environment variable specified the maximum wait before a retry in seconds, and is by default set to 1. APT_BUFFERING_POLICY This environment variable is available in the DataStage Administrator, under the Parallel category. Controls the buffering policy for all virtual data sets in all steps. The variable has the following settings: AUTOMATIC_BUFFERING (default). Buffer a data set only if necessary to prevent a data flow deadlock. FORCE_BUFFERING. Unconditionally buffer all virtual data sets. Note that this can slow down processing considerably. NO_BUFFERING. Do not buffer data sets. This setting can cause data flow deadlock if used inappropriately. APT_DECIMAL_INTERM_PRECISION Specifies the default maximum precision value for any decimal intermediate variables required in calculations. Default value is 38. APT_DECIMAL_INTERM_SCALE Specifies the default scale value for any decimal intermediate variables required in calculations. Default value is 10. APT_CONFIG_FILE Sets the path name of the configuration file. (You may want to include this as a job parameter, so that you can specify the configuration file at job run time). APT_DISABLE_COMBINATION Globally disables operator combining. Operator combining is DataStages default behavior, in which two or more (in fact any number of) operators within a step are combined into one process where possible. You may need to disable combining to facilitate debugging. Note that disabling combining generates more UNIX processes, and hence requires more system resources and memory. It also disables internal optimizations for job efficiency and run times. APT_EXECUTION_MODE By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: ONE_PROCESS one-process mode MANY_PROCESS many-process mode NO_SERIALIZE many-process mode, without serialization

APT_ORCHHOME Must be set by all DataStage Enterprise Edition users to point to the top-level directory of the DataStage Enterprise Edition installation. APT_STARTUP_SCRIPT As part of running an application, DataStage creates a remote shell on all DataStage processing nodes on which the job runs. By default, the remote shell is given the same environment as the shell from which DataStage is invoked. However, you can write an optional startupshell script to modify the shell configuration of one or more processing nodes. If a startup script exists, DataStage runs it on remote shells before running your application. APT_STARTUP_SCRIPT specifies the script to be run. If it is not defined, DataStage searches ./startup.apt, $APT_ORCHHOME/etc/startup.apt and $APT_ORCHHOME/etc/startup, in that order. APT_NO_STARTUP_SCRIPT disables running the startup script. APT_NO_STARTUP_SCRIPT Prevents DataStage from executing a startup script. By default, this variable is not set, and DataStage runs the startup script. If this variable is set, DataStage ignores the startup script. This may be useful when debugging a startup script. See also APT_STARTUP_SCRIPT. APT_STARTUP_STATUS Set this to cause messages to be generated as parallel job startup moves from phase to phase. This can be useful as a diagnostic if parallel job startup is failing. APT_MONITOR_SIZE This environment variable is available in the DataStage Administrator under the Parallel branch. Determines the minimum number of records the DataStage Job Monitor reports. The default is 5000 records. APT_MONITOR_TIME This environment variable is available in the DataStage Administrator under the Parallel branch. Determines the minimum time interval in seconds for generating monitor information at runtime. The default is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE. APT_NO_JOBMON Turn off job monitoring entirely. APT_PM_NO_SHARED_MEMORY By default, shared memory is used for local connections. If this variable is set, named pipes rather than shared memory are used for local connections. If both APT_PM_NO_NAMED_PIPES and APT_PM_NO_SHARED_MEMORY are set, then TCP sockets are used for local connections. APT_PM_NO_NAMED_PIPES Specifies not to use named pipes for local connections. Named pipes will still be used in other areas of DataStage, including subprocs and setting up of the shared memory transport protocol in the process manager. APT_RECORD_COUNTS Causes DataStage to print, for each operator Player, the number of records consumed by getRecord() and produced by putRecord(). Abandoned input records are not necessarily accounted for. Buffer operators do not print this information. APT_NO_PART_INSERTION DataStage automatically inserts partition components in your application to optimize the performance of the stages in your job. Set this variable to prevent this automatic insertion. APT_NO_SORT_INSERTION DataStage automatically inserts sort components in your job to optimize the performance of the operators in your data flow. Set this variable to prevent this automatic insertion.

APT_SORT_INSERTION_CHECK_ONLY When sorts are inserted automatically by DataStage, if this is set, the sorts will just check that the order is correct, they won't actually sort. This is a better alternative to shutting partitioning and sorting off insertion off using APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION. APT_DUMP_SCORE Configures DataStage to print a report showing the operators, processes, and data sets in a running job. APT_PM_PLAYER_MEMORY Setting this variable causes each player process to report the process heap memory allocation in the job log when returning. APT_PM_PLAYER_TIMING Setting this variable causes each player process to report its call and return in the job log. The message with the return is annotated with CPU times for the player process. OSH_DUMP If set, it causes DataStage to put a verbose description of a job in the job log before attempting to execute it. OSH_ECHO If set, it causes DataStage to echo its job specification to the job log after the shell has expanded all arguments. OSH_EXPLAIN If set, it causes DataStage to place a terse description of the job in the job log before attempting to run it. OSH_PRINT_SCHEMAS If set, it causes DataStage to print the record schema of all data sets and the interface schema of all operators in the job log. APT_STRING_PADCHAR Overrides the pad character of 0x0 (ASCII null), used by default when DataStage extends, or pads, a string field to a fixed length.

XML Stages Xml Importer


The XML Meta Data Importer window has the following panes: Tree View, which depicts the hierarchical structure in the XML source. This pane is the main view. It is always present and cannot be hidden or docked. Source, which contains the original XML schema or XML document, in read-only mode. To compare the tree view with the XML source, you can dock this pane next to the tree view. Node Properties, which describes XML and XPath information of the selected element. Table Definition, which maps elements that you select in the Tree View. Parser Output, which presents XML syntax and semantic errors. The following illustration shows all XML Meta Data Importer panes except Parser Output:

XML Meta Data Importer reports any syntax and semantic errors when you open a source file. In the following example, the Parser Output pane indicates that at least one quote is missing from line 3.

To highlight the error in the Source pane, double-click the error in the Parser Output pane. After correcting the error outside of the XML Meta Data Importer, you can load the revised source file. To reload the file, choose File Refresh. You can process an XML schema file (.xsd) or an XML document (.xml). The file can be located on your file system or accessed with a URL.

Processing XML Documents


The XML Meta Data Importer retains namespaces and considers every node in an XML hierarchy to be fullyqualified with a namespace prefix. The form is: prefix:nodename. This approach applies to documents in which the prefixes are included or unspecified. When prefixes are unspecified, XML Meta Data Importer generates prefixes using the pattern ns#, where # is a sequence number.

Example
The following input does not include a namespace prefix. Input
<Person xmlns="mynamespace"> <firstName>John</firstName> </Person>

Output
<ns1:Person xmlns:ns1="mynamespace"> <ns1:firstName>John</firstName> </Person>

Processing XML Schemas


The XML Meta Data Importer processes namespaces in XML schemas according to three rules: General Import By Reference Target Namespace Unspecified

General Rule
In general, the XML Meta Data Importer assigns the prefix defns to the target namespace. For example: <xsd:schema targetNamespace="mynamespace" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="Person"> <xsd:complexType> <xsd:sequence> <xsd:element name="firstName" type="xsd:string" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> The firstName node generates the following XPath expression: /defns:Person/defns:firstName where defns=mynamespace

Import By Reference Rule


If the schema imports by reference other schemas with different target namespaces, the XML Meta Data assigns a prefix in the form ns# to each of them. To enable this processing, the dependent schema must specify elementFormDefault="qualified". If this is omitted, the elements are considered as belonging to the caller's target namespace. Example The following example imports by reference the schema mysecondschema. <xsd:schema targetNamespace "demonamespace" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:other="othernamespace"> <xsd:import namespace="othernamespace" schemaLocation="mysecondschema.xsd"/> <xsd:element name="Person"> <xsd:complexType> <xsd:sequence> <xsd:element name="address" type="other:Address" minOccurs="1" maxOccurs="1" /> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> The schema mysecondschema contains the following statements: <xsd:schema targetNamespace="othernamespace" xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xsd:complexType name="Address"> <xsd:sequence> <xsd:element name="street" minOccurs="1" maxOccurs="1" /> <xsd:element name="city" minOccurs="1" maxOccurs="1" />

<xsd:element name="state" minOccurs="1" maxOccurs="1" /> <xsd:element name="zip" minOccurs="1" maxOccurs="1" /> </xsd:sequence> </xsd:complexType> </xsd:schema> The street node generates the following XPath expression: /defns:Person/defns:address/ns2:street where defns=demonamespace and ns2=othernamespace

The Target Namespace Unspecified Rule


When the target namespace is unspecified, XML Meta Data Importer omits the prefix defns from XPath expressions. For example:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="Person"> <xsd:complexType> <xsd:sequence> <xsd:element name="firstName" type="xsd:string" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema>

The firstName tree node generates the following XPath expression: /Person/firstName

Mapping Nodes from an XML Schema


You can individually choose elements and attributes, or select all leaf nodes except empty ones in one step. Choosing Individual Items Select the box that is next to item that you want to map. In the following example, there are elements and text nodes. The three TEXT nodes are selected.

If you select an element box, you get all the sub nodes and the actual content of the element. Your selection is reflected in the Table Definition pane:

An asterisk appears after the title Table Definition when you modify the table definition. It disappears when you save the information.

Selecting All Nodes


You can simplify selecting all leaf nodes by using the Auto-check command. This command checks leaf nodes. XML Meta Data Importer ignores leaf nodes in the following circumstances: Nodes are empty. In a branch in which a node represents a reference to an element or a type defined elsewhere, such as an included schema. To avoid recursive looping, which may be deep in the sub-schema, the node is not expanded. You may manually expand the reference branch down to a specific level, and run the Auto-check command on the top branch node. This action selects all nodes in the branch. Node represents a detected recursion. This happens with a schema that has the following form: parent = person children child = person You may manually expand the recursive branch and run the Auto-check command to select all nodes in the branch. To run Auto-check: Choose File Edit Auto-check. The nodes appear in the Table Definition pane. The default table definition name depends on the XML source name: Source file UNC-name URL XML document XML schema Default Original file name without extension The value New Original XML document filename Original XML schema filename

Xml Input Stage


XML Input stage is used to transform hierarchical XML data to flat relational tables. XML Input stage supports a single input link and one or more output links. XML Input performs two XML validations when the server job runs: Checks for well-formed XML. Optionally checks that elements and attributes conform to any XML schema that is referenced in the document. You control this option. The XML parser reports three types of conditions: fatal, error, and warning. Fatal errors are thrown when the XML is not well-formed. Non-fatal errors are thrown when the XML violates a validity constraint. For example, the root element in the document is not found in the validating XML schema. Warnings may be thrown when the schema has duplicate definitions. XML Input supports one Reject link, which can store rejection messages and rejected rows.

Writing Rejection Messages to the Link


To write rejection messages to a Reject link: 1. Add a column on the Reject link. 2. Using the General page of the Output Link properties, identify the column as the target for rejection messages.

Writing Rejected Rows to the Link


To write rejected rows to a Reject link: Add a column on the Reject link that has the same name as the column on the input link that contains or references the XML document. This is a pass-through operation. Column names for this operation are case-sensitive. Pass-through is available for any input column.

Controlling Output Rows


To populate the columns of an output row, XML Input uses XPath expressions that are specified on the output link. XPath expressions locate elements, attributes, and text nodes.

Controlling the Number of Output Rows


You must designate one column on the output link as the repetition element. A repetition element consists of an XPath expression. For each occurrence of the repetition element, XML Input always generates a row. By varying the repetition element and using a related option, you can control the number of output rows.

Identifying the Repetition Element


To identify the repetition element, set the Key property to Yes on the output link.

Transformation Settings
These properties control the values that can be shared by multiple output links of the XML Input stage. They fall into these categories: Requiring the repetition element Processing NULLs and empty values Processing namespaces Formatting extracted XML fragments To use these values with a specific output link, select the Inherit Stage properties box on the Transformation Settings tab of the output link.

Xml Output Stage


XML Output stage is used to transform tabular data, such as relational tables and sequential files, to XML hierarchical structures. XML Output stage supports a single input link and zero or one output links.

XML Output requires XPath expressions to transform tabular data to XML. A table definition stores the XPath expressions. Using the Description property on the Columns pages within the stage, you record or maintain the XPath expressions.

Aggregating Input Rows on Output


You have several options for aggregating input rows on output. Aggregate all rows in a single output row. This is the default option. Generate one output row per input row. This is the Single row option. Trigger a new output row when the value of an input column changes. Trigger a new output row when the value of a pass-through column changes. A pass-through column is an output column that has no XPath expression in the Description property and whose name exactly matches the name of an input column.

Job Management and Deployment


Quick Find
1. Name to find

2. Types to find 3. Include descriptions (If checked, the text in short and long descriptions will be searched)

Advanced Find Filtering options


1. 2. Type Type of object (Job, Table Definition, etc) Creation Date range 3. Last Modification Date range 4. Where used 5. Dependencies of 6. Options Case sensitivity and Search within last result set

Impact Analysis
Right click over a stage or table definition 1. Select Find where table definitions used 2. Select Find where table definitions used (deep) Deep includes additional object types Displays a list of objects using the table definition 1. Select Find dependencies 2. Select Find dependencies (deep) Deep includes additional object types Displays list of objects dependent on the one selected Graphical Functionality 1. Display the dependency path 2. Collapse selected objects 3. Move the graphical object 4. Birds-eye view

Comparison
1. Cross project compare 2. Compare against The two objects that can be compared are 1.Jobs and 2.Table Definitions

Aggregator Stage
1. Grouping Keys 2. Aggregations Aggregation Type - Count Rows, Calculation, Re-Calculation Aggregation Type - Count Rows Count Output Column - Name of the output column which consists of the number of records based on grouping keys Aggregation Type - Calculation, Re-Calculation Column for Calculation - Input Column to be selected for calculation Options Allow Null Output - True means that NULL is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. False means 0 is output when all input values for calculation column are NULL. Method Hash (Hash table) or Sort (Pre-Sort). The default method is Hash Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory. Sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys. Use Hash method for inputs with a limited number of distinct groups 1. Uses 2K of memory/group 2. Calculations are made for all groups and stored in memory (Hash table structure and hence the name) 3. Incoming data does not need to be pre-sorted 4. Results are output after all rows have been read 5. Useful when the number of unique groups is small Use Sort method with a large (or unknown) number of distinct key column values 1. Requires inputs pre-sorted on key columns (Does not perform the sort! Expects the sort) 2. Results are output after each group 3. Can handle unlimited number of groups Sort Aggregator - one of the lightweight stages that minimize memory usage by requiring data in key column sort order Lightweight stages that minimize memory usage by requiring data in key column sort order 1. Join 2. Merge 3. Sort Aggregator

Sort Stage
DataStage designer provides two methods for parallel (group) sorting 1. Sort stage - Parallel Execution 2. Sort on a link when the partitioning is not Auto - Identified by the Sort icon Both methods use the same tsort operator Sorting on a link provides easier job maintenance (fewer stages on job canvas) but fewer options.

The Sort stage offers more options than a link sort. The Sort Utility should be DataStage as it is faster than the Unix Sort. Stable sort preserves the order of non-key columns within each sort group but are slightly slower than non-stable sorts. Stable sort is enabled by default on Sort stages but not on Sort links. If disabled no prior ordering of records is guaranteed to be preserved by the sorting operation Sort Key Modes 1. Dont Sort (Previously Sorted) means that input records are already sorted by this key. The Sort stage will then sort on secondary keys, if any. 2. Dont Sort (Previously Grouped) means that input records are already grouped by that key but not sorted 3. Sort Sort by this key Advantages of Dont Sort (Previously Sorted) 1. Uses significantly less memory/disk 2. Sort is now on previously sorted key column groups not the entire data set 3. Outputs rows after each group DataStage provides two methods for generating a sequentially (totally) sorted result 1. Sort stage - Sequential Execution mode 2. Sort Merge Collector In general a parallel Sort + Sort Merge Collector will be faster than a Sequential Sort. By default the Parallel Framework will insert tsort operators as necessary to ensure correct results. But by setting $APT_INSERTION_CHECK_ONLY we can force the inserted tsort operator to verify if the data is sorted instead of actually performing the sort operation. By default each tsort operator (Sort stage, link sort and inserted sort) uses 20MB per partition as an internal memory buffer. But the Sort stage provides the "Restrict Memory Usage" option. 1. Increasing this value can improve improve performance if the entire (or group) data can fit into memory 2. Decreasing this value may hurt performance, but will use less memory When the memory buffer is filled, sort uses temporary disk space in the following order 1. Scratch disks in the $APT_CONFIG_FILE "sort" named disk pool 2. Scratch disks in the $APT_CONFIG_FILE default disk pool 3. The default directory specified by $TMPDIR 4. The Unix /tmp directory Removing Duplicates Can be done by Sort stage Use unique option No choice on which duplicate to keep Stable sort always retains the first row in the group Non stable sort is indeterminate Remove Duplicates stage Can choose to retain first or last

You might also like