You are on page 1of 954

Velocity v8

Best Practices

Best Practices

B2B Data Exchange


r

B2B Data Transformation Installation (for Unix) B2B Data Transformation Installation (for Windows) Deployment of B2B Data Transformation Services Establishing a B2B Data Transformation Development Architecture Testing B2B Data Transformation Services Configuring Security Data Analyzer Security Database Sizing Deployment Groups Migration Procedures - PowerCenter Migration Procedures - PowerExchange Running Sessions in Recovery Mode Using PowerCenter Labels Deploying Data Analyzer Objects Installing Data Analyzer Data Connectivity using PowerCenter Connect for BW Integration Server Data Connectivity using PowerExchange for WebSphere MQ Data Connectivity using PowerExchange for SAP NetWeaver Data Connectivity using PowerExchange for Web Services Data Migration Principles
BEST PRACTICES 2 of 954

Configuration Management and Security


r

Data Analyzer Configuration


r

Data Connectivity
r

Data Migration
r

INFORMATICA CONFIDENTIAL

Data Migration Project Challenges Data Migration Velocity Approach Build Data Audit/Balancing Processes Continuing Nature of Data Quality Data Cleansing Data Profiling Data Quality Mapping Rules Data Quality Project Estimation and Scheduling Factors Developing the Data Quality Business Case Effective Data Matching Techniques Effective Data Standardizing Techniques Integrating Data Quality Plans with PowerCenter Managing Internal and External Reference Data Real-Time Matching Using PowerCenter Testing Data Quality Plans Tuning Data Quality Plans Using Data Explorer for Data Discovery and Analysis Working with Pre-Built Plans in Data Cleanse and Match Designing Data Integration Architectures Development FAQs Event Based Scheduling Key Management in Data Warehousing Solutions Mapping Auto-Generation Mapping Design Mapping SDK Mapping Templates Naming Conventions Naming Conventions - B2B Data Transformation

Data Quality and Profiling


r

Development Techniques
r

INFORMATICA CONFIDENTIAL

BEST PRACTICES

3 of 954

Naming Conventions - Data Quality Performing Incremental Loads Real-Time Integration with PowerCenter Session and Data Partitioning Using Parameters, Variables and Parameter Files Using PowerCenter with UDB Using Shortcut Keys in PowerCenter Designer Working with JAVA Transformation Object Error Handling Process Error Handling Strategies - Data Warehousing Error Handling Strategies - General Error Handling Techniques - PowerCenter Mappings Error Handling Techniques - PowerCenter Workflows and Data Analyzer Business Case Development Canonical Data Modeling Chargeback Accounting Engagement Services Management Information Architecture People Resource Management Planning the ICC Implementation Proposal Writing Selecting the Right ICC Model Creating Inventories of Reusable Objects & Mappings Metadata Reporting and Sharing Repository Tables & Metadata Management Using Metadata Extensions Using PowerCenter Metadata Manager and Metadata Exchange Views
BEST PRACTICES 4 of 954

Error Handling
r

Integration Competency Centers and Enterprise Architecture


r

Metadata and Object Management


r

INFORMATICA CONFIDENTIAL

for Quality Assurance


q

Metadata Manager Configuration


r

Configuring Standard Metadata Resources Custom XConnect Implementation Customizing the Metadata Manager Interface Estimating Metadata Manager Volume Requirements Metadata Manager Business Glossary Metadata Manager Load Validation Metadata Manager Migration Procedures Metadata Manager Repository Administration Upgrading Metadata Manager Daily Operations Data Integration Load Traceability Disaster Recovery Planning with PowerCenter HA Option High Availability Load Validation Repository Administration Third Party Scheduler Updating Repository Statistics Determining Bottlenecks Performance Tuning Databases (Oracle) Performance Tuning Databases (SQL Server) Performance Tuning Databases (Teradata) Performance Tuning in a Real-Time Environment Performance Tuning UNIX Systems Performance Tuning Windows 2000/2003 Systems Recommended Performance Tuning Procedures Tuning and Configuring Data Analyzer and Data Analyzer Reports

Operations
r

Performance and Tuning


r

INFORMATICA CONFIDENTIAL

BEST PRACTICES

5 of 954

Tuning Mappings for Better Performance Tuning Sessions for Better Performance Tuning SQL Overrides and Environment for Better Performance Using Metadata Manager Console to Tune the XConnects Advanced Client Configuration Options Advanced Server Configuration Options Causes and Analysis of UNIX Core Files Domain Configuration Managing Repository Size Organizing and Maintaining Parameter Files & Variables Platform Sizing PowerCenter Admin Console PowerCenter Enterprise Grid Option Understanding and Setting UNIX Resources for PowerCenter Installations PowerExchange for Oracle CDC PowerExchange for SQL Server CDC PowerExchange Installation (for AS/400) PowerExchange Installation (for Mainframe) Assessing the Business Case Defining and Prioritizing Requirements Developing a Work Breakdown Structure (WBS) Developing and Maintaining the Project Plan Developing the Business Case Managing the Project Lifecycle Using Interviews to Determine Corporate Data Integration Requirements

PowerCenter Configuration
r

PowerExchange Configuration
r

Project Management
r

Upgrades

INFORMATICA CONFIDENTIAL

BEST PRACTICES

6 of 954

Upgrading Data Analyzer Upgrading PowerCenter Upgrading PowerExchange

INFORMATICA CONFIDENTIAL

BEST PRACTICES

7 of 954

B2B Data Transformation Installation (for Unix) Challenge


Install and configure B2B Data Transformation on new or existing hardware, either in conjunction with PowerCenter or co-existing with other host applications on the same application server. Note: B2B Data Transformation (B2BDT) was formerly called Complex Data Exchange (CDE). All references to CDE in this document are now referred to as B2BDT.

Description
Consider the following questions when determining what type of hardware to use for B2BDT: If the hardware already exists: 1. 2. 3. 4. 5. Is the processor, operating system supported by B2BDT? Are the necessary operating system and patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the B2BDT application? Will B2BDT share the machine with other applications? If yes, what are the CPU and memory requirements of the other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the complex data transformation requirements for B2BDT. The hardware requirements for the B2BDT environment depend upon the data volumes, number of concurrent users, application server and operating system used, among other factors. For exact sizing recommendations, contact Informatica Professional Services for a B2BDT Sizing and Baseline Architecture engagement.

Planning for B2BDT Installation


There are several variations on the hosting environment from which B2BDT services will be called. This has implications on how B2BDT is installed and configured.

Host Software Environment


The most common configurations are:
q q q

B2BDT to be used in conjunction with PowerCenter B2BDT as a stand alone configuration B2BDT in conjunction with non-PowerCenter integration using an adapter for other middleware software such as WebMethods

In addition, B2BDT 4.4 included a mechanism for exposing B2BDT services through web services so that they could be called from applications capable of calling web services. Depending on what host options are chosen, installation may vary.

Installation of B2BDT for a PowerCenter Host Environment


INFORMATICA CONFIDENTIAL BEST PRACTICES 8 of 954

Be sure to have the necessary Licenses and the additional plug-in to make PowerCenter work. Refer to the appropriate installation guide or contact Informatica support for details on installing B2BDT in PowerCenter environments.

Installation of B2BDT for a Standalone Environment


When using B2BDT services in a standalone environment, it is expected that one of the invocation methods (e.g., web services, . Net, Java APIs, command line or CGI) will be used to invoke B2BDT services. Consult accompanying B2BDT documentation for use in these environments.

Non-PowerCenter Middleware Platform Integration


Be sure to plan for additional agents to be installed. Refer to the appropriate installation guide or contact Informatica support for details for installing B2BDT in environments other than PowerCenter.

Other Decision Points


Where will the B2BDT service repository be located? The choices for the location of the service repository are i) a path on the local file system or ii) use of a shared network drive. The justification for using a shared network drive is typically to simplify service deployment if two separate B2BDT servers want to share the same repository. While the use of a shared repository is convenient for a multi-server production environment it is not advisable for development as there could be a danger of multiple development teams potentially overwriting the same project files. When a repository is shared between multiple machines, if a service is deployed via the B2BDT Studio, the Service Refresh Interval setting controls how fast other installations of B2BDT that are currently running detect the deployment of a service. What are multi-user considerations? If multiple users share a machine (but not at same time) the environment variable IFConfigLocation4 can be used to set the location of the configuration file to point to a different configuration file for each user.

Security Considerations
As the B2BDT repository, workspace and logging locations are directory-based all directories to be used should be granted read and write permissions for the user identity under which the B2BDT service will run. The identity associated with the caller of the B2BDT services will also need to have permissions to execute the files installed in B2BDT binary directory. Special considerations should be given to environments such as web services where the user identify under which the B2BDT service runs is set to be different for the interactive user or the user associated with the calling application.

Log File and Tracing Locations


Log files and tracing options should be configured for appropriate recycling policies. The calling application must have permissions to read, write and delete files to the path that is set for storing these files.

B2BDT Pre-install Checklist


B2BDT has client and server components. Only the server (or engine) component is installed on UNIX platforms. The client or development studio is only supported on the Windows platform. Reviewing the environment and recording the information in a detailed checklist facilitates the B2BDT install.

Minimum System Requirements


Verify that the minimum requirements for Operating System, Disk Space, Processor Speed and RAM are met and record them in
INFORMATICA CONFIDENTIAL BEST PRACTICES 9 of 954

the checklist. Verify the following:


q

B2BDT requires a Sun Java 2 Runtime Environment (version 1.5.X or above). B2BDT bundles with the appropriate JRE version. The installer can be pointed to an existing JRE or a JRE can be downloaded from Sun.
r r

If the server platform is AIX, Solaris or Linux JRE version 1.5 or higher is installed and configured. If the server platform is HP-UX, JRE version 1.5 or higher and the Java -AA add-on is installed and configured.

q q

A login account and directory have been created for the installation Confirm that the profile file is not write-protected. The setup program needs to update the profile.
r r

~/.profile if you use the sh, ksh, or bash shell ~/.cshrc or ~/.tcshrc if you use the csh or tcsh shell

q q

500Mb or more of temporary workspace is available. Data and Stack Size


r r

If the server platform is Linux, the data and the stack size are not limited If the server platform is AIX, the data size is not limited.

PowerCenter Integration Requirements


Complete a separate checklist for integration if you plan to integrate B2BDT with PowerCenter. For an existing PowerCenter installation, the B2BDT client will need to be installed on at least one PC in which the PowerCenter client resides. Also, B2BDT components will need to be installed on the PowerCenter server. If utilizing an existing PowerCenter installation ensure the following:
q q q q

Which version of PowerCenter is being used (8.x required)? Is the PowerCenter version 32 bit or 64 bit? Are the PowerCenter client tools installed on the client PC? Is the PowerCenter server installed on the server?

For new PowerCenter installations, the PowerCenter Pre-Install Checklist needs to be completed. Keep in mind that the same hardware will be utilized for both PowerCenter and B2BDT.

Non-PowerCenter Integration requirements


In addition to general B2BDT requirements, non-PowerCenter agents require that additional components are installed. B2BDT Agent for BizTalk - requires that Microsoft BizTalk Server (version 2004 or 2006) is installed on the same computer as B2BDT. If B2BDT Studio is installed on the same computer as BizTalk Server 2004, the Microsoft SP2 service pack for BizTalk Server must be installed. B2BDT Translator for Oracle BPEL - requires that BPEL 10.1.2 or above is installed. B2BDT Agent for WebMethods - requires that WebMethods 6.5 or above is installed. B2BDT Agent for WebSphere Business Integration Message Broker requires that WBIMB 5.0 with CSD06 (or WBIMB 6.0) is installed. Also ensure that the platform supports both the B2BDT Engine and WBIMB. A valid license key is needed to run a B2BDT project and must be installed before B2BDT services will run on the computer. Contact Informatica support to obtain a B2BDT license file (B2BDTLicense.cfg). B2BDT Studio can be used without installing a license file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

10 of 954

B2BDT Installation and Configuration


The B2BDT installation process involves two main components - the B2BDT development workbench (Studio) and the B2BDT Server, which is an application deployed on a server. The installation tips apply to UNIX environments. This section should be used as a supplement to the B2B Data Transformation Installation Guide. Before installing B2BDT, complete the following steps:
q

Verify that the hardware meets the minimum system requirements for B2BDT. Ensure that the combination of hardware and operating system are supported by B2BDT. Ensure that sufficient space has been allocated to the B2BDT serviceDB. Apply all necessary patches to the operating system. Ensure that the B2BDT license file has been obtained from technical support. Be sure to have administrative privileges for the installation user id. For *nix systems ensure that read, write and executive privileges have been given for the installation directory.

q q q

Adhere to following sequence of steps to successfully install B2BDT. 1. Complete the B2BDT pre-install checklist and obtain valid license keys. 2. Install B2BDT development workbench (studio) on the windows platform. 3. Install the B2BDT server on a server machine. When used in conjunction with PowerCenter, the server component must be installed on the same physical machine where PowerCenter resides. 4. Install necessary client agents when used in conjunction with Websphere, WebMethods and Biztalk

B2BDT Install Components


q q q q q

B2B Data Transformation Studio B2B Data Transformation Engine Processors Optional agents Optional libraries

The table below provides descriptions of each component: Component Engine Applicable Platform Both UNIX and Windows Description The runtime module that executes B2BDT data transformations. This module is required in all B2BDT installations. The design and configuration environment for creating and deploying data transformations. B2BDT Studio is hosted within Eclipse on Windows platforms. The Eclipse setup is included in the B2BDT installation package. A set of components that perform global processing operations on documents, such as transforming their file formats. All the document processors run on Windows platforms, and most of them run on UNIX-type platforms. Libraries of predefined B2BDT data transformations, which can be used with industry messaging standards such as EDI, ACORD, HL7, HIPAA, and SWIFT. Each library contains parsers, serializers, and XSD schemas for the appropriate messaging standard. The libraries can be installed on Windows platforms. B2BDT Studio can be used to import the library components to projects, and deploy the projects to Windows or UNIX-type platforms.

Studio

Windows only

Document Processors

Both UNIX and Windows

Libraries

Windows only (see description)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

11 of 954

Documentation

Windows only

An online help library, containing all the B2BDT documentation. PDF version of documentation available for UNIX platform

Install the B2BDT Engine Step 1:


Run the UNIX installation file from the software folder on the installation CD and follow the prompts. Follow the wizard to complete the install. TIP During the installation a language must be selected. If there are plans to change the language at a later point in time in the Configuration Editor, Informatica recommends that a non-English language is chosen for the initial setup. If English is selected and then later changed to another language some of the services that are required for other languages might not be installed.

B2BDT supports all of the major UNIX-type systems (e.g., Sun Solaris, IBM AIX, Linux and HP-UX). On UNIX-type operating systems, the installed components are the B2BDT Engine and the document processors. Note: On UNIX-type operating systems, do not limit the data size and the stack size. To determine whether there is currently a limitation, run the following command: For AIX, HP, and Solaris: ulimit a For Linux: limit If very large documents are processed using B2BDT, try adjusting system parameters such as the memory size and the file size. There are two install modes possible under UNIX -- Graphical Interface and Console Mode. The default installation path is /opt/Informatica/ComplexDataExchange.

The default Service Repository Path is <INSTALL_DIR>/ServiceDB. This is the storage location for data transformations that are
INFORMATICA CONFIDENTIAL BEST PRACTICES 12 of 954

deployed as B2BDT services. The default Log path is <INSTALL_DIR>/CMReports. Log Path page is the location where the B2BDT Engine should store its log files. The log path is also known as the reports path. The repository location, JRE path and Log path can be changed subsequent to the installation using environment variables.

Step 2:
Install the license file. Verify the validity of the license file with the following command: CM_console v The system displays information such as the location and validity of the license file (sample output shown below): $ ./bin/CM_console v Version: 4.4.0(Build:186) Syntax version:4.00.10 Components: Engine Processors Configuration file: /websrvr/informatica/ComplexDataExchange/CMConfig.xml Package identifier: IF_AIX_OS64_pSeries_C64 License information: License-file path: /websrvr/informatica/ComplexDataExchange/CDELicense.cfg Expiration date: 21/02/08 (dd/mm/yyyy) Maximum CPUs: 1 Maximum services: 1 Licensed components: Excel,Pdf,Word,Afp,Ppt

Step 3:
Load the Environment Variables. When the setup is complete, configure the system to load the B2BDT environment variables. The B2BDT setup assigns several environment variables that point to the installation directory and to other locations that the system needs. On UNIX-type platforms, the system must be configured to load the environment variables. B2BDT cannot run until this is done. B2BDT setup creates an environment variables file. This can be in either of the following ways: Manually from the command line In lieu of loading environment variables automatically, they can be loaded manually from the command line. This must be done upon each log in before using B2BDT. For sh, ksh, or bash shell, the command is: ./<INSTALL_DIR>/setEnv.sh For csh or tcsh shell, the command is: source /<INSTALL_DIR>/setEnv.csh Substitute the installation path for <INSTALL_DIR> as necessary. Automatically by inserting the appropriate command in the profile or in a script file To configure the system to load the environment variables file automatically upon log in:
INFORMATICA CONFIDENTIAL BEST PRACTICES 13 of 954

For the sh, ksh, or bash shell, insert the following line in the profile file. . /<INSTALL_DIR>/setEnv.sh For csh or tcsh shell, insert the following line in the login file. source /<INSTALL_DIR>/setEnv.csh On UNIX-type platforms, B2BDT uses the following environment variables. Environment Variable PATH Required/Optional Purpose of the Variable Required The environment variables file adds <INSTALL_DIR>/ bin to the paths. Note: In rare instances, the B2BDT Java document processors require that the JRE be added to the path. On AIX: LIBPATH On Solaris and Linux: LD_LIBRARY_PATH On HP-UX: SHLIB_PATH and LD_LIBRARY_PATH Required The environment variables file adds the installation directory (<INSTALL_DIR>) to the library path. It also adds the JVM directory of the JRE and its parent directory to the path, for example, <INSTALL_DIR>/ jre1.4/lib/sparc/server and <INSTALL_DIR>/jre1.4/lib/ sparc. This value can be edited to use another compatible JRE. Required CLASSPATH Required IFCONTENTMASTER_HOME Optional IFConfigLocation4 The environment variables file creates this variable, which points to the B2BDT installation directory (<INSTALL_DIR>). The path of the B2BDT configuration file. This for multiple configurations. The environment variables file adds <INSTALL_DIR>/ api/lib/CM_JavaAPI.jar to the Java class path.

The following is an example of an environment variables file (setEnv.csh) on an AIX system. The variable names and values differ slightly on other UNIX-type operating systems. ## B2B Data Transformation Environment settings setenv IFCMPath /opt/Informatica/ComplexDataExchange setenv CMJAVA_PATH /opt/Informatica/ComplexDataExchange/jre1.4/jre/bin/classic: /opt/Informatica/ComplexDataExchange/jre1.4/jre/bin # Prepend B2B Data Transformation to the PATH if ( ! $?PATH ) then setenv PATH "" endif setenv PATH "${IFCMPath}/bin:${PATH}" # Add CM & java path to LIBPATH if ( ! $?LIBPATH ) then
INFORMATICA CONFIDENTIAL BEST PRACTICES 14 of 954

setenv LIBPATH "" endif setenv LIBPATH "${IFCMPath}/bin:${CMJAVA_PATH}:${LIBPATH}" # Update IFCONTENTMASTER_HOME. setenv IFCONTENTMASTER_HOME "${IFCMPath}" # Prepend CM path CLASSPATH if ( ! $?CLASSPATH ) then setenv CLASSPATH "" endif setenv CLASSPATH "${IFCMPath}/api/lib/CM_JavaAPI.jar:.:${CLASSPATH}"

Step 4:
Configuration settings Directory Locations During the B2BDT setup, prompts were completed for the directory locations of the B2BDT repository, log files and JRE. If necessary, alter these locations by editing the following parameters: Parameter CM Configuration/ Directory services/ File system/Base Path CM Configuration/ CM Engine/ JVM Location CM Configuration/ General/ Reports directory CM Configuration/ CM Engine/ Invocation CM Configuration/ CM Engine/ CM Server Explanation The B2BDT repository location, where B2BDT services are stored.

On UNIX: This parameter is not available in the Configuration Editor on UNIX-type platforms. For more information about setting the JRE on UNIX, see UNIX Environment Variable Reference. The log path, also called the reports path, where B2BDT saves event logs and certain other types of reports.

These settings control whether B2BDT Engine runs in-process or out-of-process.

B2BDT has a Configuration Editor, for editing the parameters of a B2BDT installation. To open the Configuration Editor on UNIX in graphical mode: Enter the following command: <INSTALL_DIR>/CMConfig Note: The Configuration Editor is not supported in a UNIX console mode. Some of the Configuration Editor settings are available for all B2BDT installations. Some additional settings vary depending on the B2BDT version and on the optional components that have been installed. The Configuration Editor saves the configuration in an XML file. By default, the file is <INSTALL_DIR/CMConfig.xml>. Note: Before editing the configuration save a backup copy of CMConfig.xml. In the event of a problem the backup can be restored. The file <INSTALL_DIR>/CMConfig.bak is a backup of the original <INSTALL_DIR/CMConfig.xml> which the setup program
INFORMATICA CONFIDENTIAL BEST PRACTICES 15 of 954

created when B2BDT was installed. Restoring CMConfig.bak reverts B2BDT to its original configuration. OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths, etc. The following table lists some typical configuration items and where they are set: Type of configuration item Memory for Studio JVM / JRE usage Tuning parameters threads, timeouts etc User specific settings Memory for runtime Workspace location Where configured B2BDT Configuration application B2BDT Configuration application B2BDT Configuration application Use environment variable to point to different configuration file B2BDT Configuration application B2BDT Configuration application (B2BDT 4.3), B2BDT Studio (B2BDT 4.4) Set in project properties B2BDT Configuration application

Event generation Repository location

In-Process or Out-of-Process Invocation Out-of-process invocation requires the use of the B2BDT Server application (which is already installed by the install process). The distinction is that running under server mode causes transformations to potentially run slower, but errors will be isolated from the calling process. For Web Services, sometime the use of Server mode is recommended as the lifetime of the host process then becomes independent of the life time of the process space allocated to run the web service. For example IIS can run web services in a mode where a process dies or is recycled after a call to web services. For B2BDT the first call after a process startup can take up to 3 seconds (subsequent calls are usually milliseconds) hence it is not optimal to start host process on each invocation. Running in server mode keeps process lifetimes independent. TIP B2BDT Studio or the CM_console command always runs data transformations in-process.

Running out-of-process has the following advantages:


q q q

Allows 64-bit processes to activate 32-bit versions of B2BDT Engine An Engine failure is less likely to disrupt the calling application Helps prevent binary collisions with other modules that run in the process of the calling application.

In-process invocation has the following advantage:


q

Faster performance than out-of-process.

Thread pool settings


INFORMATICA CONFIDENTIAL BEST PRACTICES 16 of 954

The thread pool controls the maximum number of Engine threads that can run client requests concurrently per process. If the number of client requests exceeds the number of available threads, the Server queues the requests until a thread is available. The default setting is 4. Some recommendations are summarized in the table below. Actual needs vary depending upon requirements. Best practices and additional recommendations are part of Jumpstart and Base Line Architecture engagements. Contact an Informatica representative for additional information.

Step 5:
Configure ODBC connectivity. Note: This step is only needed if the ODBC database support features of B2BDT will be used. In such case, an ODBC driver may need to be configured.

Step 6:
Test the installation to confirm that B2BDT operates properly. Note: Tests are available to test the engine and document processor installation. Refer the directory <INSTALL_DIR>\setupTests for B2BDT test projects testCME and testCMDP. Sample output would be similar to following: cd $IFCONTENTMASTER_HOME cp -R setupTests/testCME ServiceDB/ CM_console testCME <Result>Test Succeeded</Result>

B2BDT Integration with PowerCenter


B2BDT does support using the runtime as a server process to be invoked from PowerCenter on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a server process in developer terminology, it does not provide full server administration or monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features. Part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

17 of 954

Installation of B2BDT for PowerCenter is a straightforward process. All required plugins needed to develop and run B2BDT transformation are installed as part of a PowerCenter installation. However, certain B2BDT versions and B2BDT plugins need to be registered after the install process. Refer to the PowerCenter Installation Guide for details.

Note: A PowerCenter UDO Transformation can only be created if the UDO plug-in is successfully registered in the repository. If the UDO Option is installed correctly then UDO Transformations can be created in the PowerCenter Designer. Note: ODBC drivers provided for PowerCenter are not automatically usable from B2BDT as licensing terms prohibit this in some cases. Contact an Informatica support representative for further details.
INFORMATICA CONFIDENTIAL BEST PRACTICES 18 of 954

Additional Details for PowerCenter Integration

q q q q

B2BDT is a Custom Transformation object within PowerCenter INFA passes data via memory buffers to the B2BDT engine and retrieves that output via buffers. The B2BDT engine runs IN-PROCESS with the PowerCenter engine. The Custom Transformation object for Informatica can be dragged and dropped inside a PowerCenter mapping.

When using B2BDT transformation, PowerCenter does NOT process the input files directly, but instead takes a path and filename (from a text file). Then the engine processes the data through the B2BDT parser defined within the mapping. After this the data is returned to the PowerCenter B2BDT transformation for processing by other Informatica transformation objects. TIP Verify that the Source filename is the name of the text file where both the file path and the file name are present. It can not be the actual file being parsed by PowerCenter and B2BDT. This is the direct versus indirect sourcing of the file.

Useful Tips and Tricks


Version Compatibility?
q q

Ensure that version of B2BDT is compatible with PowerCenter. Otherwise many issues can manifest in different forms. In general B2BDT 4.4 is compatible with 8.5 and with 8.1.1 SP4 (and SP4 only), B2BDT 4.0.6 is compatible with 8.1.1, and B2BDT 3.2 is compatible with PC 7.x. For more information refer to the Product Availability Matrix.

Service Deployment?
q q

Ensure that services are deployed on a remote machine where PowerCenter installed. Services deployed from studio show up in PowerCenter designer B2BDT transformation as a dropdown list (see screenshot below).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

19 of 954

Note: These services are only ones that are deployed on a local machine. If any services are deployed on remote machines the designer will not display them. As it is easy to mistake them for remote services manually ensure that the services for local and remote machines are in sync.
q

After making certain that the services are deployed in the remote server that also has PowerCenter installed, the B2BDT transformation can be specified from the UDO (8.1.1) or B2BDT (8.5) transformation on the Metadata Extensions tab.

Last updated: 02-Jun-08 16:23

INFORMATICA CONFIDENTIAL

BEST PRACTICES

20 of 954

B2B Data Transformation Installation (for Windows) Challenge


Installing and configuring B2B Data Transformation (B2BDT) on new or existing hardware, either in conjunction with PowerCenter or co-existing with other host applications on the same server. Note: B2B Data Transformation was formerly called Complex Data Exchange (CDE). Any references to PowerExchange Complex Data Exchange in this document are now referred to as B2B Data Transformation (B2BDT).

Description
Consider the following questions when determining what type of hardware to use for B2BDT: If the hardware already exists: 1. 2. 3. 4. 5. Is the processor, operating system supported by B2BDT? Are the necessary operating system and patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the B2BDT application? Will B2BDT share the machine with other applications? If yes, what are the CPU and memory requirements of other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the complex data transformation requirements for B2BDT. Among other factors, the hardware requirements for the B2BDT environment depend upon the data volumes, the number of concurrent users and the application server and operating system used. For exact sizing recommendations, contact Informatica Professional Services for a B2BDT Sizing and Baseline Architecture engagement.

Planning for the B2BDT Installation


There are several variations of the hosting environment from which B2BDT services will be invoked. This has implications on how B2BDT is installed and configured.

Host Software Environment


The most common configurations are: 1. B2BDT to be used in conjunction with PowerCenter 2. B2BDT as a stand alone configuration 3. B2BDT in conjunction with a non-PowerCenter integration using an adapter for other middleware software such as WebMethods or Oracle BPEL.
INFORMATICA CONFIDENTIAL BEST PRACTICES 21 of 954

B2BDT 4.4 includes a mechanism for exposing B2BDT services through web services so that they can be called from applications capable of calling web services. Depending on what host options are chosen, installation options may vary.

Installation of B2BDT for a PowerCenter Host Environment


Be sure to have the necessary licenses and the additional plug-in to make PowerCenter work. Refer to the appropriate installation guide or contact Informatica support for details on the installation of B2BDT in PowerCenter environments.

Installation of B2BDT for a Standalone Environment


When using B2BDT services in a standalone environment, it is expected that one of the invocation methods (e.g., Web Services, .Net, Java APIs, Command Line or CGI) will be used to invoke B2BDT services. Consult accompanying B2BDT documentation for use in these environments.

Non-PowerCenter Middleware Platform Integration


Be sure to plan for additional agents to be installed. Refer to the appropriate installation guide or contact Informatica support for details for installing B2BDT in environments other than PowerCenter.

Other Decision Points


Where will the B2BDT service repository be located? The choices for the location of the service repository are i) a path on the local file system or ii) use of a shared network drive. The justification for using a shared network drive is typically to simplify service deployment if two separate B2BDT servers want to share the same repository. While the use of a shared repository is convenient for a multi-server production environment it is not advisable for development as there could be a danger of multiple development teams potentially overwriting the same project files. When a repository is shared between multiple machines, if a service is deployed via the B2BDT Studio, the Service Refresh Interval setting controls how fast other installations of B2BDT that are currently running detect the deployment of a service. What are multi-user considerations? If multiple users share a machine (but not at same time) the environment variable IFConfigLocation4 can be used to set the location of the configuration file to point to a different configuration file for each user.

Security Considerations
As the B2BDT repository, workspace and logging locations are directory-based all directories to be used should be granted read and write permissions for the user identity under which the B2BDT service will run. The identity associated with the caller of the B2BDT services will also need to have permissions to execute the files installed in B2BDT binary directory. Special considerations should be given to environments such as web services where the user identify under which the B2BDT service runs is set to be different for the interactive user or the user associated with the calling
INFORMATICA CONFIDENTIAL BEST PRACTICES 22 of 954

application.

Log File and Tracing Locations


Log files and tracing options should be configured for appropriate recycling policies. The calling application must have permissions to read, write and delete files to the path that is set for storing these files.

B2BDT Pre-install Checklist


It is best to review the environment and record the information in a detailed checklist to facilitate the B2BDT install.

Minimum System Requirements


Verify that the minimum requirements for the Operating System, Disk Space, Processor Speed and RAM are met and record them the checklist.
q q q q

B2BDT Studio requires Microsoft .NET Framework, version 2.0. If this version is not already installed, the installer will prompt for and install the framework automatically. B2BDT requires a Sun Java 2 Runtime Environment, version 1.5.X or above. B2BDT bundles with the appropriate JRE version. The installer can be pointed to an existing JRE or a JRE can be downloaded from Sun. To install the optional B2BDT libraries, reserve additional space (refer to documentation for additional information).

PowerCenter Integration Requirements


Complete the checklist for integration if B2BDT will be integrated with PowerCenter. For an existing PowerCenter installation, the B2BDT client needs to be installed on at least one PC on which the PowerCenter client resides. B2BDT components also need to be installed on the PowerCenter server. If utilizing an existing PowerCenter installation ensure the following:
q q q q

Which version of PowerCenter is being used (8.x required)? Is the PowerCenter version 32 bit or 64 bit? Are the PowerCenter client tools installed on the client PC? Is the PowerCenter server installed on the server?

For new PowerCenter installations, the PowerCenter Pre-Install Checklist should be completed. Keep in mind that the same hardware will be utilized for PowerCenter and B2BDT. For windows Server, verify the following:
q q q q

The login account used for the installation has local administrator rights. 500Mb or more of temporary workspace is available. The Java 2 Runtime Environment version 1.5 or higher is installed and configured. Microsoft .NET Framework, version 2.0 is installed.

Non-PowerCenter Integration Requirements


INFORMATICA CONFIDENTIAL BEST PRACTICES 23 of 954

In addition to the general B2BDT requirements, non-PowerCenter agents require that additional components are installed. B2BDT Agent for BizTalk - requires that Microsoft BizTalk Server (version 2004 or 2006) is installed on the same computer as B2BDT. If B2BDT Studio is installed on the same computer as BizTalk Server 2004, the Microsoft SP2 service pack for BizTalk Server must be installed. B2BDT Translator for Oracle BPEL - requires that BPEL 10.1.2 or above is installed. B2BDT Agent for WebMethods - requires that WebMethods 6.5 or above is installed. B2BDT Agent for WebSphere Business Integration Message Broker requires that WBIMB 5.0 with CSD06 (or WBIMB 6.0) are installed. Also ensure that the Windows platform supports both the B2BDT Engine and WBIMB. A valid license key is needed to run a B2BDT project and must be installed before B2BDT services will run on the computer. Contact Informatica support to obtain a B2BDT license file (B2BDTLicense.cfg). B2BDT Studio can be used without installing a license file.

B2BDT Installation and Configuration


The B2BDT installation consists of two main components - the B2BDT development workbench (Studio) and the B2BDT Server (which is an application deployed on a server). The installation tips apply to Windows environments. This section should be used as a supplement to the B2B Data Transformation Installation Guide. Before installing B2BDT complete the following steps:
q q q q q q

Verify that the hardware meets the minimum system requirements for B2BDT. Ensure that the combination of hardware and operating system are supported by B2BDT. Ensure that sufficient space has been allocated for the B2BDT serviceDB. Ensure that all necessary patches have been applied to the operating system. Ensure that the B2BDT license file has been obtained from technical support. Be sure to have administrative privileges for the installation user id.

Adhere to the following sequence of steps to successfully install B2BDT. 1. Complete the B2BDT pre-install checklist and obtain valid license keys. 2. Install B2BDT development workbench (studio) on the windows platform. 3. Install the B2BDT server on a server machine. When used in conjunction with PowerCenter, the server component must be installed on the same physical machine where PowerCenter resides. 4. Install necessary client agents when used in conjunction with WebSphere, WebMethods and BizTalk. In addition to the standard B2BDT components that are installed by default additional libraries can be installed. Refer to the B2BDT documentation for detailed information on these library components.

B2BDT Install Components


The install package includes the following components.
q

B2B Data Transformation Studio


BEST PRACTICES 24 of 954

INFORMATICA CONFIDENTIAL

q q q q q

B2B Data Transformation Engine Document Processors Documentation Optional agents Optional libraries

The table below provides descriptions of each component: Component Engine Description The runtime module that executes B2BDT data transformations. This module is required in all B2BDT installations. The design and configuration environment for creating and deploying data transformations. B2BDT Studio is hosted within Eclipse on Windows platforms. The Eclipse setup is included in the B2BDT installation package. A set of components that perform global processing operations on documents, such as transforming their file formats. All the document processors run on Windows platforms, and most of them run on UNIX-type platforms. Libraries of predefined B2BDT data transformations, which can be used with industry messaging standards such as EDI, ACORD, HL7, HIPAA, and SWIFT. Each library contains parsers, serializers, and XSD schemas for the appropriate messaging standard. The libraries can be installed on Windows platforms. B2BDT Studio can be used to import the library components to projects in order to deploy the projects to Windows or UNIX-type platforms. An online help library, containing all the B2BDT documentation.

Studio

Document Processors

Optional Libraries

Documentation

Install the B2BDT Studio and Engine


Step 1:
Run the Windows installation file from the software folder on the installation CD and follow the prompts. Follow the wizard to complete the install.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

25 of 954

TIP During the installation a language must be selected. If there are plans to change the language at a later point in time in the Configuration Editor, Informatica recommends that a non-English language is chosen for the initial setup. If English is selected and then later changed to another language some of the services that are required for other languages might not be installed.

q q

The default installation path is C:\Informatica\ComplexDataExchange. The default Service Repository Path is <INSTALL_DIR>/ServiceDB. This is the storage location for data transformations that are deployed as B2BDT services. The default Log path is <INSTALL_DIR>/CMReports. The Log Path is the location where the B2BDT Engine stores its log files. The log path is also known as the reports path.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

26 of 954

The repository location, JRE path and Log path can be changed subsequent to the installation using environment variables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

27 of 954

Step 2:
Install the license file. Verify the validity of the license file with the following command: CM_console v The system displays information such as the location and validity of the license file.

Step 3:
Configure the Environment Variables. The B2BDT setup assigns several environment variables which point to the installation directory and to other locations that the system needs. On Windows, the B2BDT setup creates or modifies the following environment variables: Environment Variable PATH Required/Optional Purpose of the Variable Required The environment variables file adds <INSTALL_DIR>/bin to the paths. Note: In rare instances, the B2BDT Java document processors require that the JRE be added to the path.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

28 of 954

CLASSPATH

Required

The setup adds <INSTALL_DIR>\api\lib\CM_JavaAPI.jar to the path.

Required CLASSPATH Required IFCONTENTMASTER_HOME Optional IFConfigLocation4

The environment variables file adds <INSTALL_DIR>/api/lib/ CM_JavaAPI.jar to the Java class path.

The setup creates this environment variable, which points to the B2BDT installation directory (<INSTALL_DIR>).

The path of the B2BDT configuration file.

Step 4:
Configuration settings. The configuration application allows for the setting of properties such as JVM parameters, thread pool settings, memory available to the studio environment and many others. Consult the administrators guide for a full list of settings and their effects. Properties set using the B2BDT configuration application affect both the operation of the standalone B2BDT runtime environment and the behavior of the B2BDT studio environment. To open the Configuration Editor in Windows, from the Start menu choose Informatica > B2BDT > Configuration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

29 of 954

Some of the Configuration Editor settings are available for all B2BDT installations. Additional settings vary depending on the B2BDT version and the optional components installed. The Configuration Editor saves the configuration in an XML file. By default, the file is <INSTALL_DIR/CMConfig.xml>. The B2BDT studio environment should be installed on each developers machine or environment. While advances in virtualization technologies and technologies such as Windows remote desktop connections theoretically allow for multiple users to share the same B2BDT installation, the B2BDT studio environment does not implement mechanisms such as file locking during the authoring of transformations that are needed to secure multiple users from overwriting each others work. An environment variable can be defined called IFConfigLocation4. The value of the variable must be the path for a valid configuration file (i.e., c:\MyIFConfigLocation4\CMConfig1.xml). For example, if two users want to run B2BDT Engine with different configurations on the same platform, store their respective configuration files in their home directories. Both files must have the name CMConfig.xml. Alternately store a CMConfig.xml file in the home directory for one of the users. The other user will use the default configuration file (e.g., <INSTALL_DIR>/CMConfig.xml). TIP Always save a backup copy of CMConfig.xml prior to editing. In the event of a problem the last known backup can be restored. The file <INSTALL_DIR>/CMConfig.bak is a backup of the original <INSTALL_DIR/CMConfig.xml> which the setup program created when B2BDT was installed. Restoring CMConfig.bak reverts B2BDT to its original configuration.

OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths, etc.
INFORMATICA CONFIDENTIAL BEST PRACTICES 30 of 954

The following table lists some typical configuration items and where they are set: Type of configuration item Memory for Studio JVM / JRE usage Tuning parameters threads, timeouts etc User specific settings Where configured B2BDT Configuration application B2BDT Configuration application B2BDT Configuration application Use environment variable to point to different configuration file B2BDT Configuration application B2BDT Configuration application (B2BDT 4.3), B2BDT Studio (B2BDT 4.4) Set in project properties B2BDT Configuration application

Memory for runtime Workspace location

Event generation Repository location

In-Process or Out-of-Process Invocation Out-of-process invocation requires the use of the B2BDT Server application (which is already installed by the install process). The distinction is that running under server mode causes transformations to potentially run slower, but errors will be isolated from the calling process. For Web Services, sometime the use of Server mode is recommended as the lifetime of the host process then becomes independent of the life time of the process space allocated to run the web service. For example IIS can run web services in a mode where a process dies or is recycled after a call to web services. For B2BDT the first call after a process startup can take up to 3 seconds (subsequent calls are usually milliseconds) hence it is not optimal to start host process on each invocation. Running in server mode keeps process lifetimes independent. TIP B2BDT Studio or the CM_console command, always runs data transformations in-process.

Running out-of-process has the following advantages:


q q q

Allows 64-bit processes to activate 32-bit versions of B2BDT Engine An Engine failure is less likely to disrupt the calling application Help prevent binary collisions with other modules that run in the process of the calling application.

In-process invocation has the following advantage:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

31 of 954

Faster performance than out-of-process.

Thread pool settings The thread pool controls the maximum number of Engine threads that can run client requests concurrently per process. If the number of client requests exceeds the number of available threads, the Server queues the requests until a thread is available. The default setting is 4. Some recommendations are summarized in the table below. Actual needs vary depending upon requirements. Best practices and additional recommendations are part of Jumpstart and Base Line Architecture engagements. Contact an Informatica representative for additional information. Key Settings Eclipse settings Parameters Memory available to studio Suggestions By default Eclipse allocates up to 256MB to Java VM Set to vmargs Xmx512M to allocate 512mb Log file locations Location security needs to match identity of B2BDT engine Need to have read permissions for service db locations

ServiceDB

Preprocessor buffer sizes

Change if running out of memory during source file processing

Service Refresh Interval

INFORMATICA CONFIDENTIAL

BEST PRACTICES

32 of 954

Step 5:
Configure ODBC connectivity. Note: this step is only needed if the ODBC database support features of B2BDT will be used. In such case, an ODBC driver may need to be configured.

Step 6:
Test the installation to confirm that B2BDT operates properly Note: Tests are available to test the engine and document processor installation. Refer the directory <INSTALL_DIR>\setupTests for B2BDT test projects testCME and testCMDP.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

33 of 954

B2BDT Integration With PowerCenter


B2BDT does support using the runtime as a server process to be invoked from PowerCenter on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a server process in developer terminology, it does not provide full server administration or monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features. Part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

34 of 954

Installation of B2BDT for PowerCenter is a straightforward process. All required plugins needed to develop and run B2BDT transformation are installed as part of a PowerCenter installation. However, certain B2BDT versions and B2BDT plugins need to be registered after the install process. Refer to the PowerCenter Installation Guide for details. The repository option copies the B2BDT plug-ins to the Plugin directory. Register the B2BDT plug-ins in the PowerCenter repository.

PowerCenter 7.1.x
Register the UDT.xml plug-in in the PowerCenter Repository Server installation Plugin directory. The B2BDT plug-in will appear under the repository in the Repository Server Administration Console.

PowerCenter 8.1.x
Register the pmudt.xml plug-in in the Plugin directory of the PowerCenter Services installation. When the B2BDT plug-in is successfully registered in PowerCenter 8.1 it will appear in the Administration Console as follows:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

35 of 954

Note:A PowerCenter UDO Transformation can only be created if the UDO plug-in is successfully registered in the repository. If the UDO Option is installed correctly then UDO Transformations can be created in the PowerCenter Designer. Note: ODBC drivers provided for PowerCenter are not automatically usable from B2BDT as licensing terms prohibit this in some cases. Contact an Informatica support representative for further details.

Additional Details for PowerCenter Integration

B2BDT is a Custom Transformation object within PowerCenter


r r r

INFA passes data via memory buffers to the B2BDT engine and retrieves that output via buffers. The B2BDT engine runs IN-PROCESS with the PowerCenter engine. The Custom Transformation object for Informatica can be dragged and dropped inside a PowerCenter mapping

INFORMATICA CONFIDENTIAL

BEST PRACTICES

36 of 954

When using B2BDT transformation, PowerCenter does NOT process the input files directly, but instead takes a path and filename (from a text file). Then the engine processes the data through the B2BDT parser defined within the mapping. After this the data is returned to the PowerCenter B2BDT transformation for processing by other Informatica transformation objects. TIP Verify that the Source filename is the name of the text file where both the file path and the file name are present. It can not be the actual file being parsed by PowerCenter and B2BDT. This is the direct versus indirect sourcing of the file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

37 of 954

Useful Tips and Tricks


Can I use an existing Eclipse install with B2BDT?
q

Yes. But make sure it is compatible with the version of B2BDT installation. Check with product compatibility matrix for additional information. B2BDT can be made to work with a different version of Eclipse however it is not guaranteed.

Is there a silent install available for B2BDT on Windows?


q

As of B2BDT 4.4 there is no silent install mode. But there is likely be a future release.

Version Compatibility?
q

Ensure that version of B2BDT is compatible with PowerCenter. Otherwise many issues can manifest in different forms. In general B2BDT 4.4 is compatible with 8.5 and with 8.1.1 SP4 (and SP4 only), B2BDT 4.0.6 is compatible with 8.1.1, and B2BDT 3.2 is compatible with PC 7.x. For more information refer to the Product Availability Matrix.

Service Deployment?
q q

Ensure that services are deployed on a remote machine where PowerCenter installed. Services deployed from studio show up in PowerCenter designer B2BDT transformation as a dropdown list (see screenshot below).

Note: These services are only ones that are deployed on a local machine. If any services are deployed on remote machines the designer will not display them. As it is easy to mistake them for remote services manually ensure that the services for local and remote machines are in sync.
q

After making certain that the services are deployed in the remote server that also has PowerCenter installed, the B2BDT transformation can be specified from the UDO (8.1.1) or B2BDT (8.5) transformation on the Metadata Extensions tab.
BEST PRACTICES 38 of 954

INFORMATICA CONFIDENTIAL

Common Installation Troubleshooting Tips Problem


Problem Description The following error occurs when opening B2BDT studio: There was a problem running ContentMaster studio, Please make sure /CMConfig.XML is a valid configuration file (Error code=2) Solution To resolve this issue, do the following: Edit the CMConfig.xml file Add the below section of code after </CMAgents> and before <CMDocumentProcessors version="4.0.6.61"/> in the file: <CMStudio version="4.0.6.61"> <Eclipse> <Path>C:/Program Files/Itemfield/ContentMaster4/eclipse</Path> <Workspace>C:\Documents and Settings\kjatin.INFORMATICA\My Documents\Itemfield\ContentMaster\4.0 \workspace</Workspace> </Eclipse> </CMStudio> Note: Modify the path names as necessary to match the installation settings.

Problem
Problem Description The Content Master studio fails to open with the following error: Failed to Initialize CM engine! CM license is limited to 1 CPU, and is not compatible with this machine's hardware. Please contact support. Cause The Content Master license is licensed for a fewer number of CPUs then what is on the machine. While registering incorrect information was entered for number of CPUs and so the license provided is for machine with lesser number of CPUs. Solution To resolve the issue do the registration again, enter the right number of CPU and send the new registration.txt to Informatica Support to get the new license. When the new license is received replace it over the existing one in the Content Master Installation directory.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

39 of 954

Problem
Problem Description When launching the Designer after installing the Unstructured Data Option (UDO) option, the following error is displayed: Failed to load DLL: pmudtclient.dll for Plug-in: PC_UDT Cause This error occurs when Content Master has not been installed along with PowerCenter UDO. Solution To resolve this issue, install Content Master

Last updated: 31-May-08 19:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

40 of 954

Deployment of B2B Data Transformation Services Challenge


Outline the steps and strategies for deploying B2B Data Transformation services.

Description
Deployment is a process wherein a data transformation is made available as a service that is accessible to the B2B Data Transformation runtime engine. When a project is published to a specific transformation service, a directory name is created that corresponds to the published service name in the B2B Data Transformation Service DB which forms a runtime repository of services. A CMW file corresponding to the service name will be created in the same directory. The deployed service is stored in the Data Transformation service repository. On Windows platforms, the default repository location is: c:\Program Files\Informatica\ComplexDataExchange\ServiceDB On UNIX platforms, the default location is: /opt/Informatica/ComplexDataExchange/ServiceDB

Basics of B2B Data Transformation Service Deployment


When running in the B2B Data Transformation studio environment, developers can test the service directly without deployment. However, in order to test integration with the host environment, platform agents or external code, it is necessary to deploy the service. Deploying the transformation service copies the service with its current settings to the B2B data transformation service repository (also known as the Service DB folder). Deploying a service also sets the entry point for the transformation service. Note: The location of the service repository is set using the B2B Transformation configuration utility If changes are made to the project options or to the starting point of the service, it is necessary to redeploy the service in order for the changes to take effect. When the service is deployed, all service script files, schemas, sample data and other project artifacts will be deployed to the service repository as specified by the B2B Data Transformation configuration options in effect in the studio environment from which the service is being deployed. A transformation service can be deployed multiple times under different service names with the same or different options for each deployed service. While Informatica recommends only deploying one service from each B2B data transformation project for production, it is useful to deploy a transformation service under different names when testing different option combinations.

Deployment for Test Purposes


It is important to finish configuration and testing of data transformations before deploying it as a B2B Data Transformation service. Deploying the service allows the B2B Data Transformation runtime engine to access and run the project. When running in the B2B Data Transformation studio environment, developers can test the service
INFORMATICA CONFIDENTIAL BEST PRACTICES 41 of 954

directly without deployment. However, in order to test integration with the host environment, platform agents or external code, it is necessary to deploy the service.

Initial Production Deployment of B2B Data Transformation Services


Deploying services in the production environment allows applications to run the transformation services on live data. B2B Data Transformation services can be deployed from the B2B Data Transformation Studio environment computer to a remote computer such as a production server. The remote computer can be a Windows or UNIX-type platform, where B2B Data Transformation Engine is installed. A service can be deployed to a remote computer by either a) directly deploying it to the remote computer or b) deploying the service locally and then copying the service to a remote computer. To deploy a service to a remote computer: 1. Deploy the service on the development computer. 2. Copy the deployed project directory from the B2B Data Transformation repository on the development computer to the repository on the remote computer 3. If you have added any custom components or files to the B2B Data Transformation autoInclude\user directory, you must copy them to the autoInclude\user directory on the remote computer. Alternatively, if the development computer can access the remote file system, you can change the B2B Data Transformation repository to the remote location and deploy directly to the remote computer.

Deployment of Production Updates to B2B Data Transformation Services


B2B Data Transformation Studio cannot open a deployed project that is located in the repository. If you need to edit the data transformation, modify the original project and redeploy it. To edit and redeploy a project: 1. Open the development copy of the project in B2B Data Transformation Studio. Edit and test it as required. 2. Redeploy the service to the same location, under the same service name. You are prompted to overwrite the previously deployed version. Redeploying overwrites the complete service folder, including any output files or other files that you have stored in it. There is no versioning available in B2B Data Transformation. If previous versions of the deployed services are required, make a copy of the current service in a separate location, if desired (not in the service DB directory) or utilize a commercial or open source backup solution. Renaming the service folder is also possible. The project name has to be renamed as well. This is not a recommended practice for backing up of services or deploying a service multiple times. It is preferred to use the Studio environment to deploy a service multiple times as behaviors may change in future versions. For backup, there are many commercial and open source back up solutions available, and in order to quickly retain a copy of the service, the service should be copied to a directory outside of the Service DB folder. Important: There can be no more than one deployed service with the same service and project name. Project files contain configuration properties and indicate the transformation startup component. Having multiple services with identical project file names, even if the service names are different, will cause the service execution to fail.

Simple Service Deployment


There are two ways to deploy a service. One way is to deploy it directly as service within Data Transformation Studio while the other is to deploy the service locally and copy the service folder to the appropriate ServiceDB.
INFORMATICA CONFIDENTIAL BEST PRACTICES 42 of 954

Single Service Deployment from Within B2B Data Transformation Studio Environment 1. In the B2B Data Transformation Explorer, select the project to be deployed.

2. On the B2B Data Transformation menu, click Project > Deploy.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

43 of 954

3. The Deploy Service window displays the service details. Edit the information as required. Click the Deploy button.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

44 of 954

4. Click OK.

5. At the lower right of the B2B Data Transformation Studio window, display the Repository view. The view lists the service that you have deployed, along with any other B2B Data Transformation services that have been deployed on the computer.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

45 of 954

Single Service Deployment Via File Movement Alternatively, the service folder can be copied directly into the ServiceDB folder. On Windows

To check if the service deployed is valid, run CM_Console in the command line.
INFORMATICA CONFIDENTIAL BEST PRACTICES 46 of 954

Alternatively, the cAPITest.exe can be used test the deployed service.

The B2B Data Transformation Engine determines whether any services have been revised by periodically examining the timestamp of a file called update.txt. By default, the timestamp is examined every thirty seconds. The update.txt file exists in the repository root directory which is, by default, the ServiceDB directory. The content of the file can be empty. If this is the first time a service is deployed to the remote repository, update.txt might not exist. If the file is missing, copy it from the local repository. If update.txt exists, update its timestamp as follows.
q q

On Windows: Open update.txt in Notepad and save it On UNIX: Open a command prompt, change to the repository directory, and enter the following command. touch update.txt

You can change the interval used to check for service updates by modifying the Service refresh interval in the B2B Data Transformation configuration editor.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

47 of 954

Multi-Service Deployment
When a solution involves the use of multiple services, these may be authored as multiple independent B2B Data Transformation projects or as a single B2B Data Transformation project with multiple entry points to be deployed as multiple services under different names. For complex solutions, we recommend the use of multiple separate projects for independent services, reserving the use of multiple runnable components within the same project for test utilities, and trouble shooting items. While it is possible to deploy the set of services that make up a multi-service solution into production from the Studio environment, we recommend deploying these services to a test environment where the solution can be verified before deploying into production. In this way, mismatch between different versions of solution transformation services can be avoided. In particular when dependencies occur between services due to the use of B2B Data Transformation features such as TransformByService, or due to interdependencies in the calling system, it is necessary to avoid deploying mismatching versions of transformation services and to deploy services into production as a group. Simple batch files or shell scripts can be created to deploy the services as a group from the test environment to the production environment, and commercial enterprise system administration and deployment software will usually allow creation of a deployment package to facilitate scheduled unattended deployment and monitoring of deployment operations. As a best practice, creating a dependency matrix for each project to be deployed allows developers to identify the required services by each project to be deployed and which are commonly accessed by majority of the projects. This allows for better deployment strategies and helps to keep track of impacted services should there be changes made to them.

Deploying for Full Uptime Systems


B2B Data Transformation has the ability to integrate into various applications allowing it to become a full uptime system. An integration component, called the B2B Data Transformation Agent, runs a B2B Data Transformation service that performs the data transformation. Integration systems capabilities are enhanced by supporting the
INFORMATICA CONFIDENTIAL BEST PRACTICES 48 of 954

conversion of many document formats it do not natively support. Deploying services for full uptime systems follows the same process as that of standalone B2B Data Transformation services. However, it is important to make sure that the user accounts used for the calling application have the necessary permissions to execute the B2B Data Transformation service and write to configuration to store error logs. After deploying the service, it may be necessary to stop and restart the work flow invoking the service. Make sure that the update.txt timestamp is updated. B2B Data Transformation Engine determines whether any services have been revised by periodically examining the timestamp update.txt. By default, the timestamp is examined every thirty seconds.

Multiple Server Deployment


For enhanced performance, you can install B2B Data Transformation on multiple Windows or UNIX servers. The following discussion assumes that you use a load balancing module to connect to multiple, identically configured servers. The servers should share the same B2B Data Transformation services. There are two ways to implement a multiple server deployment.
q

Shared file system Store a single copy of the B2B Data Transformation repository on a shared disk. Configure all the servers to access the shared repository.

Replicated file system Configure each server with its own B2B Data Transformation repository. Use an automatic file deployment tool to mirror the B2B Data Transformation repository from a source location to the individual servers.

If the second approach is adopted, it is a must to replicate or touch the file update.txt, which exists in the repository directory. The timestamp of this file notifies B2B Data Transformation Engine when the last service update was performed.

Designing B2B Data Transformation Services for Deployment Identifying Versions Currently Deployed
Whenever a service is deployed through B2B Data Transformation Studio, the user is prompted to set the options shown in the table below. Option Description

INFORMATICA CONFIDENTIAL

BEST PRACTICES

49 of 954

Service Name

The name of the service. By default, this is the project name. To ensure cross-platform compatibility, the name must contain only English letters (A-Z, a-z), numerals (0-9), spaces, and the following symbols: %&+-=@_{} B2B Data Transformation creates a folder having the service name, in the repository location.

Label

A version identifier. The default value is a time stamp indicating when the service was deployed. The runnable component that the service should start. The person who developed the project. A description of the service.

Startup Component Author Description

Although version tracking is not available in the current version of B2B Data Transformation, deployment does take into account the service deployment timestamps. The deployment options are stored in a log file called deploy.log. It keeps a history of all deployments options made through the B2B Data Transformation Studio. The option settings entered in the Deploy Service window are appended to the log file.

Deploying services to different servers through file copying or FTP will not update the deployment log file. It has to be manually updated if added information is required.

Security and User Permissions


User permissions are required by users who install and use B2B Data Transformation Studio and Engine. Depending on the B2B Data Transformation application the organization runs, and the host environment used to invoke the services, additional permissions might be required. To configure data transformations in B2B Data Transformation Studio, users must have the following permissions:
q

Read and write permission for the Eclipse workspace location


BEST PRACTICES 50 of 954

INFORMATICA CONFIDENTIAL

Read and execute permission for the B2B Data Transformation installation directory and for all its subdirectories Read and write permission for the B2B Data Transformation repository, where the services are deployed Read and write permissions for the log application

q q

For applications running B2B Data Transformation Engine, a user account with the following permissions is required.
q q q

Read and execute permission for the B2B Data Transformation installation directory and for its subdirectories Read for the B2B Data Transformation repository Read and write permission for the B2B Data Transformation log path, or for any other location where B2B Data Transformation applications are configured to store error logs

Aside from user permissions, it is important to identify the user types that would be assigned work with B2B Data Transformation. In Windows setup, an administrator or limited user can be registered in the Windows Control Panel. Windows users who have administrative privileges can perform all B2B Data Transformation operations. However, limited users have the following restrictions do not have write permissions for the B2B Data Transformation program directory and are NOT allowed to perform the following:
q q q q q

Install or uninstall the B2B Data Transformation software Install a B2B Data Transformation license file Deploy services to the default B2B Data Transformation repository Add custom components such as document processors or transformers Change the setting values in the Configuration Editor

Backup Requirements
It is necessary to make regular backups of several B2B Data Transformation directories and files. In production environment where B2B Data Transformation runs, it is important to backup three locations the Configuration File, Service Repository, and AutoInclude\User directory. For development environment, we recommend using a commercial or open source-source control system such as Subversion to manage backup and versioning of the B2B Data Transformation Studio workspaces of the developers in the organization. In addition, backup the same locations listed above for production environment. If you use identical configurations on multiple servers, back up only a single copy of these items. In the event of a server failure, B2B Data Transformation can be re-installed in the same location as on the failed server and restore the backup.

Failure Handling
If a B2B Data Transformation service fails to execute successfully, it returns a failure status to the calling application. It is the responsibility of the calling application to handle the error. For example, the application can transmit failed input data to a failure queue. The application can package related inputs in a transaction to ensure that important data is not lost.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

51 of 954

In the event of a failure, the B2B Data Transformation Engine will generate an event log if event logging has been enabled for the project. To view the contents of the event file, drag the *.cme file into the events pane in the B2B Data Transformation Studio. The method used to invoke a B2B Transformation service will affect how and if events are generated. The follow table compares the effect of each invocation method on the generation of events: API / invocation method CM_Console Event generation Service deployed with events will produce events. Service deployed without events will not produce events Service runs without events. In case of error, service is rerun with events Same as Java No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

Java API

C# / .Net Agents

While the events log provides a simple mechanism for error handling, it also has a high cost in resources such as memory and disk space for storing the event logs. For anything other than the simplest of projects, it is recommended to design an error handling mechanism into your transformations and calling logic to handle errors and the appropriate alerting needed when errors occur. In many production scenarios, the event log will need to be switched off for optimal performance and resource usage.

Updating Deployed Services


B2B Data Transformation Studio cannot directly update a deployed project in the transformation service repository. To perform updates on the data transformation, the modifications must be made to the original transformation project and the project then needs to be redeployed. Note: A different project can be used which may be deployed under the existing service name, so technically it does not have to be exactly the original project. If it is required to track all deployed versions of the data transformation, make a copy of the current service in a separate location, or alternatively, consider the use of a source control system such as Subversion. Redeploying overwrites the complete service folder, including any output files or other files that you have stored in it. It is important to test the deployed service following any modifications. While the Studio environment will catch some errors and block deployment if the transformation is invalid, some types of runtime errors cannot be caught by the studio environment prior to deployment.

Upgrading B2B Data Transformation Software (Studio and Runtime Environment)


When upgrading from a previous B2B Data Transformation release, existing projects and deployed services can also be upgraded to the current release. The upgrade of projects from B2B Data Transformation version 3.1 or higher is
INFORMATICA CONFIDENTIAL BEST PRACTICES 52 of 954

automatic. Individual projects can be opened or imported in the B2B Data Transformation Studio with the developer prompted to upgrade the project, if necessary. Test the project and confirm that it runs correctly once upgrade is completed. Deploy the service to production environment. Another way to upgrade the services is through the syntax conversion tool that comes with B2B Data Transformation. It allows upgrade of multiple projects and services quickly, in an automated operation. It is also used to upgrade global TGP script files, which are stored in the B2B Data Transformation autoInclude\user directory. Syntax conversion tool supports upgrade of project or service from 3.1 and higher on Windows while release 4 on UNIX-type platforms. Before the upgrade, the tool creates an automatic backup of your existing projects and files. It creates a log file and reports any upgrade errors that it detects. In case of an error, restore the backup, correct the problem, and run the tool again. It is necessary to organize the projects before running the tool. The tool operates on projects or services that are stored in a single parent directory. It can operate on:
q q q

A B2B Data Transformation Studio version 4 workspace A B2B Data Transformation repository Any other directory that contains B2B Data Transformation Studio projects or services

Within the parent directory, the projects must be at the top level of nesting, for example:

If the projects are not currently stored in a single parent directory, re-organize them before running the tool. Alternatively, the tool can be run separately on the individual parent directories. To run the syntax conversion tool in Windows, go the B2B Data Transformation folder from the Start menu then click Syntax Conversion Tool. The tool is a window with several tabs, where the upgrade settings can be configured.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

53 of 954

After the service upgrade is complete, change the repository location to the new location using the Configuration Editor. Test the projects and services to confirm that they work correctly and that their behavior has not changed. On UNIX platforms, run the command <INSTALL_DIR>/bin/CM_DBConverter.sh. Only 4.x is supported. Optionally, you can run the syntax conversion tool from the command line, without displaying the graphical user interface. In an open console, change to the B2B Data Transformation bin directory and run the following command:
q q

On Windows: CM_DBConverter.bat <switches> On UNIX: CM_DBConverter.sh console <switches>

Following each switch, leave a space and type the value. If a path contains spaces, you must enclose it in quotation marks. The <switches> are listed in the following table.

Switch -v

Required/Optional Required

Description Version from which you are upgrading (3 or 4). On UNIX, only 4 is supported. Path of the source directory, containing projects or services.

-s

Required

INFORMATICA CONFIDENTIAL

BEST PRACTICES

54 of 954

-d

Optional

Path of the target directory. If you omit this switch, the tool overwrites the existing directory. Path of the source autoInclude\user directory. If you omit this switch, the tool does not upgrade global TGP files. Path of the target autoInclude\user directory. If you omit this switch, the tool overwrites the existing directory. Path of the upgrade log file. The default is <INSTALL_DIR> \SyntaxConversionLog.txt. Path of the backup directory, where the tool backs up the original projects or services prior to the upgrade. The default is the value of the -s switch concatenated with the suffix _OLD_Backup. Path of the error directory, where the tool stores any projects or services that it cannot upgrade due to an error. The default is the value of the -s switch concatenated with the suffix _OLD_Failure.

-si

Optional

-di

Optional

-l

Optional

-b

Optional

-e

Optional

Last updated: 29-May-08 16:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

55 of 954

Establishing a B2B Data Transformation Development Architecture Challenge


Establish a development architecture that ensures support for team development of B2B Data Transformation solutions; establishes strategies for common development tasks such as error handling and the styles of B2B Data Transformation service authoring; and plans for the subsequent clean migration of solutions between development, test, quality assurance (QA) and production environments that can scale to handle additional users and applications as the business and development needs evolve.

Description
In this Best Practice the term development architecture means establishing a development environment and establishing strategies for error handling, version control, naming conventions, mechanisms for integration with the host environment and other aspects of developing B2B Data Transformation services not specific to a particular solution. Planning for the migration of the completed solution is closely related to the development architecture. This can include transfer of finished and work in progress solutions between different members of the same team, between different teams such as development, QA, and production teams and between development, test and production environments. Deciding how to structure the development environment for one or more projects depends upon several factors. These include technical factors such as choices for hosting software and host environments and organizational factors regarding the project team makeup and interaction with operations, support and external test organizations. Technical factors:
q q q q q q

What host environment is used to invoke the B2B Data Transformation services? What are the OS platform(s) for development, test and production? What software versions are being used for both B2B Data Transformation and for host environment software? How much memory is available on development, test and production platforms? Are there previous versions of the B2B Data Transformation software in use? The use of shared technical artifacts such as XML schemas shared between projects, services, applications and developers. What environments are expected to be used for development, test and production environments? (Typically development is performed on windows, test and production may be AIX, Solaris, Linux etc).

Organizational Factors:
q q q q q q

How do development, test, production and operations teams interact? Do individual developers work on more than one application at a time? Are the developers focused on a single project, application or project component? How are transformations in progress shared between developers? What source code control system, if any, is used by the developers? Are development machines shared between developers either through sequential use of a physical machine, through the use of virtual machines or through use of technologies such as Remote Desktop Access? How are different versions of a solution, application or project managed? What is the current stage of the project life cycle? For example has the service being modified already been deployed to production? Do developers maintain or create B2B Data Transformation services for multiple versions of B2B Data
BEST PRACTICES 56 of 954

q q

INFORMATICA CONFIDENTIAL

Transformation on products? Each of these factors plays a role in determining the most appropriate development environment for a B2B Data Transformation project. In some cases, it may be necessary to create different approaches for different development groups according to their needs B2B Data Transformation, together with the B2BDT Studio environment, offers flexible development configuration options that can be adapted to fit the need of each project or application development team. This Best Practice is intended to help the development team decide what techniques are most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected.

Terminology
B2B Data Transformation (abbreviated as B2BDT) is used as a generic term for the parsing, transformation and serialization technologies provided in Informaticas B2B Data Exchange products. These technologies have been made available through the Unstructured Data Option for PowerCenter, and as standalone products such as B2B Data Transformation and its earlier versions known respectively as B2B Data Transformation, PowerExchange for Complex Data (and formerly known as ItemField ContentMaster). The B2B Data Transformation development environment uses the concepts of workspaces, projects and services to organize its transformation services. The overall business solution or solutions may impose additional structure requirements such as organizing B2BDT services into logical divisions such as solutions, applications, projects and business services corresponding to the needs of the business. There may be multiple B2BDT services corresponding to these logical solution elements. We will use the terms B2BDT service to refer to a single Complex Exchange transformation service, and B2BDT project to refer to the B2B Data Transformation project construct as exposed within the B2BDT Studio environment. Through out this document we use the term developers to refer to team members who create B2BDT services, irrespective of their actual roles in the organization. Actual roles may include business analysts, technical staff in a project or application development teams, members of test and QA organizations, or members of IT support and helpdesk operations who create new B2BDT transformations or maintain existing B2BDT transformations.

Fundamental Aspects of B2BDT Transformation Development


There are a number of fundamental concepts and aspects to development of B2BDT transformations that affect design of the development architecture and distinguish B2BDT development architecture from other development architectures.

B2BDT is an Embedded Platform


When B2BDT transformations are placed into production, the runtime is typically used in conjunction with other enterprise application or middleware platforms. The B2BDT runtime is typically invoked from other platform software (such as PowerCenter, BizTalk, WebMethods or other EAI or application server software) through the use of integration platform adapters, custom code or some other means. While it is also possible to invoke B2BDT services from a command line utility (CM_Console) without requiring the use of additional platform software, this is mainly provided for quick testing and troubleshooting purposes. CM_Console does not provide access to all available system memory or scale across multiple CPUs. Specifically, restrictions on the CM_Console application include always running the B2BDT transformation engine in-process and use of the local directory for event output. B2BDT does support using the runtime as a server process to be invoked from other software on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a server process in developer terminology, it does not provide full server administration or
INFORMATICA CONFIDENTIAL BEST PRACTICES 57 of 954

monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features and part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation. B2BDT Deployment with PowerCenter

While the B2BDT runtime is usually deployed on the same machine as the host EAI environment, it is possible to locate the B2BDT services (stored in a file based repository) on the same machine or a remote machine. It is also possible to deploy B2BDT services to be exposed as a set of web services, and in this case the hosting web/ application server forms the server platform that provides these server software services. The web service platform in turn will invoke the B2BDT runtime either in-process with the web service stack or as a separate server process on the same machine. Note: Modern application servers often support mechanisms for process, application and thread pooling which blurs the distinctions between the effects of in-process vs. server invocation modes. In process invocation can be thought of as running the B2BDT transformation engine as a shared library within the calling process. B2BDT Deployed as Web Services Used With Web Application

INFORMATICA CONFIDENTIAL

BEST PRACTICES

58 of 954

Sample Data for Parse by Example and Visual Feedback


During the process of authoring a B2BDT transformation, sample data may be used to perform actual authoring through drag-and-drop and other UI metaphors. Sample data is also used to provide visual confirmation at transformation design time of what elements in the data are being recognized, mapped and omitted. Without sample data, there is no way to verify correctness of a transformation or get feedback on the progress of a transformation during authoring. For these reasons, establishing a set of sample data for use during authoring is an important part of planning for the development of B2BDT transformations. Sample data to be used for authoring purposes should be representative of actual data used during production transformations but sized to avoid excessive memory requirements on the studio environment. While the studio environment does not impose specific limits on data to be processed, the cumulative effects of using document preprocessors within the studio environment in conjunction with use of the B2BDT event reporting can impose excessive memory requirements.

Eclipse-Based Service Authoring Environment


The B2BDT authoring environment, B2BDT Studio is based on the widely supported Eclipse platform. This has two implications: 1. Many switches, configuration options, techniques and methods of operation that affect the Eclipse environment are also available in B2BDT studio. These include settings for memory usage, version of JVM used by the studio environment etc. 2. Eclipse plug-ins that support additional behaviors and / or integration of other applications such as source code control software can be used with the B2BDT Studio environment. While the additional features offered by these plug-ins may not be available in the B2BDT authoring perspective, by switching perspectives, B2BDT developers can often take advantages of the features and extensions provided by these plug-ins. Note: An Eclipse perspective is a task oriented arrangement of views, menu options commands etc. For example while using the B2BDT authoring perspective, features for creation of Java programs or source control will not be visible but they may be accessed by changing perspectives. Some features of other perspectives may be incompatible with use of
INFORMATICA CONFIDENTIAL BEST PRACTICES 59 of 954

the B2BDT authoring perspective There are a number of features in B2BDT that may only run on Windows and some custom components such as custom COM based actions or transformations are Windows specific also. This means it is possible to create a transformation within the Studio environment that will only run on the development environment and may not be deployed into production on a non Windows platform.

Service Authoring Environment Only Supported on Windows OS Variants


While B2BDT services may be deployed and placed into production on many environments such a variety of Linux implementations, AIX, Solaris and Windows Server OS variations, the B2BDT Studio environment used to author B2BDT services only runs on Windows OS variants such as Windows 2000 and Windows XP. There are a number of features in B2BDT that may only run on Windows and some custom components such as custom COM based actions or transformations are Windows specific also. This means it is possible to create a transformation within the Studio environment that will only run on the development environment and may not be deployed into production on a non-Windows platform.

File System Based Repository for Authoring and Development


B2BDT uses a file system based repository for runtime deployment of B2BDT services and a similar file based workspace model for the physical layout of services. This means that mechanisms for sharing of source artifacts such as schemas and test data; projects and scripts and deployed solutions must be created using processes and tools external to B2BDT. These might include use of software such as source control systems for sharing of transformation sources, third party application deployment software and processes implemented either manually or through scripting environments for management of shared artifacts, deployment of solutions etc.

Support for Learn-by-Example Authoring Techniques


Authoring of a B2BDT solution may optionally use supplied sample data to determine how to extract or parse data from a representative source data construct. Under this mechanism, a transformation developer may elect to let the B2BDT runtime system decide how to extract or parse data from a sample data input. When this mechanism is used, the sample data itself becomes a source artifact for the transformation and changes to the sample data can affect how the system determines the extraction of appropriate data. Use of Learn by Example in B2BDT

INFORMATICA CONFIDENTIAL

BEST PRACTICES

60 of 954

When using learn-by-example transformations, the source data used as an example of the data must be deployed with the B2BDT project as part of the production B2BDT service. It is recommended in many cases to use the learn by example mechanism as a starting point only and to use specific transformation (non learn-by- example) mechanisms for data transformation with systems requiring a high degree of fine control over the transformation process. If learn by example mechanisms are employed, changes to the sample data should be treated as requiring the same degree of test verification as changes to the transformation scripts.

Support for Specification Driven Transformation Authoring Techniques


As B2BDT transformations are also represented as a series of text files, it is possible to parse a specification (in a Microsoft Word, Microsoft Excel , Adobe PDF or other format document) to determine how to generate a transformation. Under this style of development, the transformation developer would parse one or more specifications rather than the actual source data and generate one or more B2BDT transformations as output. This can be used instead of or in addition to, standard transformation authoring techniques. Many of the Informatica supplied B2BDT libraries are built in this fashion. Note: Typically at least one transformation will be created manually in order to get an approximation of the target transformation desired.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

61 of 954

In these cases, specifications should be treated as source artifacts and changes to specifications should be verified and tested (in conjunction with the spec driven transformation services) in the same manner as changes to the transformations.

B2B Data Transformation Project Structure


The B2B Data Transformation Studio environment provides the user interface for the development of B2B Data Transformation services. It is based on the open source Eclipse environment and inherits many of its characteristics regarding project organization. From a solution designers viewpoint, B2B Data Transformation solutions are organized as one or more B2B Data Transformation projects in the B2BDT Studio workspace. Studio Environment Indicating Main Views

INFORMATICA CONFIDENTIAL

BEST PRACTICES

62 of 954

The B2BDT workspace defines the overall set of transformation projects that may be available to a developer working in a single studio session. Developers may have multiple workspaces, but only one workspace is active within the studio environment at any one time. All artifacts such as scripts, project files and other project elements are stored in the file system as text files and can be versioned using traditional version control systems. Each B2BDT project can be used to publish one or more B2BDT services. Typically a single project is only used to publish a single primary service although it may be desirable to publish debug or troubleshooting variants of a project under different service names. Note: The same B2BDT project can be published multiple times specifying different entry point or configuration parameters. The syntax displayed in the studio environment differs from the text representation of the script files such as TGP files, which make up the B2B Data Transformation project. This will be discussed further when reviewing considerations for multi-person team development. From a physical disk storage viewpoint, the workspace is a designated file system location where B2BDT Studio stores a set of B2BDT projects. By default, there is a single B2B Data Transformation workspace, which is located in the directory My Documents\Informatica\ComplexDataExchange\4.0\workspace All projects in the current B2B Data Transformation Studio workspace are displayed in the Explorer view.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

63 of 954

Note: It is possible to have other workspaces for Java projects etc. These are not visible in the Complex Data Authoring perspective in B2B Data Transformation Studio. Optionally, it is possible to create more than one workspace. For example, a solution designer might have multiple workspaces for different sets of B2B Data Transformation projects. TIP For B2BDT Studio 4.3 and earlier releases, Use the B2BDT Studio\Eclipse\Workspace setting in the B2BDT configuration editor to change the workspace. In B2BDT 4.4, you may change the workspace by using the File | Switch Workspace menu option.

Each B2B Data Transformation project holds the business rules and operations for one or more transformation services. Once completed or while under development, the project may be published to the B2B Data Transformation repository to produce a deployable transformation service. During the publication of a transformation service, an entry point to the service is identified and a named transformation service is produced that specifies a particular transformation project along with a well known entry point where initial execution of the transformation service will take place. It is possible to publish the same project multiple times with different names, identifying a different entry point on each deployment or even to publish the same project multiple times with the same entry point under different names. Published B2B Data Transformation services are published to the runtime repository of services. In B2B Data Transformation, this takes the form of a file system directory (typically c:\program files\Informatica\ComplexDataExchange \ServiceDB) known as the service DB. This may be located on a local or network accessible file system Once the development version of a transformation service has been published, it may then be copied from the service database location by copying the corresponding named directory from the service DB location. This service directory can then be deployed by copying it to the service db directory on a production machine.

File System View of Workspace


The workspace is organized as a set of sub-directories, with one sub-directory representing each project. A specially designated directory named .metadata is used to hold metadata about the current workspace. Each subdirectory is named with the project name for that project. Workspace Layout

INFORMATICA CONFIDENTIAL

BEST PRACTICES

64 of 954

Behind the scenes (by default) B2B Data Transformation creates a new project in a directory corresponding to the project name rooted in the Eclipse workspace. (In B2BDT 4.3, this can be overridden at project creation time to create projects outside of the workspace; while in B2BDT 4.4, the studio environment will determine whether it needs to copy a project into the workspace. If the path specified for the imported project is already within a workspace, B2BDT will simply add the project to the list of available projects in the workspace). A .cmw file with the same primary project name will also be created within the project directory the cmw file defines what schemas, scripts and other artifacts make up the project. When a project is published to a specific transformation service, a directory name is created that corresponds to the published service name in the B2B Data Transformation Service DB which forms a runtime repository of services. A CMW file corresponding to the service name will be created in the same directory. Creating a new project while in the studio environment will cause changes to be made to the metadata directory in order for the project to be discoverable in the B2BDT Studio environment.

File System View of Service DB


The service database is organized as a set of sub-directories under the service database root project, with one subdirectory representing each deployed service. When a service is deployed, the service will be copied along with the settings in effect at the time of deployment. Subsequent changes to the source project will not affect deployed services, unless a project is redeployed under the same service name. It is possible to deploy the same B2BDT project under multiple different service names. TIP If a project contains a sample data file with the extension .cmw it can cause the B2BDT runtime to detect an error with that deployed service. This can prevent all services being detected by the runtime. If it is necessary to have a sample data file with the extension .cmw use a different extension for the sample data and adjust scripts accordingly. This scenario can commonly occur with specification driven transformations.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

65 of 954

Solution Organization
The organizational structure of B2B Data Transformation solutions is summarized below. Element Service Repository Parent None. This is the top level organization structure for published Complex Data services. Repository. There may be multiple projects in a repository. Workspace. There may be multiple projects in a studio workspace. Project. Project. TGP Script. However naming is global to a project and not qualified by the TGP script name. TGP Script. However naming is global to a project and not qualified by the TGP script name.

Published Complex Data Service Project TGP Script XML Schema Parser, Mapper, Serializer

Global Variables, Actions, Markers

Planning for B2BDT Development


While the overall solution life cycle may encompass project management, analysis, architecture, design, implementation, test, deployment and operation following a methodology such as Informaticas Velocity methodology, from a development architecture perspective we are mainly concerned with facilitating actual implementation of transformations, subsequent test and deployment of those transformations.

Pre-Implementation Development Environment Requirements


During the analysis phase we are mainly concerned with identifying the business needs for data transformations, the related data characteristics (data format, size, volume, frequency, performance constraints) and any existing constraints on target host environments (if candidate target host environments have already been identified). Due to the nature of how B2BDT transformations are built with their utilization of sample data during the authoring process, we also need to plan for obtaining data and schema samples as part of the requirements gathering and architecture phases. Other considerations include identification of any security constraints on the use of storage of data and identification of the need to split data, any sizing and scaling of the eventual system which will depend to a large extent on the volume of data, performance constraints, responsiveness targets etc. For example, HIPAA specifications include privacy restrictions and constraints in addition to defining message and transactions formats.

Pre-Development Checklist
While many of these requirements address the solution or solutions as a whole, rather than the development environment
INFORMATICA CONFIDENTIAL BEST PRACTICES 66 of 954

specifically, there are a number of criteria that have direct impact on the development architecture:
q

What sample data will be used in creation of the B2BDT services? The size of the sample data used to create the B2BDT services will determine some of the memory requirements for development environments. Will specific preprocessors be required for B2BDT transformation authoring? Some preprocessors such as Excel or Word preprocessors require additional software such as Microsoft Office to be deployed to the development environments. In some cases, custom preprocessors and / or transformers may need to be created to facilitate authoring of transformation solutions. Are there specific libraries being used such as the B2BDT Accord, EDI or HIPAA libraries? Use of specific libraries will have an impact on how transformations are created and on the specific licenses required for development usage of the Complex Data Transformation tools. Are custom components being created such as custom actions, transformers, preprocessors that will be shared among developers of B2BDT transformations? In many cases, these custom components will need to be deployed to each B2BDT studio environment and a process needs to be defined for handling updates and distribution of these components Are there any privacy or security concerns? Will data need to be encrypted / decrypted? Will cleansed data be needed for use with learn-by-example based transformations? How will the B2BDT runtime be invoked? Via a platform adapter, custom code, command line, web services, HTTP etc.? Each of these communication mechanisms may impose specific development requirements with regard to testing of work in progress, licensing of additional B2BDT components, performance implications and design choices? Will data splitting be needed? Depending on the choice of 32 bit vs. 64 bit B2B Data Transformation runtimes, and both host software platform and underlying OS and hardware platform, data may need to be split through the use of B2BDT Streaming capabilities, custom transformations or preprocessors How are B2BDT transformations created? What artifacts affect their creation? What is the impact of changes to specifications, schemas, sample data, etc? In some cases such as spec driven transformation, changes to specifications go beyond design change requests but may require actual rerunning of transformations that produce other executable artifacts, documentation, test scripts etc.

Establishing the Development Environment


B2B Data Transformation services are defined and designed in the B2B Data Transformation Studio environment. The B2B Data Transformation Studio application is typically installed on the developers local machines and allows the visual definition of transformations, the usage of libraries and use of import processes to build one or more B2B Data Transformation services. All extensions used during authoring such as custom transformations, preprocessors, actions etc., must be installed in each B2BDT Studio installation. While preprocessors are provided with the studio environment to support manipulation of files types such as Excel, Word and PDF files within the studio environment. For some formats it may be necessary to create custom preprocessors to optimize usage of source data within the B2BDT studio environment. Note: In some cases, additional optional studio features may need to be licensed in order to access necessary preprocessors and / or libraries During transformation authoring, B2BDT services are organized as a set of B2BDT projects within a B2BDT workspace. Each B2BDT project consists of a set of transformation scripts, XML schema definitions, and sample data used in authoring and / or runtime of the transformation. B2BDT projects and workspaces use file system based artifacts for all aspects of the definition of a B2BDT project. Due to the use of file based artifacts for all B2BDT transformation components, traditional source code controls systems may be used to share work in progress.

Development Environment Checklist

INFORMATICA CONFIDENTIAL

BEST PRACTICES

67 of 954

Many of the implementation issues will be specific to the particular solution. However there are a number of common issues for most B2BDT development projects:
q

What is the host environment and what tools are required to develop and test against that environment? While the B2BDT studio is a Windows only environment, addition consideration may need to be given to the ultimate host environment regarding what tools and procedures are required to deploy the overall solution and troubleshoot it on the host environment.

What is the communication mechanism with the host environment? How does the host environment invoke B2BDT transformations? Is it required for work in progress testing or can the invocation method be simulated through the use of command line tools, scripts or other means?

What are security needs during development? Deployment? Test? How will they affect the development architecture? What are memory and resource constraints for the development, test and production environments? What other platform tools are needed during development? What naming conventions should be used? How will work be shared between developers? How will different versions of transformations be handled? Where or how are intermediate XML schemas defined and disseminated? Are they specific to individual services? Shared between services? Externally defined either by other project teams or by external standards bodies? What is the folder and workspace layout for B2BDT projects?

q q q q q q

Supporting Multiple Users


The B2BDT studio environment is intended to be installed on each developers own machine or environment. While advances in virtualization technologies and technologies such as Windows remote desktop connections theoretically allow for multiple users to share the same B2BDT installation, the B2BDT studio environment does not implement mechanisms such as file locking during authoring of transformations that are needed to secure multiple users from overwriting each others work. TIP If For PowerCenter users, it is important to note that B2BDT does not implement a server based repository environment for work in progress, and other mechanisms are needed to support sharing of work in progress. The service database may be shared between different production instances of B2BDT by locating it on a shared file system mechanism such as a network file share or SAN. The B2BDT development environment should be installed on each B2BDT transformation authors private machine

The B2BDT Studio environment does support multiple user usage of the same development environment. However, each user should be assigned a separate workspace. As the workspace, along with many other default B2BDT configuration parameters, is stored in the configuration file, the environment needs to be configured to support multiple configuration files, with one being assigned to each user

INFORMATICA CONFIDENTIAL

BEST PRACTICES

68 of 954

Creating Multiple Configurations To create multiple configurations, you can edit and copy the default configuration file.

1. Make a backup copy of the default configuration file, <INSTALL_DIR/CMConfig.xml>. At the end of the procedure, you must restore the backup to the original CMConfig.xml location. 2. Use the Configuration Editor to edit the original copy of CMConfig.xml. Save your changes. 3. Copy the edited CMConfig.xml to another location or another filename. 4. Repeat steps 2 and 3, creating additional versions of the configuration file. In this way, you can define as many configurations as you need. 5. Restore the backup that you created in step 1. This ensures that the default configuration remains as before.

Selecting the Configuration at Runtime You can set the configuration file that B2B Data Transformation Engine should use in any of the following ways: 1. Define an environment variable called IFConfigLocation4. The value of the variable must be the path of a valid configuration file, for example: 2. c:\MyIFConfigLocation4\CMConfig1.xml 3. On Unix only: Store the configuration file under the name CMConfig.xml, in the user's home directory. 4. Use the default configuration file, <INSTALL_DIR>/CMConfig.xml. When B2B Data Transformation Engine starts, it searches these locations in sequence. It uses the first configuration file that it finds. Example 1 Suppose you want to run two applications, which run B2B Data Transformation Engine with different configuration files. Each application should set the value of IFConfigLocation4 before starts B2B Data Transformation Engine. Example 2 Two users want to run B2B Data Transformation Engine with different configurations, on the same Unix-type platform. Store their respective configuration files in their home directories. Both files must have the name CMConfig.xml. Alternatively, store a CMConfig.xml file in the home directory of one of the users. The other user uses the default configuration file, <INSTALL_DIR>/CMConfig.xml. Multiple JREs On Windows platforms, the JVM Location parameter of the configuration file defines the JRE that B2B Data Transformation should use. By using multiple configuration files, you can switch JREs. On Unix-type systems, the configuration file does not contain a JVM Location parameter. To switch JREs, you must load a different environment-variable file. Running Multiple Configurations Concurrently B2B Data Transformation Engine loads the configuration file and the environment variables when it starts. After it starts, changing the configuration file or the environment variables has no effect. This means that two applications can use different configurations concurrently. Each application uses the configuration that was in affect when its instance of B2B Data Transformation Engine started.

While this can theoretically allow Windows based sharing mechanisms such as Remote Desktop Connection to share the same installation of B2BDT, it is important to specify different workspaces for each user due to the possibility of files being overwritten by different users.
INFORMATICA CONFIDENTIAL BEST PRACTICES 69 of 954

As a best practice, it is recommended that each user of B2BDT Studio is provided with a separate install of the B2BDT Studio environment on a dedicated machine. Sharing of work in progress should be accomplished through the use of a source control system rather than multiple users using the same workspace simultaneously. In this manner, each transformation authors environment is kept separate while allowing multiple authors to create transformations, and share them between each authors environment.

Using Source Code Control for Development


As B2BDT transformations are all defined as text based artifacts scripts, XML schema definitions, project files etc., B2BDT transformation authoring lends itself to good integration with traditional source code control systems. There are a number of suitable source code control systems on the market and open-source source code control environments such as CVSNT and Subversion both have Eclipse plug-ins available that simplify the process. While source code control is a good mechanism for sharing of work between multiple transformation authors, it also serves as a good mechanism for reverting to previous versions of a code base, keeping track of milestones and other change control aspects of a project. Hence it should be considered for all but the most trivial of B2B Data Transformation projects, irrespective of the number of transformation authors. What should be placed under source code control? All project files that make up a transformation should be checked in when a transformation project is checked in. These include sample data files, TGP script files, B2BDT project files (ending with the extension .CMW), and XML Schema definition files (ending with the extension .XSD). During test execution of transformations, the B2BDT engine and Studio environment will generate a results subdirectory in the project source directory. The files contained in this directory include temporary files generated during project execution under the studio environment (typically output.xml), and the Studio events file (ending in .CME). These should not be checked in, and should be treated as temporary files. When a service is deployed, a deploy. log file is generated in the project directory. While it may seem desirable to keep track of the deployment information, a different deployment log file will be generated on each users machine. What are the effects of different authoring changes? The following table describes the file system changes that occur when different actions are taken: Action Creating a new B2BDT project Importing an XML schema Change New B2BDT project directory created in workspace Schema and dependencies copied to B2BDT project directory

Adding a new script

New TGP file in B2BDT project directory Modifications to CMW file

INFORMATICA CONFIDENTIAL

BEST PRACTICES

70 of 954

Adding a new test data file Running a transformation within the studio environment

Files copies to B2BDT project directory Changes to the results directory. New entries in the Events. CME file Changes to the CMW file Modifications to the B2BDT project file Modifications to the meta data directory in workspace

Modifications to the project preferences Modifications to the studio preferences

During test execution of transformations, the B2BDT engine and Studio environment will generate a results subdirectory in the project source directory. The files contained in this directory include temporary files generated during project execution under the studio environment (typically output.xml), and the Studio events file (ending in .CME). These should not be checked in, and should be treated as temporary files. When a service is deployed, a deploy.log file is generated in the project directory. While it may seem desirable to keep track of the deployment information, a different deployment log file will be generated on each users machine.

Special Considerations for Spec Driven Transformation


Spec driven transformation is the use of a B2BDT transformation to generate a different B2BDT transformation based on one or more inputs that form the specification for a transformation. As a B2BDT transformation is itself a set of text files, it is possible to automate the generation of B2BDT transformations with or without subsequent user modification. Specifications may include Excel files that define mappings between source and target data formats, PDF files that are generated by some standards body or a variety of custom specification formats. As the specification itself becomes part of what determines the transformation scripts generated, these should be placed under source code control. In other cases, the time taken to generate the transformation may be too great to regenerate the transformations on every team merge event, or it may be necessary to preserve the generated transformation for compliance with auditing procedures. In these cases, it is necessary to place the generated transformations under source code control also.

Sharing B2BDT Services


In addition to defining production transformations, B2BDT supports the creation of shared components such as shared library transformations that may be shared using the "autoinclude" mechanism. B2BDT also supports the creation of custom transformers, preprocessors and other components that may be shared across users and B2BDT projects. These should all be placed under source code control and must also be deployed to production B2BDT environments, if used by production code. Note: For PowerCenter users, these can be thought to be the B2B Data Transformation equivalent of Mapplets and Worklets and offer many of the same advantages.

Sharing Metadata Between B2BDT Projects


Many B2BDT solutions are comprised of multiple transformation projects. Often there is shared metadata such as XML schema definitions, and other shared artifacts.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

71 of 954

When an XML schema is added to a project, local copies of the schemas along with any included schemas will be placed in the project directory. If one or more schemas are used in multiple projects, they must be copied to each project when a change occurs to the schema. One recommendation for sharing of schemas is to place the schemas and other artifacts into a dummy project and when the schema changes, transformation authors should sync that project and copy the schemas from the dummy project to each of the other projects. This copy mechanism can be added to synchronization scripts. In these cases, the local copy of the shared schema should not be placed under source control.

Using Multiple Workspaces


A typical B2BDT solution may be comprised of multiple B2BDT transformations, shared components, schemas, and other artifacts. These transformations and other components may be all part of the same logical aspect of a B2B Data Transformation solution, or may form separate logical aspects of a B2B Data Transformation solution. Some B2B Data Transformation solutions will result in the production of 100s of transformation services, parsers and B2BDT components. When the B2BDT Studio environment is launched, it will attempt to load all transformation projects into memory. While B2BDT Studio allows closing a project to conserve memory and system resources (by right clicking on the project and selecting the close option), large numbers of B2BDT projects can make use of the studio environment unwieldy. Closing a Project Within B2BDT Studio

Reopen Project Using Open Option

INFORMATICA CONFIDENTIAL

BEST PRACTICES

72 of 954

There may also be a need (due to complexity, security or other reasons) to separate work between different developers so that only some projects need to be opened within a workspace of a given developer. For these reasons number of transformations, separation of logical aspects of solution, enforcement of change control, it may be appropriate to use separate workspaces to partition the projects.

Staging Development Environments


When there are multiple developers on a project, and / or large numbers of transformations, it is recommended to have a staging development environment where all transformations are assembled prior to deployment to test environments. While it is possible to have each developer transfer their work to the staging development environment directly, it is recommended that the staging development environment is synchronized from the source code control system. This enforces good check in practices as only those transformations checked in will be propagated to the staging development environment. It is also possible to require that each developer publishes their working services to a local service DB on their machine and use source code control to check in their published services. If this approach is chosen, it should be considered in addition to, not instead of, using source code control to manage work in progress. In Agile development methodologies, one of the core concepts is always having a working build available at any time. By using source code control to manage working copies of deployed services, it is possible to enforce this concept. When the target platform is a non Windows platform, it is also necessary to consider where the version of the services for non Windows platforms should be assembled. For example you can assemble the version of B2BDT solution for non Windows platforms on the staging development machine and either transfer the transformation services to the QA environment manually or use additional check in/check out procedures to perform the transfer.

Synchronization of Changes from Source Code Control System


If a synchronization operation in a source code control system adds an additional project to the workspace, it is necessary to use the file import command in the B2BDT studio environment to import the project into the project workspace. If a change occurs to a schema while the studio environment is open, it is sometimes necessary to switch to the schema view in the studio environment to detect the schema change.
INFORMATICA CONFIDENTIAL BEST PRACTICES 73 of 954

Best Practices for Multi-Developer or Large Transformation Solutions


q q q q

DO install a separate instance of the B2BDT Studio environment on each authors machine DO use a source code control system to synchronize and share work in progress DO consider using a dummy project to share shared meta data and artifacts such as XML schemas DONT rely on Remote Desktop Connection to share simultaneous usage of B2BDT Studio for the same workspace DO use a separate workspace location for each user on the same machine. DO place shared components under version control DO define scripts to aid with synchronization of changes to shared resources such as schemas DO consider use of a staging development environment for projects with a large number of transformations , multiple transformation authors or non windows target platforms DO consider having identical folder structure, if each developer has dedicated machine

q q q q

Configuring the B2BDT Environment


B2BDT supports setting of configuration settings through a number of means. These include the B2BDT Configuration application (which modifies the CMConfig.xml configuration file), setting of global properties in the B2BDT Studio configuration environment, setting of project specific properties on a B2BDT project and through the use of platform environment variables. The B2BDT Configuration application allows setting of global B2BDT properties through a GUI based application. Changing property settings through the configuration application causes changes to be made to the CMConfig.XML file (once saved). B2BDT Configuration Application

The configuration application allows setting of properties such as JVM parameters, thread pool settings, memory available to the studio environment and many other settings. Consult the administrators guide for a full list of settings and their effects. Properties set using the B2BDT configuration application affect both the operation of the standalone B2BDT runtime environment in addition to affecting the behavior of the B2BDT studio environment.
INFORMATICA CONFIDENTIAL BEST PRACTICES 74 of 954

Within the B2BDT studio environment, properties may be changed for the Studio environment as a whole, and for on a project specific basis. B2BDT Studio Preferences

The B2BDT studio preferences allow customization of properties that affect all B2BDT projects such as what events are generated for trouble shooting, logging settings, auto save settings and other B2BDT Studio settings. B2BDT Project Properties

INFORMATICA CONFIDENTIAL

BEST PRACTICES

75 of 954

Project properties may be set in the B2BDT Studio environment specific to a B2BDT project. These include settings such as the encoding being used; namespaces used for XML Schemas, control over the XML generation, control over the output from a project and other project specific settings. Finally, OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths etc. The following table lists some typical configuration items and where they are set: Type of configuration item Where configured

Memory for Studio JVM / JRE usage Tuning parameters threads, timeouts etc User specific settings

B2BDT Configuration application B2BDT Configuration application B2BDT Configuration application Use environment variable to point to different configuration file B2BDT Configuration application

Memory for runtime

INFORMATICA CONFIDENTIAL

BEST PRACTICES

76 of 954

Transformation encoding, output , event generation settings Workspace location

Project properties

B2BDT Configuration application (B2BDT 4.3 formerly known as PowerExchange for Complex Data), B2BDT Studio (B2BDT 4.4) Set in project properties B2BDT Configuration application

Event generation Repository location

Development Configuration Settings


The following settings need to be set up for correct operation of the development environment:
q q q q q

Java home directory set using CMConfiguration | General | Java setting in Configuration editor Java maximum heap size set using CMConfiguration | General | Java setting in Configuration editor Repository location needed to deploy projects from within Studio JVM path for use with the Studio environment B2BDT Studio Eclipse Command line parameters used to set memory available in the studio environment. Use Xmx nnMB to set the max allocation pool to size nn MB. Use Xms nnMB to set the initial allocation pool to size of nn MB. Control over project output by default, automatic output is enabled. This needs to switched off for most production quality transformations Use of event files disable for production Use of working encoding

q q

For most development scenarios, a minimum of 2GB memory is recommended for authoring environments.

Development Security Considerations


The user under which the studio environment is running needs to have write access enabled to the directories where logging occurs, where event files are placed, read and write access to the workspace locations, and read and execute access to JVMs and any tools used in operation of preprocessors. The B2BDT transformation author needs read and execute permissions for the B2BDT install directory and all of its subdirectories. In some circumstances, the user under which a transformation is run differs from the logged in user. This is especially true when running under the control of a application integration platform such as BizTalk, or under a Web services host environment. Note: Under IIS, the default user identity for a web service is the local ASPNet user. This can be configured in the AppPool settings, in the .Net configuration settings and in the web service configuration files.

Best Practices Workspace Organization


As B2B Data Transformation Studio will load all projects in the current workspace into the studio environment, keeping all
INFORMATICA CONFIDENTIAL BEST PRACTICES 77 of 954

projects under design in a single workspace leads to both excessive memory usage and logical clutter between transformations belonging to different, possibly unrelated, solutions. Note: B2B Data Transformation Studio allows for the closing of projects to reduce memory consumption. While this aids with memory consumption it does not address the logical organization aspects of using separate workspaces. TIP Right click on the B2BDT project node in the explorer view to open or close a B2BDT project in the workspace. Closing a project reduces the memory requirements in the studio environment.

Separate Workspaces for Separate Solutions


For distinct logical solutions, it is recommended to use separate workspaces to organize B2BDT projects relating to separate solutions. The B2B Data Transformation Studio configuration editor may be used to set the current workspace:

Separate Transformation Projects for Each Distinct Service


From a logical organization perspective, it is easier to manage Complex Data solutions if only one primary service is published from each project. Secondary services from the same project should be reserved for the publication of test or troubleshooting variations of the same primary service. The one exception to this should be where multiple services are substantially the same with the same transformation code but with minor differences to inputs. One alternative to publishing multiple services from the same project is to publish a shared service which is then called by the other services in order to perform the common transformation routines. For ease of maintenance, it is often desirable to name the project after the primary service which it publishes. While these do not have to be the same, it is a useful convention and simplifies the management of projects.

Implementing B2BDT Transformations


INFORMATICA CONFIDENTIAL BEST PRACTICES 78 of 954

There are a number of considerations to be taken into account when looking at the actual implementation of B2BDT transformation services.
q q q q q q q

Naming standards for B2BDT components Determining need and planning for data splitting How will B2BDT be invoked at runtime Patterns of data input and output Error handling strategies Initial deployment of B2BDT transformations Testing of B2BDT transformations

Naming Standards for Development


While naming standards for B2BDT development are subject of a separate best practice, the key points can be summarized as follows:
q q q

B2BDT service names must be unique B2BDT project names must be unique Avoid use of file system names for B2BDT artifacts. For example do not use names such as CON: as it may conflict with file system names Avoid use of names inconsistent with programming models Consider that the B2BDT service name or service parameter may need to be passed as a web service parameter Consider that the B2BDT service name or service parameter may drive the naming of an identifier in Java, C, C# or C++ Avoid names invalid as command line parameters As authors may need to use command line tools to test the service, use names that may be passed as unadorned command line arguments. Dont use spaces, > etc. Only expose one key B2BDT service per project Only expose additional services for debug and troubleshooting purposes.

Data Splitting
There are a number of factors influencing whether source data should be split, how it can be split or indeed whether a splitting strategy is necessary. First of all, lets consider when data may need to be split. For many systems, the fundamental characteristic to consider is the size of the inbound data. For many EAI platforms, files or blobs in excess of 10mb can impose problems. For example PowerCenter, Process Server and BizTalk impose limits on how much XML can be processed. This depends on what operations are needed on the XML files (do they need to be parsed or are they just passed as files), the version of the platform software (64bit vs. 32) and other factors. A midrange B2BDT system can typically handle 100s MB of data for the same system that may only handle 10 MB on other systems. But there are additional considerations to take into account:
q q q

Converting flat file or binary data can result in 5x size for resulting XML Excel files > 10 MB can result in very large XML files depending on the choice of document processor in B2BDT B2BDT generates very large event files for file sources such as Excel files

In general files of < 10 MB in size can be processed in B2BDT without problem without splitting.
INFORMATICA CONFIDENTIAL BEST PRACTICES 79 of 954

When we consider use of the 64 bit version of B2BDT, we can handle a much greater volume of data without splitting. For example, existing solutions handle 1.6gb of XML input data on a dual processor machine with 16gb of ram at one customer (using X86 based 64 bit RHEL). Average processing time was 20 minutes per file. 32 bit Windows environments are often limited to 3 GB of memory (2 GB available to applications) so this can limit what may be processed. For development environments, much less memory will be available to process the file (especially when event generation is turned on). It is common practice to use much smaller files as data samples when operating in the Studio environment especially for files that require large amounts of memory to preprocess. For Excel files, sample files of 2mb or less are recommended, depending on file contents. B2BDT provides a built in streaming mechanism which supports splitting of files (although it does not support splitting of Excel files in the current release). Considerations for splitting using the streaming capabilities include:
q

Is there an natural boundary to split on? For example EDI functional groups, transactions and other constructs can be used to provide a natural splitting boundary. Batch files composed of multiple distinct files also provide natural splitting boundaries.

In general we cannot split a file if a custom document preprocessor is required to split the file.

In some cases, disabling the event generation mechanism will alleviate the need for splitting.

How Will B2BDT Be Invoked at Run Time?


B2BDT supports a variety of mechanisms for invocation: Invocation method Considerations

Command line

Command line tools are intended mainly for troubleshooting and testing. Use of command line tools does not span multiple CPU cores for transformations and always generate the event file in the current directory. Supports exposing B2BDT transformation via web server B2BDT services may be hosted in J2EE based web service environment. Service assets in progress will support hosting of B2BDT services as IIS based web services Offer great flexibility. Calling program needs to organize parallel calls to B2BDT to optimize throughput

HTTP (via CGI) Web services

APIS (C++, C, Java, .Net)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

80 of 954

EAI agents

Agents exist for BizTalk, WebMethods and many other platforms Through use of UDO, B2BDT services may be included as a transformation within a PowerCenter workflow

PowerCenter

In addition, B2BDT supports two modes of activation server and in-process operation. In process Server

B2BDT call runs in process space of caller

B2BDT service call results in call into other process

Can result in excessive initialization costs as each call may result in overhead for initialization especially with custom code client

Slower overall communication but can avoid initial startup overhead as process possibly remains alive between invocations

Fault in B2BDT service may result in failure in caller

In practice, web service invocation is sped up by use of server invocation

In measurements for custom BizTalk based system (not via standard agent), initial call took 3 seconds, subsequent calls .1 second. But if process is not kept alive, initial 3 second hit was incurred multiple times

No effect for studio or command line invocation

Not supported for some APIs

Can allow 64 bit process to activate 32 bit B2BDT runtime or vice versa

Patterns of Data Input and Output


There are a number of patterns of inputs and outputs used commonly in B2BDT transformations:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

81 of 954

Pattern

Description

Direct data

Under the direct data pattern, the data to be transformed is passed directly to the transformation and the output data is returned directly. Under this mechanism the output data format needs to allow for returning errors, or errors need to be returned through well known error file locations or some other preagreed mechanism.

Indirect via file

The transformation receives a string that designates a file to process and the transformation reads the real data from that file. A slightly more complex version of this may include passing of input, output and error file paths as semi-colon delimited strings or some similar mechanism Under the digest file mechanism, the data passed to the transformation specifies a wide range of parameters as a single file in a similar manner to a SOAP envelope. This digest file could contain many input file paths, output file paths, parameters to services, error handling arguments, performance characteristics etc. The processing of the digest file becomes much more complex but it is essential when many input files must be processed. It avoids much of the overhead of the host system having to load the data files into memory However transaction semantics offered by host systems cannot be utilized in these scenarios. This offers a great means for implementing custom error handling strategies also.

Indirect via digest or envelope file

Error Handling Strategies


B2BDT offers the following error handling features: Feature Description

INFORMATICA CONFIDENTIAL

BEST PRACTICES

82 of 954

B2BDT event log

This is a B2BDT specific event generation mechanism where each event corresponds to an action taken by a transformation such as recognizing a particular lexical sequence. It is useful in troubleshooting of work in progress but event files can grow very large, hence it is not recommended for production systems. It is distinct from the event system offered by other B2BDT products and from the OS based event system. Custom events can be generated within transformation scripts. Event based failures are reported as exceptions or other errors in the calling environment.

B2BDT Trace files

Trace files are controlled by the B2BDT configuration application. Automated strategies may be applied for recycling of trace files At the simplest levels custom errors can be generated as B2BDT events (using the AddEventAction). However if the event mechanism is disabled for memory or performance reasons, these are omitted. Other alternatives include generation of custom error files, integration with OS event tracking mechanisms and integration with 3rd party management platform software. Integration with OS eventing or 3rd party platform software requires custom extensions to B2BDT

Custom error information

Overall the B2BDT event mechanism is the simplest to implement. But for large or high volume production systems, the event mechanism can create very large event files, and it offers no integration with popular enterprise software administration platforms. It is recommended that B2BDT Events are used for troubleshooting purposed during development only. In some cases, performance constraints may determine the error handling strategy. For example updating an external event system may cause performance bottlenecks or producing a formatted error report can be time consuming. In some cases operator interaction may be required which could potentially block a B2BDT transformation from completing. Finally it is worth looking at whether some part of the error handling can be offloaded outside of B2BDT to avoid performance bottlenecks. When using custom error schemes, consider the following:
q q q

Multiple invocations of the same transformation may execute in parallel Dont hardwire error file paths Dont assume a single error output file
BEST PRACTICES 83 of 954

INFORMATICA CONFIDENTIAL

Avoid use of B2BDT event log for productions especially when processing Excel files

Effects of API on event generation: API / invocation method Event generation

CM_Console

Service deployed with events will produce events. Service deployed without events will not produce events Service runs without events. In case of error, service is rerun with events Same as Java No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

Java API

C# / .Net

Agents

Testing
A full test of B2BDT services is covered by a separate best practice document. For simple cases and as a first step in most B2BDT transformation development projects, the B2BDT development environment offers a number of features that can be used to verify the correctness of B2BDT transformations. Initial testing of many transformations can be accomplished using these features alone. 1. The B2BDT studio environment provides visual feedback on what components of the input data are recognized by a parser. This can be viewed in the data browser window of a B2BDT project and the B2BDT studio environment will automatically mark up the first set of occurrences of patterns matched and literals found. Through the use of a simple menu option, all recognized occurrences of matched data can be marked up within the B2BDT studio authoring environment. 2. The B2BDT studio environment exposes a structured event log mechanism to allow developers to browser the flow of a transformation which can be used to verify execution of various components of a transformation. 3. The B2BDT studio environment supports specification of additional sources to perform a transformation on in order to verify the transformation execution against a set of sample or test data inputs. This is accomplished inside the studio design environment by simply setting the sources to extract property to point to the test data, either as specific files or as a directory search for data files matching a file pattern. The unit test can also be automated using the command line API. Results of transformations executed can be previewed in the studio environment, along with events generated during the transformation. In many production scenarios, the B2BDT transformation is called from an overall workflow process (EAI, ETL, MSMQ, etc), and this integrated environment is what is typically reflected in a lab environment (Dev/Test/QA). .

Deployment
Published B2BDT services are stored in a B2BDT repository which is a designated file system location where the B2BDT runtime looks for services when requested to invoke a transformation service. This may be on a shared file system location such as a Network share or SAN based mechanism facilitating the sharing of services between multiple
INFORMATICA CONFIDENTIAL BEST PRACTICES 84 of 954

production servers. A B2BDT project may be published within the B2BDT Studio environment to deploy a single B2BDT service to the B2BDT repository. A project can be used to deploy multiple B2BDT services by setting different options such as the transformation entry point (the same identical service can even be deployed under multiple B2BDT service names). At the simplest level, a B2BDT transformation may be deployed through one of two options. Direct The transformation deployment target directory is set using the CMConfiguration Editor. If the CM repository is set to a location such as a network share which is referenced by a production or QA environment, publishing the service will have the effect of making it available directly to the QA or production environment. Note: The refresh interval B2BDT configuration setting will determine how often a runtime instance checks the file system for updated services

Indirect The B2BDT transformation deployment target directory is set (via the CM repository configuration setting) to the developer specific directory. This directory is subsequently copied to the QA/Production environment using other mechanisms outside of the B2BDT Studio environment (simple copy or source management environments such as CVS, Source safe etc). Use of staging environments may be employed where it is necessary to assemble multiple dependent services prior to deployment to a test environment. In the section on source code control, we covered a number of strategies for deployment of services using version control. Other alternatives may include the use of custom scripts, setup creation tools (such as InstallShield) Configuration Settings Affecting Deployment The following configuration setting affects how soon a newly deployed service is detected: Service refresh interval Further Considerations for Deployment More detailed descriptions of deployment scenarios will be provided in a separate best practice. Some of the considerations to be taken into account include:
Last updated: 30-May-08 19:24

INFORMATICA CONFIDENTIAL

BEST PRACTICES

85 of 954

Testing B2B Data Transformation Services Challenge


Establish a testing process that ensures support for team development of B2B Data Transformation (B2BDT) solutions, strategies for verification of scaling and performance requirements, testing for transformation correctness and overall unit and system test procedures as business and development needs evolve.

Description
When testing B2B Data Transformation services, the goal to keep in mind throughout the process is achieving the ability to test transformations for measurable correctness, performance and scalability. The testing process is broken into three main functions which are addressed through the test variants. The testing process scenarios addressed in this document include finding bugs/defects, achieving the ability to test and ensure functional compliance with desired specifications and ensuring compliance with industry standards/certifications. The success of the testing process should be based on a standard of measurable milestones that provide an assessment of overall transformation completion.

Finding Defects
The first topic to address within the QA process is the ability to find defects within the transformation and to test them against specifications for compliance. This process has a number of options available. Choose the best method to fulfill testing requirements. Based upon time and resource constraints In the testing process, the QA cycle refers to the ability to find, fix or defer errors and retest them until the error count reaches 0 (or specified target). To ensure compliance with defined specifications during the QA process, test basic functionality and ensure that outlying transformation cases behave as defined. For these types of tests ensure the behavior of failure cases fail as expected in addition to ensuring that transformation succeeds as expected.

Ensuring Compliance
Another integral part of the testing process with B2B Data Transformations is the validation of transformations against industry standards such as HIPAA. In order to test standardized output there needs to be a validation of well formed inputs and outputs such as HIPAA levels 1-6 and testing against a publicly available data set. An optimally tested solution can be ensured through use of 3rd party verification software, validation support in the B2B Data Transformation libraries that verify data compliance or through B2BDT transformations created in the course of a project specifically for test purposes.

Performance
Performance and stress testing are additional components used within the testing methodology for B2BDT transformations. To effectively test performance, compare the effects of different configurations on the Informatica server. To achieve this, compare the effects of configurations parameters based on server and machine configurations. Based on data sizes and the complexity of transformations, optimize server configurations for best and worst case scenarios. One way to track benchmarking results is to create a reference spreadsheet. This should define the amount of time needed for each source file to process through the transformation based upon file size.

Setting Measurable Milestones

INFORMATICA CONFIDENTIAL

BEST PRACTICES

86 of 954

In order to track the progress of testing transformations it is best to set milestones to gauge the overall efficiency of the development and QA processes. Best practices include tracking failure rates for different builds. This builds a picture of pass/failure rate over time which can be used to determine expected delays and to gauge achievements in development over time.

Testing Practices The Basics


This section focuses on the initial testing of a B2B Data Transformation. For simple cases and as a first step in most transformation development projects, the studio development environment offers a number of features that can be used to verify the correctness of B2B Data transformations. The initial testing of many transformations can be accomplished using these features alone. It is useful to create small sample data files that are representative of the actual data to ensure quick load times and responsiveness. 1. The B2B Data Transformation Studio environment provides visual feedback on which components of the input data are recognized by a parser. This can be viewed in the data browser window of a B2B Data Transformation project. The Studio environment will automatically mark up the first set of occurrences of patterns matched and literals found. Through the use of the mark all menu option or button, all recognized occurrences of matched data can be marked up within the Studio authoring environment. This provides for a quick verification of correct operations. As shown in the figure below, the color coding indicates which data was matched.

2. The Studio environment exposes a structured event log mechanism that allows developers to browse the flow of a transformation which can then be used to verify the execution of various components of a transformation. Reviewing the event log after running the transformation often provides an indication for the error.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

87 of 954

3. Viewing the results file provides a quick indication of which data was matched. By default it contains parsed XML data. Through the use of DumpValues statements and WriteValue statements in the transformation, the contents of the results files can be customized.

4. The Studio environment supports the specification of additional sources to perform a transformation on in order to verify the transformations execution against a set of sample or test data inputs. This is accomplished inside the Studio Design environment by simply setting the sources to extract property to point to the test data, either as specific files or as a directory search for data files matching a file pattern. The unit test can also be automated using the command line API. Results of transformations executed can be previewed in the Studio environment, along with events generated during the transformation. When running through the initial test process, the Studio environment provides a basic indication about the overall integrity of the transformation. These tests allow for simple functional checks to see whether the transformation failed or not and if the correct output was produced. The events navigation pane provides a visual description of transformation processing. An illustration of the events view log within Studio is shown below.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

88 of 954

In the navigation pane, blue flags depict warnings that can be tested for functional requirements whereas red flags indicate fatal errors. Event logs are available when running a transformation from the Studio environment. Once a service has been deployed (with event output turned on) event logs are written to the directory from which CM_Console is run (when testing a service with CM_Console). When invoking a service with other invocation mechanisms the following rules apply for event log generation.

Effects of API on Event Generation


API / invocation method CM_Console Event generation Service deployed with events enabled will produce events. Service deployed without events enabled will not produce events Service runs without events. In case of error, service is rerun automatically with events enabled Same as Java No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

Java API

C# / .Net Agents

INFORMATICA CONFIDENTIAL

BEST PRACTICES

89 of 954

To view the error logs, use the Studio event pane to scan through the log for specific events. To view an external event file (usually named events.cme), drag and drop the file from the Windows Explorer into the B2BDT Studio events pane. To view the error logs, use the Studio event pane to scan through the log for specific events. To view an external event file (usually named events.cme), drag and drop the file from the Windows Explorer into the B2BDT Studio events pane. It is also possible to create a B2BDT transformation to look for specific information within the event file.

Other Troubleshooting Output


B2B Data Transformation services can be configured to produce trace files that can be examined for troubleshooting purposes. Trace file generation is controlled by the B2BDT configuration application. Automated strategies may be applied for the recycling of trace files. For other forms of troubleshooting output the following options are available:
q

Simple (non dynamic) custom errors can be generated as B2BDT events (using the AddEventAction). However if the event mechanism is disabled for memory or performance reasons, these are omitted. A transformation could be used to keep track of errors in the implementation of the transformation and output these to a custom error file. Through the use of the external code integration APIs for Java, COM (and .Net), integration with OS event tracking mechanisms and integration with 3rd party management platform software are possible through custom actions and custom data transformations.

Other Test Methods


Additional checks that can be performed include a comparison with well known input and expected output, the use of validation tools and transformations as well as the use of reverse transformations and spot checks to verify the expected data subsets. The sections below provide information on how each of these different testing options work along with descriptions of their overall efficiencies and deficiencies for the QA process.

Comparing Inputs and Outputs


For many transformations, comparing the data output from known good input data with expected output data generated through other means provides a valuable mechanism for testing the correctness of a transformation. However, this process requires that adequate sample input data is available as well as examples of output data for these inputs. While in some cases simple binary comparison between the generated output and the correct output is sufficient, it may be necessary to use 3rd party tools to perform comparison where the output is XML or where the order of output can vary. Another test that is valid for some transformations is to test if the output data contains a subset of the expected data. This is useful if only part of the expected output is known. Comparison techniques may need to ignore time and date stamp data in files unless they are expected to be the same in the output. If no comparison tools are available due to the complexity of the data, it is also possible to create a B2BDT service that performs the comparison and writes the results of the comparison to the results file or to a specific output file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

90 of 954

In the event that there is no sample data output available, one solution is to run well known good data through the transformation and create a set of baseline outputs. These should be verified for correctness either through manual examination or another method. This baseline output data can subsequently be used for comparison techniques and for the creation of further variations of expected output data. While this does not verify the correctness of the initial execution of the data transformation, the saved baseline output data can be used to verify that expected behavior has not been broken by maintenance changes. Tools that can be used for the comparison of inputs and outputs include 3rd party software applications such as KDiff3, an open source comparison tool. This application is good for the comparison of XML as well as text files (see an example in the figure below).

Validation Transformations
For some types of data, validation software is available commercially or may already exist in an organization. In the absence of available commercial or in-house validation software Informatica recommends creating B2BDT services that provide validation of the data. The developers assigned to create the validation transformations should be different from those that created the original transformations. A strict no code sharing rule should be enforced to ensure that the validation is not simply a copy of the transformation.

Reverse Transformations:
Another option for testing practices is to use reverse transformations that is, a transformation that performs a reverse transformation on the output which would create the input data. This could then be used as the basis for comparison techniques. Running the output data from B2B Data transformations through an independently created reverse transformation is optimal. The reason for the independent creation of a reverse transformation is because an auto generated reverse
INFORMATICA CONFIDENTIAL BEST PRACTICES 91 of 954

transformations has a tendency to propagate additional bugs. Partial or full compares of input against the output of the reverse transformation can be performed using this strategy. While this allows for testing of functional compliance, the downside is a reduction in the effectiveness of auto generated functions as they require a high time cost to fully implement.

Spot Checking
In some cases it may not be feasible to perform a full comparison test on outputs. Creating a set of spot check transformations provides some measure of quality assurance. The basic concept is that one or more transformations are created that perform spot checks on the output data using B2BDT services. As new issues arise in QA, enhance the spot checks to detect new problems and to look for common mistakes in the output. As time progresses a library of checks should be enhanced. Programmatic checks can be embedded within the transformation itself such as inserting actions to self test output using the AddEventAction feature. If the B2B Data Transformation service is being called through an API, exceptions within the calling code can be checked for as well. This is a subset of spot checking which can assist within the testing process. An error tracking layer can also be applied to the XML output and through the use of programmatic checks all errors associated with the transformation can be written to the output XML. The figure below illustrates how to embed programmatic checks within the transformation.

In the example above, flags are set and error codes are assigned to the specific XML error fields that were defined in the XML Schema definition earlier. In the event the ensure condition fails, then the error flags are set and reported to the output XML stream.

Unit Testing
The concept behind unit testing is to avoid using a traditional QA cycle to find many basic defects in the transformation in order to reduce the cost in time and effort. Unit tests are sets of small tests that are run by the developer of a transformation before signing off on code changes. Unit tests optimally should be created and maintained by the developer and should be used for regression control and functionality testing. Unit tests are often used with a test-first development methodology. It is important to note that unit tests are not a replacement for full QA processes but provide a way for developers to quickly verify that functionality has not been broken by changes. Unit tests may be programmatic or manual tests, although implementing unit tests as a programmatic set of tests necessitates running of the unit test cases after every change.
INFORMATICA CONFIDENTIAL BEST PRACTICES 92 of 954

Testing Transformations Integrated with PowerCenter


When testing B2B Data Transformations using PowerCenter, it is best to initially test the transformation using the aforementioned test processes before utilizing the transformation within the mapping. However, using B2B Data Transformations with PowerCenter has its advantages as data output within a PC mapping can actually be visualized as it comes out of each transformation during the debugging process. When using a combination of PC with B2B Data Transformations, write the output to a flat file to allow for quick spot check testing practices.

Design Practices to Facilitate Testing Use of Indirect Pattern for Parameters


When initiating the testing process for B2B Data Transformations one way to induce the testing process is through the use of indirect pattern for parameters. This is similar to referencing the source input in a parameter file for testing purposes. In this instance set input to the transformation service as a request file specified by host location. This request file has the flexibility to indicate where to read the input and where to place the output and reports on the status of executing transformations. This can be done through an XML file input which can be managed by the local administrator. This method can result in the reduction of the host environment footprint. Staging areas for inputs and outputs can be created which provide a way to easily track completed transformations. During the mapping process, the request file is processed to determine the actual data to be mapped along with the target locations, etc. When these have been read, control is passed to the transformation which will perform the actual mapping.The figures below demonstrate this strategy.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

93 of 954

In the mapper illustrated above, the main service input and output data takes the form of references (provided as individual service parameters or combined into a single XML block) which refer to the real input and output data located by paths to specific files and/or collections of files designated by a path to an accessible directory. Alternately, a collection of files may be referred to using a sequence of individual paths. However, the latter approach does limit the parallel operation of some of the transformation.
Last updated: 30-May-08 23:55

INFORMATICA CONFIDENTIAL

BEST PRACTICES

94 of 954

Configuring Security Challenge


Configuring a PowerCenter security scheme to prevent unauthorized access to folders, sources and targets, design objects, run-time objects, global objects, security administration, domain administration, tools access, and data in order to ensure system integrity and data confidentiality.

Description
Security is an often overlooked area within the Informatica ETL domain. However, without paying close attention to the domain security, one ignores a crucial component of ETL code management. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements, data content, and end-user access requirements. Knowledge of PowerCenter's security functionality and facilities is also a prerequisite to security design. Implement security with the goals of easy maintenance and scalability. When establishing domain security, keep it simple. Although PowerCenter includes the utilities for a complex web of security, the more simple the configuration, the easier it is to maintain. Securing the PowerCenter environment involves the following basic principles:
q q q

Create users and groups Define access requirements Grant privileges, roles and permissions

Before implementing security measures ask and answer the following questions:
q q

Who will administer the domain? How many projects need to be administered? Will the administrator be able to manage security for all PowerCenter projects or just a select few? How many environments will be supported in the domain? Who needs access to the domain objects (e.g., repository service, reporting service, etc.)? What do they need the ability to do? How will the metadata be organized in the repository? How many folders will be required? Where can we limit repository service privileges by granting folder permissions instead? Who will need Administrator or Super User-type access?

q q

q q q

After you evaluate the needs of the users, you can create appropriate user groups and assign repository service privileges and folder permissions. In most implementations, the administrator takes care of maintaining the repository. Limit the number of administrator accounts for PowerCenter. While this concept is important in a development/unit test environment, it is critical for protecting the production environment.

Domain Repository Overview


All of the PowerCenter Advanced Edition applications are centrally administered through the administration console and the settings are stored in the domain repository. User and group information, permissions and role definitions for domain objects are managed through the administration console and are stored in the domain repository.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

95 of 954

Although privileges and roles are assigned to users and group centrally from the administration console, they are also stored in each application repository. Periodically the domain synchronizes this information (when an assignment is made) to each application repository. Individual applications object permissions are also managed and stored within each application repository.

PowerCenter Repository Security Overview


A security system needs to properly control access to all sources, targets, mappings, reusable transformations, tasks, and workflows in both the test and production repositories. A successful security model needs to support all groups in the project lifecycle and also consider the repository structure. Informatica offers multiple layers of security, which enables you to customize the security within your data warehouse environment. Metadata level security controls access to PowerCenter repositories, which contain objects grouped by folders. Access to metadata is determined by the privileges granted to the user or to a group of users and the access permissions granted on each folder. Some privileges do not apply by folder, as they are granted by privilege alone (i. e., repository-level tasks). Just beyond PowerCenter authentication is the connection to the repository database. All client connectivity to the repository is handled by the PowerCenter Repository Service over a TCP/IP connection. The particular database account and password is specified at installation and during the configuration of the Repository Service. Developers need not have knowledge of this database account and password; they should only use their individual repository user ids and passwords. This information should be restricted to the administrator. Other forms of security available in PowerCenter include permissions for connections. Connections include database, FTP, and external loader connections. These permissions are useful when you want to limit access to schemas in a relational database and can be set-up in the Workflow Manager when source and target connections are defined. Occasionally, you may want to restrict changes to source and target definitions in the repository. A common way to approach this security issue is to use shared folders, which are owned by an Administrator or Super User. Granting read access to developers on these folders allows them to create read-only copies in their work folders.

PowerCenter Security Architecture


The following diagram, Informatica PowerCenter Security, depicts PowerCenter security, including access to the repository, Repository Service, Integration Service and the command-line utilities pmrep and pmcmd. As shown in the diagram, the repository service is the central component for repository metadata security. It sits between the PowerCenter repository and all client applications, including GUI tools, command line tools, and the Integration Service. Each application must be authenticated against metadata stored in several tables within the repository. Each Repository Service manages a single repository database where all security data is stored as part of its metadata; this is a second layer of security. Only the Repository Service has access to this database; it authenticates all client applications against this metadata.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

96 of 954

Repository Service Security


Connection to the PowerCenter repository database is one level of security. The Repository Service uses native drivers to communicate with the repository database. PowerCenter Client tools and the Integration Service communicate with the Repository Service over TCP/IP. When a client application connects to the repository, it connects directly to the Repository Service process. You can configure a Repository Service to run on multiple machines, or nodes, in the domain. Each instance running on a node is called a Repository Service process. This process accesses the database tables and performs most repository-related tasks. When the Repository Service is installed, the database connection information is entered for the metadata repository. At this time you need to know the database user id and password to access the metadata repository. The database user id must be able to read and write to all tables in the database. As a developer creates, modifies, executes mappings and sessions, this information is continuously updating the metadata in the repository. Actual database security should be controlled by the DBA responsible for that database, in conjunction with the PowerCenter Repository Administrator. After the Repository Service is installed and started, all subsequent client connectivity is automatic. The database id and password are transparent at this point.

Integration Service Security


Like the Repository Service, the Integration Service communicates with the metadata repository when it executes workflows or when users are using Workflow Monitor. During configuration of the Integration Service, the repository database is identified with the appropriate user id and password. Connectivity to the repository is made using native
INFORMATICA CONFIDENTIAL BEST PRACTICES 97 of 954

drivers supplied by Informatica. Service, the repository database is identified with the appropriate user id and password. Connectivity to the repository is made using native drivers supplied by Informatica. Certain permissions are also required to use the pmrep and pmcmd command line utilities.

Encrypting Repository Passwords


You can encrypt passwords and create an environment variable to use with pmcmd and pmrep. For example, you can encrypt the repository and database passwords for pmrep to maintain security when using pmrep in scripts. In addition, you can create an environment variable to store the encrypted password. Use the following steps as a guideline to use an encrypted password as an environment variable: 1. Use the command line program pmpasswd to encrypt the repository password. 2. Configure the password environment variable to set the encrypted value. To configure a password as an environment variable on UNIX: 1. At the command line, type: pmpasswd <repository password> pmpasswd returns the encrypted password. 2. In a UNIX C shell environment, type: setenv <Password_Environment_Variable> <encrypted password> In a UNIX Bourne shell environment, type: <Password_Environment_Variable> = <encrypted password> export <Password_Environment_Variable> You can assign the environment variable any valid UNIX name. To configure a password as an environment variable on Windows: 1. At the command line, type: pmpasswd <repository password> pmpasswd returns the encrypted password. 2. Enter the password environment variable in the Variable field. Enter the encrypted password in the Value field.

Setting the Repository User Name


For pmcmd and pmrep, you can create an environment variable to store the repository user name.
INFORMATICA CONFIDENTIAL BEST PRACTICES 98 of 954

To configure a user name as an environment variable on UNIX: 1. In a UNIX C shell environment, type: setenv <User_Name_Environment_Variable> <user name> 2. In a UNIX Bourne shell environment, type: <User_Name_Environment_Variable> = <user name> export <User_Name_Environment_Variable> = <user name> You can assign the environment variable any valid UNIX name. To configure a user name as an environment variable on Windows: 1. Enter the user name environment variable in the Variable field. 2. Enter the repository user name in the Value field.

Connection Object Permissions


Within Workflow Manager, you can grant read, write, and execute permissions to groups and/or users for all types of connection objects. This controls who can create, view, change, and execute workflow tasks that use those specific connections, providing another level of security for these global repository objects. Users with Use Workflow Manager permission can create and modify connection objects. Connection objects allow the PowerCenter server to read and write to source and target databases. Any database the server can access requires a connection definition. As shown below, connection information is stored in the repository. Users executing workflows need execution permission on all connections used by the workflow. The PowerCenter server looks up the connection information in the repository, and verifies permission for the required action. If permissions are properly granted, the server reads and writes to the defined databases, as specified by the workflow.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

99 of 954

Users
Users are the fundamental objects of security in a PowerCenter environment. Each individual logging into the PowerCenter domain or its services should have a unique user account. Informatica does not recommend creating shared accounts; unique accounts should be created for each user. Each domain user needs a user name and password, provided by the Informatica Administrator, to access the domain. Users are created and managed through the administration console. Users should change their passwords from the default immediately after receiving the initial user id from the Administrator. When you create a PowerCenter repository, the repository automatically creates two default repository users within the domain:
q q

Administrator - The default password for Administrator is Administrator. Database user - The username and password used when you created the repository.

These default users are in the Administrators user group, with full privileges within the repository. They cannot be deleted from the repository, nor have their group affiliation changed. To administer repository users, you must have one of the following privileges:
q q

Administer Repository Super User

LDAP (Lightweight Directory Access Protocol)


In addition to default domain user authentication, LDAP can be used to authenticate users. Using LDAP authentication, the domain maintains an association between the domain user and the external login name. When a user logs into the domain services, the security module authenticates the user name and password against the external directory. The domain maintains a status for each user. Users can be enabled or disabled by modifying this status. Prior to implementing LDAP, the administrator must know:
q q q

Domain username and password An administrator or superuser user name and password for the domain An external login name and password

To configure LDAP, follow these steps: 1. Edit ldap_authen.xml, modify the following attributes:
q q

NAME the .dll that implements the authentication OSTYPE Host operating system

2. Register ldap_authen.xml in the Domain Administration Console. 3. In the domain Administration Console, configure the authentication module.

Privileges
Seven categories of privileges have been defined. Depending on the category, each privilege controls various actions for a particular object type. The categories are:
INFORMATICA CONFIDENTIAL BEST PRACTICES 100 of 954

q q q q q q q q

Folders -- Create, Copy, Manage Versions Sources & Targets -- Edit, Create and Delete, Manage Versions Design Objects -- Edit, Create and Delete, Manage Versions Run-time Objects -- Edit, Create and Delete, Manage Versions, Monitor, Manage Execution Global Objects (Queries, Labels, Connections, Deployment Groups) Create Security Administration -- Manage, Grant Privileges and Permissions Domain Administration (Nodes, Grids, Services) Execute, Manage, Manage Execution Tools Access Designer, Workflow Manager, Workflow Monitor, Administration Console, Repository Manager

Assigning Privileges
A user must have permissions to grant privileges and roles (as well as administration console privileges in the domain) in order to assign privileges. The user must also have permission for the service to which the privileges apply. Only a user who has permissions to the domain can assign privileges in the domain. For PowerCenter, only a user who has permissions to the repository service can assign privileges for that repository service. For Metadata Manager and Data Analyzer, only a user who has permissions to the corresponding metadata or reporting service can assign privileges in that application. Privileges are assigned per repository or application instance. For example, you can assign a user create, edit, and delete privilege for runtime and design objects in a development repository but not in the production repository.

Roles
A user needs to have privileges to manage users, groups and roles (and administration console privileges in the domain) in order to define custom roles. Once roles are defined they can be assigned to users or groups for specific services. Just like privileges, roles are assigned per repository or application instance. For example, the developer role (with its associated privileges) can be assigned to a user only in the development repository; but not the test or production repository. A must have permissions to grant privileges and roles (as well as administration console privileges in the domain) in order to assign roles. The user must also have permission for the services to which the roles are to be applied. Only a user who has permissions to the domain can assign roles in the domain. For PowerCenter, only a user who has permissions to the repository service can assign roles for that repository service. For Metadata Manager and Data Analyzer, only a user who has permissions to the corresponding metadata or reporting service can assign roles in that application.

Domain Administrator Role


The domain administrator role is essentially a super-user for not only the domain itself, but also for all of the services/ applications in the domain. This role has permissions to all objects in the domain (including the domain itself) and all available privileges in the domain. As a result, the super-user role has privileges to manage users, groups and roles as well as to assign privileges and roles privileges. Because of these privileges and permissions for all objects in the domain this role can grant itself the administrator role on all services and therefore, become the super-user for all services in the domain. The domain administrator role also has implicit privileges that include:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

101 of 954

q q q q q q q q

Configuring a node as a gateway node Creating, editing, and deleting the domain Configuring SMTP Configuring service levels in the domain Shutting down domain Receiving domain alerts Exporting and truncating domain logs Configuring restart of service processes

Audit Trails
You can track changes to Repository users, groups, privileges, and permissions by selecting the SecurityAuditTrail configuration option in the Repository Service properties in the PowerCenter Administration Console. When you enable the audit trail, the Repository Service logs security changes to the Repository Service log. The audit trail logs the following operations:
q q q q q q q

Changing the owner, owner's group, or permissions for a folder. Changing the password of another user. Adding or removing a user. Adding or removing a group. Adding or removing users from a group. Changing global object permissions. Adding or removing user and group privileges.

Sample Security Implementation


1. The following steps provide an example of how to establish users, groups, permissions, and privileges in your environment. Again, the requirements of your projects and production systems should dictate how security is established. 2. Identify users and the environments they will support (e.g., Development, UAT, QA, Production, Production Support, etc.). 3. Identify the PowerCenter repositories in your environment (this may be similar to the basic groups listed in Step 1; for example, Development, UAT, QA, Production, etc.). 4. Identify which users need to exist in each repository. 5. Define the groups that will exist in each PowerCenter Repository. 6. Assign users to groups. 7. Define privileges for each group. The following table provides an example of groups and privileges that may exist in the PowerCenter repository. This example assumes one PowerCenter project with three environments co-existing in one PowerCenter repository. GROUP NAME ADMINISTRATORS FOLDER All FOLDER PERMISSIONS All PRIVILEGES Super User (all privileges)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

102 of 954

DEVELOPERS

Individual development folder; integrated development folder

Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Browse Repository, Workflow Operator

DEVELOPERS

UAT

Read

UAT

UAT working folder Read, Write, Execute

UAT OPERATIONS

Production Production Production maintenance folders

Read Read, Execute

PRODUCTION SUPPORT PRODUCTION SUPPORT

Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

Production

Read

Browse Repository

Informatica PowerCenter Security Administration


As mentioned earlier, one individual should be identified as the Informatica Administrator. This individual is responsible for a number of tasks in the Informatica environment, including security. To summarize, here are the security-related tasks an administrator is responsible for:
q q q q q q q

Creating user accounts. Defining and creating groups. Defining and granting permissions. Defining and granting privileges and roles. Enforcing changes in passwords. Controlling requests for changes in privileges. Creating and maintaining database, FTP, and external loader connections in conjunction with database administrator. Working with operations group to ensure tight security in production environment.

Summary of Recommendations
When implementing your security model, keep the following recommendations in mind:
Last updated: 04-Jun-08 15:34

INFORMATICA CONFIDENTIAL

BEST PRACTICES

103 of 954

Data Analyzer Security Challenge


Using Data Analyzer's sophisticated security architecture to establish a robust security system to safeguard valuable business information against a range of technologies and security models. Ensuring that Data Analyzer security provides appropriate mechanisms to support and augment the security infrastructure of a Business Intelligence environment at every level.

Description
Four main architectural layers must be completely secure: user layer, transmission layer, application layer and data layer. Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the following LDAP-compliant directory servers:

SunOne/iPlanet Directory Server

4.1

INFORMATICA CONFIDENTIAL

BEST PRACTICES

104 of 954

Sun Java System Directory Server

5.2

Novell eDirectory Server 8.7 IBM SecureWay Directory IBM SecureWay Directory IBM Tivoli Directory Server 3.2

4.1

5.2

Microsoft Active Directory 2000 Microsoft Active Directory 2003

In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing authentication and access control for the various web applications in the organization.

Transmission Layer
The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security protocol Secure Sockets Layer (SSL) to provide a secure environment.

Application Layer
Only appropriate application functionality should be provided to users with associated privileges. Data Analyzer provides three basic types of application-level security:
q

Report, Folder and Dashboard Security. Restricts access for users or groups to specific reports, folders, and/or dashboards. Column-level Security. Restricts users and groups to particular metric and attribute columns. Row-level Security. Restricts users to specific attribute values within an attribute column of a table.

q q

Components for Managing Application Layer Security


Data Analyzer users can perform a variety of tasks based on the privileges that you grant them. Data Analyzer provides the following components for managing application layer security:
q

Roles. A role can consist of one or more privileges. You can use system roles or create custom roles. You can grant roles to groups and/or individual users. When you edit a custom role, all
BEST PRACTICES 105 of 954

INFORMATICA CONFIDENTIAL

groups and users with the role automatically inherit the change.
q

Groups. A group can consist of users and/or groups. You can assign one or more roles to a group. Groups are created to organize logical sets of users and roles. After you create groups, you can assign users to the groups. You can also assign groups to other groups to organize privileges for related users. When you edit a group, all users and groups within the edited group inherit the change. Users. A user has a user name and password. Each person accessing Data Analyzer must have a unique user name. To set the tasks a user can perform, you can assign roles to the user or assign the user to a group with predefined roles.

Types of Roles
q

System roles - Data Analyzer provides a set of roles when the repository is created. Each role has sets of privileges assigned to it. Custom roles - The end user can create and assign privileges to these roles.

Managing Groups
Groups allow you to classify users according to a particular function. You may organize users into groups based on their departments or management level. When you assign roles to a group, you grant the same privileges to all members of the group. When you change the roles assigned to a group, all users in the group inherit the changes. If a user belongs to more than one group, the user has the privileges from all groups. To organize related users into related groups, you can create group hierarchies. With hierarchical groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you edit a group, all subgroups contained within it inherit the changes. For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer role privileges.

Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the object.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

106 of 954

Preventing Data Analyzer from Updating Group Information


If you use Windows Domain or LDAP authentication, you typically modify the users or groups in Data Analyzer. However, some organizations keep only user accounts in the Windows Domain or LDAP directory service, but set up groups in Data Analyzer to organize the Data Analyzer users. Data Analyzer provides a way for you to keep user accounts in the authentication server and still keep the groups in Data Analyzer. Ordinarily, when Data Analyzer synchronizes the repository with the Windows Domain or LDAP directory service, it updates the users and groups in the repository and deletes users and groups that are not found in the Windows Domain or LDAP directory service. To prevent Data Analyzer from deleting or updating groups in the repository, you can set a property in the web.xml file so that Data Analyzer updates only user accounts, not groups. You can then create and manage groups in Data Analyzer for users in the Windows Domain or LDAP directory service. The web.xml file is in stored in the Data Analyzer EAR file. To access the files in the Data Analyzer EAR file, use the EAR Repackager utility provided with Data Analyzer. Note: Be sure to back-up the web.xml file before you modify it. To prevent Data Analyzer from updating group information in the repository: 1. In the directory where you extracted the Data Analyzer EAR file, locate the web.xml file in the following directory: /custom/properties 2. Open the web.xml file with a text editor and locate the line containing the following property: enableGroupSynchronization The enableGroupSynchronization property determines whether Data Analyzer updates the groups in the repository.
INFORMATICA CONFIDENTIAL BEST PRACTICES 107 of 954

3. To prevent Data Analyzer from updating group information in the Data Analyzer repository, change the value of the enableGroupSynchronization property to false: <init-param> <param-name> InfSchedulerStartup.com.informatica.ias. scheduler.enableGroupSynchronization </param-name> <param-value>false</param-value> </init-param> When the value of enableGroupSynchronization property is false, Data Analyzer does not synchronize the groups in the repository with the groups in the Windows Domain or LDAP directory service. 4. Save the web.xml file and add it back to the Data Analyzer EAR file. 5. Restart Data Analyzer. When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows Domain or LDAP authentication server. You must create and manage groups, and assign users to groups in Data Analyzer.

Managing Users
Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a user must have the appropriate privileges. You can assign privileges to a user with roles or groups. Data Analyzer creates a System Administrator user account when you create the repository. The default user name for the System Administrator user account is admin. The system daemon, ias_scheduler/ padaemon, runs the updates for all time-based schedules. System daemons must have a unique user name and password in order to perform Data Analyzer system functions and tasks. You can change the password for a system daemon, but you cannot change the system daemon user name via the GUI. Data Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to system daemons or assign them to groups. To change the password for a system daemon, complete the following steps: 1. Change the password in the Administration tab in Data Analyzer 2. Change the password in the web.xml file in the Data Analyzer folder. 3. Restart Data Analyzer.

Access LDAP Directory Contacts


INFORMATICA CONFIDENTIAL BEST PRACTICES 108 of 954

To access contacts in the LDAP directory service, you can add the LDAP server on the LDAP Settings page. After you set up the connection to the LDAP directory service, users can email reports and shared documents to LDAP directory contacts. When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property. In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished name entries define the type of information that is stored in the LDAP directory. If you do not know the value for BaseDN, contact your LDAP system administrator.

Customizing User Access


You can customize Data Analyzer user access with the following security options:
q

Access permissions. Restrict user and/or group access to folders, reports, dashboards, attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access to a particular folder or object in the repository. Data restrictions. Restrict user and/or group access to information in fact and dimension tables and operational schemas. Use data restrictions to prevent certain users or groups from accessing specific values when they create reports. Password restrictions. Restrict users from changing their passwords. Use password restrictions when you do not want users to alter their passwords.

When you create an object in the repository, every user has default read and write permissions for that object. By customizing access permissions for an object, you determine which users and/or groups can read, write, delete, or change access permissions for that object. When you set data restrictions, you determine which users and groups can view particular attribute values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to that user.

Types of Access Permissions


Access permissions determine the tasks that you can perform for a specific repository object. When you set access permissions, you determine which users and groups have access to the folders and repository objects. You can assign the following types of access permissions to repository objects:
q q

Read. Allows you to view a folder or object. Write. Allows you to edit an object. Also allows you to create and edit folders and objects within a folder. Delete. Allows you to delete a folder or an object from the repository. Change permission. Allows you to change the access permissions on a folder or object.

q q

By default, Data Analyzer grants read and write access permissions to every user in the repository. You can use the General Permissions area to modify default access permissions for an object, or turn off default access permissions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

109 of 954

Data Restrictions
You can restrict access to data based on the values of related attributes. Data restrictions are set to keep sensitive data from appearing in reports. For example, you may want to restrict data related to the performance of a new store from outside vendors. You can set a data restriction that excludes the store ID from their reports. You can set data restrictions using one of the following methods:
q

Set data restrictions by object. Restrict access to attribute values in a fact table, operational schema, real-time connector, and real-time message stream. You can apply the data restriction to users and groups in the repository. Use this method to apply the same data restrictions to more than one user or group. Set data restrictions for one user at a time. Edit a user account or group to restrict user or group access to specified data. You can set one or more data restrictions for each user or group. Use this method to set custom data restrictions for different users or groups

Types of Data Restrictions


You can set two kinds of data restrictions:
q

Inclusive. Use the IN option to allow users to access data related to the attributes you select. For example, to allow users to view only data from the year 2001, create an IN 2001 rule. Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes you select. For example, to allow users to view all data except from the year 2001, create a NOT IN 2001 rule.

Restricting Data Access by User or Group


You can edit a user or group profile to restrict the data the user or group can access in reports. When you edit a user profile, you can set data restrictions for any schema in the repository, including operational schemas and fact tables. You can set a data restriction to limit user or group access to data in a single schema based on the attributes you select. If the attributes apply to more than one schema in the repository, you can also restrict the user or group access from related data across all schemas in the repository. For example, you may have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one data restriction that applies to both the Sales and Salary fact tables based on the region you select. To set data restrictions for a user or group, you need the following role or privilege:
q q

System Administrator role Access Management privilege

When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the data restrictions for the report owner. However, if the reports have consumer-based security, the Data Analyzer Server creates a separate report for each unique security profile.
INFORMATICA CONFIDENTIAL BEST PRACTICES 110 of 954

The following information applies to the required steps for changing admin user for weblogic only.

To change the Data Analyzer system administrator username on Weblogic 8.1(DA 8.1)
q

Repository authentication. You must use the Update System Accounts utility to change the system administrator account name in the repository. LDAP or Windows Domain Authentication. Set up the new system administrator account in Windows Domain or LDAP directory service. Then use the Update System Accounts utility to change the system administrator account name in the repository.

To change the Data Analyzer default users from admin, ias_scheduler/padaemon


1. Back up the repository. 2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib 3. Open the file ias.jar and locate the file entry called InfChangeSystemUserNames.class 4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory (example: d: \temp) 5. This extracts the file as 'd:\temp\repository tils\Refresh\InfChangeSystemUserNames.class' 6. Create a batch file (change_sys_user.bat) with the following commands in the directory D:\Temp \Repository Utils\Refresh\ REM To change the system user name and password REM ******************************************* REM Change the BEA home here REM ************************ set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06 set WL_HOME=E:\bea\wlserver6.1 set CLASSPATH=%WL_HOME%\sql set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense REM Change the DB information here and also REM the user Dias_scheduler and -Dadmin to values of your choice REM ************************************************************* %JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-Durl=jdbc: informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin repositoryutil.refresh.InfChangeSystemUserNames REM END OF BATCH FILE

INFORMATICA CONFIDENTIAL

BEST PRACTICES

111 of 954

7. Make changes in the batch file as directed in the remarks [REM lines] 8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils \Refresh\ 9. At the prompt, type change_sys_user.bat and press Enter. The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin", respectively. 10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias \WEB-INF) by replacing ias_scheduler with 'pa_scheduler' 11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\ To edit the file Make a copy of the iasEjb.jar:
q q q q q q q

mkdir \tmp cd \tmp jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF cd META-INF Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler cd \ jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .

Note: There is a tailing period at the end of the command above. 12. Restart the server.

Last updated: 04-Jun-08 15:51

INFORMATICA CONFIDENTIAL

BEST PRACTICES

112 of 954

Database Sizing Challenge


Database sizing involves estimating the types and sizes of the components of a data architecture. This is important for determining the optimal configuration for the database servers in order to support the operational workloads. Individuals involved in a sizing exercise may be data architects, database administrators, and/or business analysts.

Description
The first step in database sizing is to review system requirements to define such things as:
q

Expected data architecture elements (will there be staging areas? operational data stores? centralized data warehouse and/or master data? data marts?) Each additional database element requires more space. This is even more true in situations where data is being replicated across multiple systems, such as a data warehouse maintaining an operational data store as well. The same data in the ODS will be present in the warehouse as well, albeit in a different format.

Expected source data volume It is useful to analyze how each row in the source system translates into the target system. In most situations the row count in the target system can be calculated by following the data flows from the source to the target. For example, say a sales order table is being built by denormalizing a source table. The source table holds sales data for 12 months in a single row (one column for each month). Each row in the source translates to 12 rows in the target. So a source table with one million rows ends up as a 12 million row table.

Data granularity and periodicity Granularity refers to the lowest level of information that is going to be stored in a fact table. Granularity affects the size of a database to a great extent, especially for aggregate tables. The level at which a table has been aggregated increases or decreases a table's row count. For example, a sales order fact table's size is likely to be greatly affected by whether the table is being aggregated at a monthly level or at a quarterly level. The granularity of fact tables is determined by the dimensions linked to that table. The number of dimensions that are connected to the fact tables affects the granularity of the table and hence the size of the table.

Load frequency and method (full refresh? incremental updates?)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

113 of 954

Load frequency affects the space requirements for the staging areas. A load plan that updates a target less frequently is likely to load more data at one go. Therefore, more space is required by the staging areas. A full refresh requires more space for the same reason. Estimated growth rates over time and retained history.

Determining Growth Projections


One way to estimate projections of data growth over time is to use scenario analysis. As an example, for scenario analysis of a sales tracking data mart you can use the number of sales transactions to be stored as the basis for the sizing estimate. In the first year, 10 million sales transactions are expected; this equates to 10 million fact-table records. Next, use the sales growth forecasts for the upcoming years for database growth calculations. That is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next year. At the end of five years, the fact table is likely to contain about 60 million records. You may want to calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very helpful.

Oracle Table Space Prediction Model


Oracle (10g and onwards) provides a mechanism to predict the growth of a database. This feature can be useful in predicting table space requirements. Oracle incorporates a table space prediction model in the database engine that provides projected statistics for space used by a table. The following Oracle 10g query returns projected space usage statistics:

SELECT * FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE')) ORDER BY timepoint; The results of this query are shown below: TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY ------------------------------ ----------- ----------- -------------------11-APR-04 02.55.14.116000 PM 12-APR-04 02.55.14.116000 PM 13-APR-04 02.55.14.116000 PM 13-MAY-04 02.55.14.116000 PM 14-MAY-04 02.55.14.116000 PM 15-MAY-04 02.55.14.116000 PM 16-MAY-04 02.55.14.116000 PM 6372 6372 6372 6372 6372 6372 6372 65536 INTERPOLATED 65536 INTERPOLATED 65536 INTERPOLATED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED

The QUALITY column indicates the quality of the output as follows:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

114 of 954

GOOD - The data for the timepoint relates to data within the AWR repository with a timestamp within 10 percent of the interval. INTERPOLATED - The data for this timepoint did not meet the GOOD criteria but was based on data gathered before and after the timepoint. PROJECTED - The timepoint is in the future, so the data is estimated based on previous growth statistics.

Baseline Volumetric
Next, use the physical data models for the sources and the target architecture to develop a baseline sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various database structures such as tables, indexes, sort space, data files, log files, and database cache. Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical data model, along with field data types and field sizes. Various database products use different storage methods for data types. For this reason, be sure to use the database manuals to determine the size of each data type. Add up the field sizes to determine row size. Then use the data volume projections to determine the number of rows to multiply by the table size. The default estimate for index size is to assume same size as the table size. Also estimate the temporary space for sort operations. For data warehouse applications where summarizations are common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger than the largest table in the database. Another approach that is sometimes useful is to load the data architecture with representative data and determine the resulting database sizes. This test load can be a fraction of the actual data and is used only to gather basic sizing statistics. You then need to apply growth projections to these statistics. For example, after loading ten thousand sample records to the fact table, you determine the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60 million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB * (60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.

Guesstimating
When there is not enough information to calculate an estimate as described above, use educated guesses and rules of thumb to develop as reasonable an estimate as possible.
q

If you dont have the source data model, use what you do know of the source data to estimate average field size and average number of fields in a row to determine table size. Based on your understanding of transaction volume over time, determine your growth metrics for each type of data and calculate out your source data volume (SDV) from table size and growth metrics. If your target data architecture is not completed so that you can determine table sizes, base your estimates on multiples of the SDV:
r

If it includes staging areas: add another SDV for any source subject area that you will
BEST PRACTICES 115 of 954

INFORMATICA CONFIDENTIAL

stage multiplied by the number of loads youll retain in staging.


r

If you intend to consolidate data into an operational data store, add the SDV multiplied by the number of loads to be retained in the ODS for historical purposes (e.g., keeping one years worth of monthly loads = 12 x SDV) Data warehouse architectures are based on the periodicity and granularity of the warehouse; this may be another SDV + (.3n x SDV where n = number of time periods loaded in the warehouse over time) If your data architecture includes aggregates, add a percentage of the warehouse volumetrics based on how much of the warehouse data will be aggregated and to what level (e.g., if the rollup level represents 10 percent of the dimensions at the details level, use 10 percent). Similarly, for data marts add a percentage of the data warehouse based on how much of the warehouse data is moved into the data mart. Be sure to consider the growth projections over time and the history to be retained in all of your calculations.

And finally, remember that there is always much more data than you expect so you may want to add a reasonable fudge-factor to the calculations for a margin of safety.

Last updated: 19-Jul-07 14:14

INFORMATICA CONFIDENTIAL

BEST PRACTICES

116 of 954

Deployment Groups Challenge


In selectively migrating objects from one repository folder to another, there is a need for a versatile and flexible mechanism that can overcome such limitations as confinement to a single source folder.

Description
Regulations such as Sarbanes-Oxley (SOX) and HIPAA require tracking, monitoring, and reporting of changes in information technology systems. Automation of change control processes using deployment groups and pmrep commands provide organizations with a means to comply with regulations for configuration management of software artifacts in a PowerCenter repository. Deployment Groups are containers that hold references to objects that need to be migrated. This includes objects such as mappings, mapplets, reusable transformations, sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the repository folders). Deployment groups are faster and more flexible than folder moves for incremental changes. In addition, they allow for migration rollbacks if necessary. Migrating a deployment group involves moving objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. When copying a deployment group, individual objects to be copied can be selected as opposed to the entire contents of a folder. There are two types of deployment groups - static and dynamic.
q

Static deployment groups contain direct references to versions of objects that need to be moved. Users explicitly add the version of the object to be migrated to the deployment group. If the set of deployment objects is not expected to change between deployments, static deployment groups can be created. Dynamic deployment groups contain a query that is executed at the time of deployment. The results of the query (i.e., object versions in the repository) are then selected and copied to the deployment group. If the set of deployment objects is expected to change frequently between deployments, dynamic deployment groups should be used.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

117 of 954

Dynamic deployment groups are generated from a query. While any available criteria can be used, it is advisable to have developers use labels to simplify the query. For more information, refer to the Strategies for Labels section of Using PowerCenter Labels. When generating a query for deployment groups with mappings and mapplets that contain non-reusable objects, in addition to specific selection criteria, a query condition should be used. The query must include a condition for Is Reusable and use a qualifier of either Reusable and Non-Reusable. Without this qualifier, the deployment may encounter errors if there are non-reusable objects held within the mapping or mapplet. A deployment group exists in a specific repository. It can be used to move items to any other accessible repository/folder. A deployment group maintains a history of all migrations it has performed. It tracks what versions of objects were moved from which folders in which source repositories, and into which folders in which target repositories those versions were copied (i.e., it provides a complete audit trail of all migrations performed). Given that the deployment group knows what it moved and to where, then if necessary, an administrator can have the deployment group undo the most recent deployment, reverting the target repository to its pre-deployment state. Using labels (as described in the Using PowerCenter Labels Best Practice) allows objects in the subsequent repository to be tracked back to a specific deployment. It is important to note that the deployment group only migrates the objects it contains to the target repository/folder. It does not, itself, move to the target repository. It still resides in the source repository.

Deploying via the GUI


Migrations can be performed via the GUI or the command line (pmrep). In order to migrate objects via the GUI, simply drag a deployment group from the repository it resides in onto the target repository where the referenced objects are to be moved. The Deployment Wizard appears and steps the user through the deployment process. Once the wizard is complete, the migration occurs, and the deployment history is created.

Deploying via the Command Line


Alternatively, the PowerCenter pmrep command can be used to automate both Folder Level deployments (e.g., in a non-versioned repository) and deployments using Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in pmrep are used respectively for these purposes. Whereas deployment via the GUI requires stepping through a wizard and answering a series of questions to deploy, the command-line deployment requires an XML control file that contains the same

INFORMATICA CONFIDENTIAL

BEST PRACTICES

118 of 954

information that the wizard requests. This file must be present before the deployment is executed. The following steps can be used to create a script to wrap pmrep commands and automate PowerCenter deployments: 1. Use pmrep ListObjects to return the object metadata to be parsed in another pmrep command. 2. Use pmrep CreateDeploymentGroup to create a dynamic or static deployment group. 3. Use pmrep ExecuteQuery to output the results to a persistent input file. This input file can also be used for AddToDeploymentGroup command. 4. Use DeployDeploymentGroup to copy a deployment group to a different repository. A control file with all the specifications is required for this command. Additionally, a web interface can be built for entering/approving/rejecting code migration requests. This can provide additional traceability and reporting capabilities to the automation of PowerCenter code migrations.

Considerations for Deployment and Deployment Groups Simultaneous Multi-Phase Projects


If multiple phases of a project are being developed simultaneously in separate folders, it is possible to consolidate them by mapping folders appropriately through the deployment group migration wizard. When migrating with deployment groups in this way, the override buttons in the migration wizard are used to select specific folder mappings.

Rolling Back a Deployment


Deployment groups help to ensure that there is a back-out methodology and that the latest version of a deployment can be rolled back. To do this: In the target repository (where the objects were migrated to), go to: Versioning>>Deployment>>History>>View History>>Rollback. The rollback purges all objects (of the latest version) that were in the deployment group. Initiate a rollback on a deployment in order to roll back only the latest versions of

INFORMATICA CONFIDENTIAL

BEST PRACTICES

119 of 954

the objects. The rollback ensures that the check-in time for the repository objects is the same as the deploy time. Also, pmrep command RollBackDeployment can be used for automating rollbacks. Remember that you cannot rollback part of the deployment, you will have to rollback all the objects in a deployment group.

Managing Repository Size


As objects are checked in and objects are deployed to target repositories, the number of object versions in those repositories increases, as does the size of the repositories. In order to manage repository size, use a combination of Check-in Date and Latest Status (both are query parameters) to purge the desired versions from the repository and retain only the very latest version. Also all the deleted versions of the objects should be purged to reduce the size of the repository. If it is necessary to keep more than the latest version, labels can be included in the query. These labels are ones that have been applied to the repository for the specific purpose of identifying objects for purging.

Off-Shore, On-Shore Migration


In an off-shore development environment to an on-shore migration situation, other aspects of the computing environment may make it desirable to generate a dynamic deployment group. Instead of migrating the group itself to the next repository, a query can be used to select the objects for migration and save them to a single XML file which can be then be transmitted to the on-shore environment through alternative methods. If the on-shore repository is versioned, it activates the import wizard as if a deployment group was being received.

Code Migration from Versioned Repository to a Non-Versioned Repository


In some instances, it may be desirable to migrate objects to a non-versioned repository from a versioned repository. Note that when migrating in this manner, this changes the wizards used, and that the export from the versioned repository must take place using XML export.

Last updated: 27-May-08 13:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

120 of 954

Migration Procedures - PowerCenter Challenge


Develop a migration strategy that ensures clean migration between development, test, quality assurance (QA), and production environments, thereby protecting the integrity of each of these environments as the system evolves.

Description
Ensuring that an application has a smooth migration process between development, QA, and production environments is essential for the deployment of an application. Deciding which migration strategy works best for a project depends on two primary factors.
q

How is the PowerCenter repository environment designed? Are there individual repositories for development, QA, and production or are there just one or two environments that share one or all of these phases. How has the folder architecture been defined?

Each of these factors plays a role in determining the migration procedure that is most beneficial to the project. PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter migration options include repository migration, folder migration, object migration, and XML import/export. In versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which provides the capability to migrate any combination of objects within the repository with a single command. This Best Practice is intended to help the development team decide which technique is most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected. Each section describes the major advantages of its use, as well as its disadvantages.

Repository Environments
The following section outlines the migration procedures for standalone and distributed repository environments. The distributed environment section touches on several migration architectures, outlining the pros and cons of each. Also, please note that any methods described in the Standalone section may also be used in a Distributed environment.

Standalone Repository Environment


In a standalone environment, all work is performed in a single PowerCenter repository that serves as the metadata store. Separate folders are used to represent the development, QA, and production workspaces and segregate work. This type of architecture within a single repository ensures seamless migration from development to QA, and from QA to production. The following example shows a typical architecture. In this example, the company has chosen to create separate development folders for each of the individual developers for development and unit test purposes. A single shared or common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

121 of 954

Proposed Migration Process Single Repository


DEV to TEST Object Level Migration Now that we've described the repository architecture for this organization, let's discuss how it will migrate mappings to test, and then eventually to production. After all mappings have completed their unit testing, the process for migration to test can begin. The first step in this process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder. This can be done using one of two methods:
q

The first, and most common method, is object migration via an object copy. In this case, a user opens the SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e., Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file from one folder to another using Windows Explorer. The second approach is object migration via object XML import/export. A user can export each of the objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded to a third-party versioning tool, if the organization has standardized on such a tool. Otherwise, versioning can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories is covered later in this document.

After you've copied all common or shared objects, the next step is to copy the individual mappings from each development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration methods described above to copy the mappings to the folder, although the XML import/export method is the most intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST folder. Designer prompts the user to choose the correct shortcut folder that you created in the previous example, which point to the SHARED_MARKETING_TEST (see image below). You can then continue the migration process until all mappings have been successfully migrated. In PowerCenter 7 and later versions, you can export multiple objects into a single XML file, and then import them at the same time.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

122 of 954

The final step in the process is to migrate the workflows that use those mappings. Again, the object-level migration can be completed either through drag-and-drop or by using XML import/export. In either case, this process is very similar to the steps described above for migrating mappings, but differs in that the Workflow Manager provides a Workflow Copy Wizard to guide you through the process. The following steps outline the full process for successfully copying a workflow and all of its associated tasks. 1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a default name is used. Then click Next to continue the copy process. 2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or replace the current one. If it does not exist, then the default name is used (see below). Then click Next.

3. Next, the Wizard prompts you to select the mapping associated with each session task in the workflow. Select the mapping and continue by clicking Next".

INFORMATICA CONFIDENTIAL

BEST PRACTICES

123 of 954

4. If connections exist in the target repository, the Wizard prompts you to select the connection to use for the source and target. If no connections exist, the default settings are used. When this step is completed, click "Finish" and save the work.

Initial Migration New Folders Created


The move to production is very different for the initial move than for subsequent changes to mappings and workflows. Since the repository only contains folders for development and test, we need to create two new folders to house the production-ready objects. Create these folders after testing of the objects in SHARED_MARKETING_TEST and MARKETING_TEST has been approved. The following steps outline the creation of the production folders and, at the same time, address the initial test to production migration. 1. Open the PowerCenter Repository Manager client tool and log into the repository. 2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST folder, drag it, and drop it on the repository name. 3. The Copy Folder Wizard appears to guide you through the copying process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

124 of 954

4. The first Wizard screen asks if you want to use the typical folder copy options or the advanced options. In this example, we'll use the advanced options.

5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that appears on this screen is the folder name followed by the date. In this case, enter the name as SHARED_MARKETING_PROD.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

125 of 954

6. The third Wizard screen prompts you to select a folder to override. Because this is the first time you are transporting the folder, you wont need to select anything.

7. The final screen begins the actual copy process. Click "Finish" when the process is complete.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

126 of 954

Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as the original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder that you just created. At the end of the migration, you should have two additional folders in the repository environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders contain the initially migrated objects. Before you can actually run the workflow in these production folders, you need to modify the session source and target connections to point to the production environment.

When you copy or replace a PowerCenter repository folder, the Copy Wizard copies the permissions for the folder owner to the target folder. The wizard does not copy permissions for users, groups, or all others in the repository to the target folder. Previously, the Copy Wizard copied the permissions for the folder owner, owners group, and all users in the repository to the target folder.

Incremental Migration Object Copy Example


Now that the initial production migration is complete, let's take a look at how future changes will be migrated into the folder.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

127 of 954

Any time an object is modified, it must be re-tested and migrated into production for the actual change to occur. These types of changes in production take place on a case-by-case or periodically-scheduled basis. The following steps outline the process of moving these objects individually. 1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object to copy and drag-and-drop it into the appropriate workspace window. 2. Because this is a modification to an object that already exists in the destination folder, Designer prompts you to choose whether to Rename or Replace the object (as shown below). Choose the option to Replace the object.

3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any object in Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are making are what you intend. See below for an example of the mapping compare window.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

128 of 954

4. After the object has been successfully copied, save the folder so the changes can take place. 5. The newly copied mapping is now tied to any sessions that the replaced mapping was tied to. 6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can update itself with the changes.

Standalone Repository Example


In this example, we look at moving development work to QA and then from QA to production, using multiple development folders for each developer, with the test and production folders divided into the data mart they represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to move objects and mappings from each individual folder to the test folder and then how to move tasks, worklets, and workflows to the new area. Follow these steps to copy a mapping from Development to QA: 1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2 r Copy the tested objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder.
r

Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to MARKETING_TEST. Save your changes.

2. Copy the mapping from Development into Test. r In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the mapping from each development folder into the MARKETING_TEST folder.
r

When copying each mapping in PowerCenter, Designer prompts you to either Replace, Rename, or Reuse the object, or Skip for each reusable object, such as source and target definitions. Choose to Reuse the object for all shared objects in the mappings copied into the MARKETING_TEST folder.
BEST PRACTICES 129 of 954

INFORMATICA CONFIDENTIAL

Save your changes.

3. If a reusable session task is being used, follow these steps. Otherwise, skip to step 4. r In the PowerCenter Workflow Manager, open the MARKETING_TEST folder and drag and drop each reusable session from the developers folders into the MARKETING_TEST folder. A Copy Session Wizard guides you through the copying process.
r

Open each newly copied session and click on the Source tab. Change the source to point to the source database for the Test environment. Click the Target tab. Change each connection to point to the target database for the Test environment. Be sure to double-check the workspace from within the Target tab to ensure that the load options are correct. Save your changes.

4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test. r Drag each workflow from the development folders into the MARKETING_TEST folder. The Copy Workflow Wizard appears. Follow the same steps listed above to copy the workflow to the new folder.
r

As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to compare conflicts from within Workflow Manager to ensure that the correct migrations are being made. Save your changes.

5. Implement the appropriate security. r In Development, the owner of the folders should be a user(s) in the development group.
r r r

In Test, change the owner of the test folder to a user(s) in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the production folders.

Rules to Configure Folder and Global Object Permissions Rules in 8.5 The folder or global object owner or a user assigned the Administrator role for the Repository Service can grant folder and global object permissions. Rules in Previous Versions Users with the appropriate repository privileges could grant folder and global object permissions.

Permissions can be granted to users, groups, and all others in Permissions could be granted to the owner, owners group, the repository. and all others in the repository. The folder or global object owner and a user assigned the Administrator role for the Repository Service have all permissions which you cannot change. You could change the permissions for the folder or global object owner.

Disadvantages of a Single Repository Environment


The biggest disadvantage or challenge with a single repository environment is migration of repository objects with respect to database connections. When migrating objects from Dev to Test to Prod you cant use the same database connection as those that will be pointing to dev or test environment. A single repository structure can also create confusion as the same users and groups exist in all environments and the number of folders can increase exponentially.

Distributed Repository Environment


INFORMATICA CONFIDENTIAL BEST PRACTICES 130 of 954

A distributed repository environment maintains separate, independent repositories, hardware, and software for development, test, and production environments. Separating repository environments is preferable for handling development to production migrations. Because the environments are segregated from one another, work performed in development cannot impact QA or production. With a fully distributed approach, separate repositories function much like the separate folders in a standalone environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following example, we discuss a distributed repository architecture. There are four techniques for migrating from development to production in a distributed repository architecture, with each involving some advantages and disadvantages.
q q q q

Repository Copy Folder Copy Object Copy Deployment Groups

Repository Copy
So far, this document has covered object-level migrations and folder migrations through drag-and-drop object copying and object XML import/export. This section discusses migrations in a distributed repository environment through repository copies. The main advantages of this approach are:
q

The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at once from one environment to another. The ability to automate this process using pmrep commands, thereby eliminating many of the manual processes that users typically perform. The ability to move everything without breaking or corrupting any of the objects.

This approach also involves a few disadvantages.


INFORMATICA CONFIDENTIAL BEST PRACTICES 131 of 954

The first is that everything is moved at once (which is also an advantage). The problem with this is that everything is moved -- ready or not. For example, we may have 50 mappings in QA, but only 40 of them are production-ready. The 10 untested mappings are moved into production along with the 40 production-ready mappings, which leads to the second disadvantage. Significant maintenance is required to remove any unwanted or excess objects. There is also a need to adjust server variables, sequences, parameters/variables, database connections, etc. Everything must be set up correctly before the actual production runs can take place. Lastly, the repository copy process requires that the existing Production repository be deleted, and then the Test repository can be copied. This results in a loss of production environment operational metadata such as load statuses, session run times, etc. High-performance organizations leverage the value of operational metadata to track trends over time related to load success/failure and duration. This metadata can be a competitive advantage for organizations that use this information to plan for future growth.

q q

Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the Repository Copy method:
q q q

Copying the Repository Repository Backup and Restore PMREP

Copying the Repository


Copying the Test repository to Production through the GUI client tools is the easiest of all the migration methods. First, ensure that all users are logged out of the destination repository and then connect to the PowerCenter Repository Administration Console (as shown below).

If the Production repository already exists, you must delete the repository before you can copy the Test repository. Before you can delete the repository, you must run the repository in the exclusive mode. 1. Click on the INFA_PROD Repository on the left pane to select it and change the running mode to exclusive mode by clicking on the edit button on the right pane under the properties tab.
INFORMATICA CONFIDENTIAL BEST PRACTICES 132 of 954

2. Delete the Production repository by selecting it and choosing Delete from the context menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

133 of 954

3. Click on the Action drop-down list and choose Copy contents from

INFORMATICA CONFIDENTIAL

BEST PRACTICES

134 of 954

4. In the new window, choose the domain name, repository service INFA_TEST from the drop-down menu. Enter the username and password of the Test repository.

5. Click OK to begin the copy process. 6. When you've successfully copied the repository to the new location, exit from the PowerCenter Administration
INFORMATICA CONFIDENTIAL BEST PRACTICES 135 of 954

Console. 7. In the Repository Manager, double-click on the newly copied repository and log-in with a valid username and password. 8. Verify connectivity, then highlight each folder individually and rename them. For example, rename the MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to SHARED_MARKETING_PROD. 9. Be sure to remove all objects that are not pertinent to the Production environment from the folders before beginning the actual testing process. 10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify the server information and all connections so they are updated to point to the new Production locations for all existing tasks and workflows.

Repository Backup and Restore


Backup and Restore Repository is another simple method of copying an entire repository. This process backs up the repository to a binary file that can be restored to any new location. This method is preferable to the repository copy process because if any type of error occurs, the file is backed up to the binary file on the repository server. From 8.5 onwards, security information is maintained at the domain level. Before you back up a repository and restore it in a different domain, verify that users and groups with privileges for the source Repository Service exist in the target domain. The Service Manager periodically synchronizes the list of users and groups in the repository with the users and groups in the domain configuration database. During synchronization, users and groups that do not exist in the target domain are deleted from the repository. You can use infacmd to export users and groups from the source domain and import them into the target domain. Use infacmd ExportUsersAndGroups to export the users and groups to a file. Use infacmd ImportUsersAndGroups to import the users and groups from the file to a different PowerCenter domain The following steps outline the process of backing up and restoring the repository for migration. 1. Launch the PowerCenter Administration Console, and highlight the INFA_TEST repository service. Select Action -> Backup Contents from the drop-down menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

136 of 954

2. A screen appears and prompts you to supply a name for the backup file as well as the Administrator username and password. The file is saved to the Backup directory within the repository servers home directory.

3. After you've selected the location and file name, click OK to begin the backup process. 4. The backup process creates a .rep file containing all repository information. Stay logged into the Manage Repositories screen. When the backup is complete, select the repository connection to which the backup will be restored to (i.e., the Production repository).
INFORMATICA CONFIDENTIAL BEST PRACTICES 137 of 954

5. The system will prompt you to supply a username, password, and the name of the file to be restored. Enter the appropriate information and click OK. When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to delete all of the unused objects and renaming of the folders.

PMREP
Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is run from the command line rather than through the GUI client tools. pmrep is installed in the PowerCenter Client and PowerCenter Services bin directories. PMREP utilities can be used from the Informatica Server or from any client machine connected to the server. Refer to the Repository Manager Guide for a list of PMREP commands. PMREP backup backs up the repository to the file specified with the -o option. You must provide the backup file name. Use this command when the repository is running. You must be connected to a repository to use this command. The BackUp command uses the following syntax: backup -o <output_file_name> [-d <description>] [-f (overwrite existing output file)] [-b (skip workflow and session logs)] [-j (skip deploy group history)] [-q (skip MX data)] [-v (skip task statistics)] The following is a sample of the command syntax used within a Windows batch file to connect to and backup a repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform functions
INFORMATICA CONFIDENTIAL BEST PRACTICES 138 of 954

such as connect, backup, restore, etc: backupproduction.bat REM This batch file uses pmrep to connect to and back up the repository Production on the server Central @echo off echo Connecting to Production repository... <Informatica Installation Directory>\Server\bin\pmrep connect -r INFAPROD -n Administrator -x Adminpwd h infarepserver o 7001 echo Backing up Production repository... <Informatica Installation Directory>\Server\bin\pmrep backup -o c:\backup\Production_backup.rep Alternatively, the following steps can be used: 1. Use infacmd commands to run repository service in Exclusive mode 2. Use pmrep backup command to backup the source repository 3. Use pmrep delete command to delete the content of target repository (if contect already exists in the target repository) 4. Use pmrep restore command to restore the backup file into target repostiory

Post-Repository Migration Cleanup


After you have used one of the repository migration procedures to migrate into Production, follow these steps to convert the repository to Production: 1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and workflows.
r

Disable the workflows not being used in the Workflow Manager by opening the workflow properties, then checking the Disabled checkbox under the General tab. Delete the tasks not being used in the Workflow Manager and the mappings in the Designer

2. Modify the database connection strings to point to the production sources and targets.
r r

In the Workflow Manager, select Relational connections from the Connections menu. Edit each relational connection by changing the connect string to point to the production sources and targets. If you are using lookup transformations in the mappings and the connect string is anything other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.

3. Modify the pre- and post-session commands and SQL as necessary.


r

In the Workflow Manager, open the session task properties, and from the Components tab make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as:


INFORMATICA CONFIDENTIAL BEST PRACTICES 139 of 954

r r r r

In Development, ensure that the owner of the folders is a user in the development group. In Test, change the owner of the test folders to a user in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the Production folders.

Folder Copy
Although deployment groups are becoming a very popular migration method, the folder copy method has historically been the most popular way to migrate in a distributed environment. Copying an entire folder allows you to quickly promote all of the objects located within that folder. All source and target objects, reusable transformations, mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is copied. The three advantages of using the folder copy method are:
q

The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the objects located within it. If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships are automatically converted to point to this newly copied common or shared folder. All connections, sequences, mapping variables, and workflow variables are copied automatically.

The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least utilized. Remember that a locked repository means than no jobs can be launched during this process. This can be a serious consideration in real-time or near real-time environments. The following example steps through the process of copying folders from each of the different environments. The first example uses three separate repositories for development, test, and production. 1. If using shortcuts, follow these sub steps; otherwise skip to step 2:
q q q q q

Open the Repository Manager client tool. Connect to both the Development and Test repositories. Highlight the folder to copy and drag it to the Test repository. The Copy Folder Wizard appears to step you through the copy process. When the folder copy process is complete, open the newly copied folder in both the Repository Manager and Designer to ensure that the objects were copied properly.

2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps:
q q q

Open the Repository Manager client tool. Connect to both the Development and Test repositories. Highlight the folder to copy and drag it to the Test repository. The Copy Folder Wizard will appear.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

140 of 954

3. Follow these steps to ensure that all shortcuts are reconnected.


q q

Use the advanced options when copying the folder across. Select Next to use the default name of the folder

4. If the folder already exists in the destination repository, choose to replace the folder.

The following screen appears to prompt you to select the folder where the new shortcuts are located.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

141 of 954

In a situation where the folder names do not match, a folder compare will take place. The Copy Folder Wizard then completes the folder copy process. Rename the folder as appropriate and implement the security. 5. When testing is complete, repeat the steps above to migrate to the Production repository. When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is modified for test and production.

Object Copy
Copying mappings into the next stage in a networked environment involves many of the same advantages and disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked environment. For additional information, see the earlier description of Object Copy for the standalone environment. One advantage of Object Copy in a distributed environment is that it provides more granular control over objects. Two distinct disadvantages of Object Copy in a distributed environment are:
q q

Much more work to deploy an entire group of objects Shortcuts must exist prior to importing/copying mappings

Below are the steps to complete an object copy in a distributed repository environment: 1. If using shortcuts, follow these sub-steps, otherwise skip to step 2:
q q

In each of the distributed repositories, create a common folder with the exact same name and case. Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact same name.

2. Copy the mapping from the Test environment into Production.


INFORMATICA CONFIDENTIAL BEST PRACTICES 142 of 954

In the Designer, connect to both the Test and Production repositories and open the appropriate folders in each. Drag-and-drop the mapping from Test into Production. During the mapping copy process, PowerCenter 7 and later versions allow a comparison of this mapping to an existing copy of the mapping already in Production. Note that the ability to compare objects is not limited to mappings, but is available for all repository objects including workflows, sessions, and tasks.

q q

3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping (first ensure that the mapping exists in the current repository).
q q

If copying the workflow, follow the Copy Wizard. If creating the workflow, add a session task that points to the mapping and enter all the appropriate information.

4. Implement appropriate security.


q q q q

In Development, ensure the owner of the folders is a user in the development group. In Test, change the owner of the test folders to a user in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the Production folders.

Deployment Groups
For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as you would in an object copy migration, but can also have the convenience of a repository- or folder-level migration as all objects are deployed at once. The objects included in a deployment group have no restrictions and can come from one or multiple folders. Additionally, for additional convenience, you can set up a dynamic deployment group that allows the objects in the deployment group to be defined by a repository query, rather than being added to the deployment group manually. Lastly, because deployment groups are available on versioned repositories, they also have the ability to be rolled back, reverting to the previous versions of the objects, when necessary.

Advantages of Using Deployment Groups


q q q q

Backup and restore of the Repository needs to be performed only once. Copying a Folder replaces the previous copy. Copying a Mapping allows for different names to be used for the same object. Uses for Deployment Groups
r r r r r

Deployment Groups are containers that hold references to objects that need to be migrated. Allows for version-based object migration. Faster and more flexible than folder moves for incremental changes. Allows for migration rollbacks Allows specifying individual objects to copy, rather than the entire contents of a folder.

Types of Deployment Groups


q

Static
r

Contain direct references to versions of objects that need to be moved.


BEST PRACTICES 143 of 954

INFORMATICA CONFIDENTIAL

Users explicitly add the version of the object to be migrated to the deployment group.

Dynamic
r r

Contain a query that is executed at the time of deployment. The results of the query (i.e. object versions in the repository) are then selected and copied to the target repository

Pre-Requisites
Create required folders in the Target Repository

Creating Labels
A label is a versioning object that you can associate with any versioned object or group of versioned objects in a repository.
q

Advantages
r r r r

Tracks versioned objects during development. Improves query results. Associates groups of objects for deployment. Associates groups of objects for import and export.

Create label
r r r r r

Create labels through the Repository Manager. After creating the labels, go to edit mode and lock them. The "Lock" option is used to prevent other users from editing or applying the label. This option can be enabled only when the label is edited. Some Standard Label examples are:
s s s s s

Development Deploy_Test Test Deploy_Production Production

Apply Label
r r

Create a query to identify the objects that are needed to be queried. Run the query and apply the labels.

Note: By default, the latest version of the object gets labeled.

Queries
A query is an object used to search for versioned objects in the repository that meet specific conditions.
q

Advantages
r

Tracks objects during development


BEST PRACTICES 144 of 954

INFORMATICA CONFIDENTIAL

r r r

Associates a query with a deployment group Finds deleted objects you want to recover Finds groups of invalidated objects you want to validate

Create a query
r

The Query Browser allows you to create, edit, run, or delete object queries

Execute a query
r r

Execute through Query Browser EXECUTE QUERY: ExecuteQuery -q query_name -t query_type -u persistent_output_file_name -a append -c column_separator -r end-of-record_separator -l end-oflisting_indicator -b verbose

Creating a Deployment Group


Follow these steps to create a deployment group: 1. Launch the Repository Manager client tool and log in to the source repository. 2. Expand the repository, right-click on Deployment Groups and choose New Group.

3. In the dialog window, give the deployment group a name, and choose whether it should be static or dynamic. In this example, we are creating a static deployment group. Click OK.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

145 of 954

Adding Objects to a Static Deployment Group


Follow these steps to add objects to a static deployment group: 1. In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to the deployment group and choose Versioning -> View History. The View History window appears.

2. In the View History window, right-click the object and choose Add to Deployment Group.
INFORMATICA CONFIDENTIAL BEST PRACTICES 146 of 954

3. In the Deployment Group dialog window, choose the deployment group that you want to add the object to, and click OK.

4. In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want to add dependent objects to the deployment group so that they will be migrated as well. Click OK.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

147 of 954

NOTE: The All Dependencies option should be used for any new code that is migrating forward. However, this option can cause issues when moving existing code forward because All Dependencies also flags shortcuts. During the deployment, PowerCenter tries to re-insert or replace the shortcuts. This does not work, and causes the deployment to fail. The object will be added to the deployment group at this time. Although the deployment group allows the most flexibility, the task of adding each object to the deployment group is similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter allows the capability to create dynamic deployment groups.

Adding Objects to a Dynamic Deployment Group


Dynamic Deployment groups are similar in function to static deployment groups, but differ in the way that objects are added. In a static deployment group, objects are manually added one by one. In a dynamic deployment group, the contents of the deployment group are defined by a repository query. Dont worry about the complexity of writing a repository query, it is quite simple and aided by the PowerCenter GUI interface. Follow these steps to add objects to a dynamic deployment group: 1. First, create a deployment group, just as you did for a static deployment group, but in this case, choose the dynamic option. Also, select the Queries button.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

148 of 954

2. The Query Browser window appears. Choose New to create a query for the dynamic deployment group.

3. In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that should be migrated. The drop-down list of parameters lets you choose from 23 predefined metadata categories. In this case, the developers have assigned the RELEASE_20050130 label to all objects that need to be migrated, so the query is defined as Label Is Equal To RELEASE_20050130. The creation and application of labels are discussed in Using PowerCenter Labels.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

149 of 954

4. Save the Query and exit the Query Editor. Click OK on the Query Browser window, and close the Deployment Group editor window.

Executing a Deployment Group Migration


A Deployment Group migration can be executed through the Repository Manager client tool, or through the pmrep command line utility. With the client tool, you simply drag the deployment group from the source repository and drop it on the destination repository. This opens the Copy Deployment Group Wizard, which guides you through the stepby-step options for executing the deployment group.

Rolling Back a Deployment


To roll back a deployment, you must first locate the Deployment via the TARGET Repositories menu bar (i. e., Deployments -> History -> View History -> Rollback).

Automated Deployments
For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep DeployDeploymentGroup command, which can execute a deployment group migration without human intevention. This is ideal since the deployment group allows ultimate flexibility and convenience as the script can be scheduled to run overnight, thereby causing minimal impact on developers and the PowerCenter administrator. You can also use the pmrep utility to automate importing objects via XML.

Recommendations
Informatica recommends using the following process when running in a three-tiered environment with development, test, and production servers.
INFORMATICA CONFIDENTIAL BEST PRACTICES 150 of 954

Non-Versioned Repositories
For migrating from development into test, Informatica recommends using the Object Copy method. This method gives you total granular control over the objects that are being moved. It also ensures that the latest development mappings can be moved over manually as they are completed. For recommendations on performing this copy procedure correctly, see the steps listed in the Object Copy section.

Versioned Repositories
For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in a distributed repository environment. This method provides the greatest flexibility in that you can promote any object from within a development repository (even across folders) into any destination repository. Also, by using labels, dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group migration method results in automated migrations that can be executed without manual intervention.

Third-Party Versioning
Some organizations have standardized on third-party version control software. PowerCenters XML import/export functionality offers integration with such software and provides a means to migrate objects. This method is most useful in a distributed environment because objects can be exported into an XML file from one repository and imported into the destination repository. The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7 and later versions, the export/import functionality allows the export/import of multiple objects to a single XML file. This can significantly cut down on the work associated with object level XML import/export. The following steps outline the process of exporting the objects from source repository and importing them into the destination repository:

Exporting
1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object to be exported. 2. Select Repository -> Export Objects
INFORMATICA CONFIDENTIAL BEST PRACTICES 151 of 954

3. The system prompts you to select a directory location on the local workstation. Choose the directory to save the file. Using the default name for the XML file is generally recommended. 4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7 and later versions x \Client directory. (This may vary depending on where you installed the client tools.) 5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved the XML file. 6. Together, these files are now ready to be added to the version control software

Importing
Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where the object is to be imported. 1. Select Repository -> Import Objects. 2. The system prompts you to select a directory location and file to import into the repository. 3. The following screen appears with the steps for importing the object.

4. Select the mapping and add it to the Objects to Import list.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

152 of 954

5. Click "Next", and then click "Import". Since the shortcuts have been added to the folder, the mapping will now point to the new shortcuts and their parent folder. 6. It is important to note that the pmrep command line utility was greatly enhanced in PowerCenter 7 and later versions, allowing the activities associated with XML import/export to be automated through pmrep. 7. Click on the destination repository service on the left pane and choose the Action drop-down list box -> Restore. Remember, if the destination repository has content, it has to be deleted prior to restoring).

Last updated: 04-Jun-08 16:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

153 of 954

Migration Procedures - PowerExchange Challenge


To facilitate the migration of PowerExchange definitions from one environment to another.

Description
There are two approaches to perform a migration.
q q

Using the DTLURDMO utility Using the Power Exchange Client tool (Detail Navigator)

DTLURDMO Utility Step 1: Validate connectivity between the client and listeners

Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping <loc>=<nodename>.

Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Run DTLURDMO to copy PowerExchange objects.


At this stage, if PowerExchange is to run against new versions of the PowerExchange objects rather than existing libraries, you need to copy the datamaps. To do this, use the PowerExchange Copy Utility DTLURDMO. The following section assumes that the entire datamap set is to be copied. DTLURDMO does have the ability to copy selectively, however, and the full functionality of the utility is documented in the PowerExchange Utilities Guide. The types of definitions that can be managed with this utility are:
q

PowerExchange data maps


BEST PRACTICES 154 of 954

INFORMATICA CONFIDENTIAL

q q

PowerExchange capture registrations PowerExchange capture extraction data maps

On MVS, the input statements for this utility are taken from SYSIN. On non-MVS platforms, the input argument point to a file containing the input definition. If no input argument is provided, it looks for a file dtlurdmo.ini in the current path. The utility runs on all capture platforms.

Windows and UNIX Command Line


Syntax: DTLURDMO <dtlurdmo definition file> For example: DTLURDMO e:\powerexchange\bin\dtlurdmo.ini
q

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.

MVS DTLURDMO job utility


Run the utility by submitting the DTLURDMO job, which can be found in the RUNLIB library.
q

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates and is read from the SYSIN card.

AS/400 utility
Syntax: CALL PGM(<location and name of DTLURDMO executable file>) For example: CALL PGM(dtllib/DTLURDMO)
q

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib library.

If you want to create a separate DTLURDMO definition file rather than use the default location, you must give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/ DTLURDMO) parm ('datalib/deffile(dtlurdmo)')

Running DTLURDMO
The utility should be run extracting information from the files locally, then writing out the datamaps through the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again for the registrations, and then the extract maps if this is a capture environment. Commands for mixed datamaps, registrations, and extract maps cannot be run together.
INFORMATICA CONFIDENTIAL BEST PRACTICES 155 of 954

If only a subset of the PowerExchange datamaps, registrations, and extract maps are required, then selective copies can be carried out. Details of performing selective copies are documented fully in the PowerExchange Utilities Guide. This document assumes that everything is going to be migrated from the existing environment to the new V8.x.x format.

Definition File Example


The following example shows a definition file to copy all datamaps from the existing local datamaps (the local datamaps are defined in the DATAMAP DD card in the MVS JCL or by the path on Windows or UNIX) to the V8.x.x listener (defined by the TARGET location node1): USER DTLUSR; EPWD A3156A3623298FDC; SOURCE LOCAL; TARGET NODE1; DETAIL; REPLACE; DM_COPY; SELECT schema=*; Note: The encrypted password (EPWD) is generated from the FILE, ENCRYPT PASSWORD option from the PowerExchange Navigator.

Power Exchange Client tool (Detail Navigator) Step 1: Validate connectivity between the client and listeners

Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping loc=<nodename>.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

156 of 954

Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Start the Power Exchange Navigator

q q

Select the datamap that is going to be promoted to production. On the menu bar, select a file to send to the remote node.

On the drop-down list box, choose the appropriate location ( in this case mvs_prod).
q

Supply the user name and password and click OK.


q

A confirmation message for successful migration is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

157 of 954

Last updated: 06-Feb-07 11:39

INFORMATICA CONFIDENTIAL

BEST PRACTICES

158 of 954

Running Sessions in Recovery Mode Challenge


Understanding the recovery options that are available for PowerCenter when errors are encountered during the load.

Description
When a task in the workflow fails at any point, one option is to truncate the target and run the workflow again from the beginning. As an alternative, the workflow can be suspended and the error can be fixed, rather than re-processing the portion of the workflow with no errors. This option, "Suspend on Error", results in accurate and complete target data, as if the session completed successfully with one run. There are also recovery options available for workflows and tasks that can be used to handle different failure scenarios.

Configure Mapping for Recovery


For consistent recovery, the mapping needs to produce the same result, and in the same order, in the recovery execution as in the failed execution. This can be achieved by sorting the input data using either the sorted ports option in Source Qualifier (or Application Source Qualifier) or by using a sorter transformation with distinct rows option immediately after source qualifier transformation. Additionally, ensure that all the targets received data from transformations that produce repeatable data.

Configure Session for Recovery


The recovery strategy can be configured on the Properties page of the Session task. Enable the session for recovery by selecting one of the following three Recovery Strategies:
q

Resume from the last checkpoint


r

The Integration Service saves the session recovery information and updates recovery tables for a target database. If a session interrupts, the Integration Service uses the saved recovery information to recover it.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

159 of 954

The Integration Service recovers a stopped, aborted or terminated session from the last checkpoint.

Restart task
r r

The Integration Service does not save session recovery information. If a session interrupts, the Integration Service reruns the session during recovery.

Fail task and continue workflow


r

The Integration Service recovers a workflow; it does not recover the session. The session status becomes failed and the Integration Service continues running the workflow.

Configure Workflow for Recovery


The Suspend on Error option directs the Integration Service to suspend the workflow while the error is being fixed and then it resumes the workflow. The workflow is suspended when any of the following tasks fail:
q q q q

Session Command Worklet Email

When a task fails in the workflow, the Integration Service stops running tasks in the path. The Integration Service does not evaluate the output link of the failed task. If no other task is running in the workflow, the Workflow Monitor displays the status of the workflow as "Suspended." If one or more tasks are still running in the workflow when a task fails, the Integration Service stops running the failed task and continues running tasks in other paths. The Workflow Monitor displays the status of the workflow as "Suspending." When the status of the workflow is "Suspended" or "Suspending," you can fix the error, such as a target database error, and recover the workflow in the Workflow Monitor. When you recover a workflow, the Integration Service restarts the failed tasks and continues evaluating the rest of the tasks in the workflow. The Integration Service does not run any task that already completed successfully.

Truncate Target Table

INFORMATICA CONFIDENTIAL

BEST PRACTICES

160 of 954

If the truncate table option is enabled in a recovery-enabled session, the target table is not truncated during recovery process.

Session Logs
In a suspended workflow scenario, the Integration Service uses the existing session log when it resumes the workflow from the point of suspension. However, the earlier runs that caused the suspension are recorded in the historical run information in the repository.

Suspension Email
The workflow can be configured to send an email when the Integration Service suspends the workflow. When a task fails, the workflow is suspended and suspension email is sent. The error can be fixed and the workflow can be resumed subsequently. If another task fails while the Integration Service is suspending the workflow, another suspension email is not sent. The Integration Service only sends out another suspension email if another task fails after the workflow resumes. Check the "Browse Emails" button on the General tab of the Workflow Designer Edit sheet to configure the suspension email.

Suspending Worklets
When the "Suspend On Error" option is enabled for the parent workflow, the Integration Service also suspends the worklet if a task within the worklet fails. When a task in the worklet fails, the Integration Service stops executing the failed task and other tasks in its path. If no other task is running in the worklet, the status of the worklet is "Suspended". If other tasks are still running in the worklet, the status of the worklet is "Suspending". The parent workflow is also suspended when the worklet is "Suspended" or "Suspending".

Starting Recovery
The recovery process can be started using Workflow Manager or Workflow Monitor . Alternately, the recovery process can be started by using pmcmd in command line mode or by using a script.

Recovery Tables and Recovery Process


When the Integration Service runs a session that has a resume recovery strategy, it

INFORMATICA CONFIDENTIAL

BEST PRACTICES

161 of 954

writes to recovery tables on the target database system. When the Integration Service recovers the session, it uses information in the recovery tables to determine where to begin loading data to target tables. If you want the Integration Service to create the recovery tables, grant table creation privilege to the database user name that is configured in the target database connection. If you do not want the Integration Service to create the recovery tables, create the recovery tables manually. The Integration Service creates the following recovery tables in the target database: PM_RECOVERY - Contains target load information for the session run. The Integration Service removes the information from this table after each successful session and initializes the information at the beginning of subsequent sessions. PM_TGT_RUN_ID - Contains information that the Integration Service uses to identify each target on the database. The information remains in the table between session runs. If you manually create this table, you must create a row and enter a value other than zero for LAST_TGT_RUN_ID to ensure that the session recovers successfully. PM_REC_STATE - When the Integration Service runs a real-time session that uses the recovery table and that has recovery enabled, it creates a recovery table, PM_REC_STATE, on the target database to store message IDs and commit numbers. When the Integration Service recovers the session, it uses information in the recovery tables to determine if it needs to write the message to the target table. The table contains information that the Integration Service uses to determine if it needs to write messages to the target table during recovery for a real-time session. If you edit or drop the recovery tables before you recover a session, the Integration Service cannot recover the session. If you disable recovery, the Integration Service does not remove the recovery tables from the target database and you must manually remove them

Session Recovery Considerations


The following options affect whether the session is incrementally recoverable:
q

Output is deterministic. A property that determines if the transformation generates the same set of data for each session run. Output is repeatable. A property that determines if the transformation generates the data in the same order for each session run. You can set this property for Custom transformations. Lookup source is static. A Lookup transformation property that determines if the lookup source is the same between the session and recovery. The

INFORMATICA CONFIDENTIAL

BEST PRACTICES

162 of 954

Integration Service uses this property to determine if the output is deterministic.

Inconsistent Data During Recovery Process


For recovery to be effective, the recovery session must produce the same set of rows; and in the same order. Any change after initial failure (in mapping, session and/or in the Integration Service) that changes the ability to produce repeatable data, results in inconsistent data during the recovery process. The following situations may produce inconsistent data during a recovery session:
q

Session performs incremental aggregation and the Integration Service stops unexpectedly. Mapping uses sequence generator transformation. Mapping uses a normalizer transformation. Source and/or target changes after initial session failure. Data movement mode change after initial session failure. Code page (server, source or target) changes, after initial session failure. Mapping changes in a way that causes server to distribute or filter or aggregate rows differently. Session configurations are not supported by PowerCenter for session recovery. Mapping uses a lookup table and the data in the lookup table changes between session runs. Session sort order changes, when server is running in Unicode mode.

q q q q q q

HA Recovery
Highly-available recovery allows the workflow to resume automatically in the case of Integration Service failover. The following options are available in the properties tab of the workflow:
q

Enable HA recovery Allows the workflow to be configured for Highly Availability. Automatically recover terminated tasks Recover terminated Session or Command tasks without user intervention. Maximum automatic recovery attempts When you automatically recover terminated tasks, you can choose the number of times the Integration Service
BEST PRACTICES 163 of 954

INFORMATICA CONFIDENTIAL

attempts to recover the task. The default setting is 5.


Last updated: 26-May-08 11:28

INFORMATICA CONFIDENTIAL

BEST PRACTICES

164 of 954

Using PowerCenter Labels Challenge


Using labels effectively in a data warehouse or data integration project to assist with administration and migration.

Description
A label is a versioning object that can be associated with any versioned object or group of versioned objects in a repository. Labels provide a way to tag a number of object versions with a name for later identification. Therefore, a label is a named object in the repository, whose purpose is to be a pointer or reference to a group of versioned objects. For example, a label called Project X version X can be applied to all object versions that are part of that project and release. Labels can be used for many purposes:
q q q q

Track versioned objects during development Improve object query results. Create logical groups of objects for future deployment. Associate groups of objects for import and export.

Note that labels apply to individual object versions, and not objects as a whole. So if a mapping has ten versions checked in, and a label is applied to version 9, then only version 9 has that label. The other versions of that mapping do not automatically inherit that label. However, multiple labels can point to the same object for greater flexibility. The Use Repository Manager privilege is required in order to create or edit labels, To create a label, choose Versioning-Labels from the Repository Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

165 of 954

When creating a new label, choose a name that is as descriptive as possible. For example, a suggested naming convention for labels is: Project_Version_Action. Include comments for further meaningful description. Locking the label is also advisable. This prevents anyone from accidentally associating additional objects with the label or removing object references for the label. Labels, like other global objects such as Queries and Deployment Groups, can have user and group privileges attached to them. This allows an administrator to create a label that can only be used by specific individuals or groups. Only those people working on a specific project should be given read/write/execute permissions for labels that are assigned to that project.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

166 of 954

Once a label is created, it should be applied to related objects. To apply the label to objects, invoke the Apply Label wizard from the Versioning >> Apply Label menu option from the menu bar in the Repository Manager (as shown in the following figure).

Applying Labels
Labels can be applied to any object and cascaded upwards and downwards to parent and/or child objects. For example, to group dependencies for a workflow, apply a label to all children objects. The Repository Server applies labels to sources, targets, mappings, and tasks associated with the workflow. Use the Move label property to point the label to the latest version of the object(s). Note: Labels can be applied to any object version in the repository except checked-out versions. Execute permission is required for applying labels. After the label has been applied to related objects, it can be used in queries and deployment groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the size of the repository (i.e. to purge object versions).

Using Labels in Deployment


An object query can be created using the existing labels (as shown below). Labels can be associated only with a dynamic deployment group. Based on the object query, objects associated with that label can be used in the deployment.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

167 of 954

Strategies for Labels


Repository Administrators and other individuals in charge of migrations should develop their own label strategies and naming conventions in the early stages of a data integration project. Be sure that developers are aware of the uses of these labels and when they should apply labels. For each planned migration between repositories, choose three labels for the development and subsequent repositories:
q q q

The first is to identify the objects that developers can mark as ready for migration. The second should apply to migrated objects, thus developing a migration audit trail. The third is to apply to objects as they are migrated into the receiving repository, completing the migration audit trail.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

168 of 954

When preparing for the migration, use the first label to construct a query to build a dynamic deployment group. The second and third labels in the process are optionally applied by the migration wizard when copying folders between versioned repositories. Developers and administrators do not need to apply the second and third labels manually. Additional labels can be created with developers to allow the progress of mappings to be tracked if desired. For example, when an object is successfully unit-tested by the developer, it can be marked as such. Developers can also label the object with a migration label at a later time if necessary. Using labels in this fashion along with the query feature allows complete or incomplete objects to be identified quickly and easily, thereby providing an object-based view of progress.

Last updated: 04-Jun-08 13:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

169 of 954

Deploying Data Analyzer Objects Challenge


To understand the methods for deploying Data Analyzer objects among repositories and the limitations of such deployment.

Description
Data Analyzer repository objects can be exported to and imported from Extensible Markup Language (XML) files. Export/import facilitates archiving the Data Analyzer repository and deploying Data Analyzer Dashboards and reports from development to production. The following repository objects in Data Analyzer can be exported and imported:
q q q q q q q q q q

Schemas Reports Time Dimensions Global Variables Dashboards Security profiles Schedules Users Groups Roles

The XML file created after exporting objects should not be modified. Any change might invalidate the XML file and result in failure of import objects into a Data Analyzer repository. For more information on exporting objects from the Data Analyzer repository, refer to the Data Analyzer Administration Guide.

Exporting Schema(s)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

170 of 954

To export the definition of a star schema or an operational schema, you need to select a metric or folder from the Metrics system folder in the Schema Directory. When you export a folder, you export the schema associated with the definitions of the metrics in that folder and its subfolders. If the folder you select for export does not contain any objects, Data Analyzer does not export any schema definition and displays the following message: There is no content to be exported. There are two ways to export metrics or folders containing metrics:
q

Select the Export Metric Definitions and All Associated Schema Table and Attribute Definitions option. If you select to export a metric and its associated schema objects, Data Analyzer exports the definitions of the metric and the schema objects associated with that metric. If you select to export an entire metric folder and its associated objects, Data Analyzer exports the definitions of all metrics in the folder, as well as schema objects associated with every metric in the folder. Alternatively, select the Export Metric Definitions Only option. When you choose to export only the definition of the selected metric, Data Analyzer does not export the definition of the schema table from which the metric is derived or any other associated schema object.

1. Login to Data Analyzer as a System Administrator. 2. Click on the Administration tab XML Export/Import Export Schemas. 3. All the metric folders in the schema directory are displayed. Click Refresh Schema to display the latest list of folders and metrics in the schema directory. 4. Select the check box for the folder or metric to be exported and click Export as XML option. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Report(s)
To export the definitions of more than one report, select multiple reports or folders. Data Analyzer exports only report definitions. It does not export the data or the schedule for cached reports. As part of the Report Definition export, Data Analyzer exports the report table, report chart, filters, indicators (i.e., gauge, chart, and table indicators), custom metrics, links to similar reports, and all reports in an analytic workflow, including links to similar reports.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

171 of 954

Reports can have public or personal indicators associated with them. By default, Data Analyzer exports only public indicators associated with a report. To export the personal indicators as well, select the Export Personal Indicators check box. To export an analytic workflow, you need to export only the originating report. When you export the originating report of an analytic workflow, Data Analyzer exports the definitions of all the workflow reports. If a report in the analytic workflow has similar reports associated with it, Data Analyzer exports the links to the similar reports. Data Analyzer does not export the alerts, schedules, or global variables associated with the report. Although Data Analyzer does not export global variables, it lists all global variables it finds in the report filter. You can, however, export these global variables separately. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Reports. Select the folder or report to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

Exporting Global Variables


1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Global Variables. Select the Global variable to be exported. Click Export as XML. Enter the XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

Exporting a Dashboard
Whenever a dashboard is exported, Data Analyzer exports the reports, indicators, shared documents, and gauges associated with the dashboard. Data Analyzer does not, however, export the alerts, access permissions, attributes or metrics in the report (s), or real-time objects. You can export any of the public dashboards defined in the repository, and can export more than one dashboard at one time. 1. Login to Data Analyzer as a System Administrator. 2. Click Administration XML Export/Import Export Dashboards. 3. Select the Dashboard to be exported.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

172 of 954

4. Click Export as XML. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a User Security Profile


Data Analyzer maintains a security profile for each user or group in the repository. A security profile consists of the access permissions and data restrictions that the system administrator sets for a user or group. When exporting a security profile, Data Analyzer exports access permissions for objects under the Schema Directory, which include folders, metrics, and attributes. Data Analyzer does not export access permissions for filtersets, reports, or shared documents. Data Analyzer allows you to export only one security profile at a time. If a user or group security profile you export does not have any access permissions or data restrictions, Data Analyzer does not export any object definitions and displays the following message: There is no content to be exported. 1. Login to Data Analyzer as a System Administrator. 2. Click Administration XML Export/Import Export Security Profile. 3. Click Export from users and select the user for which security profile to be exported. 4. Click Export as XML. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a Schedule
You can export a time-based or event-based schedule to an XML file. Data Analyzer runs a report with a time-based schedule on a configured schedule. Data Analyzer runs a report with an event-based schedule when a PowerCenter session completes. When you export a schedule, Data Analyzer does not export the history of the schedule. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Schedules. Select the Schedule to be exported. Click Export as XML.
BEST PRACTICES 173 of 954

INFORMATICA CONFIDENTIAL

5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Users, Groups, or Roles Exporting Users


You can export the definition of any user defined in the repository. However, you cannot export the definitions of system users defined by Data Analyzer. If you have more than one thousand users defined in the repository, Data Analyzer allows you to search for the users that you want to export. You can use the asterisk (*) or the percent symbol (%) as wildcard characters to search for users to export. You can export the definitions of more than one user, including the following information:
q q q q q q q q q q q

Login name Description First, middle, and last name Title Password Change password privilege Password never expires indicator Account status Groups to which the user belongs Roles assigned to the user Query governing settings

Data Analyzer does not export the email address, reply-to address, department, or color scheme assignment associated with the exported user(s). 1. 2. 3. 4. 5. 6. 7. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the user(s) to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

174 of 954

Exporting Groups
You can export any group defined in the repository, and can export the definitions of multiple groups. You can also export the definitions of all the users within a selected group. Use the asterisk (*) or percent symbol (%) as wildcard characters to search for groups to export. Each group definition includes the following information:
q q q q q q q q

Name Description Department Color scheme assignment Group hierarchy Roles assigned to the group Users assigned to the group Query governing settings

Data Analyzer does not export the color scheme associated with an exported group. 1. 2. 3. 4. 5. 6. 7. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the group to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

Exporting Roles
You can export the definitions of the custom roles defined in the repository. However, you cannot export the definitions of system roles defined by Data Analyzer. You can export the definitions of more than one role. Each role definition includes the name and description of the role and the permissions assigned to each role. 1. 2. 3. 4. 5. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the role to be exported. Click Export as XML.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

175 of 954

6. Enter XML filename and click Save to save the XML file. 7. The XML file will be stored locally on the client machine.

Importing Objects
You can import objects into the same repository or a different repository. If you import objects that already exist in the repository, you can choose to overwrite the existing objects. However, you can import only global variables that do not already exist in the repository. When you import objects, you can validate the XML file against the DTD provided by Data Analyzer. Informatica recommends that you do not modify the XML files after you export from Data Analyzer. Ordinarily, you do not need to validate an XML file that you create by exporting from Data Analyzer. However, if you are not sure of the validity of an XML file, you can validate it against the Data Analyzer DTD file when you start the import process. To import repository objects, you must have the System Administrator role or the Access XML Export/Import privilege. When you import a repository object, you become the owner of the object as if you created it. However, other system administrators can also access imported repository objects. You can limit access to reports for users who are not system administrators. If you select to publish imported reports to everyone, all users in Data Analyzer have read and write access to them. You can change the access permissions to reports after you import them.

Importing Schemas
When importing schemas, if the XML file contains only the metric definition, you must make sure that the fact table for the metric exists in the target repository. You can import a metric only if its associated fact table exists in the target repository or the definition of its associated fact table is also in the XML file. When you import a schema, Data Analyzer displays a list of all the definitions contained in the XML file. It then displays a list of all the object definitions in the XML file that already exist in the repository. You can choose to overwrite objects in the repository. If you import a schema that contains time keys, you must import or create a time dimension. 1. Login to Data Analyzer as a System Administrator.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

176 of 954

2. 3. 4. 5. 6.

Click Administration XML Export/Import Import Schema. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Reports
A valid XML file of exported report objects can contain definitions of cached or ondemand reports, including prompted reports. When you import a report, you must make sure that all the metrics and attributes used in the report are defined in the target repository. If you import a report that contains attributes and metrics not defined in the target repository, you can cancel the import process. If you choose to continue the import process, you may not be able to run the report correctly. To run the report, you must import or add the attribute and metric definitions to the target repository. You are the owner of all the reports you import, including the personal or public indicators associated with the reports. You can publish the imported reports to all Data Analyzer users. If you publish reports to everyone, Data Analyzer provides read-access to the reports to all users. However, it does not provide access to the folder that contains the imported reports. If you want another user to access an imported report, you can put the imported report in a public folder and have the user save or move the imported report to his or her personal folder. Any public indicator associated with the report also becomes accessible to the user. If you import a report and its corresponding analytic workflow, the XML file contains all workflow reports. If you choose to overwrite the report, Data Analyzer also overwrites the workflow reports. Also, when importing multiple workflows, note that Data Analyzer does not import analytic workflows containing the same workflow report names. Thus, ensure that all imported analytic workflows have unique report names prior to being imported. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Report. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Global Variables

INFORMATICA CONFIDENTIAL

BEST PRACTICES

177 of 954

You can import global variables that are not defined in the target repository. If the XML file contains global variables already in the repository, you can cancel the process. If you continue the import process, Data Analyzer imports only the global variables not in the target repository. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Global Variables. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Dashboards
Dashboards display links to reports, shared documents, alerts, and indicators. When you import a dashboard, Data Analyzer imports the following objects associated with the dashboard:
q q q q

Reports Indicators Shared documents Gauges

Data Analyzer does not import the following objects associated with the dashboard:
q q q q

Alerts Access permissions Attributes and metrics in the report Real-time objects

If an object already exists in the repository, Data Analyzer provides an option to overwrite it. Data Analyzer does not import the attributes and metrics in the reports associated with the dashboard. If the attributes or metrics in a report associated with the dashboard do not exist, the report does not display on the imported dashboard. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Dashboard. Click Browse to choose an XML file to import. Select Validate XML against DTD.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

178 of 954

5. Click Import XML. 6. Verify all attributes on the summary page, and choose Continue.

Importing Security Profile(s)


To import a security profile, you must begin by selecting the user or group to which you want to assign the security profile. You can assign the same security profile to more than one user or group. When you import a security profile and associate it with a user or group, you can either overwrite the current security profile or add to it. When you overwrite a security profile, you assign the user or group only the access permissions and data restrictions found in the new security profile. Data Analyzer removes the old restrictions associated with the user or group. When you append a security profile, you assign the user or group the new access permissions and data restrictions in addition to the old permissions and restrictions. When exporting a security profile, Data Analyzer exports the security profile for objects in Schema Directory, including folders, attributes, and metrics. However, it does not include the security profile for filtersets. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Security Profile. Click Import to Users. Select the user with which you want to associate the security profile you import.
r

To associate the imported security profiles with all the users on the page, select the "Users" check box at the top of the list. To associate the imported security profiles with all the users in the repository, select Import to All.. To overwrite the selected users current security profile with the imported security profile, select Overwrite.. To append the imported security profile to the selected users current security profile, select Append..

5. 6. 7. 8.

Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

179 of 954

Importing Schedule(s)
A time-based schedule runs reports based on a configured schedule. An event-based schedule runs reports when a PowerCenter session completes. You can import a timebased or event-based schedules from an XML file. When you import a schedule, Data Analyzer does not attach the schedule to any reports. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Schedule. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Users, Groups, or Roles


When you import a user, group, or role, you import all the information associated with each user, group, or role. The XML file includes definitions of roles assigned to users or groups, and definitions of users within groups. For this reason, you can import the definition of a user, group, or role in the same import process. When importing a user, you import the definitions of roles assigned to the user and the groups to which the user belongs. When you import a user or group, you import the user or group definitions only. The XML file does not contain the color scheme assignments, access permissions, or data restrictions for the user or group. To import the access permissions and data restrictions, you must import the security profile for the user or group. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import User/Group/Role. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML option. Verify all attributes on the summary page, and choose Continue.

Tips for Importing/Exporting


q

Schedule Importing/Exporting of repository objects for a time of minimal

INFORMATICA CONFIDENTIAL

BEST PRACTICES

180 of 954

Data Analyzer activity, when most of the users are not accessing the Data Analyzer repository. This should help to prevent users from experiencing timeout errors or degraded response time. Only the System Administrator should perform import/export operations.
q

Take a backup of the Data Analyzer repository prior to performing an import/ export operation. This backup should be completed using the Repository Backup Utility provided with Data Analyzer. Manually add user/group permissions for the report. These permissions will not be exported as part of exporting Reports and should be manually added after the report is imported in the desired server. Use a version control tool. Prior to importing objects into a new environment, it is advisable to check the XML documents with a version-control tool such as Microsoft's Visual Source Safe, or PVCS. This facilitates the versioning of repository objects and provides a means for rollback to a prior version of an object, if necessary. Attach cached reports to schedules. Data Analyzer does not import the schedule with a cached report. When you import cached reports, you must attach them to schedules in the target repository. You can attach multiple imported reports to schedules in the target repository in one process immediately after you import them. Ensure that global variables exist in the target repository. If you import a report that uses global variables in the attribute filter, ensure that the global variables already exist in the target repository. If they are not in the target repository, you must either import the global variables from the source repository or recreate them in the target repository. Manually add indicators to the dashboard. When you import a dashboard, Data Analyzer imports all indicators for the originating report and workflow reports in a workflow. However, indicators for workflow reports do not display on the dashboard after you import it until added manually. Check with your System Administrator to understand what level of LDAP integration has been configured (if any). Users, groups, and roles need to be exported and imported during deployment when using repository authentication. If Data Analyzer has been integrated with an LDAP (Lightweight Directory Access Protocol) tool, then users, groups, and/or roles may not require deployment.

When you import users into a Microsoft SQL Server or IBM DB2 repository, Data Analyzer blocks all user authentication requests until the import process is complete.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

181 of 954

Installing Data Analyzer Challenge


Installing Data Analyzer on new or existing hardware, either as a dedicated application on a physical machine (as Informatica recommends) or co-existing with other applications on the same physical server or with other Web applications on the same application server.

Description
Consider the following questions when determining what type of hardware to use for Data Analyzer: If the hardware already exists: 1. 2. 3. 4. 5. Is the processor, operating system, and database software supported by Data Analyzer? Are the necessary operating system and database patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the Data Analyzer application? Will Data Analyzer share the machine with other applications? If yes, what are the CPU and memory requirements of the other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? (e.g., Solaris, Windows, AIX, HP-UX, Redhat AS, SuSE) 3. What database and version is preferred and supported for the Data Analyzer repository? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the reporting response time requirements for Data Analyzer. The following questions should be answered in order to estimate the size of a Data Analyzer server: 1. 2. 3. 4. How many users are predicted for concurrent access? On average, how many rows will be returned in each report? On average, how many charts will there be for each report? Do the business requirements mandate a SSL Web server?

The hardware requirements for the Data Analyzer environment depend on the number of concurrent users, types of reports being used (i.e., interactive vs. static), average number of records in a report, application server and operating system used, among other factors. The following table should be used as a general guide for hardware recommendations for a Data Analyzer installation. Actual results may vary depending upon exact hardware configuration and user volume. For exact sizing recommendations, contact Informatica Professional Services for a Data Analyzer Sizing and Baseline Architecture engagement.

Windows
# of Concurrent Users Average Number of Rows per Report 1000 Average # of Charts per Report Estimated # of CPUs for Peak Usage Estimated Total RAM (For Data Analyzer alone) Estimated # of App servers in a Clustered Environment 1

50

1 GB

INFORMATICA CONFIDENTIAL

BEST PRACTICES

182 of 954

100 200 400 100 -100 100 100 100 100 100 100

1000 1000 1000 1000 2000 5000 10000 1000 1000 1000 1000

2 2 2 2 2 2 2 2 5 7 10

3 6 12 3 3 4 5 3 3 3 3-4

2 GB 3.5 GB 6.5 GB 2 GB 2.5 GB 3 GB 4 GB 2 GB 2 GB 2.5 GB 3 GB

1-2 3 6 1-2 1-2 2 2-3 1-2 1-2 1-2 1-2

Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1 SP3, Windows 2000, on a 4 CPU 2.5 GHz Xeon Processor. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. There will be an increase in overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

IBM AIX
# of Concurrent Users Average Number of Rows per Report 1000 Average # of Charts per Report Estimated # of CPUs for Peak Usage Estimated Total RAM (For Data Analyzer alone) Estimated # of App servers in a Clustered Environment 1

50

1 GB

INFORMATICA CONFIDENTIAL

BEST PRACTICES

183 of 954

100 200 400 100 -100 100 100 100 100 100 100

1000 1000 1000 1000 2000 5000 10000 1000 1000 1000 1000

2 2 2 2 2 2 2 2 5 7 10

2-3 4-5 9 - 10 2-3 2-3 2-3 4 2-3 2-3 2-3 2-3

2 GB 3.5 GB 6 GB 2 GB 2 GB 3 GB 4 GB 2 GB 2 GB 2 GB 2.5 GB

1 2-3 4-5 1 1-2 1-2 2 1 1 1-2 1-2

Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere 5.1.1.1 and AIX 5.2.02 on a 4 CPU 2.4 GHz IBM p630. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. Add 30 to 50 percent overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

Data Analyzer Installation


The Data Analyzer installation process involves two main components: the Data Analyzer Repository and the Data Analyzer Server, which is an application deployed on an application server. A Web server is necessary to support these components and is included with the installation of the application servers. This section discusses the installation process for JBOSS, BEA WebLogic and IBM WebSphere. The installation tips apply to both Windows and UNIX environments. This section is intended to serve as a supplement to the Data Analyzer Installation Guide. Before installing Data Analyzer, be sure to complete the following steps:
q

Verify that the hardware meets the minimum system requirements for Data Analyzer. Ensure that the combination of hardware, operating system, application server, repository database, and, optionally, authentication software
BEST PRACTICES 184 of 954

INFORMATICA CONFIDENTIAL

are supported by Data Analyzer. Ensure that sufficient space has been allocated to the Data Analyzer repository.
q q q

Apply all necessary patches to the operating system and database software. Verify connectivity to the data warehouse database (or other reporting source) and repository database. If LDAP or NT Domain is used for Data Analyzer authentication, verify connectivity to the LDAP directory server or the NT primary domain controller. The Data Analyzer license file has been obtained from technical support. On UNIX/Linux installations, the OS user that is running Data Analyzer must have execute privileges on all Data Analyzer installation executables.

q q

In addition to the standard Data Analyzer components that are installed by default, you can also install Metadata Manager. With Version 8.0, the Data Analyzer SDK and Portal Integration Kit are now installed with Data Analyzer. Refer to the Data Analyzer documentation for detailed information for these components.

Changes to Installation Process


Beginning with Data Analyzer version 7.1.4, Data Analyzer is packaged with PowerCenter Advance Edition. To install only the Data Analyzer portion, during the installation process choose the Custom Installation option. On the following screen, uncheck all of the check boxes except the Data Analyzer check box and then click Next.

Repository Configuration
To properly install Data Analyzer you need to have connectivity information for the database server where the repository is going to reside. This information includes:
q q q

Database URL Repository username Password for repository username

Installation Steps: JBOSS


INFORMATICA CONFIDENTIAL BEST PRACTICES 185 of 954

The following are the basic installation steps for Data Analyzer on JBOSS 1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install Data Analyzer. The Data Analyzer installation process will install JBOSS if a version does not already exist, or an existing instance can be selected. 3. Apply the Data Analyzer license key. 4. Install the Data Analyzer Online Help.

Installation Tips: JBOSS


The following are the basic installation tips for Data Analyzer on JBOSS:
q

Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of JBOSS. Also, other applications can coexist with Data Analyzer on a single instance of JBOSS. Although this architecture should be considered during hardware sizing estimates, it allows greater flexibility during installation. For JBOSS installations on UNIX, the JBOSS Server installation program requires an X-Windows server. If JBOSS Server is installed on a machine where an X-Windows server is not installed, an X-Windows server must be installed on another machine in order to render graphics for the GUI-based installation program. For more information on installing on UNIX, please see the UNIX Servers section of the installation and configuration tips below. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file must be applied prior to starting Data Analyzer.

Configuration Screen

INFORMATICA CONFIDENTIAL

BEST PRACTICES

186 of 954

Installation Steps: BEA WebLogic


The following are the basic installation steps for Data Analyzer on BEA WebLogic: 1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install BEA WebLogic and apply the BEA license. 3. Install Data Analyzer. 4. Apply the Data Analyzer license key. 5. Install the Data Analyzer Online Help.

TIP When creating a repository in an Oracle database, make sure the storage parameters specified for the tablespace that contains the repository are not set too large. Since many target tablespaces are initially set for very large INITIAL and NEXT values, large storage parameters cause the repository to use excessive amounts of space. Also verify that the default tablespace for the user that owns the repository tables is set correctly. The following example shows how to set the recommended storage parameters, assuming the repository is stored in the REPOSITORY tablespace: ALTER TABLESPACE REPOSITORY DEFAULT STORAGE ( INITIAL 10K NEXT 10K MAXEXTENTS UNLIMITED PCTINCREASE 50 );

Installation Tips: BEA WebLogic


INFORMATICA CONFIDENTIAL BEST PRACTICES 187 of 954

The following are the basic installation tips for Data Analyzer on BEA WebLogic:
q

Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebLogic. Also, other applications can coexist with Data Analyzer on a single instance of WebLogic. Although this architecture should be factored in during hardware sizing estimates, it allows greater flexibility during installation. With Data Analyzer 8, there is a console version of the installation available. X-Windows is no longer required for WebLogic installations. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation since the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

Installation Steps: IBM WebSphere


The following are the basic installation steps for Data Analyzer on IBM WebSphere: 1. Setup the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but the empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install IBM WebSphere and apply the WebSphere patches. WebSphere can be installed in its Base configuration or Network Deployment configuration if clustering will be utilized. In both cases, patchsets will need to be applied.
INFORMATICA CONFIDENTIAL BEST PRACTICES 188 of 954

3. 4. 5. 6.

Install Data Analyzer. Apply the Data Analyzer license key. Install the Data Analyzer Online Help. Configure the PowerCenter Integration Utility. See the section "Configuring the PowerCenter Integration Utility for WebSphere" in the PowerCenter Installation and Configuration Guide.

Installation Tips: IBM WebSphere


q

Starting in Data Analyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebSphere. Also, other applications can coexist with Data Analyzer on a single instance of WebSphere. Although this architecture should be considered during sizing estimates, it allows greater flexibility during installation. q With Data Analyzer 8 there is a console version of the installation available. X-Windows is no longer required for WebSphere installations.
q

For WebSphere on UNIX installations, Data Analyzer must be installed using the root user or system administrator account. Two groups (mqm and mqbrkrs) must be created prior to the installation and the root account should be added to both of these groups. For WebSphere on Windows installations, ensure that Data Analyzer is installed under the padaemon local Windows user ID that is in the Administrative group and has the advanced user rights: "Act as part of the operating system" and "Log on as a service." During the installation, the padaemon account will need to be added to the mqm group. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format. To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the WebSphere installation process, the user will be prompted to enter a directory for the application server and the HTTP (web) server. In both instances, it is advisable to keep the default installation directory. Directory names for the application server and HTTP server that include spaces may result in errors. During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is utilized, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

INFORMATICA CONFIDENTIAL

BEST PRACTICES

189 of 954

Installation and Configuration Tips: UNIX Servers


With Data Analyzer 8 there is a console version of the installation available. For previous versions of Data Analyzer, a graphics display server is required for a Data Analyzer installation on UNIX. On UNIX, the graphics display server is typically an X-Windows server, although an X-Window Virtual Frame Buffer (XVFB) or personal computer X-Windows software such as WRQ Reflection-X can also be used. In any case, the XWindows server does not need to exist on the local machine where Data Analyzer is being installed, but does need to be accessible. A remote X-Windows, XVFB, or PC-X Server can be used by setting the DISPLAY to the appropriate IP address, as discussed below. If the X-Windows server is not installed on the machine where Data Analyzer will be installed, Data Analyzer can be installed using an X-Windows server installed on another machine. Simply redirect the DISPLAY variable to use the XWindows server on another UNIX machine. To redirect the host output, define the environment variable DISPLAY. On the command line, type the following command and press Enter:

C shell:
setenv DISPLAY=<TCP/IP node of X-Windows server>:0

Bourne/Korn shell:
export DISPLAY=<TCP/IP node of X-Windows server>:0

Configuration

Data Analyzer requires a means to render graphics for charting and indicators. When graphics rendering is not configured properly, charts and indicators do not display properly on dashboards or reports. For Data Analyzer
BEST PRACTICES 190 of 954

INFORMATICA CONFIDENTIAL

installations using an application server with JDK 1.4 and greater, the java.awt.headless=true setting can be set in the application server startup scripts to facilitate graphics rendering for Data Analyzer. If the application server does not use JDK 1.4 or later, use an X-Windows server or XVFB to render graphics. The DISPLAY environment variable should be set to the IP address of the X-Windows or XVFB server prior to starting Data Analyzer.
q

The application server heap size is the memory allocation for the JVM. The recommended heap size depends on the memory available on the machine hosting the application server and server load, but the recommended starting point is 512MB. This setting is the first setting that should be examined when tuning a Data Analyzer instance.

Last updated: 24-Jul-07 16:40

INFORMATICA CONFIDENTIAL

BEST PRACTICES

191 of 954

Data Connectivity using PowerCenter Connect for BW Integration Server Challenge


Understanding how to use PowerCenter Connect for SAP NetWeaver - BW Option to load data into the SAP BW (Business Information Warehouse).

Description
The PowerCenter Connect for SAP NetWeaver - BW Option supports the SAP Business Information Warehouse as both a source and target.

Extracting Data from BW


PowerCenter Connect for SAP NetWeaver - BW Option lets you extract data from SAP BW to use as a source in a PowerCenter session. PowerCenter Connect for SAP NetWeaver - BW Option integrates with the Open Hub Service (OHS), SAPs framework for extracting data from BW. OHS uses data from multiple BW data sources, including SAP's InfoSources and InfoCubes. The OHS framework includes InfoSpoke programs, which extract data from BW and write the output to SAP transparent tables.

Loading Data into BW


PowerCenter Connect for SAP NetWeaver - BW Option lets you import BW target definitions into the Designer and use the target in a mapping to load data into BW. PowerCenter Connect for SAP NetWeaver - BW Option uses Business Application Program Interface (BAPI), to exchange metadata and load data into BW. PowerCenter can use SAPs business content framework to provide a high-volume data warehousing solution or SAPs Business Application Program Interface (BAPI), SAPs strategic technology for linking components into the Business Framework, to exchange metadata with BW. PowerCenter extracts and transforms data from multiple sources and uses SAPs highspeed bulk BAPIs to load the data into BW, where it is integrated with industry-specific models for analysis through the SAP Business Explorer tool.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

192 of 954

Using PowerCenter with PowerCenter Connect to Populate BW


The following paragraphs summarize some of the key differences in using PowerCenter with the PowerCenter Connect to populate a SAP BW rather than working with standard RDBMS sources and targets.
q

BW uses a pull model. The BW must request data from a source system before the source system can send data to the BW. PowerCenter must first register with the BW using SAPs Remote Function Call (RFC) protocol. The native interface to communicate with BW is the Staging BAPI, an API published and supported by SAP. Three products in the PowerCenter suite use this API. PowerCenter Designer uses the Staging BAPI to import metadata for the target transfer structures; PowerCenter Integration Server for BW uses the Staging BAPI to register with BW and receive requests to run sessions; and the PowerCenter Server uses the Staging BAPI to perform metadata verification and load data into BW. Programs communicating with BW use the SAP standard saprfc.ini file to communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle or the interface file in Sybase. The PowerCenter Designer reads metadata from BW and the PowerCenter Server writes data to BW. BW requires that all metadata extensions be defined in the BW Administrator Workbench. The definition must be imported to Designer. An active structure is the target for PowerCenter mappings loading BW. Because of the pull model, BW must control all scheduling. BW invokes the PowerCenter session when the InfoPackage is scheduled to run in BW. BW only supports insertion of data into BW. There is no concept of update or deletes through the staging BAPI.

Steps for Extracting Data from BW


The process of extracting data from SAP BW is quite similar to extracting data from SAP. Similar transports are used on the SAP side, and data type support is the same as that supported for SAP PowerCenter Connect. The steps required for extracting data are: 1. Create an InfoSpoke. Create an InfoSpoke in the BW to extract the data from the BW database and write it to either a database table or a file output target. 2. Import the ABAP program. Import the Informatica-provided ABAP program,
INFORMATICA CONFIDENTIAL BEST PRACTICES 193 of 954

3. 4. 5. 6.

which calls the workflow created in the Workflow Manager. Create a mapping. Create a mapping in the Designer that uses the database table or file output target as a source. Create a workflow to extract data from BW. Create a workflow and session task to automate data extraction from BW. Create a Process Chain. A BW Process Chain links programs together to run in sequence. Create a Process Chain to link the InfoSpoke and ABAP programs together. Schedule the data extraction from BW. Set up a schedule in BW to automate data extraction.

Steps To Load Data into BW


1. Install and Configure PowerCenter Components. The installation of the PowerCenter Connect for SAP NetWeaver - BW Option includes both a client and a server component. The Connect server must be installed in the same directory as the PowerCenter Server. Informatica recommends installing the Connect client tools in the same directory as the PowerCenter Client. For more details on installation and configuration refer to the PowerCenter and the PowerCenter Connect installation guides. Note: On SAP Transports for PowerConnect version 8.1 and above, it is crucial to install or upgrade PowerCenter 8.1 transports on the appropriate SAP system, when installing or upgrading PowerCenter Connect for SAP NetWeaver - BW Option. If you are extracting data from BW using OHS, you must also configure the mySAP option. If the BW system is separate from the SAP system, install the designated transports on the BW system. It is also important to note that there are now three categories of transports (as compared to two in previous versions). These are as follows:
q

Transports for SAP versions 3.1H and 3.1I.


q

Transports for SAP versions 4.0B to 4.6B, 4.6C, and non-Unicode versions 4.7 and above.
q

Transports for SAP Unicode versions 4.7 and above; this category has been added for Unicode extraction support which was not previously available in SAP versions 4.6 and earlier.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

194 of 954

2. Build the BW Components. To load data into BW, you must build components in both BW and PowerCenter. You must first build the BW components in the Administrator Workbench:
q

Define PowerCenter as a source system to BW. BW requires an external source definition for all non-R/3 sources.

q q

Create the InfoObjects in BW (this is similar to a database table).


The InfoSource represents a provider structure. Create the InfoSource in the BW Administrator Workbench and import the definition into the PowerCenter Warehouse Designer. Assign the InfoSource to the PowerCenter source system. After you create an InfoSource, assign it to the PowerCenter source system. Activate the InfoSource. When you activate the InfoSource, you activate the InfoObjects and the transfer rules.

3. Configure the sparfc.ini file. Required for PowerCenter and Connect to connect to BW. PowerCenter uses two types of entries to connect to BW through the saprfc.ini file:
q

Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the BW application server. Type R. Used by the PowerCenter Connect for SAP NetWeaver - BW Option. Specifies the external program, which is registered at the SAP gateway. Note: Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the saprfc.ini.

4. Start the Connect for BW server Start Connect for BW server after you start PowerCenter Server and before you create InfoPackage in BW.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

195 of 954

5. Build mappings Import the InfoSource into the PowerCenter repository and build a mapping using the InfoSource as a target. The following restrictions apply to building mappings with BW InfoSource target:
q q q q q

You cannot use BW as a lookup table. You can use only one transfer structure for each mapping. You cannot execute stored procedure in a BW target. You cannot partition pipelines with a BW target. You cannot copy fields that are prefaced with /BIC/ from the InfoSource definition into other transformations. You cannot build an update strategy in a mapping. BW supports only inserts; it does not support updates or deletes. You can use Update Strategy transformation in a mapping, but the Connect for BW Server attempts to insert all records, even those marked for update or delete.

6. Load data To load data into BW from PowerCenter, both PowerCenter and the BW system must be configured. Use the following steps to load data into BW:
q

Configure a workflow to load data into BW. Create a session in a workflow that uses a mapping with an InfoSource target definition. Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter session with the InfoSource. When the Connect for BW Server starts, it communicates with the BW to register itself as a server. The Connect for BW Server waits for a request from the BW to start the workflow. When the InfoPackage starts, the BW communicates with the registered Connect for BW Server and sends the workflow name to be scheduled with the PowerCenter Server. The Connect for BW Server reads information about the workflow and sends a request to the PowerCenter Server to run the workflow. The PowerCenter Server validates the workflow name in the repository and the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

196 of 954

workflow name in the InfoPackage. The PowerCenter Server executes the session and loads the data into BW. You must start the Connect for BW Server after you restart the PowerCenter Server.

Supported Datatypes
The PowerCenter Server transforms data based on the Informatica transformation datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one byte for a continuation flag. BW receives data until it reads the continuation flag set to zero. Within the transfer structure, BW then converts the data to the BW datatype. Currently, BW only supports the following datatypes in transfer structures assigned to BAPI source systems (PowerCenter ): CHAR, CUKY, CURR, DATS, NUMC, TIMS, UNIT. All other datatypes result in the following error in BW: Invalid data type (data type name) for source system of type BAPI.

Date/Time Datatypes
The transformation date/time datatype supports dates with precision to the second. If you import a date/time value that includes milliseconds, the PowerCenter Server truncates to seconds. If you write a date/time value to a target column that supports milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the date.

Binary Datatypes
BW does not allow you to build a transfer structure with binary datatypes. Therefore, you cannot load binary data from PowerCenter into BW.

Numeric Datatypes
PowerCenter does not support the INT1 datatype.

Performance Enhancement for Loading into SAP BW

INFORMATICA CONFIDENTIAL

BEST PRACTICES

197 of 954

If you see a performance slowdown for sessions that load into SAP BW, set the default buffer block size to 15MB to 20MB to enhance performance. You can put 5,000 to 10,000 rows per block, so you can calculate the buffer block size needed with the following formula: Row size x Rows per block = Default Buffer Block size For example, if your target row size is 2KB: 2 KB x 10,000 = 20MB.

Last updated: 04-Jun-08 16:31

INFORMATICA CONFIDENTIAL

BEST PRACTICES

198 of 954

Data Connectivity using PowerExchange for WebSphere MQ Challenge


Integrate WebSphere MQ applications with PowerCenter mappings.

Description
With increasing requirements for both on-demand real-time data integration and the development of Enterprise Application Integration (EAI) architectures, WebSphere MQ has become an important part of the Informatica data integration platform. PowerExchange for WebSphere MQ provides data integration for transactional data generated by continuously messaging systems. PowerCenters Zero Latency (ZL) Engine provides immediate processing of trickle-feed data for these types of messaging systems that allows both uni-directional and bi-directional processing of real-time data flow.

High Volume System Considerations


When working with high volume systems, two things to consider are the volume and the size of the messages coming over the network and whether or not the messages are persistent or non-persistent. Although a queue may be configured for persistence, a specific message can override this setting. When a message is persistent, the Queue Manager first writes the message out to a log before it allows it to be visible in the queue. In a very high volume flow, if this is not handled correctly, it can lead to performance degradation and cause the logging to potentially fill up the file system. Non-persistent messages are immediately visible in the queue for processing, but unlike persistent messages, if the Queue Manager or server crashes they cannot be recovered. To handle this type of flow volume, PowerCenter workflows can be configured to run in a Grid environment. The image below shows the two options that are available for persistence when creating a Local Queue:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

199 of 954

In conjunction with the PowerCenter Grid option, WebSphere MQ can also be clustered to allow multiple Queue Managers to process the same message flow(s). In this type of configuration, separate Integration Services can be created to each hold unique MQSERVER environment variables. Alternately, a Client Connection can be created for one Integration Service, with multiple connection properties configured for each Queue Manager in the cluster that holds the flow.

Message Affinity
Message Affinity is a consideration that is unique to clustered environments. Message Affinity occurs when the order in which a message should be processed happens out of sync. Example: Solution: In a trading system environment, a users sell message comes before the buy message. To help limit this behavior messages can have a unique id placed in the message header to show grouping as well as order. IMPORTANT -- It is not a common practice for the resequencing of these messages to be placed on the middleware software. The sending and receiving application should be responsible for this algorithm.

Message Sizes
The message size for any given flow needs to be determined before the development and architecture of workflows and queues. By default, all messaging communication objects are set to allow up to a 4 MB message size. If a message in the flow is larger than 4 MB the Queue Manager will log an error and allow the message through.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

200 of 954

To overcome this issue MQCHLLIB/MQCHLTAB environment variables must be used. The following settings must also be modified to allow for the larger message(s) in the queue. 1. Client Connection Channel: Set the Maximum Message Length to the largest estimated message size (100 MB limit).

2. Local Queue: Set the Max Message Length to the largest message size (100 MB limit).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

201 of 954

3. Queue Manager: The Queue Manager Max Message Length setting is key to allowing other objects to allow messages through. If the Queue Manager has a Max Message Length set to anything smaller than what is set in a Channel or a Local Queue the message will fail. For large messaging systems, create a separate Queue Manager just for those flows. Maximum size a Queue Manager can handle is 100 MB.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

202 of 954

Example:

A high volume application requiring PowerCenter to process a minimum 200 MSG/Sec 24/7. One message has four segments and each segment loads to a separate table. Three of the segments are optional and may not be present in a given message. The message is XML and must go thru a midstream XML parser in order to get the separate data out for each table. If a midstream XML Parser cannot handle segmenting the XML and loading it to the correct database tables fast enough to keep up with the message flow, messages can back up and cause the Queue Manager to overflow. First estimate each messages maximum size and then create a separate queue for each of the separate segments within the message. Create individual workflows to handle each queue and to load the data to the correct table. Then use an expression in PowerCenter to break out each segment and load it to the associated queue. For the optional segments, if they dont exist, there is nothing to load. Each workflow can then separately load the segmented XML into its own Mid Stream XML parser and into the correct database. Processing speed thru PowerCenter increased to 400450 MSG/Sec.

Solution:

Result:

Last updated: 27-May-08 13:07

INFORMATICA CONFIDENTIAL

BEST PRACTICES

203 of 954

Data Connectivity using PowerExchange for SAP NetWeaver Challenge


Understanding how to install PowerExchange for SAP NetWeaver, extract data from SAP R/3, and load data into SAP R/3.

Description
SAP R/3 is an ERP software that provides multiple business applications/modules, such as financial accounting, materials management, sales and distribution, human resources, CRM and SRM. The CORE R/3 system (BASIS layer) is programmed in Advance Business Application Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP. PowerExchange for SAP NetWeaver can write/read/change data in R/3 via BAPI/RFC and IDoc interfaces. The ABAP interface of PowerExchange for SAP NetWeaver can only read data from SAP R/3. PowerExchange for SAP NetWeaver provides the ability to extract SAP R/3 data into data warehouses, data integration applications, and other third-party applications. All of this is accomplished without writing complex ABAP code. PowerExchange for SAP NetWeaver generates ABAP programs and is capable of extracting data from transparent tables, pool tables, and cluster tables. When integrated with R/3 using ALE (Application Link Enabling), PowerExchange for SAP NetWeaver can also extract data from R/3 using outbound IDocs (Intermediate Documents) in near real-time. The ALE concept available in R/3 Release 3.0 supports the construction and operation of distributed applications. It incorporates controlled exchange of business data messages while ensuring data consistency across loosely-coupled SAP applications. The integration of various applications is achieved by using synchronous and asynchronous communication, rather than by means of a central database. The database server stores the physical tables in the R/3 system, while the application server stores the logical tables. A transparent table definition on the application server is represented by a single physical table on the database server. Pool and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical table on the database server.

Communication Interfaces
TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other interfaces between the two include: Common Program Interface-Communications (CPI-C). CPI-C communication protocol enables online data exchange and data conversion between R/3 and PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires information such as the host name of the application server and the SAP gateway. This information is stored on the PowerCenter Server in a configuration file named sideinfo. The PowerCenter Server uses parameters in the sideinfo file to execute ABAP stream mode sessions. Remote Function Call (RFC). RFC is the remote communication protocol used by SAP and is based on RPC (Remote Procedure Call). To execute remote calls from PowerCenter, SAP R/3 requires information such as the connection type and the service name and gateway on the application server. This information is stored on the PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini. PowerCenter makes remote function calls when importing source definitions, installing ABAP programs and running ABAP file mode sessions. Transport system. The transport system in SAP is a mechanism to transfer objects developed on one system to
INFORMATICA CONFIDENTIAL BEST PRACTICES 204 of 954

another system. Transport system is primarily used to migrate code and configuration from development to QA and production systems. It can be used in the following cases:
q q

PowerExchange for SAP NetWeaver installation transports PowerExchange for SAP NetWeaver generated ABAP programs

Note: If the ABAP programs are installed in the $TMP development class, they cannot be transported from development to production. Ensure you have a transportable development class/package for the ABAP mappings. Security You must have proper authorizations on the R/3 system to perform integration tasks. The R/3 administrator needs to create authorizations, profiles, and users for PowerCenter users.

Integration Feature Import Definitions, Install Programs Extract Data

Authorization Object Activity S_DEVELOP All activities. Also need to set Development Object ID to PROG READ

S_TABU_DIS

Run File Mode Sessions

S_DATASET

WRITE

Submit Background Job Release Background Job

S_PROGRAM S_BTCH_JOB

BTCSUBMIT, SUBMIT DELE, LIST, PLAN, SHOW Also need to set Job Operation to RELE

Run Stream Mode Sessions Authorize RFC privileges

S_CPIC S_RFC

All activities All activities

You also need access to the SAP GUI, as described in following SAP GUI Parameters table:

Parameter

Feature references to this variable

Comments

INFORMATICA CONFIDENTIAL

BEST PRACTICES

205 of 954

User ID

$SAP_USERID

Identify the username that connects to the SAP GUI and is authorized for read-only access to the following transactions: - SE12 - SE15 - SE16 - SPRO

Password

$SAP_PASSWORD

Identify the password for the above user Identify the SAP system number Identify the SAP client number Identify the server on which this instance of SAP is running

System Number Client Number Server

$SAP_SYSTEM_NUMBER $SAP_CLIENT_NUMBER $SAP_SERVER

Key Capabilities of PowerExchange for SAP NetWeaver


Some key capabilities of PowerExchange for SAP NetWeaver include:
q q q

Extract data from SAP R/3 using ABAP BAPI /RFC and IDoc interfaces. Migrate/load data from any source into R/3 using IDoc, BAPI/RFC and DMI interfaces. Generate DMI files ready to be loaded into SAP via SXDA TOOLS or LSMW or SAP standard delivered programs. Support calling BAPI and RFC functions dynamically from PowerCenter for data integration. PowerExchange for SAP NetWeaver can make BAPI and RFC function calls dynamically from mappings to extract or load. Capture changes to the master and transactional data in SAP R/3 using ALE. PowerExchange for SAP NetWeaver can receive outbound IDocs from SAP R/3 in real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real time using ALE, install PowerExchange for SAP NetWeaver on PowerCenterRT. Provide rapid development of the data warehouse based on R/3 data using Analytic Business Components for SAP R/3 (ABC). ABC is a set of business content that includes mappings, mapplets, source objects, targets, and transformations. Set partition points in a pipeline for outbound/inbound IDoc sessions; sessions that fail when reading outbound IDocs from an SAP R/3 source can be configured for recovery. You can also receive data from outbound IDoc files and write data to inbound IDoc files. Insert ABAP Code Block to add functionality to the ABAP program flow and use static/dynamic filters to reduce return rows. Customize the ABAP program flow with joins, filters, SAP functions, and code blocks. For example: qualifying table = table1-field1 = table2-field2 where the qualifying table is the last table in the condition based on the join order including outer joins. Create ABAP program variables to represent SAP R/3 structures, structure fields, or values in the ABAP
BEST PRACTICES 206 of 954

INFORMATICA CONFIDENTIAL

program.
q q

Remove ABAP program information from SAP R/3 and the repository when a folder is deleted. Provide enhanced platform support by running on 64-bit AIX and HP-UX (Itanium). You can install PowerExchange for SAP NetWeaver for the PowerCenter Server and Repository Server on SuSe Linux or on Red Hat Linux.

Installation and Configuration Steps


PowerExchange for SAP NetWeaver setup programs install components for PowerCenter Server, Client, and repository server. These programs install drivers, connection files, and a repository plug-in XML file that enables integration between PowerCenter and SAP R/3. Setup programs can also install PowerExchange for SAP NetWeaver Analytic Business Components, and PowerExchange for SAP NetWeaver Metadata Exchange. The PowerExchange for SAP NetWeaver repository plug-in is called sapplg.xml. After the plug-in is installed, it needs to be registered in the PowerCenter repository.

For SAP R/3


Informatica provides a group of customized objects required for R/3 integration in the form of transport files. These objects include tables, programs, structures, and functions that PowerExchange for SAP NetWeaver exports to data files. The R/3 system administrator must use the transport control program, tp import, to transport these object files on the R/3 system. The transport process creates a development class called ZERP. The SAPTRANS directory contains data and co files. The data files are the actual transport objects. The co files are control files containing information about the transport request. The R/3 system needs development objects and user profiles established to communicate with PowerCenter. Preparing R/3 for integration involves the following tasks:
q

Transport the development objects on the PowerCenter CD to R/3. PowerCenter calls these objects each time it makes a request to the R/3 system. Run the transport program that generates unique Ids. Establish profiles in the R/3 system for PowerCenter users. Create a development class for the ABAP programs that PowerCenter installs on the SAP R/3 system.

q q q

For PowerCenter
The PowerCenter server and client need drivers and connection files to communicate with SAP R/3. Preparing PowerCenter for integration involves the following tasks:
q q

Run installation programs on PowerCenter Server and Client machines. Configure the connection files:
r

The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system. Following are the required parameters for sideinfo : DEST logical name of the R/3 system TYPE set to A to indicate connection to specific R/3 system. ASHOST host name of the SAP R/3 application server. SYSNR system number of the SAP R/3 application server.

-The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system as an RFC client. The required parameters for sideinfo are: DEST logical name of the R/3 system LU host name of the SAP application server machine

INFORMATICA CONFIDENTIAL

BEST PRACTICES

207 of 954

TP set to sapdp<system number> GWHOST host name of the SAP gateway machine. GWSERV set to sapgw<system number> ROTOCOL set to I for TCP/IP connection. Following is the summary of required steps: 1. 2. 3. 4. 5. 6. Install PowerExchange for SAP NetWeaver on PowerCenter. Configure the sideinfo file. Configure the saprfc.ini Set the RFC_INI environment variable. Configure an application connection for SAP R/3 sources in the Workflow Manager. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs generated by the SAP R/3 system. 7. Configure the FTP connection to access staging files through FTP. 8. Install the repository plug-in in the PowerCenter repository.

Configuring the Services File Windows


If SAPGUI is not installed, you must make entries in the Services file to run stream mode sessions. This is found in the \WINNT\SYSTEM32\drivers\etc directory. Entries should be similar to the following: sapdp<system number> <port number of dispatcher service>/tcp sapgw<system number> <port number of gateway service>/tcp Note: SAPGUI is not technically required, but experience has shown that evaluators typically want to log into the R/3 system to use the ABAP workbench and to view table contents.

UNIX
Services file is located in /etc
q q

sapdp<system number> <port# of dispatcher service>/TCP sapgw<system number> <port# of gateway service>/TCP

The system number and port numbers are provided by the BASIS administrator.

Configure Connections to Run Sessions


Informatica supports two methods of communication between the SAP R/3 system and the PowerCenter Server.
q

Streaming Mode does not create any intermediate files on the R/3 system. This method is faster, but uses more CPU cycles on the R/3 system. File Mode creates an intermediate file on the SAP R/3 system, which is then transferred to the machine running the PowerCenter Server.

If you want to run file mode sessions, you must provide either FTP access or NFS access from the machine running the PowerCenter Server to the machine running SAP R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the same machine; it is possible to run PowerCenter and R/3 on the same system, but highly
INFORMATICA CONFIDENTIAL BEST PRACTICES 208 of 954

unlikely. If you want to use File mode sessions and your R/3 system is on a UNIX system, you need to do one of the following:
q q

Provide the login and password for the UNIX account used to run the SAP R/3 system. Provide a login and password for a UNIX account belonging to same group as the UNIX account used to run the SAP R/3 system. Create a directory on the machine running SAP R/3, and run chmod g+s on that directory. Provide the login and password for the account used to create this directory.

Configure database connections in the Server Manager to access the SAP R/3 system when running a session, then configure an FTP connection to access staging file through FTP.

Extraction Process
R/3 source definitions can be imported from the logical tables using RFC protocol. Extracting data from R/3 is a fourstep process: Import source definitions. The PowerCenter Designer connects to the R/3 application server using RFC. The Designer calls a function in the R/3 system to import source definitions. Note: If you plan to join two or more tables in SAP, be sure you have the optimized join conditions. Make sure you have identified your driving table (e.g., if you plan to extract data from bkpf and bseg accounting tables, be sure to drive your extracts from bkpf table). There is a significant difference in performance if the joins are properly defined. Create a mapping. When creating a mapping using an R/3 source definition, you must use an ERP source qualifier. In the ERP source qualifier, you can customize properties of the ABAP program that the R/3 server uses to extract source data. You can also use joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to customize the ABAP program. Generate and install ABAP program. You can install two types of ABAP programs for each mapping:
q

File mode. Extract data to file. The PowerCenter Server accesses the file through FTP or NFS mount. This mode is used for large extracts as there are timeouts set in SAP for long running queries. Stream Mode. Extract data to buffers. The PowerCenter Server accesses the buffers through CPI-C, the SAP protocol for program-to-program communication. This mode is preferred for short running extracts.

You can modify the ABAP program block and customize according to your requirements (e.g., if you want to get data incrementally, create a mapping variable/parameter and use it in the ABAP program).

Create Session and Run Workflow


q

Stream Mode. In stream mode, the installed ABAP program creates buffers on the application server. The program extracts source data and loads it into the buffers. When a buffer fills, the program streams the data to the PowerCenter Server using CPI-C. With this method, the PowerCenter Server can process data when it is received. File Mode. When running a session in file mode, the session must be configured to access the file through NFS mount or FTP. When the session runs, the installed ABAP program creates a file on the application server. The program extracts source data and loads it into the file. When the file is complete, the PowerCenter Server accesses the file through FTP or NFS mount and continues processing the session.

Data Integration Using RFC/BAPI Functions


INFORMATICA CONFIDENTIAL BEST PRACTICES 209 of 954

PowerExchange for SAP NetWeaver can generate RFC/BAPI function mappings in the Designer to extract data from SAP R/3, change data in R/3, or load data into R/3. When it uses an RFC/BAPI function mapping in a workflow, the PowerCenter Server makes the RFC function calls on R/3 directly to process the R/3 data. It doesnt have to generate and install the ABAP program for data extraction.

Data Integration Using ALE


PowerExchange for SAP NetWeaver can integrate PowerCenter with SAP R/3 using ALE. With PowerExchange for SAP NetWeaver, PowerCenter can generate mappings in the Designer to receive outbound IDocs from SAP R/3 in real time. It can also generate mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses an inbound or outbound mapping in a workflow to process data in SAP R/3 using ALE, it doesnt have to generate and install the ABAP program for data extraction.

Analytical Business Components


Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business logic to extract and transform R/3 data. It works in conjunction with PowerCenter and PowerExchange for SAP NetWeaver to extract master data, perform lookups, provide documents, and other fact and dimension data from the following R/3 modules:
q q q q q q

Financial Accounting Controlling Materials Management Personnel Administration and Payroll Accounting Personnel Planning and Development Sales and Distribution

Refer to the ABC Guide for complete installation and configuration information.

Last updated: 04-Jun-08 17:30

INFORMATICA CONFIDENTIAL

BEST PRACTICES

210 of 954

Data Connectivity using PowerExchange for Web Services Challenge


Understanding PowerExchange for Web Services and configuring PowerCenter to access a secure web service.

Description
PowerExchange for Web Services is a service oriented integration technology that can be utilized for bringing application logic that is embedded in existing systems into the PowerCenter data integration platform. Leveraging the logic in existing systems is a cost-effective method for data integration. For example, an insurance policy score calculation logic that is available in a mainframe application can be exposed as a web service and then used by PowerCenter mappings. PowerExchange for Web Services (WebServices Consumer) allows PowerCenter to act as a web services client to consume external web services. PowerExchange for Web Services uses the Simple Object Access Protocol (SOAP) to communicate with the external web service provider. An external web service can be invoked from PowerCenter in three ways:
q q q

Web Service source Web Service transformation Web Service target

In order to increase performance of message transmission, SOAP requests and responses can be compressed. Furthermore, pass-through partitioned sessions can be used for increasing parallelism in the case of large data volumes.

Web Service Source Usage


PowerCenter supports a request-response type of operation when using a Web Services source. The web service can be used as a source if the input in the SOAP request remains fairly constant (since input values for a web service source can only be provided at the source transformation level). Although Web services source definitions

INFORMATICA CONFIDENTIAL

BEST PRACTICES

211 of 954

can be created without using a WSDL they can be edited in the WSDL workspace in PowerCenter Designer.

Web Service Transformation Usage


PowerCenter also supports a request-response type of operation when using a Web Services transformation. The web service can be used as a transformation if input data is available midstream and the response values will be captured from the web service. The following steps provide an example for invoking a Stock Quote web service to learn the price of each of the ticker symbols available in a flat file: 1. In Transformation Developer, create a web service consumer transformation. 2. Specify the URL for the stock quote wsdl and choose the operation get quote. 3. Connect the input port of this transformation to the field containing the ticker symbols. 4. To invoke the web service for each input row, change to source-based commit and an interval of 1. Also change the Transaction Scope to Transaction in the web services consumer transformation.

Web Service Target Usage


PowerCenter supports a one-way type of operation when using a Web Services target. The web service can be used as a target if it is needed only to send a message (and no response is needed). PowerCenter only waits for the web server to start processing the message; it does not wait for the web server to finish processing the web service operation. Existing relational and flat files can be used for the target definitions; or target columns can be defined manually.

PowerExchange for Web Services and Web Services Provider


PowerCenter Web Services Provider is a separate product from PowerExchange for Web Services. An advantage to using PowerCenter Web Services Provider is that it decouples the web service that needs to be consumed from the client. By using PowerCenter as the glue, changes can be made that are transparent to the client. This is useful because often there is no access to the client code or to the web service. Other considerations include:
q

PowerCenter Web Services Provider acts as a Service Provider and exposes many key functionalities as web services.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

212 of 954

In PowerExchange for Web Services, PowerCenter acts as a web service client and consumes external web services. It is not necessary to install or configure Web Services Provider in order to use PowerExchange for Web Services. Web Services exposed through PowerCenter have two formats that can be invoked by different kinds of client programs (e.g., C#, Java, .net) by using the WSDL that can be generated from the Web Services Hub.
r

Real-Time: In real time mode, web enabled workflows are exposed. The Web Services Provider must be used and be pointed to the workflow that is going to be invoked as a web service. Workflows can be started and protected. Batch: In batch mode, a pre-set of services are exposed to run and monitor workflows in PowerCenter. This feature can be used for reporting and monitoring purposes.

Last but not least, PowerCenters open architecture facilitates HTTP and HTTPS requests with an http transformation for GET, POST, and SIMPLE POST methods to read from or write data to an HTTP server.

Configuring PowerCenter to Invoke a Secure Web Service


Secure Sockets Layer (SSL) is used to provide security features such as authentication and encryption to web services applications. The authentication certificates follow the Public Key Infrastructure (PKI) standard, a system of digital certificates provided by certificate authorities to verify and authenticate parties of Internet communications or transactions. These certificates are managed in the following two keystore files:
q

Trust store. A trust store holds the public keys for the entities it can trust. Integration Service uses the entries in the trust store file to authenticate the external web services servers. Client store. A client store holds both the entitys public and private keys. Integration Service sends the entries in the client store file to the web services provider so that the web services provider can authenticate the Integration Service.

By default, the trust certificates file is named ca-bundle.crt and contains certificates issued by major, trusted certificate authorities. The ca-bundle.crt file is located in <PowerCenter Installation Directory>/server/bin.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

213 of 954

SSL authentication can be performed in three ways:


q q q

Server authentication Client authentication Mutual authentication

All of the SSL authentication configurations can be set by entering values for Web Service application connections in Workflow Manager.

Server Authentication:
Since the web service provider is the server and the Integration Service is the client, the web service provider is responsible for authenticating the Integration Service. The Integration Service sends the web service provider a client certificate file containing a public key and the web service provider verifies this file. The client certificate file and the corresponding private key file should be configured for this option.

Client Authentication:
Since the Integration Service is the client of the web service provider, it establishes an SSL session to authenticate the web service provider. The Integration Service verifies that the authentication certificate sent by the web service provider exists in the trust certificates file. The trust certificates file should be configured for this option.

Mutual Authentication
The Integration Service and web service provider exchange certificates and verify each other. For this option the trust certificates file, the client certificate and the corresponding private key file should be configured.

Converting Other Formats of Certificate Files


There are a number of other formats of certificate files available: DER format (.cer and . der extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12 extension). The private key for a client certificate must be in PEM format. Files can be converted from one format of certificate to another using the OpenSSL utility. Refer to the OpenSSL documentation for complete information on such conversions. A few examples are given below:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

214 of 954

To convert from DER to PEM (assuming there is a DER file called server.der) openssl x509 -in server.der -inform DER -out server.pem -outform PEM To convert a PKCS12 file called server.pfx to PEM openssl pkcs12 -in server.pfx -out server.pem

Web Service Performance Tips


The basis of Web Services communication takes place in the form of XML Documents. The performance does get affected by the type of requests that are being transmitted. Below are some tips that can help to improve performance.
q q

Avoid frequent transmissions of huge data elements. The nesting of elements in a SOAP request has a significant effect on performance. Run these requests in verbose data mode in order to check for this. When data is being retrieved for aggregation purposes or for financial calculations (i.e., not real-time) shift those requests to non-peak hours to improve response time. Capture the response time for each request sent, by using Sysdate in an expression before the web service transformation, and in an expression after. This will show the true latency which can then be averaged to determine scaling needs. Try to limit the number of web service calls (when possible). If you are using the same calls multiple times to return pieces of information for different targets, it would be better to return a complete set of results with a unique ID and then stage the sourcing for the different targets. Sending simple datatypes (e.g., integer, float, string) improves performance.

Last updated: 27-May-08 16:45

INFORMATICA CONFIDENTIAL

BEST PRACTICES

215 of 954

Data Migration Principles Challenge


A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informaticas suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). In this Best Practice we will discuss basic principles for data migration to lower the project time, to lower staff time to develop, lower risk and lower the total cost of ownership of the project. These principles include: 1. 2. 3. 4. 5. 6. 7. Leverage staging strategies Utilize table driven approaches Develop via Modular Design Focus On Re-Use Common Exception Handling Processes Multiple Simple Processes versus Few Complex Processes Take advantage of metadata

Description
Leverage Staging Strategies
As discussed elsewhere in Velocity, in data migration it is recommended to employ both a legacy staging and pre-load staging area. The reason for this is simple, it provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy you are able to see real production data sooner and follow the guiding principle of Convert Early, Convert Often, and with Real Production Data'.

Utilize Table Driven Approaches


INFORMATICA CONFIDENTIAL BEST PRACTICES 216 of 954

Developers frequently find themselves in positions where they need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable transformations during a Data Migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of configuration or reference data tables. It is recommend to use table driven approaches such as these whenever possible. Some common table driven approaches include:
q

Default Values hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, if you have a hard coded value of NA for any value not populated and then want to change that value to NV you could simply change the value in a default value table rather then change numerous hard-coded values. Cross-Reference Values frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical mapping development situation this would require adding additional values to a series of IIF or Decode statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required. Parameter Values by using a table driven parameter file you can reduce the need for scripting and accelerate the development process. Code-Driven Table in some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if you had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.

Develop Via Modular Design


As part of the migration methodology, modular design is encouraged. Modular design is the act of developing a standard way of how similar mappings should function. These are then published as templates and developers are required to build similar mappings in that same manner. This provides rapid development, increases efficiency for testing, and increases ease of maintenance. The result of this change is it causes dramatically lower total cost of ownership and reduced cost.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

217 of 954

Focus On Re-Use
Re-use should always be considered during Informatica development. However, due to such a high degree of repeatability, on data migration projects re-use is paramount to success. There is often tremendous opportunity for re-use of mappings/strategies/ processes/scripts/testing documents. This reduces the staff time for migration projects and lowers project costs.

Common Exception Handling Processes


Employing the Velocity Data Migration Methodology through its iterative intent will add new data quality rules as problems are found with the data. Because of this it is critical to find data exceptions and write appropriate rules to correct these situations throughout the data migration effort. It is highly recommended to build a common method for capturing and recording these exceptions. This common method should then be deployed for all data migration processes.

Multiple Simple Processes versus Few Complex Processes


For data migration projects it is possible to build one process to pull all data for a given entity from all systems to the target system. While this may seem ideal, these type of complex processes take much longer to design and develop, are challenging to test, and are very difficult to maintain over time. Due to these drawbacks, it is recommend to develop many simple processes as needed to complete the effort rather then a few complex processes.

Take Advantage of Metadata


The Informatica data integration platform is highly metadata driven. Take advantage of those capabilities on data migration projects. This can be done via a host of reports against the data integration repository such as: 1. 2. 3. 4. 5. 6. Illustrate how the data is being transformed (i.e., lineage reports) Illustrate who has access to what data (i.e., security group reports) Illustrate what source or target objects exist in the repository Identify how many mappings each developer has created Identify how many sessions each developer has run during a given time period Identify how many successful/failed sessions have been executed

In summary, these design principles provide significant benefits to data migration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

218 of 954

projects and add to the large set of typical best practice items that are available in Velocity. The key to Data Migration projects is architect well, design better, and execute best.
Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

219 of 954

Data Migration Project Challenges Challenge


A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity, or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informaticas suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). In this best practice the three main data migration project challenges will be discussed. These include: 1. Specifications incomplete, inaccurate, or not completed on-time. 2. Data quality problems impacting project time-lines. 3. Difficulties in project management executing the data migration project.

Description
Unlike other Velocity Best Practices we will not specify the full solution to each. Rather, it is more important to understand these three challenges and take action to address them throughout the implementation.

Migration Specifications
During the execution of data migration projects a challenge that projects always encounter is problems with migration specifications. Projects require the completion of functional specs to identify what is required of each migration interface. Definitions:
q

A migration interface is defined as 1 to many mapping/sessions/workflows or scripts used to migrate a data entity from one source system to one target system. A Functional Requirements Specification is normally comprised of a document covering details including security, database join needs, audit needs, and primary contact details. These details are normally at the interface level rather then at the column level. It also includes a Target-Source Matrix target-source matrix which identifies details at the column level such as how source table/columns map to target table/columns, business rules, data cleansing rules, validation rules, and other column level specifics.

Many projects attempt to complete these migrations without these types of specifications. Often these projects have little to no chance to complete on-time or on-budget. Time and subject matter expertise
INFORMATICA CONFIDENTIAL BEST PRACTICES 220 of 954

is needed to complete this analysis; this is the baseline for project success. Projects are disadvantaged when functional specifications are not completed on-time. Developers can often be in a wait mode for extended periods of time when these specs are not completed at the time specified by the project plan. Another project risk occurs when the right individuals are not used to write these specs or often inappropriate levels of importance are applied to this exercise. These situations cause inaccurate or incomplete specifications which prevent data integration developers from successfully building the migration processes. To address the spec challenge for migration projects, projects must have specifications that are completed with accuracy and delivered on time.

Data Quality
Most projects are affected by data quality due to the need to address problems in the source data that fit into the six dimensions of data quality:

Data Quality Dimension Completeness Conformity Consistency Accuracy Duplicates Integrity

Description

What data is missing or unusable? What data is stored in a non-standard format? What data values give conflicting Informatica? What data is incorrect or out of date? What data records or attributes are repeated? What data is missing or not referenced?

Data migration data quality problems are typically worse then planned for. Projects need to allow enough time to identify and fix data quality problems BEFORE loading the data into the new target system. Informaticas data integration platform provides data quality capabilities that can help to identify the data quality problems in an efficient manner, but Subject-Matter Experts are required to address how these data problems should be addressed within business context and process.

Project Management
INFORMATICA CONFIDENTIAL BEST PRACTICES 221 of 954

Project managers are often disadvantaged on these types of projects as they are mainly much larger, more expensive, and more complex then any prior project they have been involved with. They need to understand early in the project the importance of correctly completed specs and the importance of addressing data quality and establish a set of tools to accurately and objectively plan the project with the ability to evaluate progress. Informaticas Velocity Migration Methodology, its tool sets, and the metadata reporting capabilities are key to addressing these project challenges. The key challenge is to fully understand the pitfalls early on in the project and how PowerCenter and Informatica Data Quality can address these challenges, and how metadata reporting can provide objective information relative to project status. In summary, data migration projects are challenged by specification issues, data quality issues, and project management difficulties. By understanding the Velocity Methodology focus on data migration and how Informaticas products can handle these changes for a successful migration, these challenges can be minimized.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

222 of 954

Data Migration Velocity Approach Challenge


A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informaticas suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). To meet these objectives a set of best practices have been provided in Velocity focused on Data Migration. This Best Practice provides an overview of how to use Informaticas Products in an iterative methodology to expedite a data migration project. The keys to the methodology are further discussed in the Best Practice Data Migration Principles.

Description
The Velocity approach to data migration is illustrated here. While it is possible to migrate data in one step it is more productive to break these processes up into two or three simpler steps. The goal for data migration is to get the data into the target application as early as possible for large scale implementations. Typical implementations will have three to four trial cutovers or mock-runs before the final implementation of Go-Live. The mantra for the Informatica based migration is to Convert Early, Convert Often, and Convert with Real Production Data. To do this the following approach is encouraged:

Analysis
In the analysis phase the functional specs will be completed, these will include both functional specs and target-source matrix. See the Best Practice Data Migration Project Challenges for related information.

Acquire
INFORMATICA CONFIDENTIAL BEST PRACTICES 223 of 954

In the acquire phase the targets-source matrix will be reviewed and all source systems/tables will be identified. These tables will be used to develop one mapping per source table to populate a mirrored structure in a legacy data based schema. For example if there were 50 source tables identified in all the Target-Source Matrix documents, 50 legacy tables would be created and one mapping would be developed; one for each table. It is recommended to perform the initial development against test data, but once complete run a single extract of the current production data. This will assist in addressing data quality problems without impacting production systems. It is recommended to run these extracts in low use time periods and with the cooperation of the operations group responsible for these systems. It is also recommended to take advantage of the Visio Generation Option if available. These mappings are very straight forward and the use of autogeneration can increase consistency and lower required staff time for the project.

Convert
In this phase data will be extracted from the legacy stage tables (merged, transformed, and cleansed) to populate a mirror of the target application. As part of this process a standard exception process should be developed to determine exceptions and expedite data cleansing activities. The results of this convert process should be profiled, and appropriate data quality scorecards should be reviewed. During the convert phase the basic set of exception tests should be executed, with exception details collected for future reporting and correction. The basic exception tests include: 1. 2. 3. 4. 5. Data Type Data Size Data Length Valid Values Range of Values

Exception Type Data Type

Exception Description Will the source data value load correctly to the target data type such as a numeric date loading into an Oracle date type? Will a numeric value from a source value load correctly to the target column or will a numeric overflow occur? Is the input value too large for the target column? (This is appropriate for all data types but of particular interest for string data types. For example, in one system a field could be char(256) but most of the values are char(10). In the target the new field is varchar(20) so any value over char (20) should raise an exception.) Is the input value within a tolerable range for the new system? (For example, does the birth date for an Insurance Subscriber fall between Jan 1, 1900 and Jan 1, 2006? If this test fails the date is unreasonable and should be addressed.)

Data Size

Data Length

Range of Values

INFORMATICA CONFIDENTIAL

BEST PRACTICES

224 of 954

Valid Values

Is the input value in a list of tolerant values in the target system? (An example of this would be does the state code for an input record match the list of states in the new target system? If not the data should be corrected prior to entry to the new system.)

Once profiling exercises, exception reports and data quality scorecards are complete a list of data quality issues should be created. This list should then be reviewed with the functional business owners to generate new data quality rules to correct the data. These details should be added to the spec and the original convert process should be modified with the new data quality rules. The convert process should then be re-executed as well as the profiling, exception reporting and data scorecarding until the data is correct and ready for load to the target application.

Migrate
In the migrate phase the data from the convert phase should be loaded to the target application. The expectation is that there should be no failures on these loads. The data should be corrected in the covert phase prior to loading the target application. Once the migrate phase is complete, validation should occur. It is recommended to complete an audit/balancing step prior to validation. This is discussed in the Best Practice Build Data Audit/Balancing Processes. Additional detail about these steps are defined in the Best Practice Data Migration Principles.
Last updated: 06-Feb-07 12:08

INFORMATICA CONFIDENTIAL

BEST PRACTICES

225 of 954

Build Data Audit/Balancing Processes Challenge


Data Migration and Data Integration projects are often challenged to verify that the data in an application is complete. More specifically, to identify that all the appropriate data was extracted from a source system and propagated to its final target. This best practice illustrates how to do this in an efficient and a repeatable fashion for increased productivity and reliability. This is particularly important in businesses that are either highly regulated internally and externally or that have to comply with a host of government compliance regulations such as Sarbanes-Oxley, BASEL II, HIPAA, Patriot Act, and many others.

Description
The common practice for audit and balancing solutions is to produce a set of common tables that can hold various control metrics regarding the data integration process. Ultimately, business intelligence reports provide insight at a glance to verify that the correct data has been pulled from the source and completely loaded to the target. Each control measure that is being tracked will require development of a corresponding PowerCenter process to load the metrics to the Audit/ Balancing Detail table. To drive out this type of solution execute the following tasks: 1. Work with business users to identify what audit/balancing processes are needed. Some examples of this may be: a. Customers (Number of Customers or Number of Customers by Country) b. Orders (Qty of Units Sold or Net Sales Amount) c. Deliveries (Number of shipments or Qty of units shipped of Value of all shipments) d. Accounts Receivable (Number of Accounts Receivable Shipments or Total Accounts Receivable Outstanding) 2. Define for each process defined in #1 which columns should be used for tracking purposes for both the source and target system. 3. Develop a data integration process that will read from the source system and populate the detail audit/balancing table with the control totals. 4. Develop a data integration process that will read from the target system and populate the detail audit/balancing table with the control totals. 5. Develop a reporting mechanism that will query the audit/balancing table and identify the the source and target entries match or if there is a discrepancy. An example audit/balance table definition looks like this : Audit/Balancing Details

INFORMATICA CONFIDENTIAL

BEST PRACTICES

226 of 954

Column Name AUDIT_KEY CONTROL_AREA

Data Type NUMBER VARCHAR2

Size 10 50 50 10 10 10 10 10

CONTROL_SUB_AREA VARCHAR2 CONTROL_COUNT_1 CONTROL_COUNT_2 CONTROL_COUNT_3 CONTROL_COUNT_4 CONTROL_COUNT_5 CONTROL_SUM_1 CONTROL_SUM_2 CONTROL_SUM_3 CONTROL_SUM_4 CONTROL_SUM_5 NUMBER NUMBER NUMBER NUMBER NUMBER

NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS VARCHAR2 50

Control Column Definition by Control Area/Control Sub Area Column Name CONTROL_AREA Data Type Size

VARCHAR2 50

CONTROL_SUB_AREA VARCHAR2 50

INFORMATICA CONFIDENTIAL

BEST PRACTICES

227 of 954

CONTROL_COUNT_1 CONTROL_COUNT_2 CONTROL_COUNT_3 CONTROL_COUNT_4 CONTROL_COUNT_5 CONTROL_SUM_1 CONTROL_SUM_2 CONTROL_SUM_3 CONTROL_SUM_4 CONTROL_SUM_5

VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS VARCHAR2 50

The following is a screenshot of a single mapping that will populate both the source and target values in a single mapping:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

228 of 954

The following two screenshots show how two mappings could be used to provide the same results:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

229 of 954

Note: One key challenge is how to capture the appropriate control values from the source system if it is continually being updated. The first example with one mapping will not work due to the changes that occur in the time between the extraction of the data from the source and the completion of the load to the target application. In those cases you may want to take advantage of an aggregator transformation to collect the appropriate control totals as illustrated in this screenshot:

The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of this type of process: Data Area Leg count TT count Diff Leg amt TT amt Customer 11000 Orders 9827 10099 9827 1288 1 0 0 0 11230.21 11230.21 0 21294.22 21011.21 283.01

Deliveries 1298

In summary, there are two big challenges in building audit/balancing processes: 1. Identifying what the control totals should be 2. Building processes that will collect the correct information at the correct granularity There are also a set of basic tasks that can be leveraged and shared across any audit/balancing needs. By building a common model for meeting audit/balancing needs, projects can lower the time needed to develop these solutions and still provide risk reductions by having this type of solution in place.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

230 of 954

Continuing Nature of Data Quality Challenge


A data quality (DQ) project usually begins with a specific use case in mind; such as resolving data quality issues as a part of a data migration effort or attempting to reconcile data acquired as a part of a merger or acquisition. Regardless of the specific data quality need, planning for the data quality project should be considered an iterative process. As change will always be prevalent, data quality is not something that should be considered an absolute. An organization must be cognizant of the continuing nature of data quality whenever undertaking a project that involves data quality. The goal of this Best Practice is to set forth principles that outline the iterative nature of data quality and the steps that should be considered when planning a data quality initiative. Experience has shown that applying these principles and steps will maximize the potential for ongoing success in data quality projects.

Description
Reasons for considering data quality as an iterative process stems from two core concepts. First, the level of sophistication around data quality will continue to improve as a DQ process is implemented. Specifically, as the results are disseminated throughout the organization, it will become easier to make decisions on the types of rules and standards that should be implemented; as everyone will be working from a single view of the truth. Although everyone may not agree on how data is being entered or identified, the baseline analysis will identify the standards (or lack thereof) currently in place and provide a starting point to work from. Once the initial data quality process is implemented, the iterative nature begins. The users become more familiar with the data as they review the results of the data quality plans to standardize, cleanse and de-duplicate the data. As each iteration continues, the data stewards should determine if the business rules and reference dictionaries initially put into place need to be modified to effectively address any new issues that arise. The second reason that data quality continues to evolve is based on the premise that the data will not remain static. Although a baseline set of data quality rules will eventually be agreed upon, the assumption is that as soon as legacy data has been cleansed, standardized and de-duplicated it will ultimately change. This change could come from a user updating a record or a new data source being introduced that ultimately needs to become a part of the master data. In either case, the need to perform additional iterations on the updated records and/or new sources should be considered. The frequency of these iterations will vary and are ultimately driven by the processes for data entry and manipulation within an organization. This can result in anything from a need to cleanse data in realtime to possibly performing a nightly or weekly batch process. Regardless, scorecards should be monitored to determine if the business rules initially implemented need to be modified or if they are continuing to meet the needs of the organization as it pertains to data quality. The questions that should be considered when evaluating the continuing and iterative nature of data quality include:
q

Are the business rules and reference dictionaries meeting the needs of the organization when attempting to report on the underlying data?
BEST PRACTICES 231 of 954

INFORMATICA CONFIDENTIAL

If a new data source is introduced, can the same data quality rules be applied or do new rules need to be developed to meet the type of data found in this new source? From a trend perspective, is the quality of data improving over time? If not, what needs to be done to remedy the situation?

The answers to these questions will provide a framework to measure the current level of success achieved in implementing an iterative data quality initiative. Just as data quality should be viewed as iterative, so should these questions. They should be reflected upon frequently to determine if changes are needed to how data quality is implemented within the environment; or to the underlying business rules within a specific DQ process. Although the reasons to iterate through the data may vary, the following steps will be prevalent in each iteration: 1. Identify the problematic data element that needs to be addressed. This problematic data could include bad addresses, duplicate records or incomplete data elements as well as other examples. 2. Define the data quality rules and targets that need to be resolved. This includes rules for specific sources and content around which data quality areas are being addressed. 3. Design data quality plans to correct the problematic data. This could be one or many data quality plans, depending upon the scope and complexity of the source data. 4. Implement quality improvement processes to identify problematic data on an ongoing basis. These processes should detect data anomalies which could lead to known and unknown data problems. 5. Monitor and Repeat. This is done to ensure that the data quality plans correct the data to desired thresholds. Since data quality definitions can be adjusted based on business and data factors, this iterative review is essential to ensure that the stakeholders understand what will change with the data as it is cleansed and how that cleansed data may affect existing business process and management reporting.

Example of the Iterative Process

INFORMATICA CONFIDENTIAL

BEST PRACTICES

232 of 954

As noted in the above diagram, the iterative data quality process will continue to be leveraged within an organization as new master data is introduced. By having defined processes in place upfront, the ability to effectively leverage the data quality solution will be enhanced. An organizations departments that are charged with implementing and monitoring data quality will be doing so within the confines of the enterprise wide rules and procedures that have been identified for the organization. The following points should be considered as an expansion to the five steps noted above: 1. Identify & Measure Data Quality: This first point is key. The ability to understand the data within the confines of the six dimensions of data quality will form the foundation for the business rules and processes that will be put in place. Without performing an upfront assessment, the ability to effectively implement a data quality strategy will be negatively impacted. From an ongoing perspective, the data quality assessment will allow an organization to see how the data quality procedures put into place have caused the quality of the data to improve. Additionally, as new data enters the organization, the assessment will provide key information for making ongoing modifications to the data quality processes. 2. Define Data Quality Rules & Targets: Once the assessment is complete, the second part of the analysis phase involves scorecarding the results in order to put into place success criteria and metrics for the data quality management initiative. From an ongoing perspective, this phase will involve performing trend analysis on the data and the rules in place to ensure the data continues to conform to the rules that were put into place during the data quality management initiative. 3. Design Quality Improvement Processes: This phase involves the manipulation of the data to align it with the business rules put into place. Examples of potential improvements includestandardization, removing noise, aligning product attributes and implementing measures or classifications. 4. Implement Quality Improvement Processes: Once the data has been standardized, an adjunct to the enhancement process involves the identification of duplicate data and taking
INFORMATICA CONFIDENTIAL BEST PRACTICES 233 of 954

action based upon the business rules that have been identified. The rules to identify and address duplicate data will continue to evolve. This evolution occurs as data stewards become more familiar with the data and as the policies and procedures set in place by the data governance committee become widely adopted throughout the organization. As this occurs, the ability to find additional duplicates or the ability to find new relationships within the data begins to arise. 5. Monitor Data Quality versus Targets: The ability to monitor the data quality processes is critical as it provides the organization with a quick snapshot of the health of the data. Through analysis of the scorecard results, the data governance committee will have the information needed to confidently make additional modifications to the data quality strategies in place, if needed. Conversely, the scorecards and trend analysis results will provide the peace of mind that data quality is being effectively addressed within the organization.

Last updated: 20-May-08 22:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

234 of 954

Data Cleansing Challenge


Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005 study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of attention to data quality. Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues. Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. It is essential that data quality issues are tackled during any large-scale data project to enable project success and future organizational success. Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure that all data entering the organizational data stores provides for consistent and reliable decision-making.

Description
A significant portion of time in the project development process should be dedicated to data quality, including the implementation of data cleansing processes. In a production environment, data quality reports should be generated after each data warehouse implementation or when new source systems are integrated into the environment. There should also be provision for rolling back if data quality testing indicates that the data is unacceptable. Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE) and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a complete solution for identifying and resolving all types of data quality problems and preparing data for the consolidation and load processes.

Concepts
Following are some key concepts in the field of data quality. These data quality concepts provide a foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to consolidation. Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can discover data quality issues at a record and field level, and Velocity best practices recommends the use of IDQ for such purposes. Note: The remaining items in this document will therefore, focus in the context of IDQ usage.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

235 of 954

Parsing - the process of extracting individual elements within the records, files, or data entry forms in order to check the structure and content of each field and to create discrete fields devoted to specific information types. Examples may include: name, title, company name, phone number, and SSN. Cleansing and Standardization - refers to arranging information in a consistent manner or preferred format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see the Best Practice Effective Data Standardizing Techniques. Enhancement - refers to adding useful, but optional, information to existing data or complete data. Examples may include: sales volume, number of employees for a given business, and zip+4 codes. Validation - the process of correcting data using algorithmic components and secondary reference data sources, to check and validate information. Example: validating addresses with postal directories. Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality records where high-quality records of the same information exist. Use matching components and business rules to identify records that may refer, for example, to the same customer. For more information, see the Best Practice Effective Data Matching Techniques. Consolidation - using the data sets defined during the matching process to combine all cleansed or approved data into a single, consolidated view. Examples are building best record, master record, or house-holding.

Informatica Applications
The Informatica Data Quality software suite has been developed to resolve a wide range of data quality issues, including data cleansing. The suite comprises the following elements:
q

IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality functionality on a single computer (Windows only).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

236 of 954

IDQ Server- a set of processes that enables the deployment and management of data quality procedures and resources across a network of any size through TCP/IP. IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling PowerCenter users to embed data quality procedures defined in IDQ in their mappings. IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server enables the creation and management of multiple repositories.

Using IDQ in Data Projects


IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its own applications or to provide them for addition to PowerCenter transformations. Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is, Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input components, output components, and operational components. Plans can perform analysis, parsing, standardization, enhancement, validation, matching, and consolidation operations on the specified data. Plans are saved into projects that can provide a structure and sequence to your data quality endeavors. The following figure illustrates how data quality processes can function in a project setting:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

237 of 954

In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile and easy to use dashboards to communicate data quality metrics to all interested parties. In stage 2, you verify the target levels of quality for the business according to the data quality measurements taken in stage 1, and in accordance with project resourcing and scheduling. In stage 3, you use Workbench to design the data quality plans and projects to achieve the targets. Capturing business rules and testing the plans are also covered in this stage. In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy plans and resources to remote repositories and file systems through the user interface. If you are running Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which data cleansing and other data quality tasks are performed on the project data. In stage 5, youll test and measure the results of the plans and compare them to the initial data quality assessment to verify that targets have been met. If targets have not been met, this information feeds into another iteration of data quality operations in which the plans are tuned and optimized. In a large data project, you may find that data quality processes of varying sizes and impact are necessary at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the level of unit testing required.

Using the IDQ Integration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

238 of 954

Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality repository and import data quality plans to a PowerCenter transformation. With the Integration component, you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ Workbench or Server. The Integration interacts with PowerCenter in two ways:
q

On the PowerCenter client side, it enables you to browse the Data Quality repository and add data quality plans to custom transformations. The data quality plans functional details are saved as XML in the PowerCenter repository. On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to send data quality plan XML to the Data Quality engine for execution.

The Integration requires that at least the following IDQ components are available to PowerCenter:
q q

Client side: PowerCenter needs to access a Data Quality repository from which to import plans. Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan instructions.

An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North American name and postal address records. The Integration component enables the following process:
q

Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality repository. The PowerCenter Designer user opens a Data Quality Integration transformation and configures it to read from the Data Quality repository. Next, the users selects a plan from the Data Quality repository and adds it to the transformation. The PowerCenter Designer user saves the transformation and the mapping containing it to the PowerCenter repository. The plan information is saved with the transformation as XML.

The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant source data and plan information will be sent to the Data Quality engine, which processes the data (in conjunction with any reference data files used by the plan) and returns the results to PowerCenter.

Last updated: 06-Feb-07 12:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

239 of 954

Data Profiling Challenge


Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This Best Practice is intended to provide an introduction on usage for new users. Bear in mind that Informaticas Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following Velocity Best Practice documents for more information:
q

Data Cleansing Using Data Explorer for Data Discovery and Analysis

Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems and enables users to measure changes in the source data over time. This information can help to improve the quality of the source data. An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level. Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short amount of time. A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a business rule that you want to validate, or if you want to test whether data matches a particular pattern.

Setting Up the Profile Wizard


To customize the profile wizard for your preferences:
q q

Open the Profile Manager and choose Tools > Options. If you are profiling data using a database user that is not the owner of the tables to be sourced, check the Use source owner name during profile mapping generation option. If you are in the analysis phase of your project, choose Always run profile interactively since most of your dataprofiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data profiles are useful in these phases.)

Running and Monitoring Profiles


Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking Configure Session on the "Function-Level Operations tab of the wizard.
q

Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration parameters. For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow Manager and configure and schedule them appropriately.

Generating and Viewing Profile Reports


Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.
INFORMATICA CONFIDENTIAL BEST PRACTICES 240 of 954

For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client installation. You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.

Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling option:

Technique No sampling

Description Uses all source data

Usage Relatively small data sources

Automatic random sampling PowerCenter determines the Larger data sources where you appropriate percentage to sample, then want a statistically significant data samples random rows. analysis Manual random sampling PowerCenter samples random rows of the source data based on a userspecified percentage. Samples more or fewer rows than the automatic option chooses.

Sample first N rows

Samples the number of user-selected rows

Provides a quick readout of a source (e.g., first 200 rows)

Profile Warehouse Administration Updating Data Profiling Repository Statistics


The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script that is generated and run it.

ORACLE
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%'; select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';

Microsoft SQL Server


select 'update statistics ' + name from sysobjects where name like 'PMDP%'

SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'

INFORMIX

INFORMATICA CONFIDENTIAL

BEST PRACTICES

241 of 954

select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'

IBM DB2
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP %'

TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and databasename = 'database_name' where database_name is the name of the repository database.

Purging Old Data Profiles


Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

242 of 954

Data Quality Mapping Rules Challenge


Use PowerCenter to create data quality mapping rules to enhance the usability of the data in your system.

Description
The issue of poor data quality is one that frequently hinders the success of data integration projects. It can produce inconsistent or faulty results and ruin the credibility of the system with the business users. This Best Practice focuses on techniques for use with PowerCenter and third-party or add-on software. Comments that are specific to the use of PowerCenter are enclosed in brackets. Bear in mind that you can augment or supplant the data quality handling capabilities of PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite dedicated to data quality issues. Data analysis and data enhancement processes, or plans, defined in IDQ can deliver significant data quality improvements to your project data. A data project that has built-in data quality steps, such as those described in the Analyze and Design phases of Velocity, enjoys a significant advantage over a project that has not audited and resolved issues of poor data quality. If you have added these data quality steps to your project, you are likely to avoid the issues described below. A description of the range of IDQ capabilities is beyond the scope of this document. For a summary of Informaticas data quality methodology, as embodied in IDQ, consult the Best Practice Data Cleansing.

Common Questions to Consider


Data integration/warehousing projects often encounter general data problems that may not merit a full-blown data quality project, but which nonetheless must be addressed. This document discusses some methods to ensure a base level of data quality; much of the content discusses specific strategies to use with PowerCenter. The quality of data is important in all types of projects, whether it be data warehousing,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

243 of 954

data synchronization, or data migration. Certain questions need to be considered for all of these projects, with the answers driven by the projects requirements and the business users that are being serviced. Ideally, these questions should be addressed during the Design and Analyze Phases of the project because they can require a significant amount of re-coding if identified later. Some of the areas to consider are:

Text Formatting
The most common hurdle here is capitalization and trimming of spaces. Often, users want to see data in its raw format without any capitalization, trimming, or formatting applied to it. This is easily achievable as it is the default behavior, but there is danger in taking this requirement literally since it can lead to duplicate records when some of these fields are used to identify uniqueness and the system is combining data from various source systems. One solution to this issue is to create additional fields that act as a unique key to a given table, but which are formatted in a standard way. Since the raw data is stored in the table, users can still see it in this format, but the additional columns mitigate the risk of duplication. Another possibility is to explain to the users that raw data in unique, identifying fields is not as clean and consistent as data in a common format. In other words, push back on this requirement. This issue can be particularly troublesome in data migration projects where matching the source data is a high priority. Failing to trim leading/trailing spaces from data can often lead to mismatched results since the spaces are stored as part of the data value. The project team must understand how spaces are handled from the source systems to determine the amount of coding required to correct this. (When using PowerCenter and sourcing flat files, the options provided while configuring the File Properties may be sufficient.). Remember that certain RDBMS products use the data type CHAR, which then stores the data with trailing blanks. These blanks need to be trimmed before matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

244 of 954

Note that many fixed-width files do not use a null as space. Therefore, developers must put one space beside the text radio button, and also tell the product that the space is repeating to fill out the rest of the precision of the column. The strip trailing blanks facility then strips off any remaining spaces from the end of the data value. Embedding database text manipulation functions in lookup transformations is not recommended because a developer must then cache the lookup table due to the presence of a SQL override. (In PowerCenter, avoid embedding database text manipulation functions in lookup transformations.) On very large tables, caching is not always realistic or feasible.

Datatype Conversions
It is advisable to use explicit tool functions when converting the data type of a particular data value. [In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is performed, and 15 digits are carried forward, even when they are not needed or desired. PowerCenter can handle some conversions without function calls (these are detailed in the product documentation), but this may cause subsequent support or
INFORMATICA CONFIDENTIAL BEST PRACTICES 245 of 954

maintenance headaches.]

Dates
Dates can cause many problems when moving and transforming data from one place to another because an assumption must be made that all data values are in a designated format. [Informatica recommends first checking a piece of data to ensure it is in the proper format before trying to convert it to a Date data type. If the check is not performed first, then a developer increases the risk of transformation errors, which can cause data to be lost]. An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT, YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL) If the majority of the dates coming from a source system arrive in the same format, then it is often wise to create a reusable expression that handles dates, so that the proper checks are made. It is also advisable to determine if any default dates should be defined, such as a low date or high date. These should then be used throughout the system for consistency. However, do not fall into the trap of always using default dates as some are meant to be NULL until the appropriate time (e.g., birth date or death date). The NULL in the example above could be changed to one of the standard default dates described here.

Decimal Precision
With numeric data columns, developers must determine the expected or required precisions of the columns. (By default, to increase performance, PowerCenter treats all numeric columns as 15 digit floating point decimals, regardless of how they are defined in the transformations. The maximum numeric precision in PowerCenter is 28 digits.) If it is determined that a column realistically needs a higher precision, then the Enable Decimal Arithmetic in the Session Properties option needs to be checked. However, be aware that enabling this option can slow performance by as much as 15 percent. The Enable Decimal Arithmetic option must be enabled when comparing two numbers for equality.

Trapping Poor Data Quality Techniques

INFORMATICA CONFIDENTIAL

BEST PRACTICES

246 of 954

The most important technique for ensuring good data quality is to prevent incorrect, inconsistent, or incomplete data from ever reaching the target system. This goal may be difficult to achieve in a data synchronization or data migration project, but it is very relevant when discussing data warehousing or ODS. This section discusses techniques that you can use to prevent bad data from reaching the system.

Checking Data for Completeness Before Loading


When requesting a data feed from an upstream system, be sure to request an audit file or report that contains a summary of what to expect within the feed. Common requests here are record counts or summaries of numeric data fields. If you have performed a data quality audit, as specified in the Analyze Phase these metrics and others should be readily available. Assuming that the metrics can be obtained from the source system, it is advisable to then create a pre-process step that ensures your input source matches the audit file. If the values do not match, stop the overall process from loading into your target system. The source system can then be alerted to verify where the problem exists in its feed.

Enforcing Rules During Mapping


Another method of filtering bad data is to have a set of clearly defined data rules built into the load job. The records are then evaluated against these rules and routed to an Error or Bad Table for further re-processing accordingly. An example of this is to check all incoming Country Codes against a Valid Values table. If the code is not found, then the record is flagged as an Error record and written to the Error table. A pitfall of this method is that you must determine what happens to the record once it has been loaded to the Error table. If the record is pushed back to the source system to be fixed, then a delay may occur until the record can be successfully loaded to the target system. In fact, if the proper governance is not in place, the source system may refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the data manually and risk not matching with the source system; or 2) relax the business rule to allow the record to be loaded. Often times, in the absence of an enterprise data steward, it is a good idea to assign a team member the role of data steward. It is this persons responsibility to patrol these tables and push back to the appropriate systems as necessary, as well as help to make decisions about fixing or filtering bad data. A data steward should have a good command of the metadata, and he/she should also understand the consequences to the user community of data decisions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

247 of 954

Another solution applicable in cases with a small number of code values is to try to anticipate any mistyped error codes and translate them back to the correct codes. The cross-reference translation data can be accumulated over time. Each time an error is corrected, both the incorrect and correct values should be put into the table and used to correct future errors automatically.

Dimension Not Found While Loading Fact


The majority of current data warehouses are built using a dimensional model. A dimensional model relies on the presence of dimension records existing before loading the fact tables. This can usually be accomplished by loading the dimension tables before loading the fact tables. However, there are some cases where a corresponding dimension record is not present at the time of the fact load. When this occurs, consistent rules need to handle this so that data is not improperly exposed to, or hidden from, the users. One solution is to continue to load the data to the fact table, but assign the foreign key a value that represents Not Found or Not Available in the dimension. These keys must also exist in the dimension tables to satisfy referential integrity, but they provide a clear and easy way to identify records that may need to be reprocessed at a later date. Another solution is to filter the record from processing since it may no longer be relevant to the fact table. The team will most likely want to flag the row through the use of either error tables or process codes so that it can be reprocessed at a later time. A third solution is to use dynamic caches and load the dimensions when a record is not found there, even while loading the fact table. This should be done very carefully since it may add unwanted or junk values to the dimension table. One occasion when this may be advisable is in cases where dimensions are simply made up of the distinct combination values in a data set. Thus, this dimension may require a new record if a new combination occurs. It is imperative that all of these solutions be discussed with the users before making any decisions since they will eventually be the ones making decisions based on the reports.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

248 of 954

Data Quality Project Estimation and Scheduling Factors Challenge


This Best Practice is intended to assist project managers who must estimate the time and resources necessary to address data quality issues within data integration or other data-dependent projects. Its primary concerns are the project estimation issues that arise when you add a discrete data quality stage to your data project. However, it also examines the factors that determine when, or whether, you need to build a larger data quality element into your project.

Description
At a high level, there are three ways to add data quality to your project:
q

Add a discrete and self-contained data quality stage, such as that enabled by using pre-built Informatica Data Quality (IDQ) processes, or plans, in conjunction with Informatica Data Cleanse and Match. Add an expanded but finite set of data quality actions to the project, for example in cases where pre-built plans do not fit the project parameters. Incorporate data quality actions throughout the project.

This document should help you decide which of these methods best suits your project and assist in estimating the time and resources needed for the first and second methods.

Using Pre-Built Plans with Informatica Data Cleanse and Match


Informatica Data Cleanse and Match is a cross-application solution that enables PowerCenter users to add data quality processes defined in IDQ to custom transformations in PowerCenter. It incorporates the following components:
q

Data Quality Workbench, a user-interface application for building and executing data quality processes, or plans. Data Quality Integration, a plug-in component for PowerCenter that integrates PowerCenter and IDQ. At least one set of reference data files that can be read by data quality plans to validate and enrich certain types of project data. For example, Data Cleanse and Match can be used with the North America Content Pack, which includes pre-built data quality plans and complete address reference datasets for the United States and Canada.

Data Quality Engagement Scenarios


Data Cleanse and Match delivers its data quality capabilities out of the box; a PowerCenter user can select data quality plans and add them to a Data Quality transformation without leaving PowerCenter. In this way, Data Cleanse and Match capabilities can be added into a project plan as a relatively short and
INFORMATICA CONFIDENTIAL BEST PRACTICES 249 of 954

discrete stage. In a more complex scenario, a Data Quality Developer may wish to modify the underlying data quality plans or create new plans to focus on quality analysis or enhancements in particular areas. This expansion of the data quality operations beyond the pre-built plans can also be handled within a discrete data quality stage. The Project Manager may decide to implement a more thorough approach to data quality and integrate data quality actions throughout the project plan. In many cases, a convincing case can be made for enlarging the data quality aspect to encompass the full data project. (Velocity contains several tasks and subtasks concerned with such an endeavor.) This is well worth considering. Often, businesses do not realize the extent to which their business and project goals depend on the quality of their data. The project impact of these three types of data quality activity can be summarized as follows:

DQ approach Simple stage Expanded data quality stage

Estimated Project impact 10 days, 1-2 Data Quality Developers 15-20 days, 2 Data Quality Developers, high visibility to business

Data quality integrated with data project Duration of data project, 2 or more project roles, impact on business and project objectives
Note: The actual time that should be allotted to the data quality stages noted above depends on the factors discussed in the remainder of this document.

Factors Influencing Project Estimation


The factors influencing project estimation for a data quality stage range from high-level project parameters to lower-level data characteristics. The main factors are listed below and explained in detail later in this document.
q q q q q q q q

Base and target levels of data quality Overall project duration/budget Overlap of sources/Complexity of data joins Quantity of data sources Matching requirements Data volumes Complexity and quantity of data rules Geography

Determine which scenario out of the box (Data Cleanse and Match), expanded Data Cleanse and Match, or a thorough data quality integration best fits your data project by considering the projects overall objectives and its mix of factors.

The Simple Data Quality Stage


INFORMATICA CONFIDENTIAL BEST PRACTICES 250 of 954

Project managers can consider the use of pre-built plans with Data Cleanse and Match as a simple scenario with a predictable number of function points that can be added to the project plan as a single package. You can add the North America Content Pack plans to your project if the project meets most of the following criteria. Similar metrics apply to other types of pre-built plans:
q q q q q q q q q

Baseline functionality of the pre-built data quality plans meets 80 percent of the project needs. Complexity of data rules is relatively low. Business rules present in pre-built plans need minimum fine-tuning. Target data quality level is achievable (i.e., <100 percent). Quantity of data sources is relatively low. Overlap of data sources/complexity of database table joins is relatively low. Matching requirements and targets are straightforward. Overall project duration is relatively short. The project relates to a single country.

Note that the source data quality level is not a major concern.

Implementing the Simple Data Quality Stage


The out-of-the-box scenario is designed to deliver significant increases in data quality in those areas for which the plans were designed (i.e., North American name and address data) in a short time frame. As indicated above, it does not anticipate major changes to the underlying data quality plans. It involves the following three steps: 1. Run pre-built plans. 2. Review plan results. 3. Transfer data to the next stage in the project and (optionally) add data quality plans to PowerCenter transformations. While every project is different, a single iteration of the simple model may take approximately five days, as indicated below:
q q q

Run pre-built plans (2 days) Review plan results (1 day) Pass data to the next stage in the project and add plans to PowerCenter transformations (2 days)

Note that these estimates fit neatly into a five-day week but may be conservative in some cases. Note also that a Data Quality Developer can tune plans on an ad-hoc basis to suit the project. Therefore you should plan for a two week simple data quality stage.

Step - Simple Stage Run pre-built plans

Days, week 1 2

Days, week 2

INFORMATICA CONFIDENTIAL

BEST PRACTICES

251 of 954

Review plan results Fine-tune pre-built plans if necessary Re-run pre-built plans Review plan results with stakeholders Add plans to PowerCenter transformations and define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage

1 2 2 1 1 1

Expanding the Simple Data Quality Stage


Although the simple scenario above allows for the data quality components to be treated as a black box, it allows for modifications to the data quality plans. The types of plan tuning that developers can undertake in this time frame include changing the reference dictionaries used by the plans, editing these dictionaries, and re-selecting the data fields used by the plans as keys to identify data matches. The above time frame does not guarantee that a developer can build or re-build a plan from scratch. The gap between base and target levels of data quality is an important area to consider when expanding the data quality stage. The Developer and Project Manager may decide to add a data analysis step in this stage, or even decide to split these activities across the project plan by conducting a data quality audit early in the project, so that issues can be revealed to the business in advance of the formal data quality stage. The schedule should allow for sufficient time for testing the data quality plans and for contact with the business managers in order to define data quality expectations and targets. In addition:
q

If a data quality audit is added early in the project, the data quality stage grows into a projectlength endeavor. If the data quality audit is included in the discrete data quality stage, the expanded, three-week Data Quality stage may look like this:

Step - Enhanced DQ Stage Set up and run data analysis plans Review plan results Conduct advance tuning of pre-built plans Run pre-built plans Review plan results with stakeholders Modify pre-built plans or build new plans from scratch Re-run the plans

Days, week 1 1-2 2

Days, week 2

Days, week 3

1 2 2

INFORMATICA CONFIDENTIAL

BEST PRACTICES

252 of 954

Review plan results/obtain approval from stakeholders Add approved plans to PowerCenter transformations, define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage

1 2 1 1 1

Sizing Your Data Quality Initiatives


The following section describes the factors that affect the estimated time that the data quality endeavors may add to a project. Estimating the specific impact that a single factor is likely to have on a project plan is difficult, as a single data factor rarely exists in isolation from others. If one or two of these factors apply to your data, you may be able to treat them within the scope of a discrete DQ stage. If several factors apply, you are moving into a complex scenario and must design your project plan accordingly.

Base and Target Levels of Data Quality


The rigor of your data quality stage depends in large part on the current (i.e., base) levels of data quality in your dataset and the target levels that you want to achieve. As part of your data project, you should run a set of data analysis plans and determine the strengths and weaknesses of the proposed project data. If your data is already of a high quality relative to project and business goals, then your data quality stage is likely to be a short one! If possible, you should conduct this analysis at an early stage in the data project (i.e., well in advance of the data quality stage). Depending on your overall project parameters, you may have already scoped a Data Quality Audit into your project. However, if your overall project is short in duration, you may have to tailor your data quality analysis actions to the time available. Action:If there is a wide gap between base and target data quality levels, determine whether a short data quality stage can bridge the gap. If a data quality audit is conducted early in the project, you have latitude to discuss this with the business managers in the context of the overall project timeline. In general, it is good practice to agree with the business to incorporate time into the project plan for a dedicated Data Quality Audit. (See Task 2.8 in theVelocity Work Breakdown Structure.) If the aggregated data quality percentage for your projects source data is greater than 60 percent, and your target percentage level for the data quality stage is less than 95 percent, then you are in the zone of effectiveness for Data Cleanse and Match. Note: You can assess data quality according to at least six criteria. Your business may need to improve data quality levels with respect to one criterion but not another. See the Best Practice document Data Cleansing .

Overall Project Duration/Budget

INFORMATICA CONFIDENTIAL

BEST PRACTICES

253 of 954

A data project with a short duration may not have the means to accommodate a complex data quality stage, regardless of the potential or need to enhance the quality of the data involved. In such a case, you may have to incorporate a finite data quality stage. Conversely, a data project with a long time line may have scope for a larger data quality initiative. In large data projects with major business and IT targets, good data quality may be a significant issue. For example, poor data quality can affect the ability to cleanly and quickly load data into target systems. Major data projects typically have a genuine need for high-quality data if they are to avoid unforeseen problems. Action: Evaluate the project schedule parameters and expectations put forward by the business and evaluate how data quality fits into these parameters. You must also determine if there are any data quality issues that may jeopardize project success, such as a poor understanding of the data structure. These issues may already be visible to the business community. If not, they should be raised with the management. Bear in mind that data quality is not simply concerned with the accuracy of the data values it can encompass the project metadata also.

Overlap of Sources/Complexity of Data Joins


When data sources overlap, data quality issues can be spread across several sources. The relationships among the variables within the sources can be complex, difficult to join together, and difficult to resolve, all adding to project time. If the joins between the data are simple, then this task may be straightforward. However, if the data joins use complex keys or exist over many hierarchies, then the data modeling stage can be time-consuming, and the process of resolving the indices may be prolonged. Action: You can tackle complexity in data sources and in required database joins within a data quality stage, but in doing so, you step outside the scope of the simple data quality stage.

Quantity of Data Sources


This issue is similar to that of data source overlap and complexity (above). The greater the quantity of sources, the greater the opportunity for data quality issues to arise. The number of data sources has a particular impact on the time required to set up the data quality solution. (The source data setup in PowerCenter can facilitate the data setup in the data quality stage.) Action: You may find that the number of data sources correlates with the number of data sites covered by the project. If your project includes data from multiple geographies, you step outside the scope of a simple data quality stage.

Matching Requirements
Data matching plans are the most performance-intensive type of data quality plan. Moreover, matching plans are often coupled to a type of data standardization plan (i.e., grouping plan) that prepares the data for match analysis. Matching plans are not necessarily more complex to design than other types of plans, although they may contain sophisticated business rules. However, the time taken to execute a matching plan is exponentially
INFORMATICA CONFIDENTIAL BEST PRACTICES 254 of 954

proportional to the volume of data records passed through the plan. (Specifically, the time taken is proportional to the size and number of data groups created in the grouping plans.) Action: Consult the Best Practice on Effective Data Matching Techniques and determine how long your matching plans may take to run.

Data Volumes
Data matching requirements and data volumes are closely related. As stated above, the time taken to execute a matching plan is exponentially proportional to the volume of data records passed through it. In other types of plans, this exponential relationship does not exist. However, the general rule applies: the larger your data volumes, the longer it takes for plans to execute. Action: Although IDQ can handle data volumes measurable in eight figures, a dataset of more than 1.5 million records is considered larger than average. If your dataset is measurable in millions of records, and high levels of matching/de-duplication are required, consult the Best Practice on Effective Data Matching Techniques.

Complexity and Quantity of Data Rules


This is a key factor in determining the complexity of your data quality stage. If the Data Quality Developer is likely to write a large number of business rules for the data quality plans as may be the case if data quality target levels are very high or relate to precise data objectives then the project is de facto moving out of Data Cleanse and Match capability and you need to add rule-creation and rule-review elements to the data quality effort. Action: If the business requires multiple complex rules, you must scope additional time for rule creation and for multiple iterations of the data quality stage. Bear in mind that, as well as writing and adding these rules to data quality plans, the rules must be tested and passed by the business.

Geography
Geography affects the project plan in two ways:
q

First, the geographical spread of data sites is likely to affect the time needed to run plans, collate data, and engage with key business personnel. Working hours in different time zones can mean that one site is starting its business day while others are ending theirs, and this can effect the tight scheduling of the simple data quality stage. Secondly, project data that is sourced from several countries typically means multiple data sources, with opportunities for data quality issues to arise that may be specific to the country or the division of the organization providing the data source.

There is also a high correlation between the scale of the data project and the scale of the enterprise in which the project will take place. For multi-national corporations, there is rarely such a thing as a small data project! Action: Consider the geographical spread of your source data. If the data sites are spread across several time zones or countries, you may need to factor in time lags to your data quality planning.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

255 of 954

Developing the Data Quality Business Case Challenge


When a potential data quality issue has been identified it is imperative to develop a business case that details the severity of the issue along with the benefits to be gained by implementing a data quality strategy. A strong business case can help to build the necessary organizational support for funding a data quality initiative.

Description
Building a business case around data quality often necessitates starting with a pilot project. The purpose of the pilot project is to document the anticipated return on investment (ROI). It is important to ensure that the pilot is both manageable and achievable in a relatively short period of time. Build the business case by conducting a Data Quality Audit on a representative sample set of data, but set a reasonable scope so that the audit can be accomplished within a three to four week period. At the conclusion of the Data Quality Audit a report should be prepared that captures the results of the investigation (i.e., invalid data, duplicate records, etc.) and extrapolates the expected cost savings that can be gained if an Enterprise data quality initiative is pursued. Below are the five key steps necessary to develop a business case for a Data Quality Audit. Following these steps also provides a solid foundation for detailing the business requirements for an Enterprise data quality initiative. 1. Identify a Test Source a. What source files (s) are to be considered? A representative sample set of data should be evaluated. This can be a crosssection of an enterprise data set or data from a specific department in which a potential data quality issue is expected to be found. b. What data within those files (priority, obsolete, dormant, incorrect) will be used?

INFORMATICA CONFIDENTIAL

BEST PRACTICES

256 of 954

Prior to conducting the Data Quality Audit, the type of data within each file should be documented. The results generated during the Audit should be tracked against the anticipated data types. For example, if 10% of the records are incorrectly flagged as priority (when they should be marked obsolete or dormant) any reporting based upon the results of this data will be skewed. 2. Identify Issues a. What data needs to be fixed? Any anticipated issues with the data should be identified prior to conducting the Audit in order to ensure that the specific use cases are investigated. b. What data needs to be changed or enhanced? A data dictionary should be created or made available to capture any anticipated values that should reside within a given data field. These values will be utilized via a reference lookup to analyze the level of conformity between the actual value and the recorded value in the reference dictionary. Additionally, any missing values should be updated based upon the documented data dictionary value. c. What is a representative set of business rules to demonstrate functionality? Prior to conducting the Audit, a discussion should be held regarding the business rules that should be enforced in the provided data set. The intent is to use the expected business rules as a starting point for validation of the data during the Audit. As new rules are likely to be identified during the Audit, having a starting point ensures that initial results can be quickly disseminated to key stakeholders via an initial data quality iteration that leverages the previously documented business rules. 3. Define Scope a. What can be achieved with which resources in the time available? The scope of the Audit should be defined in order to ensure that a business case can be made for a data quality initiative within weeks, not months. The project should be seen as a pilot in order to validate the anticipated ROI if an Enterprise initiative is pursued. Just as the scope should be well defined, commitments should be agreed upon prior to starting the project that the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

257 of 954

required resources (i.e., data steward, IT representative, business user) will be available as needed during the duration of the project. This will ensure that activities such as the data and business rule review remain on schedule. b. What milestones are critical to other parts of the project? Any relationships between the outcome of the project and other initiatives within the organization should be identified up front. Although the Audit is a pilot project, the data quality results should be reusable on other projects within the organization. If there are specific milestones for the delivery of results, this should be incorporated into the project plan in order to ensure that other projects are not adversely impacted. 4. Highlight Resulting Issues a. Highlight typical issues for the Business, Data Owners, the Governance Team and Senior Management. Upon conclusion of the Audit, the issues uncovered during the project should be summarized and presented to key stakeholders in a workshop setting. During the workshop, the results should be highlighted, along with any anticipated impact to the business if a data quality initiative is not enacted within the organization. b. Test the execution resolution of issues. During the Audit, the resolution of identified issues should occur by leveraging Informatica Data Quality. During the workshop, the means to resolve the issues and the end results should be presented. The types of issues typically resolved include: address validation, ensuring conformity of data through the use of reference dictionaries and the identification and resolution of duplicate data. 5. Build Knowledge a. Gain confidence and knowledge of data quality management strategies, conference room pilots, migrations, etc. To reiterate, the intent of the Audit is to quantify the anticipated ROI within an organization if a data quality strategy is implemented. Additionally, knowledge about the data, the business rules and the potential strategy that can be leveraged throughout the entire organization should be captured.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

258 of 954

b. The rules employed will form the basis of an ongoing DQM Strategy in the target systems. The identified rules should be incorporated into an existing data quality management strategy or utilized as the starting point for a new strategy moving forward. The above steps are intended as a starting point for developing a framework for conducting a Data Quality Audit. From this Audit, the key stakeholders in an organization should have definitive proof as to the extent of the types of data quality issues within their organization and the anticipated ROI that can be achieved through the introduction of data quality throughout the organization.

Last updated: 21-Aug-07 11:48

INFORMATICA CONFIDENTIAL

BEST PRACTICES

259 of 954

Effective Data Matching Techniques Challenge


Identifying and eliminating duplicates is a cornerstone of effective marketing efforts and customer resource management initiatives, and it is an increasingly important driver of cost-efficient compliance with regulatory initiatives such as KYC (Know Your Customer). Once duplicate records are identified, you can remove them from your dataset, and better recognize key relationships among data records (such as customer records from a common household). You can also match records or values against reference data to ensure data accuracy and validity. This Best Practice is targeted toward Informatica Data Quality (IDQ) users familiar with Informatica's matching approach. It has two high-level objectives:
q q

To identify the key performance variables that affect the design and execution of IDQ matching plans. To describe plan design and plan execution actions that will optimize plan performance and results.

To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.

Description
All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer numbers or product ID fields) that, if present, would allow clear joins between the datasets and improve business knowledge. Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single view of customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from being sent to the same person or household; and it can assist marketing efforts by identifying households or individuals who are heavy users of a product or service. Data can be enriched by matching across production data and reference data sources. Business intelligence operations can be improved by identifying links between two or more systems to provide a more complete picture of how customers interact with a business. IDQs matching capabilities can help to resolve dataset duplications and deliver business results. However, a users ability to design and execute a matching plan that meets the key requirements of performance and match quality depends on understanding the best-practice approaches described in this document. An integrated approach to data matching involves several steps that prepare the data for matching and improve the overall quality of the matches. The following table outlines the processes in each step.

Step Profiling

Description Typically the first stage of the data quality process, profiling generates a picture of the data and indicates the data elements that can comprise effective group keys. It also highlights the data elements that require standardizing to improve match scores.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

260 of 954

Standardization

Removes noise, excess punctuation, variant spellings, and other extraneous data elements. Standardization reduces the likelihood that match quality will be affected by data elements that are not relevant to match determination. A post-standardization function in which the groups' key fields identified in the profiling stage are used to segment data into logical groups that facilitate matching plan performance. The process whereby the data values in the created groups are compared against one another and record matches are identified according to user-defined criteria. The process whereby duplicate records are cleansed. It identifies the master record in a duplicate cluster and permits the creation of a new dataset or the elimination of subordinate records. Any child data associated with subordinate records is linked to the master record.

Grouping

Matching

Consolidation

The sections below identify the key factors that affect the performance (or speed) of a matching plan and the quality of the matches identified. They also outline the best practices that ensure that each matching plan is implemented with the highest probability of success. (This document does not make any recommendations on profiling, standardization or consolidation strategies. Its focus is grouping and matching.) The following table identifies the key variables that affect matching plan performance and the quality of matches identified.

Factor Group size

Impact Plan performance

Impact summary The number and size of groups have a significant impact on plan execution speed. The proper selection of group keys ensures that the maximum number of possible matches are identified in the plan. Processors, disk performance, and memory require consideration. This is not a high-priority issue. However, it should be considered when designing the plan. The plan designer must weigh file-based versus database matching approaches when considering plan requirements.

Group keys

Quality of matches

Hardware resources

Plan performance

Size of dataset(s)

Plan performance

Informatica Data Quality components

Plan performance

INFORMATICA CONFIDENTIAL

BEST PRACTICES

261 of 954

Time window and frequency of execution

Plan performance

The time taken for a matching plan to complete execution depends on its scale. Timing requirements must be understood up-front. The plan designer must weigh deterministic versus probabilistic approaches.

Match identification

Quality of matches

Group Size
Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons performed in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a matching plan compares the records within each group with one another. When grouping is implemented properly, plan execution speed is increased significantly, with no meaningful effect on match quality. The most important determinant of plan execution speed is the size of the groups to be processed that is, the number of data records in each group. For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If 9,999 of these groups have an average of 50 records each, the remaining group will contain more than 500,000 records; based on this one large group, the matching plan would require 87 days to complete, processing 1,000,000 comparisons a minute! In comparison, the remaining 9,999 groups could be matched in about 12 minutes if the group sizes were evenly distributed. Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups perform more record comparisons, so more likely matches are potentially identified. The reverse is true for small groups. As groups get smaller, fewer comparisons are possible, and the potential for missing good matches is increased. The goal of grouping is to optimize performance while minimizing the possibility that valid matches will be overlooked because like records are assigned to different groups. Therefore, groups must be defined intelligently through the use of group keys.

Group Keys
Group keys determine which records are assigned to which groups. Group key selection, therefore, has a significant affect on the success of matching operations. Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the plan. The selection of group keys, based on key data fields, is critical to ensuring that relevant records are compared against one another. When selecting a group key, two main criteria apply:
q

Candidate group keys should represent a logical separation of the data into distinct units where there is a low probability that matches exist between records in different units. This can be determined by profiling the data and uncovering the structure and quality of the content prior to grouping. Candidate group keys should also have high scores in three keys areas of data quality: completeness, conformity, and accuracy. Problems in these data areas can be improved by standardizing the data prior to grouping.

For example, geography is a logical separation criterion when comparing name and address data. A record for a
INFORMATICA CONFIDENTIAL BEST PRACTICES 262 of 954

person living in Canada is unlikely to match someone living in Ireland. Thus, the country-identifier field can provide a useful group key. However, if you are working with national data (e.g. Swiss data), duplicate data may exist for an individual living in Geneva, who may also be recorded as living in Genf or Geneve. If the group key in this case is based on city name, records for Geneva, Genf, and Geneve will be written to different groups and never compared unless variant city names are standardized.

Size of Dataset
In matching, the size of the dataset typically does not have as significant an impact on plan performance as the definition of the groups within the plan. However, in general terms, the larger the dataset, the more time required to produce a matching plan both in terms of the preparation of the data and the plan execution.

IDQ Components
All IDQ components serve specific purposes, and very little functionality is duplicated across the components. However, there are performance implications for certain component types, combinations of components, and the quantity of components used in the plan. Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational components. In tests comparing file-based matching against database matching, file-based matching outperformed database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching plans that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed Field Matcher component performed more slowly than plans without a Mixed Field Matcher. Raw performance should not be the only consideration when selecting the components to use in a matching plan. Different components serve different needs and may offer advantages in a given scenario.

Time Window
IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the completion of a matching plan can have a significant impact on the perception that the plan is running correctly. Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping strategy, and the IDQ components to employ.

Frequency of Execution
The frequency with which plans are executed is linked to the time window available. Matching plans may need to be tuned to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the execution time will have to be considered.

Match Identification
The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key methods for assessing matches are:
q q

deterministic matching probabilistic matching

Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQs fuzzy matching algorithms can be combined with this method. For example, a deterministic check may first check if
INFORMATICA CONFIDENTIAL BEST PRACTICES 263 of 954

the last name comparison score was greater than 85 percent. If this is true, it next checks the address. If an 80 percent match is found, it then checks the first name. If a 90 percent match is found on the first name, then the entire record is considered successfully matched. The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to others, and (2) it is similar to the methods employed when manually checking for matches. The disadvantages to this method are its rigidity and its requirement that each dependency be true. This can result in matches being missed, or can require several different rule checks to cover all likely combinations. Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in order to calculate a weighted average that indicates the degree of similarity between two pieces of information. The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no dependencies on certain data elements matching in order for a full match to be found. Weights assigned to individual components can place emphasis on different fields or areas in a record. However, even if a heavilyweighted score falls below a defined threshold, match scores from less heavily-weighted components may still produce a match. The disadvantages of this method are a higher degree of required tweaking on the users part to get the right balance of weights in order to optimize successful matches. This can be difficult for users to understand and communicate to one another. Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching plan with 95 to 100 percent success may have found all good matches, but matching plan success between 90 and 94 percent may map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to only 65 percent genuine matches, and so on. The following table illustrates this principle.

Close analysis of the match results is required because of the relationship between match quality and match thresholds scores assigned since there may not be a one-to-one mapping between the plans weighted score and the number of records that can be considered genuine matches.

Best Practice Operations


The following section outlines best practices for matching with IDQ.
INFORMATICA CONFIDENTIAL BEST PRACTICES 264 of 954

Capturing Client Requirements


Capturing client requirements is key to understanding how successful and relevant your matching plans are likely to be. As a best practice, be sure to answer the following questions, as a minimum, before designing and implementing a matching plan:
q q q q q q q

How large is the dataset to be matched? How often will the matching plans be executed? When will the match process need to be completed? Are there any other dependent processes? What are the rules for determining a match? What process is required to sign-off on the quality of match results? What processes exist for merging records?

Test Results
Performance tests demonstrate the following:
q q

IDQ has near-linear scalability in a multi-processor environment. Scalability in standard installations, as achieved in the allocation of matching plans to multiple processors, will eventually level off.

Performance is the key to success in high-volume matching solutions. IDQs architecture supports massive scalability by allowing large jobs to be subdivided and executed across several processors. This scalability greatly enhances IDQs ability to meet the service levels required by users without sacrificing quality or requiring an overly complex solution. If IDQ is integrated with PowerCenter, matching scalability can be achieved using PowerCenter's partitioning capabilities.

Managing Group Sizes


As stated earlier, group sizes have a significant affect on the speed of matching plan execution. Also, the quantity of small groups should be minimized to ensure that the greatest number of comparisons are captured. Keep the following parameters in mind when designing a grouping plan.

Condition Maximum group size

Best practice 5,000 records

Exceptions Large datasets over 2M records with uniform data. Minimize the number of groups containing more than 5,000 records.

Minimum number of singlerecord groups Optimum number of comparisons

1,000 groups per one million record dataset. 500,000,000 comparisons +/- 20 percent per 1 million records

INFORMATICA CONFIDENTIAL

BEST PRACTICES

265 of 954

In cases where the datasets are large, multiple group keys may be required to segment the data to ensure that best practice guidelines are followed. Informatica Corporation can provide sample grouping plans that automate these requirements as far as is practicable.

Group Key Identification


Identifying appropriate group keys is essential to the success of a matching plan. Ideally, any dataset that is about to be matched has been profiled and standardized to identify candidate keys. Group keys act as a first pass or high-level summary of the shape of the dataset(s). Remember that only data records within a given group are compared with one another. Therefore, it is vital to select group keys that have high data quality scores for completeness, conformity, consistency, and accuracy. Group key selection depends on the type of data in the dataset, for example whether it contains name and address data or other data types such as product codes.

Hardware Specifications
Matching is a resource-intensive operation, especially in terms of processor capability. Three key variables determine the effect of hardware on a matching plan: processor speed, disk performance, and memory. The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has a significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is one million comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million comparisons per minute, depending on the hardware specification, background processes running, and components used. As a best practice, higher-specification processors (e.g., 1.5 GHz minimum) should be used for high-volume matching plans. Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and writes data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how quickly data can be read from, and written to, the hard disk. Information that cannot be stored in memory during plan execution must be temporarily written to the hard disk. This increases the time required to retrieve information that otherwise could be stored in memory, and also increases the load on the hard disk. A RAID drive may be appropriate for datasets of 3 to 4 million records and a minimum of 512MB of memory should be available. The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms. Specifications for UNIX-based systems vary.

Match volumes < 1,500,000 records 1,500,000 to 3 million records > 3 million records

Suggested hardware specification 1.5 GHz computer, 512MB RAM Multi processor server, 1GB RAM Multi-processor server, 2GB RAM, RAID 5 hard disk

Single Processor vs. Multi-Processor


With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based or database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware however, that this requires additional effort to create the groups and consolidate the match output. Also, matching plans split across four processors do not run four times faster than a single-processor matching plan. As a result, multi-processor matching may not significantly improve performance in every case.
INFORMATICA CONFIDENTIAL BEST PRACTICES 266 of 954

Using IDQ with PowerCenter and taking advantage of PowerCenter's partitioning capabilities may also improve throughput. This approach has the advantage that splitting plans into multiple independent plans is not typically required. The following table can help in estimating the execution time between a single and multi-processor match plan.

Plan Type Standardardization/ grouping Matching

Single Processor Depends on operations and size of data set. (Time equals Y) Est 1 million comparisons a minute. (Time equals X)

Multiprocessor Single processor time plus 20 percent. (Time equals Y * 1.20) Time for single processor matching divided by no or processors (NP) multiplied by 25 percent. (Time equals [(X / NP) * 1.25])

For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match, a four-processor match plan should require approximately one hour and 20 minute to group and standardize and two and one half hours to match. The time difference between a single- and multi-processor plan in this case would be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the quad-processor plan).

Deterministic vs. Probabilistic Comparisons


No best-practice research has yet been completed on which type of comparison is most effective at determining a match. Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for deterministic comparisons since they remove the burden of identifying a universal match threshold from the user. Bear in mind that IDQ supports deterministic matching operations only. However, IDQs Weight Based Analyzer component lets plan designers calculate weighted match scores for matched fields.

Database vs. File-Based Matching


File-based matching and database matching perform essentially the same operations. The major differences between the two methods revolve around how data is stored and how the outputs can be manipulated after matching is complete. With regards to selecting one method or the other, there are no best practice recommendations since this is largely defined by requirements. The following table outlines the strengths and weakness of each method:

File-Based Method Ease of implementation Performance Space utilization Operating system restrictions Easy to implement Fastest method Requires more hard-disk space Possible limit to number of groups that can be created

Database Method Requires SQL knowledge Slower than file-based method Lower hard-disk space requirement None

INFORMATICA CONFIDENTIAL

BEST PRACTICES

267 of 954

Ability to control/ manipulate output

Low

High

High-Volume Data Matching Techniques


This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of execution and quality of results. It highlights the key factors affecting matching performance and discusses the results of IDQ performance testing in single and multi-processor environments. Checking for duplicate records where no clear connection exists among data elements is a resource-intensive activity. In order to detect matching information, a record must be compared against every other record in a dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases geometrically as the volume of data increases. A similar situation arises when matching between two datasets, where the number of comparisons required is a multiple of the volumes of data in each dataset. When the volume of data increases into the tens of millions, the number of comparisons required to identify matches and consequently, the amount of time required to check for matches reaches impractical levels.

Approaches to High-Volume Matching


Two key factors control the time it takes to match a dataset:
q q

The number of comparisons required to check the data. The number of comparisons that can be performed per minute.

The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of records outside of the group. Grouping data greatly reduces the total number of required comparisons without affecting match accuracy. IDQ affects the number of comparisons per minute in two ways:
q

Its matching components maximize the comparison activities assigned to the com-puter processor. This reduces the amount of disk I/O communication in the system and increases the number of comparisons per minute. Therefore, hard-ware with higher processor speeds has higher match throughputs. IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple processors. The use of multiple processors to handle matching operations greatly enhances IDQ scalability with regard to high-volume matching problems.

The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the results obtained in Informatica Corporation testing.

Multi-Processor Matching: Solution Overview


IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take advantage of a multi-processor environment, the plan designer must develop multiple plans for execution in parallel. To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the plan being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and
INFORMATICA CONFIDENTIAL BEST PRACTICES 268 of 954

the plans are executed in parallel. The following diagram outlines how multi-processor matching can be implemented in a database model. Source data is first grouped and then subgrouped according to the number of processors available to the job. Each subgroup of data is loaded into a sepa-rate staging area, and the discrete match plans are run in parallel against each table. Results from each plan are consolidated to generate a single match result for the orig-inal source data.

Informatica Corporation Match Plan Tests


Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows 2003 (Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors effectively provided four CPUs on which to run the tests. Several tests were performed using file-based and database-based matching methods and single and multiple processor methods. The tests were performed on one million rows of data. Grouping of the data limited the total number of comparisons to approximately 500,000,000. Test results using file-based and database-based methods showed a near linear scal-ability as the number of available processors increased. As the number of processors increased, so too did the demand on disk I/O resources. As the processor capacity began to scale upward, disk I/O in this configuration eventually limited the benefits of adding additional processor capacity. This is demonstrated in the graph below.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

269 of 954

Execution times for multiple processors were based on the longest execution time of the jobs run in parallel. Therefore, having an even distribution of records across all proc-essors was important to maintaining scalability. When the data was not evenly distributed, some match plans ran longer than others, and the benefits of scaling over multiple processors was not as evident.

Last updated: 26-May-08 17:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

270 of 954

Effective Data Standardizing Techniques Challenge


To enable users to streamline their data cleansing and standardization processes (or plans) with Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a consistent and methodological approach to cleansing and standardizing project data.

Description
Data cleansing refers to operations that remove non-relevant information and noise from the content of the data. Examples of cleansing operations include the removal of person names, care of information, excess character spaces, or punctuation from postal address. Data standardization refers to operations related to modifying the appearance of the data, so that it takes on a more uniform structure and to enriching the data by deriving additional details from existing content.

Cleansing and Standardization Operations


Data can be transformed into a standard format appropriate for its business type. This is typically performed on complex data types such as name and address or product data. A data standardization operation typically profiles data by type (e.g., word, number, code) and parses data strings into discrete components. This reveals the content of the elements within the data as well as standardizing the data itself. For best results, the Data Quality Developer should carry out these steps in consultation with a member of the business. Often, this individual is the data steward, the person who best understands the nature of the data within the business scenario.
q

Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the correct fields. However, when using the Profile Standardizer, be aware that there is a finite number of profiles (500) that can be contained within a cleansing plan. Users can extend the number of profiles by using the first 500 profiles within one component and then feeding the data overflow into a second Profile Standardizer via the Token Parser component.

After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to further standardize the data. It may take several iterations of dictionary construction and review before the data is standardized to an acceptable level. Once acceptable standardization has been achieved, data quality scorecard or dashboard reporting can be introduced. For information on dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User Guide.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

271 of 954

Discovering Business Rules


At this point, the business user may discover and define business rules applicable to the data. These rules should be documented and converted to logic that can be contained within a data quality plan. When building a data quality plan, be sure to group related business rules together in a single rules component whenever possible; otherwise the plan may become very difficult to read. If there are rules that do not lend themselves easily to regular IDQ components (i.e, when standardizing product data information), it may be necessary to perform some custom scripting using IDQs scripting component. This requirement may arise when a string or an element within a string needs to be treated as an array.

Standard and Third-Party Reference Data


Reference data can be a useful tool when standardizing data. Terms with variant formats or spellings can be standardized to a single form. IDQ installs with several reference dictionary files that cover common name and address and business terms. The illustration below shows part of a dictionary of street address suffixes.

Common Issues when Cleansing and Standardizing Data


If the customer has expectations of a bureau-style service, it may be advisable to re-emphasize the score-carding and graded-data approach to cleansing and standardizing. This helps to ensure that the customer develops reasonable expectations of what can be achieved with the data set within an agreed-upon timeframe.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

272 of 954

Standardizing Ambiguous Data


Data values can often appear ambiguous, particularly in name and address data where name, address, and premise values can be interchangeable. For example, Hill, Park, and Church are all common surnames. In some cases, the position of the value is important. ST can be a suffix for street or a prefix for Saint, and sometimes they can both occur in the same string. The address string St Patricks Church, Main St can reasonably be interpreted as Saint Patricks Church, Main Street. In this case, if the delimiter is a space (thus ignoring any commas and periods), the string has five tokens. You may need to write business rules using the IDQ Scripting component, as you are treating the string as an array. St with position 1 within the string would be standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2. Each data value can then be compared to a discrete prefix and suffix dictionary.

Conclusion
Using the data cleansing and standardization techniques described in this Best Practice can help an organization to recognize the value of incorporating IDQ into their development methodology. Because data quality is an iterative process, the business rules initially developed may require ongoing modification, as the results produced by IDQ will be affected by the starting condition of the data and the requirements of the business users. When data arrives in multiple languages, it is worth creating similar IDQ plans for each country and applying the same rules across these plans. The data would typically be staged in a database, and the plans developed using a SQL statement as input, with a where country_code= DE clause, for example. Country dictionaries are identifiable by country code to facilitate such statements. Remember that IDQ installs with a large set of reference dictionaries and additional dictionaries are available from Informatica. IDQ provides several components that focus on verifying and correcting the accuracy of name and postal address data. These components leverage address reference data that originates from national postal carriers such as the United States Postal Service. Such datasets enable IDQ to validate an address to premise level. Please note, the reference datasets are licensed and installed as discrete Informatica products, and thus it is important to discuss their inclusion in the project with the business in advance so as to avoid budget and installation issues. Several types of reference data, with differing levels of address granularity, are available from Informatica. Pricing for the licensing of these components may vary and should be discussed with the Informatica Account Manager.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

273 of 954

Integrating Data Quality Plans with PowerCenter Challenge


This Best Practice outlines the steps to integrate an Informatica Data Quality (IDQ) plan into a PowerCenter mapping. This document assumes that the appropriate setup and configuration of IDQ and PowerCenter have been completed as part of the software installation process and these steps are not included in this document.

Description
Preparing IDQ Plans for PowerCenter Integration
IDQ plans are typically developed and tested by executing from workbench. Plans running locally from workbench can use any of the available IDQ Source and Sink components. This is not true for plans that are integrated into PowerCenter as they can only use Source and Sink components that contain the Enable Real-time processing check box. Specifically those components are CSV Source, CSV Match Source, CSV Sink and CSV Match Sink. In addition, the Real-time Source and Sink can be used; however, they require additional setup as each field name and length must be defined. Database source and sinks are not allowed in PC integration. When IDQ plans are integrated within a PowerCenter mapping, the source and sink need to be enabled by setting the enable real-time processing option on them. Consider the following points when developing a plan for integration in PC.
q

If the IDQ was plan developed using database source and/or sink, you must replace them with CSV Sink/ Source or CSV Match Sink/Source. If the IDQ plan was developed using group sink/source (or dual group sink), you must replace them with either CSV Sink/Source or CSV Match Sink/Source depending on the functionality you are replacing. When replacing group sink you also must add functionality to the PC mapping to replicate the grouping. This is done by placing a join and sort prior to the IDQ plan containing the match. PowerCenter only sees the input and output ports of the IDQ plan from within the PC mapping. This is driven by the input file used for the workbench plan and the fields selected as output in the sink. If you dont see a field after the plan is integrated in PowerCenter, it means the field is not in the input file or not selected as output. PowerCenter integration does not allow input ports to be selected as output if the IDQ transformation is defined as a passive transformation. If the IDQ transformation is configured as active this is not an issue as you must select all fields needed as output from the IDQ transformation within the sink transformation of the IDQ plan. Passive and active IDQ transformations follow the general restrictions and rules for active and passive transformations in PowerCenter. The delimiter of the Source and Sink must be comma for integration IDQ plans. Other fields such as Pipe will cause an error within the PowerCenter Designer. If you encounter this error, go back to workbench, change the delimiter to comma, save the plan and then go back to PowerCenter Designer and perform the import of the plan again. For reusability of IDQ plans, use generic naming conventions for the input and output ports. For example, rather than naming a field Customer address1, customer address2, customer city, name the field address1, address2, city, etc. Thus, if the same standardization and cleansing is needed by multiple sources you can integrate the same IDQ plan, which will reduce development time as well as ongoing maintenance. Use only necessary fields as input to each mapping plan. If you are working with an input file that has 50 fields and you only really need 10 fields for the IDQ plan, create a file that contains only the necessary field names, save it as a comma delimited file and then point to that newly created file from the source of the IDQ plan. This changes the input field reference to only those fields that must be visible in the PowerCenter integration.
BEST PRACTICES 274 of 954

INFORMATICA CONFIDENTIAL

Once the source and sink are converted to real time, you cannot run the plan within workbench, only within the PowerCenter mapping. However, you may change the check box at any time to revert to standalone processing. Be careful not to refresh the IDQ plan in the mapping within PowerCenter while real time is not enabled. If you do so, the PowerCenter mapping will display an error message and will not allow that mapping to be integrated until the Runtime enable is active again.

Integrating IDQ Plans into PowerCenter Mappings


After the IDQ Plans are converted to real time-enabled, they are ready to integrate into a PowerCenter mapping. Integrating into PowerCenter requires proper installation and configuration of the IDQ/PowerCenter integration, including:
q q q q

Making appropriate changes to environment variables (to .profile for UNIX) Installing IDQ on the PowerCenter server Running IDQ Integration and Content install on the server Registering IDQ plug-in via the PowerCenter Admin console Note: The plug-in must be registered in each repository from which an IDQ transformation is to be developed.

q q

Installing IDQ workbench on the workstation Installing IDQ Integration and Content on the workstation using the PowerCenter Designer

When all of the above steps are executed correctly, the IDQ transformation icon, shown below, is visible in the PowerCenter repository.

To integrate an IDQ plan, open the mapping, and click on the IDQ icon. Then click in the mapping workspace to insert the transformation into the mapping. The following dialog box appears:

Select Active or Passive, as appropriate. Typically, an active transformation is necessary only for a matching plan. If selecting Active, IDQ plan input needs to have all input fields passed through, as typical PowerCenter
INFORMATICA CONFIDENTIAL BEST PRACTICES 275 of 954

rules apply to Active and Passive transformation processing. As the following figure illustrates, the IDQ transformation is empty in its initial, un-configured state. Notice all ports are currently blank; they will be populated upon import/integration of the IDQ plan.

Double-click on the title bar for the IDQ transformation to open it for editing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

276 of 954

Then select the far right tab, Configuration.

When first integrating an IDQ plan, the connection and repository displays are blank. Click the Connect button to establish a connection to the appropriate IDQ repository.

In the Host Name box, specify the name of the computer on which the IDQ repository is installed. This is usually the PowerCenter server. If the default Port Number (3306) was changed during installation, specify the correct value. Next, click Test Connection.
INFORMATICA CONFIDENTIAL BEST PRACTICES 277 of 954

Note: In some cases if the User Name has not been granted privileges on the Host server you will not be allowed to connect. The procedure for granting privileges to the IDQ (MySQL) repository is explained at the end of this document. When the connection is established, click the down arrow to the right of the Plan Name box, and the following dialog is displayed:

Browse to the plan you want to import, then click on the Validate button. If there is an error in the plan, a dialog box appears. For example, if the Source and Sink have not been configured correctly, the following dialog box appears.

If the plan is valid for PowerCenter integration, the following dialog is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

278 of 954

After a valid plan has been configured, the PowerCenter ports (equivalent to the IDQ Source and Sink fields, are visible and can be connected just as any other PowerCenter transformation.

Refreshing IDQ Plans for PowerCenter Integration


After Data Quality Plans are integrated in PowerCenter, changes made to the IDQ plan in Workbench are not reflected in the PowerCenter mapping until the plan is manually refreshed in the PowerCenter mapping. When you save an IDQ plan, it is saved in the MySQL repository. When you integrate that plan into PowerCenter, a copy of that plan is then integrated in the PowerCenter metadata; the MySQL repository and the PowerCenter repository do not communicate updates automatically. The following paragraphs detail the process for refreshing integrated IDQ plans when necessary to reflect changes made in workbench.
q q q q

Double-click on IDQ transformation in PowerCenter Mapping Select the Configurations tab: Select Refresh. This reads the current version of the plan and refreshes it within PowerCenter. Select apply. If any PowerCenter-specific errors were created when the plan was modified, an error dialog is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

279 of 954

Update input, output, and pass-through ports as necessary, then save the mapping in PowerCenter, and test the changes.

Saving IDQ Plans to the Appropriate Repository MySQL Permissions


Plans that are to be integrated into PowerCenter mappings must be saved to an IDQ Repository that is visible to the PowerCenter Designer prior to integration. The usual practice is to save the plan to the IDQ repository located on the PowerCenter server.

In order for a Workbench client to save a plan to that repository, the client machine must be granted permissions to the MySQL on the server. If the client machine has not been granted access, the client receives an error message when attempting to access the server repository. The person at your organization who has login rights to the server on which IDQ is installed needs to perform this task for all users who will need to save or retrieve plans from the IDQ Server. This procedure is detailed below.
q q

Identify the IP address for any client machine that needs to be granted access. Login to the server on which the MySQL repository is located and login to MySQL: mysql u root

For a user to connect to IDQ server, save and retrieve plans, enter the following command: grant all privileges on *.* to admin@<idq_client_ip>

For a user to integrate an IDQ plan into PowerCenter, grant the following privilege:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

280 of 954

grant all privileges on *.* to root@<powercenter_client_ip>

Last updated: 20-May-08 23:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

281 of 954

Managing Internal and External Reference Data Challenge


To provide guidelines for the development and management of the reference data sources that can be used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition from development to production for reference data files and the plans with which they are associated.

Description
Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan. A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those terms. It may be a list of employees, package measurements, or valid postal addresses any data set that provides an objective reference against which project data sources can be checked or corrected. Reference files are essential to some, but not all data quality processes. Reference data can be internal or external in origin. Internal data is specific to a particular project or client. Such data is typically generated from internal company information. It may be custom-built for the project. External data has been sourced or purchased from outside the organization. External data is used when authoritative, independently-verified data is needed to provide the desired level of data quality to a particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal address data sets that have been verified as current and complete by a national postal carrier, such as United States Postal Service, or company registration and identification information from an industrystandard source such as Dun & Bradstreet. Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that requires intermediary (third-party) software in order to be read by Informatica applications. Internal data files, as they are often created specifically for data quality projects, are typically saved in the dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases can also be used as a source for internal data. External files are more likely to remain in their original format. For example, external data may be contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal discrete data values.

Working with Internal Data Obtaining Reference Data


Most organizations already possess much information that can be used as reference data for example, employee tax numbers or customer names. These forms of data may or may not be part of the project source data, and they may be stored in different parts of the organization.
INFORMATICA CONFIDENTIAL BEST PRACTICES 282 of 954

The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that in some cases the reference data does not need to be 100 percent accurate. It can be good enough to compare project data against reference data and to flag inconsistencies between them, particularly in cases where both sets of data are highly unlikely to share common errors.

Saving the Data in .DIC File Format


IDQ installs with a set of reference dictionaries that have been created to handle many types of business data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from dictionary, and dictionary files are essentially comma delimited text files. You can create a new dictionary in three ways:
q

You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of your IDQ (client or server) installation. You can use the Dictionary Manager within Data Quality Workbench. This method allows you to create text and database dictionaries. You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).

The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct or standardized form of each datum from the dictionarys perspective. The Item columns contain versions of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore, each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration below). A dictionary can have multiple Item columns.

To edit a dictionary value, open the DIC file and make your changes. You can make changes either through a text editor or by opening the dictionary in the Dictionary Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

283 of 954

To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row, and add a Label string and at least one Item string. You can also add values in a text editor by placing the cursor on a new line and typing Label and Item values separated by commas. Once saved, the dictionary is ready for use in IDQ. Note: IDQ users with database expertise can create and specify dictionaries that are linked to database tables, and that thus can be updated dynamically when the underlying data is updated. Database dictionaries are useful when the reference data has been originated for other purposes and is likely to change independently of data quality. By making use of a dynamic connection, data quality plans can always point to the current version of the reference data.

Sharing Reference Data Across the Organization


As you can publish or export plans from a local Data Quality repository to server repositories, so you can copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like mechanism for moving files to other machines across the network. Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when running a plan. By default, Data Quality relies on dictionaries being located in the following locations:
q q

The Dictionaries folders installed with Workbench and Server. The users file space in the Data Quality service domain.

IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will fail. This is most relevant when you publish or export a plan to another machine on the network. You must ensure that copies of any dictionary files used in the local plan are available in a suitable location on the service domain in the user space on the server, or at a location in the servers Dictionaries folders that corresponds to the dictionaries location on Workbench when the plan is copied to the server-side repository. Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file. However, this is the master configuration file for the product and you should not edit it without consulting Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.

Version Controlling Updates and Managing Rollout from Development to Production


Plans can be version-controlled during development in Workbench and when published to a domain repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions when necessary. Dictionary files are not version controlled by IDQ, however. You should define a process to log changes and back-up your dictionaries using version control software if possible or a manual method. If modifications are to be made to the versions of dictionary files installed by the software, it is recommended that these modifications be made to a copy of the original file, renamed or relocated as desired. This approach avoids the risk that a subsequent installation might overwrite changes.
INFORMATICA CONFIDENTIAL BEST PRACTICES 284 of 954

Database reference data can also be version controlled, although this presents difficulties if the database is very large in size. Bear in mind that third-party reference data, such as postal address data, should not ordinarily be changed, and so the need for a versioning strategy for these files is debatable.

Working with External Data Formatting Data into Dictionary Format


External data may or may not permit the copying of data into text format for example, external data contained in a database or in library files. Currently, third-party postal address validation data is provided to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The third-party software has a very small footprint.) However, some software files can be amenable to data extraction to file.

Obtaining Updates for External Reference Data


External data vendors produce regular data updates, and its vital to refresh your external reference data when updates become available. The key advantage of external data its reliability is lost if you do not apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept up to date with the latest data as it becomes available for as long as your data subscription warrants. You can check that you possess the latest versions of third-party data by contacting your Informatica Account Manager.

Managing Reference Updates and Rolling Out Across the Organization


If your organization has a reference data subscription, you will receive either regular data files on compact disc or regular information on how to download data from Informatica or vendor web sites. You must develop a strategy for distributing these updates to all parties who run plans with the external data. This may involve installing the data on machines in a service domain. Bear in mind that postal address data vendors update their offerings every two or three months, and that a significant percentage of postal addresses can change in such time periods. You should plan for the task of obtaining and distributing updates in your organization at frequent intervals. Depending on the number of IDQ installations that must be updated, updating your organization with thirdparty reference data can be a sizable task.

Strategies for Managing Internal and External Reference Data


Experience working with reference data leads to a series of best practice tips for creating and managing reference data files.

Using Workbench to Build Dictionaries


With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionarycompatible format.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

285 of 954

Lets say you have designed a data quality plan that identifies invalid or anomalous records in a customer database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file to create a dictionary-compatible file. For example, lets say you have an exception file containing suspect or invalid customer account records. Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a new text file containing the account serial numbers only. This file effectively constitutes the labels column of your dictionary. By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns. Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account numbers that you can use in any plans checking the validity of the organization's account records.

Using Report Viewer to Build Dictionaries


The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data. The figure below illustrates how you can drill-down into report data, right-click on a column, and save the column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to the column data. In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically, records containing bad zip codes). The plan designer can now create plans to check customer databases against these serial numbers. You can also append data to an existing dictionary file in this manner.

As a general rule, it is a best practice to follow the dictionary organization structure installed by the application, adding to that structure as necessary to accommodate specialized and supplemental dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible modifications, thereby lowering the risk of accidental errors during migration. When following the original dictionary organization structure is not practical or contravenes other requirements, take care to document
INFORMATICA CONFIDENTIAL BEST PRACTICES 286 of 954

the customizations. Since external data may be obtained from third parties and may not be in file format, the most efficient way to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically, this is the machine that hosts the Execution Service.)

Moving Dictionary Files After IDQ Plans are Built


This is a similar issue to that of sharing reference data across the organization. If you must move or relocate your reference data files post-plan development, you have three options:
q q

You can reset the location to which IDQ looks by default for dictionary files. You can reconfigure the plan components that employ the dictionaries to point to the new location. Depending on the complexity of the plan concerned, this can be very labor-intensive. If deploying plans in a batch or scheduled task, you can append the new location to the plan execution command. You can do this by appending a parameter file to the plan execution instructions on the command line. The parameter file is an xml file that can contain a simple command to use one file path instead of another.

Last updated: 08-Feb-07 17:09

INFORMATICA CONFIDENTIAL

BEST PRACTICES

287 of 954

Real-Time Matching Using PowerCenter Challenge

This Best Practice describes the rationale for matching in real-time along with the concepts and strategies used in planning for and developing a real-time matching solution. It also provides step-by-step instructions on how to build this process using Informaticas PowerCenter and Data Quality. The cheapest and most effective way to eliminate duplicate records from a system is to prevent them from ever being entered in the first place. Whether the data is coming from a website, an application entry, EDI feeds messages on a queue, changes captured from a database, or other common data feeds, taking these records and matching them against existing master data that already exists allows for only the new, unique records to be added.
q q q q q

Benefits of preventing duplicate records include: Better ability to service customer, with the most accurate and complete information readily available Reduced risk of fraud or over-exposure Trusted information at the source Less effort in BI, data warehouse, and/or migration projects

Description
Performing effective real-time matching involves multiple puzzle pieces. 1. There is a master data set (or possibly multiple master data sets) that contain clean and unique customers, prospects, suppliers, products, and/or many other types of data. 2. To interact with the master data set, there is an incoming transaction; typically thought to be a new item. This transaction can be anything from a new customer signing up on the web to a list of new products; this is anything that is assumed to be new and intended to be added to master. 3. There must be a process to determine if a new item really is new or if it already exists within the master data set. In a perfect world of consistent ids, spellings, and representations of data across all companies and systems, checking for duplicates would simply be some sort of exact lookup into the master to see if the item already exists. Unfortunately, this is not the case and even being creative and using %LIKE% syntax does not provide thorough results. For example, comparing Bob to Robert or GRN to Green requires a more sophisticated approach.

Standardizing Data in Advance of Matching


The first prerequisite for successful matching is to cleanse and standardize the master data set. This process requires well-defined rules for important attributes. Applying these rules to the data should result in complete, consistent, conformant, valid data, which really means trusted data. These rules should also be reusable so they can be used with the incoming transaction data prior to matching. The more compromises made in the quality of master data by failing to cleanse and standardize, the more effort will need to be put into the matching logic, and the less value the organization will derive from it. There will be many more chances of missed matches allowing duplicates to enter the system. Once the master data is cleansed, the next step is to develop criteria for candidate selection. For efficient matching, there is no need to compare records that are so dissimilar that they cannot meet the business rules for matching. On the other hand, the set of candidates must be sufficiently broad to minimize the chance that similar records will not be compared. For example, when matching consumer data on name and address, it may be sensible to limit candidate pull records to those having the same zip code and the same first letter of the last name, because we can reason that if those elements are different between two records, those two records will not match.
INFORMATICA CONFIDENTIAL BEST PRACTICES 288 of 954

There also may be cases where multiple candidate sets are needed. This would be the case if there are multiple sets of match rules that the two records will be compared against. Adding to the previous example, think of matching on name and address for one set of match rules and name and phone for a second. This would require selecting records from the master that have the same phone number and first letter of the last name. Once the candidate selection process is resolved, the matching logic can be developed. This can consist of matching one to many elements of the input record to each candidate pulled from the master. Once the data is compared each pair of records, one input and one candidate, will have a match score or a series of match scores. Scores below a certain threshold can then be discarded and potential matches can be output or displayed. The full real-time match process flow includes: 1. The input record coming into the server 2. The server then standardizes the incoming record and retrieves candidate records from the master data source that could match the incoming record 3. Match pairs are then generated, one for each candidate, consisting of the incoming record and the candidate 4. The match pairs then go through the matching logic resulting in a match score 5. Records with a match score below a given threshold are discarded 6. The returned result set consists of the candidates that are potential matches to the incoming record

Developing an Effective Candidate Selection Strategy


Determining which records from the master should be compared with the incoming record is a critical decision in an effective real-time matching system. For most organizations it is not realistic to match an incoming record to all master records. Consider even a modest customer master data set with one million records; the amount of processing, and thus the wait in real-time would be unacceptable. Candidate selection for real-time matching is synonymous to grouping or blocking for batch matching. The goal of candidate selection is to select only that subset of the records from the master that are definitively related by a field, part of a field, or combination of multiple parts/fields. The selection is done using a candidate key or group key. Ideally this key would be constructed and stored in an indexed field within the master table(s) allowing for the quickest retrieval. There are many instances where multiple keys are used to allow for one key to be missing or different, while another pulls in the record as a candidate. What specific data elements the candidate key should consist of very much depends on the scenario and the match rules. The one common theme with candidate keys is the data elements used should have the highest levels of completeness and validity possible. It is also best to use elements that can be verified as valid, such as a postal code
INFORMATICA CONFIDENTIAL BEST PRACTICES 289 of 954

or a National ID. The table below lists multiple common matching elements and how group keys could be used around the data. The ideal size of the candidate record sets, for sub-second response times, should be under 300 records. For acceptable two to three second response times, candidate record counts should be kept under 5000 records.

Step by Step Development


The following instructions further explain the steps for building a solution to real-time matching using the Informatica suite. They involve the following applications:
q q q q

Informatica PowerCenter 8.5.1 - utilizing Web Services Hub Informatica Data Explorer 5.0 SP4 Informatica Data Quality 8.5 SP1 utilizing North American Country Pack SQL Server 2000
BEST PRACTICES 290 of 954

INFORMATICA CONFIDENTIAL

Scenario:
q

A customer master file is provided with the following structure

q q

In this scenario, we are performing a name and address match Because address is part of the match, we will use the recommended address grouping strategy for our candidate key (see table1) The desire is that different applications from the business will be able to make a web service call to determine if the data entry represents a new customer or an existing customer

Solution: 1. The first step is to analyze the customer master file. Assume that this analysis shows the postcode field is complete for all records and the majority of it is of high accuracy. Assume also that neither the first name or last name field is completely populated; thus the match rules we must account for blank names. 2. The next step is to load the customer master file into the database. Below is a list of tasks that should be implemented in the mapping that loads the customer master data into the database:
q

Standardize and validate the address, outputting the discreet address components such as house number, street name, street type, directional, and suite number. (Pre-built mapplet to do this; country pack) Generate the candidate key field, populate that with the selected strategy (assume it is the first 3 characters of the zip, house number, and the first character of street name), and generate an index on that field. (Expression, output of previous mapplet, hint: substr(in_ZIPCODE, 0, 3)|| in_HOUSE_NUMBER||substr(in_STREET_NAME, 0, 1)) Standardize the phone number. (Pre-built mapplet to do this; country pack) Parse the name field into individual fields. Although the data structure indicates names are already parsed into first, middle, and last, assume there are examples where the names are not properly fielded. Also remember to output a value to handle of nicknames. (Pre-built mapplet to do this; country pack) Once complete, your customer master table should look something like this:

q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

291 of 954

3. Now that the customer master has been loaded, a Web Service mapping must be created to handle real-time matching. For this project, assume that the incoming record will include a full name field, address, city, state, zip, and a phone number. All fields will be free-form text. Since we are providing the Service, we will be using a Web Service Provider source and target. Follow these steps to build the source and target definitions.
q

Within PowerCenter Designer, go to the source analyzer and select the source menu. From there select Web Service Provider and the Create Web Service Definition.

You will see a screen like the one below where the Service can be named and input and output ports can be created. Since this is a matching scenario, the potential that multiple records will be returned must be taken into account. Select the Multiple Occurring Elements checkbox for the output ports section. Also add a match score output field to return the percentage at which the input record matches the different potential matching records from the master.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

292 of 954

Both the source and target should now be present in the project folder.

4. An IDQ match plan must be build to use within the mapping. In developing a plan for real-time, using a CSV source and CSV sink, both enabled for real-time is the most significant difference from a similar match plan designed for use in IDQ standalone. The source will have the _1 and the _2 fields that a Group Source would supply built into it, e.g. Firstname_1 & Firstname_2. Another difference from batch matching in PowerCenter is that the DQ transformation can be set to passive. The following steps illustrate converting the North America Country Packs Individual Name and Address Match Plan from a plan built for use in a batch mapping to a plan built for use in a real-time mapping.
q

Open the DCM_NorthAmerica project and from within the Match folder make a copy of the Individual Name and Address Match plan. Rename it to RT Individual Name and Address Match. Create a new stub CSV file with only the header row. This will be used to generate a new CSV Source within the plan. This header must use all of the input fields used by the plan before modification. For convenience, a sample stub header is listed below. The header for the stub file will duplicate all of the fields, with one set having a suffix of _1 and the other _2. IN_GROUP_KEY_1,IN_FIRSTNAME_1,IN_FIRSTNAME_ALT_1, IN_MIDNAME_1,IN_LASTNAME_1,IN_POSTNAME_1, IN_HOUSE_NUM_1,IN_STREET_NAME_1,IN_DIRECTIONAL_1, IN_ADDRESS2_1,IN_SUITE_NUM_1,IN_CITY_1,IN_STATE_1,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

293 of 954

IN_POSTAL_CODE_1,IN_GROUP_KEY_2,IN_FIRSTNAME_2, IN_FIRSTNAME_ALT_2,IN_MIDNAME_2,IN_LASTNAME_2, IN_POSTNAME_2,IN_HOUSE_NUM_2,IN_STREET_NAME_2, IN_DIRECTIONAL_2,IN_ADDRESS2_2,IN_CITY_2,IN_STATE_2, IN_POSTAL_CODE_2


q

Now delete the CSV Match Source from the plan and add a new CSV Source, and point it at the new stub file. Because the components were originally mapped to the CSV Match Source and that was deleted, the fields within your plan need to be reselected. As you open the different match components and RBAs, you can see the different instances that need to be reselected as they appear with a red diamond, as seen below.

Also delete the CSV Match Sink and replace it with a CSV Sink. Only the match score field(s) must be selected for output. This plan will be imported into a passive transformation. Consequently, data can be passed around it and does not need to be carried through the transformation. With this implementation you can output multiple match scores so it is possible to see why two records matched or didnt match on a field by field basis. Select the check box for Enable Real-time Processing in both the source and the sink and the plan will be ready to be imported into PowerCenter.

5. The mapping will consist of: a. The source and target previously generated b. An IDQ transformation importing the plan just built c. The same IDQ cleansing and standardization transformations used to load then master data (Refer to step 2 for specifics) d. An Expression transformation to generate the group key and build a single directional field e. A SQL transformation to get the candidate records for the master table f. A Filter transformation to filter those records that match score below a certain threshold g. A Sequence transformation to build a unique key for each matching record returned in the SOAP response
INFORMATICA CONFIDENTIAL BEST PRACTICES 294 of 954

Within PowerCenter Designer, create a new mapping and drag the web service source and target previously created into the mapping. Add the following country pack mapplets to standardize and validate the incoming record from the web service:
r r r

mplt_dq_p_Personal_Name_Standardization_FML mplt_dq_p_USA_Address_Validation mplt_dq_p_USA_Phone_Standardization_Validation

Add an Expression Transformation and build the candidate key from the Address Validation mapplet output fields. Remember to use the same logic as in the mapping that loaded the customer master. Also within the expression, concatenate the pre and post directional field into a single directional field for matching purposes. Add a SQL transformation to the mapping. The SQL transform will present a dialog box with a few questions related to the SQL transformation. For this example select Query mode, MS SQL Server (change as desired), and a Static connection. For details on the other options refer to the PowerCenter help. Connect all necessary fields from the source qualifier, DQ mapplets, and Expression transformation to the SQL transformation. These fields should include:
r r r r

XPK_n4_Envelope (This is the Web Service message key) Parsed name elements Standardized and parsed address elements, which will be used for matching. Standardized phone number

The next step is to build the query from within the SQL transformation to select the candidate records. Make sure that the output fields agree with the query in number, name, and type.

The output of the SQL transform will be the incoming customer record along with the candidate record.
INFORMATICA CONFIDENTIAL BEST PRACTICES 295 of 954

These will be stacked records where the Input/Output fields will represent the input record and the Output only fields will represent the Candidate record. A simple example of this is shown in the table below where a single incoming record will be paired with two candidate records:

Comparing the new record to the candidates is done by embedding the IDQ plan converted in step 4 into the mapping through the use of the Data Quality transformation. When this transformation is created, select passive as the transformation type. The output of the Data Quality transformation will be a match score. This match score will be in a float type format between 0.0 and 1.0. Using a filter transformation, all records that have a match score below a certain threshold will get filtered off. For this scenario, the cut-off will be 80%. (Hint: TO_FLOAT(out_match_score) >= .80) Any record coming out of the filter transformation is a potential match that exceeds the specified threshold, and the record will be included in the response. Each of these records needs a new Unique ID so the Sequence Generator transformation will be used. To complete the mapping, the output of the Filter and Sequence Generator transformations need to be mapped to the target. Make sure to map the input primary key field (XPK_n4_Envelope_output) to the primary key field of the envelope group in the target (XPK_n4_Envelope) and to the foreign key of the response element group in the target (FK_n4_Envelope). Map the output of the Sequence Generator to the primary key field of the response element group. The mapping should look like this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

296 of 954

6. Before testing the mapping, create a workflow.


q

Using the Workflow Manager, generate a new workflow and session for this mapping using all the defaults. Once created, edit the session task. On the Mapping tab select the SQL transformation and make sure the connection type is relational. Also make sure to select the proper connection. For more advanced tweaking and web service settings see the PowerCenter documentation.

The final step is to expose this workflow as a Web Service. This is done by editing the Workflow. The workflow needs to be Web Services enabled and this is done by selecting the enabled checkbox for Web Services. Once the Web Service is enabled, it should be configured. For all the specific details of this please refer to the PowerCenter documentation, but for the purpose of this scenario: a. Give the service the name you would like to see exposed to the outside world

INFORMATICA CONFIDENTIAL

BEST PRACTICES

297 of 954

b. Set the timeout to 30 seconds c. Allow 2 concurrent runs d. Set the workflow to be visible and runnable

7. The web service is ready for testing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

298 of 954

Testing Data Quality Plans Challenge


To provide a guide for testing data quality processes or plans created using Informatica Data Quality (IDQ) and to manage some of the unique complexities associated with data quality plans.

Description
Testing data quality plans is an iterative process that occurs as part of the Design Phase of Velocity. Plan testing often precedes the projects main testing activities, as the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to formally test the plans used in the Analyze Phase of Velocity. The development of data quality plans typically follows a prototyping methodology of create, execute, analyze. Testing is performed as part of the third step, in order to determine that the plans are being developed in accordance with design and project requirements. This method of iterative testing helps support rapid identification and resolution of bugs. Bear in mind that data quality plans are designed to analyze and resolve data content issues. These are not typically cut-and-dry problems, but more often represent a continuum of data improvement issues where it is possible that every data instance is unique and there is a target level of data quality rather than a right or wrong answer. Data quality plans tend to resolve problems in terms of percentages and probabilities that a problem is fixed. For example, the project may set a target of 95 percent accuracy in its customer addresses. The level of inaccuracy acceptability is also likely to change over time, based upon the importance of a given data field to the underlying business process. As well, accuracy should continuously improve as the data quality rules are applied and the existing data sets adhere to a higher standard of quality.

Common Questions in Data Quality Plan Testing


q

What dataset will you use to test the plans? While the ideal situation is to use a data set that exactly mimics the project production data, you may not gain access to this data. If you obtain a full cloned set of the project data for testing purposes, bear in mind that some plans (specifically some data

INFORMATICA CONFIDENTIAL

BEST PRACTICES

299 of 954

matching plans) can take several hours to complete. Consider testing data matching plans overnight.
q

Are the plans using reference dictionaries? Reference dictionary management is an important factor since it is possible to make changes to a reference dictionary independently of IDQ and without making any changes to the plan itself. When you pass an IDQ plan as tested, you must ensure that no additional work is carried out on any dictionaries referenced in the plan. Moreover, you must ensure that the dictionary files reside in locations that are valid IDQ. How will the plans be executed? Will they be executed on a remote IDQ Server and/or via a scheduler? In cases like these, its vital to ensure that your plan resources, including source data files and reference data files, are in valid locations for use by the Data Quality engine. For details on the local and remote locations to which IDQ looks for source and reference data files, refer to the Informatica Data Quality 8.5 User Guide. Will the plans be integrated into a PowerCenter transformation? If so, the plans must have real-time enabled data source and sink components.

Strategies for Testing Data Quality Plans


The best practice steps for testing plans can be grouped under two headings.

Testing to Validate Rules


1. Identify a small, representative sample of source data. 2. To determine the results to expect when the plans are run, manually process the data based on the rules for profiling, standardization or matching that the plans will apply. 3. Execute the plans on the test dataset and validate the plan results against the manually-derived results.

Testing to Validate Plan Effectiveness


This process is concerned with establishing that a data enhancement plan has been properly designed; that is, that the plan delivers the required improvements in data quality. This is largely a matter of comparing the business and project requirements for data quality and establishing if the plans are on course to deliver these. If not, the plans may need a thorough redesign or the business and project targets may need to be revised. In either case, discussions should be held with the key business stakeholders to review the results of the IDQ plan and determine the appropriate course of action. In
INFORMATICA CONFIDENTIAL BEST PRACTICES 300 of 954

addition, once the entire data set is processed against the business rules, there may be other data anomalies that were unaccounted for that may require additional modifications to the underlying business rules and IDQ plans.

Last updated: 05-Dec-07 16:02

INFORMATICA CONFIDENTIAL

BEST PRACTICES

301 of 954

Tuning Data Quality Plans Challenge


This document gives an insight into the type of considerations and issues a user needs to be aware of when making changes to data quality processes defined in Informatica Data Quality (IDQ). In IDQ, data quality processes are called plans. The principal focus of this best practice is to know how to tune your plans without adversely affecting the plan logic. This best practice is not intended to replace training materials but serve as a guide for decision making in the areas of adding, removing or changing the operational components that comprise a data quality plan.

Description
You should consider the following questions prior to making changes to a data quality plan:
q

What is the purpose of changing the plan? You should consider changing a plan if you believe the plan is not optimally configured, or the plan is not functioning properly and there is a problem at execution time or the plan is not delivering expected results as per the plan design principles. Are you trained to change the plan? Data quality plans can be complex. You should not alter a plan unless you have been trained or are highly experienced with IDQ methodology. Is the plan properly documented? You should ensure all plan documentation on the data flow and the data components are up-to-date. For guidelines on documenting IDQ plans, see the Sample Deliverable Data Quality Plan Design. Have you backed up the plan before editing? If you are using IDQ in a client-server environment, you can create a baseline version of the plan using IDQ version control functionality. In addition, you should copy the plan to a new project folder (viz., Work_Folder) in the Workbench for changing and testing, and leave the original plan untouched during testing. Is the plan operating directly on production data? This applies especially to standardization plans. When editing a plan, always work on staged data (database or flat-file). You can later migrate the plan to the production environment after complete and thorough testing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

302 of 954

You should have a clear goal whenever you plan to change an existing plan. An event may prompt the change: for example, input data changing (in format or content), or changes in business rules or business/project targets. You should take into account all current change-management procedures, and the updated plans should be thoroughly tested before production processes are updated. This includes integration and regression testing too. (See also Testing Data Quality Plans.) Bear in mind that at a high level there are two types of data quality plans: data analysis and data enhancement plans.
q

Data analysis plans produce reports on data patterns and data quality across the input data. The key objective in data analysis is to determine the levels of completeness, conformity, and consistency in the dataset. In pursuing these objectives, data analysis plans can also identify cases of missing, inaccurate or noisy data. Data enhancement plans corrects completeness, conformity and consistency problems; they can also identify duplicate data entries and fix accuracy issues through the use of reference data.

Your goal in a data analysis plan is to discover the quality and usability of your data. It is not necessarily your goal to obtain the best scores for your data. Your goal in a data enhancement plan is to resolve the data quality issues discovered in the data analysis.

Adding Components
In general, simply adding a component to a plan is not likely to directly affect results if no further changes are made to the plan. However, once the outputs from the new component are integrated into existing components, the data process flow is changed and the plan must be re-tested and results reviewed in detail before migrating the plan into production. Bear in mind, particularly in data analysis plans, that improved plan statistics do not always mean that the plan is performing better. It is possible to configure a plan that moves beyond the point of truth by focusing on certain data elements and excluding others. When added to existing plans, some components have a larger impact than others. For example, adding a To Upper component to convert text into upper case may not cause the plan results to change meaningfully, although the presentation of the output data will change. However, adding and integrating a Rule Based Analyzer component

INFORMATICA CONFIDENTIAL

BEST PRACTICES

303 of 954

(designed to apply business rules) may cause a severe impact, as the rules are likely to change the plan logic. As well as adding a new component that is, a new icon to the plan, you can add a new instance to an existing component. This can have the same effect as adding and integrating a new component icon. To avoid overloading a plan with too many components, it is a good practice to add multiple instances to a single component, within reason. Good plan design suggests that instances within a single component should be logically similar and work on the selected inputs in similar ways. The overall name for the component should also be changed to reflect the logic of the instances contained in the component. If you add a new instance to a component, and that instance behaves very differently to the other instances in that component for example, if it acts on an unrelated set of outputs or performs an unrelated type of action on the data you should probably add a new component for this instance. This will also help you keep track of your changes onscreen. To avoid making plans over-complicated, it is often a good practice to split tasks into multiple plans where a large amount of data quality measures need to be checked. This makes plans and business rules easier to maintain and provides a good framework for future development. For example, in an environment where a large number of attributes must be evaluated against the six standard data quality criteria (i.e., completeness, conformity, consistency, accuracy, duplication and consolidation) using one plan per data quality criterion may be a good way to move forward. Alternatively, splitting plans up by data entity may be advantageous. Similarly, during standardization, you can create plans for specific function areas (e.g,. address, product, or name) as opposed to adding all standardization tasks to a single large plan. For more information on the six standard data quality criteria, see Data Cleansing

Removing Components
Removing a component from a plan is likely to have a major impact since, in most cases, data flow in the plan will be broken. If you remove an integrated component, configuration changes will be required to all components that use the outputs from the component. The plan cannot run without these configuration changes being completed. The only exceptions to this case are when the output(s) of the removed component are solely used by CSV Sink component or by a frequency component. However, in these cases, you must note that the plan output changes since the column(s) no longer appear in the result set.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

304 of 954

Editing Component Configurations


Changing the configuration of a component can have a comparable impact on the overall plan as adding or removing a component the plans logic changes, and therefore, so do the results that it produces. However, although adding or removing a component may make a plan non-executable, changing the configuration of a component can impact the results in more subtle ways. For example, changing the reference dictionary used by a parsing component does not break a plan, but may have a major impact on the resulting output. Similarly, changing the name of a component instance output does not break a plan. By default, component output names cascade through the other components in the plan, so when you change an output name, all subsequent components automatically update with the new output name. It is not necessary to change the configuration of dependent components.

Last updated: 26-May-08 11:12

INFORMATICA CONFIDENTIAL

BEST PRACTICES

305 of 954

Using Data Explorer for Data Discovery and Analysis Challenge


To understand and make full use of Informatica Data Explorers potential to profile and define mappings for your project data. Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration, consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate understanding of the true structure of the source data in order to correctly transform the data for a given target database design. However, the datas actual form rarely coincides with its documented or supposed form. The key to success for data-related projects is to fully understand the data as it actually is, before attempting to cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this purpose. This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.

Description
Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality, content and structure of project data sources. Data profiling analyzes several aspects of data structure and content, including characteristics of each column or field, the relationships between fields, and the commonality of data values between fields often an indicator of redundant data.

Data Profiling
Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality standards may either be the native rules expressed in the source datas metadata, or an external standard (e.g., corporate, industry, or government) to which the source data must be mapped in order to be assessed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

306 of 954

Data profiling in IDE is based on two main processes:


q q

Inference of characteristics from the data Comparison of those characteristics with specified standards, as an assessment of data quality

Data mapping involves establishing relationships among data elements in various data structures or sources, in terms of how the same information is expressed or stored in different ways in different sources. By performing these processes early in a data project, IT organizations can preempt the code/load/explode syndrome, wherein a project fails at the load stage because the data is not in the anticipated form. Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure summarizes and abstracts these scenarios into a single depiction of the IDE solution.

The overall process flow for the IDE Solution is as follows:


INFORMATICA CONFIDENTIAL BEST PRACTICES 307 of 954

1. Data and metadata are prepared and imported into IDE.

2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. 3. The resultant metadata are exported to and managed in the IDE Repository. 4. In a derived-target scenario, the project team designs the target database by modeling the existing data sources and then modifying the model as required to meet current business and performance requirements. In this scenario, IDE is used to develop the normalized schema into a target database. The normalized and target schemas are then exported to IDEs FTM/XML tool, which documents transformation requirements between fields in the source, normalized, and target schemas. OR 5. In a fixed-target scenario, the design of the target database is a given (i.e., because another organization is responsible for developing it, or because an off-the-shelf package or industry standard is to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to map the source data fields to the corresponding fields in an externally-specified target schema, and to document transformation requirements between fields in the normalized and target schemas. FTM is used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based metadata structures. Externally specified targets are typical for ERP package migrations, business-tobusiness integration projects, or situations where a data modeling team is independently designing the target schema. 6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE applications.

IDE's Methods of Data Profiling


IDE employs three methods of data profiling: Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely metadata and alternate metadata which is consistent with the data.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

308 of 954

Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This process can discover primary and foreign keys, functional dependencies, and sub-tables.

Cross-Table profiling - determines the overlap of values across a set of columns, which may come from multiple tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

309 of 954

Profiling against external standards requires that the data source be mapped to the standard before being assessed (as shown in the following figure). Note that the mapping is performed by IDEs Fixed Target Mapping tool (FTM). IDE can also be used in the development and application of corporate standards, making them relevant to existing systems as well as to new systems.

Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality should be considered as an alternative tool for data cleansing.

IDE and Fixed-Target Migration


Fixed-target migration projects involve the conversion and migration of data from one or more sources to an externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing
INFORMATICA CONFIDENTIAL BEST PRACTICES 310 of 954

the data source(s), while IDEs Fixed Target Mapping tool (FTM) is used to map from the normalized schema to the fixed target. The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows: 1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. The resultant metadata are exported to and managed by the IDE Repository. 4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and documents transformation requirements between fields in the normalized and target schemas. Externallyspecified targets are typical for ERP migrations or projects where a data modeling team is independently designing the target schema. 5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE and FTM. 6. The cleansing, transformation, and formatting specs can be used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms or configure an ETL product to perform the data conversion and migration.

The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may discover hidden tables within tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

311 of 954

Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to establish several of the staging databases between the sources and target, as shown below:

Derived-Target Migration
Derived-target migration projects involve the conversion and migration of data from one or more sources to a target database defined by the migration team. IDE is used to profile the data and develop a normalized schema representing the data source(s), and to further develop the normalized schema into a target schema by adding tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or denormalizing the schema to enhance performance. When the target schema is developed from the normalized schema within IDE, the product automatically maintains the mappings from the source to normalized schema, and from the normalized to target schemas. The figure below shows that the general sequence of activities for a derived-target migration project is as follows:
INFORMATICA CONFIDENTIAL BEST PRACTICES 312 of 954

1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and document cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves removing obsolete or spurious data elements, incorporating new business requirements and data elements, adapting to corporate data standards, and denormalizing to enhance performance. 4. The resultant metadata are exported to and managed by the IDE Repository. 5. FTM is used to develop and document transformation requirements between the normalized and target schemas. The mappings between the data elements are automatically carried over from the IDE-based schema development process. 6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting specs developed with IDE and FTM/XML. 7. The cleansing, transformation, and formatting specs are used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms of configure an ETL product to perform the data conversion and migration.

Last updated: 09-Feb-07 12:55

INFORMATICA CONFIDENTIAL

BEST PRACTICES

313 of 954

Working with Pre-Built Plans in Data Cleanse and Match Challenge


To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data Cleanse and Match (DC&M) product offering. Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter system:
q

Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until needed. Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users can connect to the Data Quality repository and read data quality plan information into this transformation.

Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address reference data files. This document focuses on the following areas:
q q q

when to use one plan vs. another for data cleansing. what behavior to expected from the plans. how best to manage exception data.

Description
The North America Content Pack installs several plans to the Data Quality Repository:
q q

Plans 01-04 are designed to parse, standardize, and validate United States name and address data. Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual source matching operations (identifying matching records between two datasets).

The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.

Plans 01-04: Parsing, Cleansing, and Validation


These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and wellstructured data sources. The level of structure contained in a given data set determines the plan to be used. The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and validate an address.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

314 of 954

In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields, only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address standardization, and validation plans may be required to obtain meaning from the data. The purpose of making the plans modular is twofold:
q

It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07. Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation and extremely complex plan logic that would be difficult to modify and maintain.

01 General Parser
The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example, consider data stored in the following format:

Field1 100 Cardinal Way Redwood City

Field2 Informatica Corp 38725

Field3 CA 94063 100 Cardinal Way

Field4 info@informatica.com CA 94063

Field5 Redwood City info@informatica.com

While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into typespecific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses, depending on the profile of the content. As a result, the above data will be parsed into the following format:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

315 of 954

Address1 100 Cardinal Way Redwood City

Address2 CA 94063 100 Cardinal Way

Address3 Redwood City CA 94063

E-mail info@informatica.com info@informatica.com

Date

Company Informatica Corp

08/01/2006

The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed in the file. The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date. The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and address element are contained in the same field, the General Parser would label the entire field either a name or an address - or leave it unparsed - depending on the elements in the field it can identify first (if any). While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form containing unparsed data. The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan. Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g., telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of differing structures have been merged into a single file.

02 Name Standardization
The Name Standardization plan is designed to take in person name or company name information and apply parsing and standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names. The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters, numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to validate a company name are not likely to yield usable results. Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found. The second track for name standardization is person names standardization. While this track is dedicated to standardizing person names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that contain an identified first name and a company name are treated as a person name. If the company name track inputs are already fully populated for the record in question, then any company name detected in a person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e. g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining data is accepted as being a valid person name and parsed as such. North American person names are typically entered in one of two different styles: either in a firstname middlename surname format or surname, firstname middlename format. Name parsing algorithms have been built using this assumption. Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name
INFORMATICA CONFIDENTIAL BEST PRACTICES 316 of 954

prefixes, name suffixes, firstnames, and any extraneous data (noise) present. Any remaining details are assumed to be middle name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, best guess parsing is applied to the field based on the possible assumed formats. When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated. In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate. The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output). Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed according to person name processing rules. Likewise, some person names may be identified as companies and standardized according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required. Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the fields. For example, an address datum such as Corporate Parkway may be standardized as a business name, as Corporate is also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on whether or not the field contains a recognizable company suffix in the text. To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and postexecution analysis of the data. Based on the following input:

ROW ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

IN NAME1 Steven King Chris Pope Jr. Shannon C. Prince Dean Jones Mike Judge Thomas Staples Eugene F. Sears Roy Jones Jr. Thomas Smith, Sr Eddie Martin III Martin Luther King, Jr. Staples Corner Sears Chicago Robert Tyre Chris News

The following outputs are produced by the Name Standardization plan:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

317 of 954

The last entry (Chris News) is identified as a company in the current plan configuration such results can be refined by changing the underlying dictionary entries used to identify company and person names.

03 US Canada Standardization
This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key search elements into discrete fields, thereby speeding up the validation process. The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that cannot be parsed into the remaining fields is merged into the non-address data field. The plan makes a number of assumptions that may or may not suit your data:
q

When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where town names are commonly misspelled, the standardization plan may not correctly parse the information. Zip codes are all assumed to be five-digit. In some files, zip codes that begin with 0 may lack this first number and so appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not recommended, as these will conflict with the Plus 4 element of a zip code. Zip codes may also be confused with other five-digit numbers in an address line such as street numbers. City names are also commonly found in street names and other address elements. For example, United is part of a country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates from right to left across the data, so that country name and zip code fields are analyzed before city names and street addresses. Therefore, the word United may be parsed and written as the town name for a given address before the actual town name datum is reached. The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there is no need to include any country code field in the address inputs when configuring the plan.

Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding some pre-processing logic to a workflow prior to passing the data into the plan. The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as address lines 1-3.

04 NA Address Validation
The purposes of the North America Address Validation plan are:
q q

To match input addresses against known valid addresses in an address database, and To parse, standardize, and enrich the input addresses.
BEST PRACTICES 318 of 954

INFORMATICA CONFIDENTIAL

Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times. The address validation APIs store specific area information in memory and continue to use that information from one record to the next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to maximize the usage of data in memory. In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1 User Guide for information on how to interpret them.

Plans 05-07: Pre-Match Standardization, Grouping, and Matching


These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05 and 06 or plans 05 and 07. There plans work as follows:
q

05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on the data prior to matching. 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set. 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.

q q

Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English data. Although they work with datasets in other languages, the results may be sub-optimal.

Matching Concepts
To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and group the data. The aim for standardization here is different from a classic standardization plan the intent is to ensure that different spellings, abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main Road will obtain an imperfect match score, although they clearly refer to the same street address. Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches in the dataset. Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group keys. In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.) Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching Techniques. Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before

group keys are generated. It offers a number of grouping options. The plan generates the following group keys:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

319 of 954

q q q q q

OUT_ZIP_GROUP: first 5 digits of ZIP code OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name

The grouping output used depends on the data contents and data volume.

Plans 06 Single Source Matching and 07 Dual Source Matching


Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used. However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression transform upstream in the PowerCenter mapping. A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weightbased component and a custom rule are applied to the outputs from the matching components. For further information on IDQ matching components, consult the Informatica Data Quality 3.1 User Guide. By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The Data Quality Developer can easily adjusted this figure in each plan.

PowerCenter Mappings
When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.

To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts data according to the group key to be used during matching. This transformation should follow standardization and grouping operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active
INFORMATICA CONFIDENTIAL BEST PRACTICES 320 of 954

transformation.

The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not present in the source data. (Note that a unique identifier is not required for matching processes). When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively. The data from the two sources is then joined together using a Union transformation, before being passed to the Integration transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single source version.

Last updated: 09-Feb-07 13:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

321 of 954

Designing Data Integration Architectures Challenge


Develop a sound data integration architecture that can serve as a foundation for data integration solutions.

Description
Historically, organizations have approached the development of a "data warehouse" or "data mart" as a departmental effort, without considering an enterprise perspective. The result has been silos of corporate data and analysis, which very often conflict with each other in terms of both detailed data and the business conclusions implied by it. Data integration efforts are often the cornerstone in today's IT initiatives. Taking an enterprise-wide, architect stance in developing data integration solutions provides many advantages, including:
q

A sound architectural foundation ensures the solution can evolve and scale with the business over time. Proper architecture can isolate the application component (business context) of the data integration solution from the technology. Broader data integration efforts will be simplified by using an holisitc enterprise-based approach. Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.

As the evolution of data integration solutions (and the corresponding nomenclature) has progressed, the necessity of building these solutions on a solid architectural framework has become more and more clear. To understand why, a brief review of the history of data integration solutions and their predecessors is warranted. As businesses become more global, Service Oriented Architecture (SOA) becomes more of an Information Technology standard. Having a solid architecture is paramount to the success of data Integration efforts.

Historical Perspective
Online Transaction Processing Systems (OLTPs) have always provided a very detailed, transaction-oriented view of an organization's data. While this view was indispensable for the dayto-day operation of a business, its ability to provide a "big picture" view of the operation, critical for management decision-making, was severely limited. Initial attempts to address this problem took several directions: Reporting directly against the production system. This approach minimized the effort associated with developing management reports, but introduced a number of significant issues:
INFORMATICA CONFIDENTIAL BEST PRACTICES 322 of 954

The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the year, month, or even the day, were inconsistent with each other. Ad hoc queries against the production database introduced uncontrolled performance issues, resulting in slow reporting results and degradation of OLTP system performance. Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the OLTP systems.
q

Mirroring the production system in a reporting database . While this approach alleviated the performance degradation of the OLTP system, it did nothing to address the other issues noted above. Reporting databases . To address the fundamental issues associated with reporting against the OLTP schema, organizations began to move toward dedicated reporting databases. These databases were optimized for the types of queries typically run by analysts, rather than those used by systems supporting data entry clerks or customer service representatives. These databases may or may not have included pre-aggregated data, and took several forms, including traditional RDBMS as well as newer technology Online Analytical Processing (OLAP) solutions.

The initial attempts at reporting solutions were typically point solutions; they were developed internally to provide very targeted data to a particular department within the enterprise. For example, the Marketing department might extract sales and demographic data in order to infer customer purchasing habits. Concurrently, the Sales department was also extracting sales data for the purpose of awarding commissions to the sales force. Over time, these isolated silos of information became irreconcilable, since the extracts and business rules applied to the data during the extract process differed for the different departments The result of this evolution was that the Sales and Marketing departments might report completely different sales figures to executive management, resulting in a lack of confidence in both departments' "data marts." From a technical perspective, the uncoordinated extracts of the same data from the source systems multiple times placed undue strain on system resources. The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would be supported by a single set of periodic extracts of all relevant data into the data warehouse (or Operational Data Store), with the data being cleansed and made consistent as part of the extract process. The problem with this solution was its enormous complexity, typically resulting in project failure. The scale of these failures led many organizations to abandon the concept of the enterprise data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these solutions still had all of the issues discussed previously, they had the clear advantage of providing individual departments with the data they needed without the unmanageability of the enterprise solution. As individual departments pursued their own data and data integration needs, they not only created data stovepipes, they also created technical islands. The approaches to populating the data marts and performing the data integration tasks varied widely, resulting in a single enterprise evaluating,
INFORMATICA CONFIDENTIAL BEST PRACTICES 323 of 954

purchasing, and being trained on multiple tools and adopting multiple methods for performing these tasks. If, at any point, the organization did attempt to undertake an enterprise effort, it was likely to face the daunting challenge of integrating the disparate data as well as the widely varying technologies. To deal with these issues, organizations began developing approaches that considered the enterprise-level requirements of a data integration solution.

Centralized Data Warehouse


The first approach to gain popularity was the centralized data warehouse. Designed to solve the decision support needs for the entire enterprise at one time, with one effort, the data integration process extracts the data directly from the operational systems. It transforms the data according to the business rules and loads it into a single target database serving as the enterprise-wide data warehouse.

Advantages
The centralized model offers a number of benefits to the overall architecture, including:
q

Centralized control . Since a single project drives the entire process, there is centralized control over everything occurring in the data warehouse. This makes it easier to manage a production system while concurrently integrating new components of the warehouse. Consistent metadata . Because the warehouse environment is contained in a single database and the metadata is stored in a single repository, the entire enterprise can be queried whether you are looking at data from Finance, Customers, or Human Resources.
BEST PRACTICES 324 of 954

INFORMATICA CONFIDENTIAL

Enterprise view . Developing the entire project at one time provides a global view of how data from one workgroup coordinates with data from others. Since the warehouse is highly integrated, different workgroups often share common tables such as customer, employee, and item lists. High data integrity . A single, integrated data repository for the entire enterprise would naturally avoid all data integrity issues that result from duplicate copies and versions of the same business data.

Disadvantages
Of course, the centralized data warehouse also involves a number of drawbacks, including:
q

Lengthy implementation cycle. With the complete warehouse environment developed simultaneously, many components of the warehouse become daunting tasks, such as analyzing all of the source systems and developing the target data model. Even minor tasks, such as defining how to measure profit and establishing naming conventions, snowball into major issues. Substantial up-front costs . Many analysts who have studied the costs of this approach agree that this type of effort nearly always runs into the millions. While this level of investment is often justified, the problem lies in the delay between the investment and the delivery of value back to the business. Scope too broad . The centralized data warehouse requires a single database to satisfy the needs of the entire organization. Attempts to develop an enterprise-wide warehouse using this approach have rarely succeeded, since the goal is simply too ambitious. As a result, this wide scope has been a strong contributor to project failure. Impact on the operational systems . Different tables within the warehouse often read data from the same source tables, but manipulate it differently before loading it into the targets. Since the centralized approach extracts data directly from the operational systems, a source table that feeds into three different target tables is queried three times to load the appropriate target tables in the warehouse. When combined with all the other loads for the warehouse, this can create an unacceptable performance hit on the operational systems. Potential integration challenges. A centralized data warehouse has the disadvantage of limited scalability. As businesses change and consolidate, adding new interfaces and/or merging a potentially disparate data source into the centralized data warehouse can be a challenge.

Independent Data Mart


The second warehousing approach is the independent data mart, which gained popularity in 1996 when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the same principles as the centralized approach, but it scales down the scope from solving the warehousing needs of the entire company to the needs of a single department or workgroup. Much like the centralized data warehouse, an independent data mart extracts data directly from the operational sources, manipulates the data according to the business rules, and loads a single
INFORMATICA CONFIDENTIAL BEST PRACTICES 325 of 954

target database serving as the independent data mart. In some cases, the operational data may be staged in an Operational Data Store (ODS) and then moved to the mart.

Advantages
The independent data mart is the logical opposite of the centralized data warehouse. The disadvantages of the centralized approach are the strengths of the independent data mart:
q

Impact on operational databases localized . Because the independent data mart is trying to solve the DSS needs of a single department or workgroup, only the few operational databases containing the information required need to be analyzed. Reduced scope of the data model . The target data modeling effort is vastly reduced since it only needs to serve a single department or workgroup, rather than the entire company. Lower up-front costs . The data mart is serving only a single department or workgroup; thus hardware and software costs are reduced. Fast implementation . The project can be completed in months, not years. The process of defining business terms and naming conventions is simplified since "players from the same team" are working on the project.

Disadvantages

INFORMATICA CONFIDENTIAL

BEST PRACTICES

326 of 954

Of course, independent data marts also have some significant disadvantages:


q

Lack of centralized control . Because several independent data marts are needed to solve the decision support needs of an organization, there is no centralized control. Each data mart or project controls itself, but there is no central control from a single location. Redundant data . After several data marts are in production throughout the organization, all of the problems associated with data redundancy surface, such as inconsistent definitions of the same data object or timing differences that make reconciliation impossible. Metadata integration . Due to their independence, the opportunity to share metadata - for example, the definition and business rules associated with the Invoice data object - is lost. Subsequent projects must repeat the development and deployment of common data objects. Manageability . The independent data marts control their own scheduling routines and therefore store and report their metadata differently, with a negative impact on the manageability of the data warehouse. There is no centralized scheduler to coordinate the individual loads appropriately or metadata browser to maintain the global metadata and share development work among related projects.

Dependent Data Marts (Federated Data Warehouses)


The third warehouse architecture is the dependent data mart approach supported by the hub-andspoke architecture of PowerCenter and PowerExchange. After studying more than one hundred different warehousing projects, Informatica introduced this approach in 1998, leveraging the benefits of the centralized data warehouse and independent data mart. The more general term being adopted to describe this approach is the "federated data warehouse." Industry analysts have recognized that, in many cases, there is no "one size fits all" solution. Although the goal of true enterprise architecture, with conformed dimensions and strict standards, is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated data warehouse was born. It allows for the relatively independent development of data marts, but leverages a centralized PowerCenter repository for sharing transformations, source and target objects, business rules, etc. Recent literature describes the federated architecture approach as a way to get closer to the goal of a truly centralized architecture while allowing for the practical realities of most organizations. The centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the organization can develop semi-autonomous data marts, so long as they subscribe to a common view of the business. This common business model is the fundamental, underlying basis of the federated architecture, since it ensures consistent use of business terms and meanings throughout the enterprise. With the exception of the rare case of a truly independent data mart, where no future growth is planned or anticipated, and where no opportunities for integration with other business areas exist, the federated data warehouse architecture provides the best framework for building a data integration solution.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

327 of 954

Informatica's PowerCenter and PowerExchange products provide an essential capability for supporting the federated architecture: the shared Global Repository. When used in conjunction with one or more Local Repositories, the Global Repository serves as a sort of "federal" governing body, providing a common understanding of core business concepts that can be shared across the semi-autonomous data marts. These data marts each have their own Local Repository, which typically include a combination of purely local metadata and shared metadata by way of links to the Global Repository.

This environment allows for relatively independent development of individual data marts, but also supports metadata sharing without obstacles. The common business model and names described above can be captured in metadata terms and stored in the Global Repository. The data marts use the common business model as a basis, but extend the model by developing departmental metadata and storing it locally. A typical characteristic of the federated architecture is the existence of an Operational Data Store (ODS). Although this component is optional, it can be found in many implementations that extract data from multiple source systems and load multiple targets. The ODS was originally designed to extract and hold operational data that would be sent to a centralized data warehouse, working as a time-variant database to support end-user reporting directly from operational systems. A typical ODS had to be organized by data subject area because it did not retain the data model from the operational system. Informatica's approach to the ODS, by contrast, has virtually no change in data model from the operational system, so it need not be organized by subject area. The ODS does not permit direct
INFORMATICA CONFIDENTIAL BEST PRACTICES 328 of 954

end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation functions than a traditional ODS.

Advantages
The Federated architecture brings together the best features of the centralized data warehouse and independent data mart:
q

Room for expansion . While the architecture is designed to quickly deploy the initial data mart, it is also easy to share project deliverables across subsequent data marts by migrating local metadata to the Global Repository. Reuse is built in. Centralized control . A single platform controls the environment from development to test to production. Mechanisms to control and monitor the data movement from operational databases into the data integration environment are applied across the data marts, easing the system management task. Consistent metadata . A Global Repository spans all the data marts, providing a consistent view of metadata. Enterprise view . Viewing all the metadata from a central location also provides an enterprise view, easing the maintenance burden for the warehouse administrators. Business users can also access the entire environment when necessary (assuming that security privileges are granted). High data integrity . Using a set of integrated metadata repositories for the entire enterprise removes data integrity issues that result from duplicate copies of data. Minimized impact on operational systems . Frequently accessed source data, such as customer, product, or invoice records is moved into the decision support environment once, leaving the operational systems unaffected by the number of target data marts.

Disadvantages
Disadvantages of the federated approach include:
q

Data propagation . This approach moves data twice-to the ODS, then into the individual data mart. This requires extra database space to store the staged data as well as extra time to move the data. However, the disadvantage can be mitigated by not saving the data permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or a rolling three months of data can be saved. Increased development effort during initial installations . For each table in the target, there needs to be one load developed from the ODS to the target, in addition to all the loads from the source to the targets.

Operational Data Store


Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is not organized by subject area and is not customized for viewing by end users or even for reporting.
INFORMATICA CONFIDENTIAL BEST PRACTICES 329 of 954

The primary focus of the ODS is in providing a clean, consistent set of operational data for creating and refreshing data marts. Separating out this function allows the ODS to provide more reliable and flexible support. Data from the various operational sources is staged for subsequent extraction by target systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS refreshes, for instance). The ODS and the data marts may reside in a single database or be distributed across several physical databases and servers. Characteristics of the Operational Data Store are:
q q q q q

Normalized Detailed (not summarized) Integrated Cleansed Consistent

Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number of ways:
q

Normalizes data where necessary (such as non-relational mainframe data), preparing it for storage in a relational system. Cleans data by enforcing commonalties in dates, names and other data types that appear across multiple systems. Maintains reference data to help standardize other formats; references might range from zip codes and currency conversion rates to product-code-to-product-name translations. The ODS may apply fundamental transformations to some database tables in order to reconcile common definitions, but the ODS is not intended to be a transformation processor for end-user reporting requirements.

Its role is to consolidate detailed data within common formats. This enables users to create wide varieties of data integration reports, with confidence that those reports will be based on the same detailed data, using common definitions and formats. The following table compares the key differences in the three architectures: Architecture Centralized Data Warehouse Independent Data Mart Federated Data Warehouse

INFORMATICA CONFIDENTIAL

BEST PRACTICES

330 of 954

Centralized Control Consistent Metadata Cost effective Enterprise View Fast Implementation High Data Integrity Immediate ROI Repeatable Process

Yes

No

Yes

Yes

No

Yes

No Yes No

Yes No Yes

Yes Yes Yes

Yes

No

Yes

No No

Yes Yes

Yes Yes

The Role of Enterprise Architecture


The federated architecture approach allows for the planning and implementation of an enterprise architecture framework that addresses not only short-term departmental needs, but also the longterm enterprise requirements of the business. This does not mean that the entire architectural investment must be made in advance of any application development. However, it does mean that development is approached within the guidelines of the framework, allowing for future growth without significant technological change. The remainder of this chapter will focus on the process of designing and developing a data integration solution architecture using PowerCenter as the platform.

Fitting Into the Corporate Architecture


Very few organizations have the luxury of creating a "green field" architecture to support their decision support needs. Rather, the architecture must fit within an existing set of corporate guidelines regarding preferred hardware, operating systems, databases, and other software. The Technical Architect, if not already an employee of the organization, should ensure that he/she has a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will eliminate the possibility of developing an elegant technical solution that will never be implemented because it defies corporate standards.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

331 of 954

Development FAQs Challenge


Using the PowerCenter product suite to effectively develop, name, and document components of the data integration solution. While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions that are commonly raised by project teams. It provides answers in a number of areas, including Logs, Scheduling, Backup Strategies, Server Administration, Custom Transformations, and Metadata. Refer to the product guides supplied with PowerCenter for additional information.

Description
The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.

Mapping Design
Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?) In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixedwidth files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and custom SQL SELECTs where appropriate. Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by a single map?) With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple targets, and to multiple sessions running simultaneously. Q: What are some considerations for determining how many objects and transformations to include in a single mapping? The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement. Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging and better understandability, as well as to create potential partition points. This should be balanced against the fact that more objects means more overhead for the DTM process. It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.

Log File Organization


Q: How does PowerCenter handle logs? The Service Manager provides accumulated log events from each service in the domain and for sessions and workflows. To perform the logging function, the Service Manager runs a Log Manager and a Log Agent. The Log Manager runs on the master gateway node. It collects and processes log events for Service Manager domain operations and application services. The log events contain operational and error messages for a domain. The Service
INFORMATICA CONFIDENTIAL BEST PRACTICES 332 of 954

Manager and the application services send log events to the Log Manager. When the Log Manager receives log events, it generates log event files, which can be viewed in the Administration Console. The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for sessions include information about the tasks performed by the Integration Service, session errors, and load summary and transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the Workflow Monitor. Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide. Q: Where can I view the logs? Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error messages. Q: Where is the best place to maintain Session Logs? One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed in the Administration Console. If you have more than one PowerCenter domain, you must configure a different directory path for each domains Log Manager. Multiple domains can not use the same shared directory path. For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide. Q: What documentation is available for the error codes that appear within the error log files? Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific errors, consult your Database User Guide.

Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session? Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the targets. Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either.
q

A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2 when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next session only if the previous session was successful, or to stop on errors, etc. A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multiprocessing (SMP) architecture.

Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse. This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.
INFORMATICA CONFIDENTIAL BEST PRACTICES 333 of 954

Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure? No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow fails, first recover and then restart that workflow from the Workflow Monitor. Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor? Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow From Task." Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications? Workflow Execution needs to be planned around two main constraints:
q q

Available system resources Memory and processors

The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a session needs about 120 percent of a processor for the DTM, reader, and writer in total. For concurrent sessions: One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server. If possible, sessions should run at "off-peak" hours to have as many available resources as possible. Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter sessions running. The first step is to estimate memory usage, accounting for:
q q q

Operating system kernel and miscellaneous processes Database engine Informatica Load Manager

Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners. At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently. Load-order dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may become saturated if overloaded; and some target tables may need to be available to end users earlier than others. Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify the Server Administrator? The application level of event notification can be accomplished through post-session email. Post-session email allows you to
INFORMATICA CONFIDENTIAL BEST PRACTICES 334 of 954

create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session email:

Email Variable Description %s %l %r %e %t Session name Total records loaded Total records rejected Session status Table details, including read throughput in bytes/second and write throughput in rows/second Session start time Session completion time Session elapsed time (session completion time-session start time) Attaches the session log to the message Name and version of the mapping used in the session Name of the folder containing the session Name of the repository containing the session Attaches the named file. The file must be local to the Informatica Server. The following are valid filenames: %a<c: \data\sales.txt> or %a</users/john/data/sales.txt> On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a nontext file, the send may fail. Note: The filename cannot include the Greater Than character (>) or a line break.

%b %c %i

%g %m %d %n %a<filename>

The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send email. To verify the rmail tool is accessible: 1. 2. 3. 4. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server. Type rmail <fully qualified email address> at the prompt and press Enter. Type '.' to indicate the end of the message and press Enter. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail
BEST PRACTICES 335 of 954

INFORMATICA CONFIDENTIAL

resides and add that directory to the path. 5. When you have verified that rmail is installed correctly, you are ready to send post-session email. The output should look like the following: Session complete. Session name: sInstrTest Total Rows Loaded = 1 Total Rows Rejected = 0 Completed

Rows Loaded Status 1

Rows Rejected

ReadThroughput (bytes/sec)

WriteThroughput Table Name (rows/sec)

30

t_Q3_sales

No errors encountered. Start Time: Tue Sep 14 12:26:31 1999 Completion Time: Tue Sep 14 12:26:41 1999 Elapsed time: 0: 00:10 (h:m:s) This information, or a subset, can also be sent to any text pager that accepts email.

Backup Strategy Recommendation


Q: Can individual objects within a repository be restored from the backup or from a prior version? At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then manually copy the individual objects back into the main repository. It should be noted that PowerCenter does not restore repository backup files created in previous versions of PowerCenter. To correctly restore a repository, the version of PowerCenter used to create the backup file must be used for the restore as well. An option for the backup of individual objects is to export them to XML files. This allows for the granular re-importation of individual objects, mappings, tasks, workflows, etc. Refer to Migration Procedures - PowerCenter for details on promoting new or changed objects between development, test, QA, and production environments.

Server Administration
Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs? The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by another user. Notification messages are received through the PowerCenter Client tools. Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels? The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.
INFORMATICA CONFIDENTIAL BEST PRACTICES 336 of 954

Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information:
q q q q

CPID - Creator PID (process ID) LPID - Last PID that accessed the resource Semaphores - used to sync the reader and writer 0 or 1 - shows slot in LM shared memory

A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT documentation. Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash? If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.

Custom Transformations
Q: What is the relationship between the Java or SQL transformation and the Custom transformation? Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality. Other transformations that were built using Custom transformations include HTTP, SQL, Union , XML Parser, XML Generator, and many others. Below is a summary of noticeable differences.

Transformation Custom HTTP Java SQL Union XML Parser XML Generator

# of Input Groups Multiple One One One Multiple One Multiple

# of Output Groups Multiple One One One One Multiple One

Type Active/Passive Passive Active/Passive Active/Passive Active Active Active

For further details, please see the Transformation Guide. Q: What is the main benefit of a Custom transformation over an External Procedure transformation? A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation handles both the input and output simultaneously. Additionally, an External Procedure transformations parameters consist of all the ports of the transformation. The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to
INFORMATICA CONFIDENTIAL BEST PRACTICES 337 of 954

be processed before outputting any output rows. Q: How do I change a Custom transformation from Active to Passive, or vice versa? After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type, delete and recreate the transformation. Q: What is the difference between active and passive Java transformations? When should one be used over the other? An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive Java transformation only allows for the generation of one output row per input row. Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use passive when you need one output row for each input. Q: What are the advantages of a SQL transformation over a Source Qualifier? A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete, update, and retrieve rows from a database. For example, you might need to create database tables before adding new transactions. The SQL transformation allows for the creation of these tables from within the workflow. Q: What is the difference between the SQL transformations Script and Query modes? Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters. For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.

Metadata
Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may be extracted from the PowerCenter repository and used in others? With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata. There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party metadata software and, for sources and targets, data modeling tools. Q: What procedures exist for extracting metadata from the repository? Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store, retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository. Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to report and query the Informatica metadata. Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have
INFORMATICA CONFIDENTIAL BEST PRACTICES 338 of 954

been created to provide access to the metadata stored in the repository. Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository database and are able to present reports to the end-user and/or management.

Versioning
Q: How can I keep multiple copies of the same object within PowerCenter? A: With PowerCenter, you can use version control to maintain previous copies of every changed object. You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an object, control development of the object, and track changes. You can configure a repository for versioning when you create it, or you can upgrade an existing repository to support versioned objects. When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object has an active status. You can perform the following tasks when you work with a versioned object:
q

View object version properties. Each versioned object has a set of version properties and a status. You can also configure the status of a folder to freeze all objects it contains or make them active for editing. Track changes to an object. You can view a history that includes all versions of a given object, and compare any version of the object in the history to any other version. This allows you to determine changes made to an object over time. Check the object version in and out. You can check out an object to reserve it while you edit the object. When you check in an object, the repository saves a new version of the object and allows you to add comments to the version. You can also find objects checked out by yourself and other users. Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from the repository.

Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time on making a list of all changed/affected objects? A: Yes there is. You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can create the following types of deployment groups:
q q

Static. You populate the deployment group by manually selecting objects. Dynamic. You use the result set from an object query to populate the deployment group.

To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of deployment. You can associate an object query with a deployment group when you edit or create a deployment group. If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another. Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder.

Performance
Q: Can PowerCenter sessions be load balanced?
INFORMATICA CONFIDENTIAL BEST PRACTICES 339 of 954

A: Yes, if the PowerCenter Enterprise Grid Option option is available. The Load Balancer is a component of the Integration Service that dispatches tasks to Integration Service processes running on nodes in a grid. It matches task requirements with resource availability to identify the best Integration Service process to run a task. It can dispatch tasks on a single node or across nodes. Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels to change the priority of each task waiting to be dispatched. This can be changed in the Administration Consoles domain properties. For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.

Web Services
Q: How does Web Services Hub work in PowerCenter? A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and Repository Service through the Web Services Hub. The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide. The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name and password. You can use the Web Services Hub console to view service information and download Web Services Description Language (WSDL) files necessary for running services and workflows.

Last updated: 06-Dec-07 15:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

340 of 954

Event Based Scheduling Challenge


In an operational environment, the beginning of a task often needs to be triggered by some event, either internal or external, to the Informatica environment. In versions of PowerCenter prior to version 6.0, this was achieved through the use of indicator files. In PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and EventWait Workflow and Worklet tasks, as well as indicator files.

Description
Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through the use indicator files. Users specified the indicator file configuration in the session configuration under advanced options. When the session started, the PowerCenter Server looked for the specified file name; if it wasnt there, it waited until it appeared, then deleted it, and triggered the session. In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and Event-Raise tasks. These tasks can be used to define task execution order within a workflow or worklet. They can even be used to control sessions across workflows.
q q

An Event-Raise task represents a user-defined event (i.e., an indicator file). An Event-Wait task waits for an event to occurwithin a workflow. After the event triggers, the PowerCenter Server continues executing the workflow from the Event-Wait task forward.

The following paragraphs describe events that can be triggered by an Event-Wait task.

Waiting for Pre-Defined Events


To use a pre-defined event, you need a session, shell command, script, or batch file to create an indicator file. You must create the file locally or send it to a directory local to the PowerCenter Server. The file can be any format recognized by the PowerCenter Server operating system. You can choose to have the PowerCenter Server delete the indicator file after it detects the file, or you can manually delete the indicator file. The PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot delete the indicator file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

341 of 954

When you specify the indicator file in the Event-Wait task, specify the directory in which the file will appear and the name of the indicator file. Do not use either a source or target file name as the indicator file name. You must also provide the absolute path for the file and the directory must be local to the PowerCenter Server. If you only specify the file name, and not the directory, Workflow Manager looks for the indicator file in the system directory. For example, on Windows NT, the system directory is C:/winnt/ system32. You can enter the actual name of the file or use server variables to specify the location of the files. The PowerCenter Server writes the time the file appears in the workflow log. Follow these steps to set up a pre-defined event in the workflow: 1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit Tasks dialog box. 2. In the Events tab of the Edit Task dialog box, select Pre-defined. 3. Enter the path of the indicator file. 4. If you want the PowerCenter Server to delete the indicator file after it detects the file, select the Delete Indicator File option in the Properties tab. 5. Click OK.

Pre-defined Event
A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait task to instruct the PowerCenter Server to wait for the specified indicator file to appear before continuing with the rest of the workflow. When the PowerCenter Server locates the indicator file, it starts the task downstream of the Event-Wait.

User-defined Event
A user-defined event is defined at the workflow or worklet level and the Event-Raise task triggers the event at one point of the workflow/worklet. If an Event-Wait task is configured in the same workflow/worklet to listen for that event, then execution will continue from the Event-Wait task forward. The following is an example of using user-defined events: Assume that you have four sessions that you want to execute in a workflow. You want P1_session and P2_session to execute concurrently to save time. You also want to execute Q3_session after P1_session completes. You want to execute Q4_session only when P1_session, P2_session, and Q3_session complete. Follow these steps:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

342 of 954

1. Link P1_session and P2_session concurrently. 2. Add Q3_session after P1_session 3. Declare an event called P1Q3_Complete in the Events tab of the workflow properties 4. In the workspace, add an Event-Raise task after Q3_session. 5. Specify the P1Q3_Complete event in the Event-Raise task properties. This allows the Event-Raise task to trigger the event when P1_session and Q3_session complete. 6. Add an Event-Wait task after P2_session. 7. Specify the Q1 Q3_Complete event for the Event-Wait task. 8. Add Q4_session after the Event-Wait task. When the PowerCenter Server processes the Event-Wait task, it waits until the Event-Raise task triggers Q1Q3_Complete before it executes Q4_session. The PowerCenter Server executes the workflow in the following order: 1. 2. 3. 4. 5. 6. 7. The PowerCenter Server executes P1_session and P2_session concurrently. When P1_session completes, the PowerCenter Server executes Q3_session. The PowerCenter Server finishes executing P2_session. The Event-Wait task waits for the Event-Raise task to trigger the event. The PowerCenter Server completes Q3_session. The Event-Raise task triggers the event, Q1Q3_complete. The Informatica Server executes Q4_session because the event, Q1Q3_Complete, has been triggered.

Be sure to take carein setting the links though. If they are left as the default and if Q3 fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the workflow will run until it is stopped. To avoid this, check the workflow option suspend on error. With this option, if a session fails, the whole workflow goes into suspended mode and can send an email to notify developers.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

343 of 954

Key Management in Data Warehousing Solutions Challenge


Key management refers to the technique that manages key allocation in a decision support RDBMS to create a single view of reference data from multiple sources. Informatica recommends a concept of key management that ensures loading everything extracted from a source system into the data warehouse. This Best Practice provides some tips for employing the Informatica-recommended approach of key management, an approach that deviates from many traditional data warehouse solutions that apply logical and data warehouse (surrogate) key strategies where errors are loaded and transactions rejected from referential integrity issues.

Description
Key management in a decision support RDBMS comprises three techniques for handling the following common situations:
q q q

Key merging/matching Missing keys Unknown keys

All three methods are applicable to a Reference Data Store, whereas only the missing and unknown keys are relevant for an Operational Data Store (ODS). Key management should be handled at the data integration level, thereby making it transparent to the Business Intelligence layer.

Key Merging/Matching
When companies source data from more than one transaction system of a similar type, the same object may have different, non-unique legacy keys. Additionally, a single key may have several descriptions or attributes in each of the source systems. The independence of these systems can result in incongruent coding, which poses a greater problem than records being sourced from multiple systems.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

344 of 954

A business can resolve this inconsistency by undertaking a complete code standardization initiative (often as part of a larger metadata management effort) or applying a Universal Reference Data Store (URDS). Standardizing code requires an object to be uniquely represented in the new system. Alternatively, URDS contains universal codes for common reference values. Most companies adopt this pragmatic approach, while embarking on the longer term solution of code standardization. The bottom line is that nearly every data warehouse project encounters this issue and needs to find a solution in the short term.

Missing Keys
A problem arises when a transaction is sent through without a value in a column where a foreign key should exist (i.e., a reference to a key in a reference table). This normally occurs during the loading of transactional data, although it can also occur when loading reference data into hierarchy structures. In many older data warehouse solutions, this condition would be identified as an error and the transaction row would be rejected. The row would have to be processed through some other mechanism to find the correct code and loaded at a later date. This is often a slow and cumbersome process that leaves the data warehouse incomplete until the issue is resolved. The more practical way to resolve this situation is to allocate a special key in place of the missing key, which links it with a dummy 'missing key' row in the related table. This enables the transaction to continue through the loading process and end up in the warehouse without further processing. Furthermore, the row ID of the bad transaction can be recorded in an error log, allowing the addition of the correct key value at a later time. The major advantage of this approach is that any aggregate values derived from the transaction table will be correct because the transaction exists in the data warehouse rather than being in some external error processing file waiting to be fixed. Simple Example: PRODUCT Audi TT18 CUSTOMER Doe10224 SALES REP QUANTITY 1 UNIT PRICE 35,000

In the transaction above, there is no code in the SALES REP column. As this row is processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a record in the SALES REP table. A data warehouse key (8888888) is also added to the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

345 of 954

transaction. PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 9999999 QUANTITY 1 UNIT PRICE 35,000 DWKEY 8888888

The related sales rep record may look like this: REP CODE 1234567 7654321 9999999 REP NAME David Jones Mark Smith Missing Rep REP MANAGER Mark Smith

An error log entry to identify the missing key on this transaction may look like: ERROR CODE MSGKEY TABLE NAME ORDERS KEY NAME SALES REP KEY 8888888

This type of error reporting is not usually necessary because the transactions with missing keys can be identified using standard end-user reporting tools against the data warehouse.

Unknown Keys
Unknown keys need to be treated much like missing keys except that the load process has to add the unknown key value to the referenced table to maintain integrity rather than explicitly allocating a dummy key to the transaction. The process also needs to make two error log entries. The first, to log the fact that a new and unknown key has been added to the reference table and a second to record the transaction in which the unknown key was found. Simple example: The sales rep reference data record might look like the following:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

346 of 954

DWKEY 1234567 7654321 9999999

REP NAME David Jones Mark Smith Missing Rep

REP MANAGER Mark Smith

A transaction comes into ODS with the record below: PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 2424242 QUANTITY 1 UNIT PRICE 35,000

In the transaction above, the code 2424242 appears in the SALES REP column. As this row is processed, a new row has to be added to the Sales Rep reference table. This allows the transaction to be loaded successfully. DWKEY 2424242 REP NAME Unknown REP MANAGER

A data warehouse key (8888889) is also added to the transaction. PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 2424242 QUANTITY 1 UNIT PRICE 35,000 DWKEY 8888889

Some warehouse administrators like to have an error log entry generated to identify the addition of a new reference table entry. This can be achieved simply by adding the following entries to an error log. ERROR CODE NEWROW TABLE NAME KEY NAME SALES REP SALES REP KEY 2424242

A second log entry can be added with the data warehouse key of the transaction in which the unknown key was found.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

347 of 954

ERROR CODE UNKNKEY

TABLE NAME KEY NAME ORDERS SALES REP

KEY 8888889

As with missing keys, error reporting is not essential because the unknown status is clearly visible through the standard end-user reporting. Moreover, regardless of the error logging, the system is self-healing because the newly added reference data entry will be updated with full details as soon as these changes appear in a reference data feed. This would result in the reference data entry looking complete. DWKEY 2424242 REP NAME David Digby REP MANAGER Mark Smith

Employing the Informatica recommended key management strategy produces the following benefits:
q q q q

All rows can be loaded into the data warehouse All objects are allocated a unique key Referential integrity is maintained Load dependencies are removed

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

348 of 954

Mapping Auto-Generation Challenge


In the course of developing mappings for PowerCenter, situations can arise where a set of similar functions/procedures must be executed for each mapping. The first reaction to this issue is generally to employ a mapplet. These objects are suited to situations where all of the individual fields/data are the same across uses of the mapplet. However, in cases where the fields are different but the process is the same a requirement emerges to generate multiple mappings using a standard template of actions and procedures. The potential benefits of Autogeneration are focused on a reduction in the Total Cost of Ownership (TCO) of the integration application and include:
q q q q q q

Reduced build time Reduced requirement for skilled developer resources Promotion of pattern-based design Built in quality and consistency Reduced defect rate through elimination of manual errors Reduced support overhead

Description
From the outset, it should be emphasized that auto-generation should be integrated into the overall development strategy. It is probable that some components will still need to be manually developed and many of the disciplines and best practices that are documented elsewhere in Velocity still apply. It is best to regard autogeneration as a productivity aid in specific situations and not as a technique that works in all situations. Currently, the autogeneration of 100% of the components required is not a realistic objective. All of the techniques discussed here revolve around the generation of an XML file which shares the standard format of exported PowerCenter components as defined in the powrmart.dtd schema definition. After being generated, the resulting XML document is imported into PowerCenter using standard facilities available through the user interface or via command line. With Informatica technology, there are a number of options for XML targeting which can be leveraged to implement autogeneration. Thus you can exploit these features to make the technology self-generating. The stages in implementing an autogeneration strategy are: 1. 2. 3. 4. Establish the Scope for Autogeneration Design the Assembly Line(s) Build the Assembly Line Implement the QA and Testing Strategies

These stages are discussed in more detail in the following sections.

1. Establish the Scope for Autogeneration


There are three types of opportunities for manufacturing components:
q q q

Pattern-Driven Rules-Driven Metadata-Driven

A Pattern-Driven build is appropriate when a single pattern of transformation is to be replicated for multiple source-target
INFORMATICA CONFIDENTIAL BEST PRACTICES 349 of 954

combinations. For example, the initial extract in a standard data warehouse load typically extracts some source data with standardized filters, and then adds some load metadata before populating a staging table which essentially replicates the source structure. The potential for Rules-Driven build typically arises when non-technical users are empowered to articulate transformation requirements in a format which is the source for a process generating components. Usually, this is accomplished via a spreadsheet which defines the source-to-target mapping and uses a standardized syntax to define the transformation rules. To implement this type of autogeneration, it is necessary to build an application (typically based on a PowerCenter mapping) which reads the spreadsheet, matches the sources and targets against the metadata in the repository and produces the XML output. Finally, the potential for Metadata-Driven build arises when the import of source and target metadata enables transformation requirements to be inferred which also requires a mechanism for mapping sources to target. For example, when a text source column is mapped to a numeric target column the inferred rule is to test for data type compatibility. The first stage in the implementation of an autogeneration strategy is to decide which of these autogeneration types is applicable and to ensure that the appropriate technology is available. In most case, it is the Pattern-Driven build which is the main area of interest; this is precisely the requirement which the mapping generation license option within PowerCenter is designed to address. This option uses the freely distributed Informatica Data Stencil design tool for Microsoft Visio and freely distributed Informatica Velocity-based mapping templates to accelerate and automate mapping design. Generally speaking, applications which involve a small number of highly-complex flows of data tailored to very specific source/ target attributes are not good candidates for pattern-driven autogeneration. Currently, there is a great deal of product innovation in the areas of Rules-Driven and Metadata-driven autogeneration One option includes using PowerCenter via an XML target to generate the required XML files later used as import mappings.. Depending on the scale and complexity of both the autogeneration-rules and the functionality of the generated components, it may be advisable to acquire a license for the PowerCenter Unstructured Data option. In conclusion, at the end of this stage the type of autogeneration should be identified and all the required technology licenses should be acquired.

2. Design the Assembly Line


It is assumed that the standard development activities in the Velocity Architect and Design phases have been undertaken and at this stage, the development team should understand the data and the value to be added to it. It should be possible to identify the patterns of data movement. The main stages in designing the assembly line are:
q q q q q q q

Manually develop a prototype Distinguish between the generic and the flow-specific components Establish the boundaries and inter-action between generated and manually built components Agree the format and syntax for the specification of the rules (usually Excel) Articulate the rules in the agreed format Incorporate component generation in the overall development process Develop the manual components (if any)

It is recommended that a prototype is manually developed for a representative subset of the sources and targets since the adoption of autogeneration techniques does not obviate the need for a re-usability strategy. Even if some components are generated rather than built, it is still necessary to distinguish between the generic and the flow-specific components. This will allow the generic functionality to be mapped onto the appropriate re-usable PowerCenter components mapplets, transformations, user defined functions etc. The manual development of the prototype also allows the scope of the autogeneration to be established. It is unlikely that every
INFORMATICA CONFIDENTIAL BEST PRACTICES 350 of 954

single required PowerCenter component can be generated; and may be restricted by the current capabilities of the PowerCenter Visio Stencil. It is necessary to establish the demarcation between generated and manually-built components. It will also be necessary to devise a customization strategy if the autogeneration is seen as a repeatable process. How are manual modifications to the generated component to be implemented? Should this be isolated in discrete components which are called from the generated components? If the autogeneration strategy is based on an application rather than the Visio stencil mapping generation option, ensure that the components you are planning to generate are consistent with the restrictions on the XML export file by referring to the product documentation.

TIP If you modify an exported XML file, you need to make sure that the XML file conforms to the structure of powrmart.dtd. You also need to make sure the metadata in the XML file conforms to Designer and Workflow Manager rules. For example, when you define a shortcut to an object, define the folder in which the referenced object resides as a shared folder. Although PowerCenter validates the XML file before importing repository objects from it, it might not catch all invalid changes. If you import into the repository an object that does not conform to Designer or Workflow Manager rules, you may cause data inconsistencies in the repository. Do not modify the powrmart.dtd file. CRCVALUE Codes Informatica restricts which elements you can modify in the XML file. When you export a Designer object, the PowerCenter Client might include a Cyclic Redundancy Checking Value (CRCVALUE) code in one or more elements in the XML file. The CRCVALUE code is another attribute in an element. When the PowerCenter Client includes a CRCVALUE code in the exported XML file, you can modify some attributes and elements before importing the object into a repository. For example, VSAM source objects always contain a CRCVALUE code, so you can only modify some attributes in a VSAM source object. If you modify certain attributes in an element that contains a CRCVALUE code, you cannot import the object

For more information, refer to the Chapter on Exporting and Importing Objects in the PowerCenter Repository Guide.

3. Build the Assembly Line


Essentially, the requirements for the autogeneration may be discerned from the XML exports of the manually developed prototype. Autogeneration Based on Visio Data Stencil (Refer to the product documentation for more information on installation, configuration and usage.) It is important to confirm that all the required PowerCenter transformations are supported by the installed version of the Stencil. The use of an external industry-standard interface such as MS Visio allows the tool to be used by Business Analysts rather than PowerCenter specialists. Apart from allowing the mapping patterns to be specified, the Stencil may also be used as a documentation tool. Essentially, there are three usage stages:
q q q

Implement the Design in a Visio template Publish the Design Generate the PC Components

INFORMATICA CONFIDENTIAL

BEST PRACTICES

351 of 954

A separate Visio template is defined for every pattern identified in the design phase. A template can be created from scratch or imported from a mapping export; an example is shown below:

The icons for transformation objects should be familiar to PowerCenter users. Less easily understood will be the concept of properties for the links (i.e. relationships) between the objects in the Stencil. These link rules define what ports propagate from one transformation to the next and there may be multiple rules in a single link. Essentially, the process of developing the template consists of identifying the dynamic components in the pattern and parameterizing them such as.
q q q q

Source and target table name Source primary key, target primary key Lookup table name and foreign keys Transformations

Once the template is saved and validated, it needs to be published which simply makes it available in formats which the generating mechanisms can understand such as:
q q

Mapping template parameter xml Mapping template xml

One of the outputs from the publishing is the template for the definition of the parameters specified in the template. An example of a modified file is shown below: <?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE PARAMETERS SYSTEM "parameters.dtd"> <PARAMETERS REPOSITORY_NAME="REP_MAIN" REPOSITORY_VERSION="179" REPOSITORY_CODEPAGE="MS1252" REPOSITORY_DATABASETYPE="Oracle"> <MAPPING NAME="M_LOAD_CUSTOMER_GENERATED" FOLDER_NAME="PTM_2008_VISIO_SOURCE" DESCRIPTION="M_LOAD_CUSTOMER"> <PARAM NAME="$SRC_KEY$" VALUE="CUSTOMER_CODE" /> <PARAM NAME="$TGT$" VALUE="CUSTOMER_DIM" /> <PARAM NAME="$TGT_KEY$" VALUE="CUSTOMER_ID" /> <PARAM NAME="$SRC$" VALUE="CUSTOMER_MASTER" /> </MAPPING> <MAPPING NAME="M_LOAD_PRODUCT_GENERATED" FOLDER_NAME="PTM_2008_VISIO_SOURCE" DESCRIPTION="M_LOAD_CUSTOMER"> <PARAM NAME="$SRC_KEY$" VALUE="PRODUCT_CODE" /> <PARAM NAME="$TGT$" VALUE="PRODUCT_DIM" /> <PARAM NAME="$TGT_KEY$" VALUE="PRODUCT_ID" />
INFORMATICA CONFIDENTIAL BEST PRACTICES 352 of 954

<PARAM NAME="$SRC$" VALUE="PRODUCT_MASTER" /> </MAPPING> </PARAMETERS>

This file is only used in scripted generation. The other output from the publishing is the template in XML format. This file is only used in manual generation. There is a choice of either manual or scripted mechanisms for generating components from the published files. The manual mechanism involves the importation of the published XML template through the Mapping Template Import Wizard in the PowerCenter Designer. The parameters defined in the template are entered manually through the user interface. Alternately, the scripted process is based on a supplied command-line utility mapgen. The first stage is to manually modify the published parameter file to specify values for all the mappings to be generated. The second stage is to use PowerCenter to export source and target definitions for all the objects referenced in the parameter file. These are required in order to generate the ports. Mapgen requires the following syntax :
q q q q

<-t> Visio Drawing File <-p> ParameterFile <-o> MappingFile [-d] TableDefinitionDir

(i.e., mapping source) (i.e., parameters) (i.e., output) (i.e., metadata sources & targets)

The generated output file is imported using the standard import facilities in PowerCenter.

TIP Even if the scripted option is selected as the main generating mechanism, use the Mapping Template Import Wizard in the PC Designer to generate the first mapping; this allows the early identification of any errors or inconsistencies in the template.

Autogeneration Based on Informatica Application This strategy generates PowerCenter XML but can be implemented through either PowerCenter itself or the Unstructured Data option. Essentially, it will require the same build sub-stages as any other data integration application. The following components are anticipated:
q q q q q

Specification of the formats for source to target mapping and transformation rules definition Development of a mapping to load the specification spreadsheets into a table Development of a mapping to validate the specification and report errors Development of a mapping to generate the XML output excluding critical errors Development of a component to automate the importation of the XML output into PowerCenter

One of the main issues to be addressed is whether there is a single generation engine which deals with all of the required patterns, or a series of pattern-specific generation engines. One of the drivers for the design should be the early identification of errors in the specifications. Otherwise the first indication of any problem will be the failure of the XML output to import in PowerCenter. It is very important to define the process around the generation and to allocate responsibilities appropriately. Autogeneration Based on Java Application
INFORMATICA CONFIDENTIAL BEST PRACTICES 353 of 954

Assuming the appropriate skills are available in the development team, an alternative technique is to develop a Java application to generate the mapping XML files. The PowerCenter Mapping SDK is a java API that provides all of the elements required to generate mappings. The mapping SDK can be found in client installation directory. It contains:
q q q

The javadoc (api directory) describe all the class of the java API The API (lib directory) which contains the jar files used for mapping SDK application Some basic samples which show how java development with Mapping SDK is done

The Java application also requires a mechanism to define the final mapping between source and target structures; the application interprets this data source and combines it with the metadata in the repository in order to output the required mapping XML.

4. Implement the QA and Testing Strategies


Presumably there should be less of a requirement for QA and Testing with generated components. This does not mean that the need to test no longer exists. To some extent, the testing effort should be re-directed to the components in the Assembly line itself. There is a great deal of material in Velocity to support QA and Test activities. In particular, refer to Naming Conventions . Informatica suggests adopting a Naming Convention that distinguishes between generated and manually-built components. For more information on the QA strategy refer to Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance . Otherwise, the main areas of focus for testing are:
Last updated: 26-May-08 18:26

INFORMATICA CONFIDENTIAL

BEST PRACTICES

354 of 954

Mapping Design Challenge


Optimizing PowerCenter to create an efficient execution environment.

Description
Although PowerCenter environments vary widely, most sessions and/or mappings can benefit from the implementation of common objects and optimization procedures. Follow these procedures and rules of thumb when creating mappings to help ensure optimization.

General Suggestions for Optimizing


1. Reduce the number of transformations. There is always overhead involved in moving data between transformations. 2. Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice. 3. Calculate once, use many times.
r r r

Avoid calculating or testing the same value over and over. Calculate it once in an expression, and set a True/False flag. Within an expression, use variable ports to calculate a value that can be used multiple times within that transformation. Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier. This is also helpful for maintenance. If a transformation needs to be reconnected, it is best to only have necessary ports set as input and output to reconnect. In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.

4. Only connect what is used.


r

5. Watch the data types.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

355 of 954

r r

The engine automatically converts compatible types. Sometimes data conversion is excessive. Data types are automatically converted when types differ between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping. Plan for reusable transformations upfront.. Use variables. Use both mapping variables and ports that are variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times. Use mapplets to encapsulate multiple reusable transformations. Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions. Reduce the number of non-essential records that are passed through the entire mapping. Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source as possible). Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join. Redesign mappings to utilize one Source Qualifier to populate multiple targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier. Remove or reduce field-level stored procedures.

6. Facilitate reuse.
r r

r r

7. Only manipulate data that needs to be moved and transformed.


r

8. Utilize single-pass reads.


r

9. Utilize Pushdown Optimization. r Design mappings so they can take advantage of the Pushdown Optimization feature. This improves performance by allowing the source and/or target database to perform the mapping logic.

Lookup Transformation Optimizing Tips

INFORMATICA CONFIDENTIAL

BEST PRACTICES

356 of 954

1. When your source is large, cache lookup table columns for those lookup tables of 500,000 rows or less. This typically improves performance by 10 to 20 percent. 2. The rule of thumb is not to cache any table over 500,000 rows. This is only true if the standard row byte count is 1,024 or less. If the row byte count is more than 1,024, then you need to adjust the 500K-row standard down as the number of bytes increase (i.e., a 2,048 byte row can drop the cache row count to between 250K and 300K, so the lookup table should not be cached in this case). This is just a general rule though. Try running the session with a large lookup cached and not cached. Caching is often faster on very large lookup tables. 3. When using a Lookup Table Transformation, improve lookup performance by placing all conditions that use the equality operator = first in the list of conditions under the condition tab. 4. Cache only lookup tables if the number of lookup calls is more than 10 to 20 percent of the lookup table rows. For fewer number of lookup calls, do not cache if the number of lookup table rows is large. For small lookup tables (i.e., less than 5,000 rows), cache for more than 5 to 10 lookup calls. 5. Replace lookup with decode or IIF (for small sets of values). 6. If caching lookups and performance is poor, consider replacing with an unconnected, uncached lookup. 7. For overly large lookup tables, use dynamic caching along with a persistent cache. Cache the entire table to a persistent file on the first run, enable the "update else insert" option on the dynamic cache and the engine never has to go back to the database to read data from this table. You can also partition this persistent cache at run time for further performance gains. 8. When handling multiple matches, use the "Return any matching value" setting whenever possible. Also use this setting if the lookup is being performed to determine that a match exists, but the value returned is irrelevant. The lookup creates an index based on the key ports rather than all lookup transformation ports. This simplified indexing process can improve performance. 9. Review complex expressions.
r

Examine mappings via Repository Reporting and Dependency Reporting within the mapping. Minimize aggregate function calls. Replace Aggregate Transformation object with an Expression Transformation object and an Update Strategy Transformation for certain types of Aggregations.

r r

Operations and Expression Optimizing Tips


1. Numeric operations are faster than string operations. 2. Optimize char-varchar comparisons (i.e., trim spaces before comparing).
INFORMATICA CONFIDENTIAL BEST PRACTICES 357 of 954

3. 4. 5. 6. 7.

Operators are faster than functions (i.e., || vs. CONCAT). Optimize IIF expressions. Avoid date comparisons in lookup; replace with string. Test expression timing by replacing with constant. Use flat files.
r

Using flat files located on the server machine loads faster than a database located in the server machine. Fixed-width files are faster to load than delimited files because delimited files require extra parsing. If processing intricate transformations, consider loading first to a source flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL Selects where appropriate.

8. If working with data that is not able to return sorted data (e.g., Web Logs), consider using the Sorter Advanced External Procedure. 9. Use a Router Transformation to separate data flows instead of multiple Filter Transformations. 10. Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator Transformation to optimize the aggregate. With a Sorter Transformation, the Sorted Ports option can be used even if the original source cannot be ordered. 11. Use a Normalizer Transformation to pivot rows rather than multiple instances of the same target. 12. Rejected rows from an update strategy are logged to the bad file. Consider filtering before the update strategy if retaining these rows is not critical because logging causes extra overhead on the engine. Choose the option in the update strategy to discard rejected rows. 13. When using a Joiner Transformation, be sure to make the source with the smallest amount of data the Master source. 14. If an update override is necessary in a load, consider using a Lookup transformation just in front of the target to retrieve the primary key. The primary key update is much faster than the non-indexed lookup override.

Suggestions for Using Mapplets


A mapplet is a reusable object that represents a set of transformations. It allows you to reuse transformation logic and can contain as many transformations as necessary. Use the Mapplet Designer to create mapplets.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

358 of 954

Mapping SDK Challenge


Understand how to create PowerCenter repository objects such as mappings, sessions and workflows using Java programming language instead of PowerCenter client tools.

Description
PowerCenters Mapping Software Developer Kit (SDK) is a set of interfaces that can be used to generate PowerCenter XML documents containing mappings, sessions and workflows. The Mapping SDK is a Java API that provides all of the elements needed to set up mappings in the repository where metadata is stored. These elements are the objects usually used in the PowerCenter Designer and Workflow Manager like source and target definitions, transformations, mapplets, mappings, tasks, sessions and workflows. The Mapping SDK can be found in the PowerCenter client installation. In the Mapping SDK directory, the following components are available:
q q q

The javadoc (api directory) that describe all the classes of the Java API The API (lib directory) which contains the jar files used for the Mapping SDK application Some basic samples which show how Java development with Mapping SDK can be done

Below is a simplified Class diagram that represents the Mapping SDK:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

359 of 954

The purpose of the Mapping SDK feature is to improve design and development efficiency for repetitive tasks during the implementation. The Mapping SDK can also be used for mapping autogeneration purposes to complete data-flow for repetitive tasks with various structures of data. This can be used to create on demand mappings with same transformations between various sources and targets. A particular advantage for a project that has been designed using the mapping autogeneration comes is with project maintenance. The project team will be able to regenerate mappings quickly using the new source or target structure definitions. The sections below are an example of a Mapping SDK implementation for mapping autogeneration purposes. Mapping auto-generation is based on a low level Java API, which means that there are many ways to create mappings. The development of such a tool requires knowledge and skills about PowerCenter objects design as well as Java program development. To implement the mapping auto-generation method, the project team should follow these tasks:
q q q q

Identify repetitive data mappings which will be common for task and methodology. Create samples of these mappings. Define where data structures are stored (e.g., database catalog, file, COBOL copybook). Develop a Java application using the mapping SDK which is able to obtain the data structure of the project and to generate the mapping defined.

Identify Repetitive Data Mappings


INFORMATICA CONFIDENTIAL BEST PRACTICES 360 of 954

In most projects there are some tasks or mappings that are similar and vary only in the structure of the data they transform. Examples of these types of mappings include:
q q q

loading a table from a flat file performing incremental loads on historical and non-historical tables extracting table data to files

During the design phase of the project, the Business Analyst and the Data Integration developer need to identify which tasks or mappings can be designed as repetitive tasks to improve the future design for similar tasks.

Create A Sample Mapping


During the design phase, the Data Integration developer must develop a sample mapping for each repetitive task that has identified. This will help to outline how the data mapping could be designed. For example, define the needed transformations, mappings, tasks and processes needed to create the data mapping. A mapping template can be used for this purpose. Frequently, the repetitive tasks correspond to one of the sample data mappings that have been defined as mapping templates in Informaticas Customer Portal.

Define The Location Where Data Structures are Stored


An important point for the mapping auto-generation method is to define where the data structure can be found that is needed to create the final mapping between the source and target structure. You can build a Java application that will build a PowerCenter mapping with dynamic source and target definitions stored in:
q q q

A set of data files A database catalog A structured file like copy COBOL or XML Schema file

The final application may contain a set of functionalities to map the source and the target structure definitions.

Develop A Java Application Using The Mapping SDK


As a final step during the build phase, develop a Java application that will create (according to the source and target structure definition) the final mapping definition that includes all of the column specifications for the source and target. This application will be based on the Mapping SDK, which provides all of the resources to create an XML file containing the mapping, session and workflow definition. This application has to be developed in such a way as to generate all of the types of mappings that were defined during the design phase.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

361 of 954

Last updated: 29-May-08 13:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

362 of 954

Mapping Templates Challenge


Mapping Templates demonstrate proven solutions for tackling challenges that commonly occur during data integration development efforts. Mapping Templates can be used to make the development phase of a project more efficient. Mapping Templates can also serve as a medium to introduce development standards into the mapping development process that developers need to follow. A wide array of Mapping Template examples can be obtained for the most current PowerCenter version from the Informatica Customer Portal. As "templates," each of the objects in Informatica's Mapping Template Inventory illustrates the transformation logic and steps required to solve specific data integration requirements. These sample templates, however, are meant to be used as examples, not as means to implement development standards.

Description
Reuse Transformation Logic
Templates can be heavily used in a data integration and warehouse environment, when loading information from multiple source providers into the same target structure, or when similar source system structures are employed to load different target instances. Using templates guarantees that any transformation logic that is developed and tested correctly, once, can be successfully applied across multiple mappings as needed. In some instances, the process can be further simplified if the source/target structures have the same attributes, by simply creating multiple instances of the session, each with its own connection/execution attributes, instead of duplicating the mapping.

Implementing Development Techniques


When the process is not simple enough to allow usage based on the need to duplicate transformation logic to load the same target, Mapping Templates can help to reproduce transformation techniques. In this case, the implementation process requires more than just replacing source/target transformations. This scenario is most useful when certain logic (i.e., logical group of transformations) is employed across mappings. In many instances this can be further simplified by making use of mapplets. Additionally user defined functions can be utilized for expression logic reuse and build complex

INFORMATICA CONFIDENTIAL

BEST PRACTICES

363 of 954

expressions using transformation language.

Transport mechanism
Once Mapping Templates have been developed, they can be distributed by any of the following procedures:
q q

Copy mapping from development area to the desired repository/folder Export mapping template into XML and import to the desired repository/folder.

Mapping template examples


The following Mapping Templates can be downloaded from the Informatica Customer Portal and are listed by subject area:

Common Data Warehousing Techniques


q q q q q q

Aggregation using Sorted Input Tracking Dimension History Constraint-Based Loading Loading Incremental Updates Tracking History and Current Inserts or Updates

Transformation Techniques
q q q q q q q q q

Error Handling Strategy Flat File Creation with Headers and Footers Removing Duplicate Source Records Transforming One Record into Multiple Records Dynamic Caching Sequence Generator Alternative Streamline a Mapping with a Mapplet Reusable Transformations (Customers) Using a Sorter

INFORMATICA CONFIDENTIAL

BEST PRACTICES

364 of 954

q q q q

Pipeline Partitioning Mapping Template Using Update Strategy to Delete Rows Loading Heterogenous Targets Load Using External Procedure

Advanced Mapping Concepts


q q q q q

Aggregation Using Expression Transformation Building a Parameter File Best Build Logic Comparing Values Between Records Transaction Control Transformation

Source-Specific Requirements
q q q

Processing VSAM Source Files Processing Data from an XML Source Joining a Flat File with a Relational Table

Industry-Specific Requirements
q q

Loading SWIFT 942 Messages.htm Loading SWIFT 950 Messages.htm

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

365 of 954

Naming Conventions Challenge


A variety of factors are considered when assessing the success of a project. Naming standards are an important, but often overlooked component. The application and enforcement of naming standards not only establishes consistency in the repository, but provides for a developer friendly environment. Choose a good naming standard and adhere to it to ensure that the repository can be easily understood by all developers.

Description
Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the former. Choosing a convention and sticking with it is the key. Having a good naming convention facilitates smooth migrations and improves readability for anyone reviewing or carrying out maintenance on the repository objects. It helps them to understand the processes being affected. If consistent names and descriptions are not used, significant time may be needed to understand the workings of mappings and transformation objects. If no description is provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective. The following pages offer suggested naming conventions for various repository objects. Whatever convention is chosen, it is important to make the selection very early in the development cycle and communicate the convention to project staff working on the repository. The policy can be enforced by peer review and at test phases by adding processes to check conventions both to test plans and to test execution documents.

Suggested Naming Conventions


Designer Objects Mapping Suggested Naming Conventions m_{PROCESS}_{SOURCE_SYSTEM}_{TARGET_NAME} or suffix with _ {descriptor} if there are multiple mappings for that single target table mplt_{DESCRIPTION} {update_types(s)}_{TARGET_NAME} this naming convention should only occur within a mapping as the actual target name object affects the actual table that PowerCenter will access AGG_{FUNCTION} that leverages the expression and/or a name that describes the processing being done. ASQ_{TRANSFORMATION} _{SOURCE_TABLE1}_{SOURCE_TABLE2} represents data from application source. CT_{TRANSFORMATION} name that describes the processing being done. IDQ_{descriptor}_{plan} with the descriptor describing what this plan is doing with the optional plan name included if desired. EXP_{FUNCTION} that leverages the expression and/or a name that describes the processing being done. EXT_{PROCEDURE_NAME}

Mapplet Target

Aggregator Transformation

Application Source Qualifier Transformation Custom Transformation Data Quality Transform Expression Transformation

External Procedure Transformation Filter Transformation

FIL_ or FILT_{FUNCTION} that leverages the expression or a name that describes the processing being done. Fkey{descriptor}
BEST PRACTICES 366 of 954

Flexible Target Key


INFORMATICA CONFIDENTIAL

HTTP Idoc Interpreter Idoc Prepare Java Transformation

http_{descriptor} idoci_{Descriptor}_{IDOC Type} defining what the idoc does and possibly the idoc message. idocp_{Descriptor}_{IDOC Type} defining what the idoc does and possibly the idoc message. JV_{FUNCTION} that leverages the expression or a name that describes the processing being done. JNR_{DESCRIPTION} LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple lookups on a single table. For unconnected look-ups, use ULKP in place of LKP. MPLTI_{DESCRIPTOR} indicating the data going into the mapplet.

Joiner Transformation Lookup Transformation

Mapplet Input Transformation

Mapplet Output Transformation MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet. MQ Source Qualifier Transformation Normalizer Transformation MQSQ_{DESCRIPTOR} defines the messaging being selected.

NRM_{FUNCTION} that leverages the expression or a name that describes the processing being done. RNK_{FUNCTION} that leverages the expression or a name that describes the processing being done. RTR_{DESCRIPTOR} dmi_{Entity Descriptor}_{Secondary Descriptor} defining what entity is being loaded and a secondary description if multiple DMI objects are being leveraged in a mapping. SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer to that

Rank Transformation

Router Transformation SAP DMI Prepare Sequence Generator Transformation Sorter Transformation

SRT_{DESCRIPTOR}

Source Qualifier Transformation SQ_{SOURCE_TABLE1}_{SOURCE_TABLE2}. Using all source tables can be impractical if there are a lot of tables in a source qualifier, so refer to the type of information being obtained, for example a certain type of product SQ_SALES_INSURANCE_PRODUCTS. Stored Procedure Transformation Transaction Control Transformation Union Transformation Unstructured Data Transform SP_{STORED_PROCEDURE_NAME}

TCT_ or TRANS_{DESCRIPTOR} indicating the function of the transaction control. UN_{DESCRIPTOR}

UDO_{descriptor} with the descriptor ideintifying the kind of data being parsed by the UDO transform. Update Strategy Transformation UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_ {TARGET_NAME} if there are multiple targets in the mapping. E.g., UPD_UPDATE_EXISTING_EMPLOYEES Web Service Consumer WSC_{descriptor} XML Generator Transformation XMG_{DESCRIPTOR}defines the target message.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

367 of 954

XML Parser Transformation XML Source Qualifier Transformation

XMP_{DESCRIPTOR}defines the messaging being selected. XMSQ_{DESCRIPTOR}defines the data being selected.

Port Names
Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be prefixed with the appropriate name. When the developer brings a source port into a lookup, the port should be prefixed with in_. This helps the user immediately identify the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port is transformed in an output port with the same name, prefix the input port with in_. Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix v, 'var_ or v_' plus a meaningful name. With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data from the database. Other transformations that are not applicable to the port standards are:
q q q

Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it. Sequence Generator - The ports are reserved words. Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_ as well. Port names should not have any prefix. Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to rename them unless they are prefixed. Prefixed port names should be removed. Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in both the input and output. The port names should not have any prefix.

All other transformation object ports can be prefixed or suffixed with:


q q q q q q

in_ or i_for Input ports o_ or _out for Output ports io_ for Input/Output ports v,v_ or var_ for variable ports lkp_ for returns from look ups mplt_ for returns from mapplets

Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for longer port names. Transformation object ports can also:
q q q q

Have the Source Qualifier port name. Be unique. Be meaningful. Be given the target port name.

Transformation Descriptions
This section defines the standards to be used for transformation descriptions in the Designer.
INFORMATICA CONFIDENTIAL BEST PRACTICES 368 of 954

Source Qualifier Descriptions. Should include the aim of the source qualifier and the data it is intended to select. Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items such as the SQL statement to be included in the description as well.

Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup table name] to retrieve the [lookup attribute name]. Where:
r r r

Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria. Lookup table name is the table on which the lookup is being performed. Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition when the lookup is actually executed.

It is also important to note lookup features such as persistent cache or dynamic lookup.
q

Expression Transformation Descriptions. Must adhere to the following format: This expression [explanation of what transformation does]. Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Expression, transformation ports have their own description in the format: This port [explanation of what the port is used for].

Aggregator Transformation Descriptions. Must adhere to the following format: This Aggregator [explanation of what transformation does]. Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Aggregator, transformation ports have their own description in the format: This port [explanation of what the port is used for].

Sequence Generators Transformation Descriptions. Must adhere to the following format: This Sequence Generator provides the next value for the [column name] on the [table name]. Where:
r r

Table name is the table being populated by the sequence number, and the Column name is the column within that table being populated.

Joiner Transformation Descriptions. Must adhere to the following format: This Joiner uses [joining field names] from [joining table names]. Where:
r

INFORMATICA CONFIDENTIAL

BEST PRACTICES

369 of 954

Joining field names are the names of the columns on which the join is done, and the
r

Joining table names are the tables being joined.


q

Normalizer Transformation Descriptions. Must adhere to the following format:: This Normalizer [explanation]. Where:
r

explanation describes what the Normalizer does.

Filter Transformation Descriptions. Must adhere to the following format: This Filter processes [explanation]. Where:
r

explanation describes what the filter criteria are and what they do.

Stored Procedure Transformation Descriptions. Explain the stored procedures functionality within the mapping (i.e., what does it return in relation to the input ports?). Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet. Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate? business rate or tourist rate? Has the conversion gone through an intermediate currency? Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or determined by a calculation. Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction. Router Transformation Descriptions. Describes the groups and their functions. Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if any) is expected to take place in later transformations in the mapping. Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of the control to commit or rollback. Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure which is used. External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure that is used. Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation. Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the rank, the rank direction, and the purpose of the transformation. XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the purpose of the XML being generated. XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the
BEST PRACTICES 370 of 954

INFORMATICA CONFIDENTIAL

purpose of the transformation.

Mapping Comments
These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues arise that need to be discussed with business analysts.

Mapplet Comments
These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions for the input and output transformation.

Repository Objects
Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either L_ for local or G for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g., PROD, TEST, DEV).

Folders and Groups


Working folder names should be meaningful and include project name and, if there are multiple folders for that one project, a descriptor. User groups should also include project name and descriptors, as necessary. For example, folder DW_SALES_US and DW_SALES_UK could both have TEAM_SALES as their user group. Individual developer folders or non-production folders should prefix with z_ so that they are grouped together and not confused with working production folders.

Shared Objects and Folders


Any object within a folder can be shared across folders and maintained in one central location. These objects are sources, targets, mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. In addition to facilitating maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of copies. Only users with the proper permissions can access these shared folders. These users are responsible for migrating the folders across the repositories and, with help from the developers, for maintaining the objects within the folders. For example, if an object is created by a developer and is to be shared, the developer should provide details of the object and the level at which the object is to be shared before the Administrator accepts it as a valid entry into the shared folder. The developers, not necessarily the creator, control the maintenance of the object, since they must ensure that a subsequent change does not negatively impact other objects. If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders by creating a shortcut to the object. In this case, the naming convention is sc_ (e.g., sc_EXP_CALC_SALES_TAX). The folder should prefix with SC_ to identify it as a shared folder and keep all shared folders grouped together in the repository.

Workflow Manager Objects


WorkFlow Objects Session Command Object Worklet Workflow Suggested Naming Convention s_{MappingName} cmd_{DESCRIPTOR} wk or wklt_{DESCRIPTOR} wkf or wf_{DESCRIPTOR}

INFORMATICA CONFIDENTIAL

BEST PRACTICES

371 of 954

Email Task: Decision Task: Assign Task: Timer Task: Control Task:

email_ or eml_{DESCRIPTOR} dcn_ or dt_{DESCRIPTOR} asgn_{DESCRIPTOR} timer_ or tmr_{DESCRIPTOR} ctl_{DESCRIPTOR}Specify when and how the PowerCenter Server is to stop or abort a workflow by using the Control task in the workflow. wait_ or ew_{DESCRIPTOR}Waits for an event to occur. Once the event triggers, the PowerCenter Server continues executing the rest of the workflow. raise_ or er_{DESCRIPTOR} Represents a user-defined event. When the PowerCenter Server runs the Event-Raise task, the Event-Raise task triggers the event. Use the Event-Raise task with the Event-Wait task to define events.

Event Wait Task:

Event Raise Task:

ODBC Data Source Names


All Open Database Connectivity (ODBC) data source names (DSNs) should be set up in the same way on all client machines. PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the ODBC DSN since the PowerCenter Client talks to all databases through ODBC. Also be sure to setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that there is less chance of a discrepancy occuring among users when they use different (i.e., colleagues') machines and have to recreate a new DSN when they use a separate machine. If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example, machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2. TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by multiple names, creating confusion for developers, testers, and potentially end users. Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev, to test, to prod, PowerCenter can wind up with source objects called dev_db01 in the production repository. ODBC database names should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.

Database Connection Information


Security considerations may dictate using the company name of the database or project instead of {user}_{database name}, except for developer scratch schemas, which are not found in test or production environments. Be careful not to include machine names or environment tokens in the database connection name. Database connection names must be very generic to be understandable and ensure a smooth migration. The naming convention should be applied across all development, test, and production environments. This allows seamless migration of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also copied. So, if the developer uses connections with names like Dev_DW in the development repository, they are likely to eventually wind up in the test, and even the production repositories as the folders are migrated. Manual intervention is then necessary to change connection names, user names, passwords, and possibly even connect strings. Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the development environment to the test environment, the sessions automatically use the existing connection in the test repository. With the right naming convention, you can migrate sessions from the test to production repository without manual intervention.
INFORMATICA CONFIDENTIAL BEST PRACTICES 372 of 954

TIP At the beginning of a project, have the Repository Administrator or DBA setup all connections in all environments based on the issues discussed in this Best Practice. Then use permission options to protect these connections so that only specified individuals can modify them. Whenever possible, avoid having developers create their own connections using different conventions and possibly duplicating connections.

Administration Console Objects


Administration console objects such as domains, nodes, and services should also have meaningful names. Object Domain Node Recommended Naming Convention DOM_ or DMN_[PROJECT]_[ENVIRONMENT] Example DOM_PROCURE_DEV

NODE[#]_[SERVER_NAME]_ [optional_descriptor] NODE02_SERVER_rs_b (backup node for the repository service)

Services: - Integration - Repository INT_SVC_[ENVIRONMENT]_[optional descriptor] INT_SVC_DEV_primary REPO_SVC_[ENVIRONMENT]_[optional descriptor] REPO_SVC_TEST

- Web Services Hub

WEB_SVC_[ENVIRONMENT]_[optional descriptor] WEB_SVC_PROD

PowerCenter PowerExchange Application/Relational Connections


Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager. When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target databases. Connections are saved in the repository. For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you configure depends on the type of source data you want to extract and the extraction mode (e.g., PWX[MODE_INITIAL]_[SOURCE]_ [Instance_Name]). The following table shows some examples. Source Type/ Extraction Mode DB2/390 Bulk Mode DB2/390 Change Mode DB2/390 Real Time Mode IMS Batch Mode IMS Change Mode Application Connection/ Relational Connection Relational Application Connection Type Recommended Naming Convention PWX DB2390 PWX DB2390 CDC Change PWX DB2390 CDC Real Time PWXB_DB2_Instance_Name PWXC_DB2_Instance_Name

Application

PWXR_DB2_Instance_Name

Application Application

PWX NRDB Batch PWXB_IMS_ Instance_Name PWX NRDB CDC Change PWXC_IMS_ Instance_Name

INFORMATICA CONFIDENTIAL

BEST PRACTICES

373 of 954

IMS Real Time

Application

PWX NRDB CDC Real Time

PWXR_IMS_ Instance_Name

Oracle Change Mode Application

PWX Oracle CDC PWXC_ORA_Instance_Name Change PWX Oracle CDC PWXR_ORA_Instance_Name Real

Oracle Real Time

Application

PowerCenter PowerExchange Target Connections


The connection you configure depends on the type of target data you want to load. Target Type Connection Type Recommended Naming Convention

DB2/390

PWX DB2390 relational database PWXT_DB2_Instance_Name connection PWX DB2400 relational database PWXT_DB2_Instance_Name connection

DB2/400

Last updated: 05-Dec-07 16:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

374 of 954

Naming Conventions - B2B Data Transformation Challenge


As with any development process, the use of clear, consistent, and documented naming conventions contributes to the effective use of Informatica B2B Data Transformation. The purpose of this document is to provide suggested naming conventions for the major structural elements of B2B Data Transformation solutions.

Description
The process of creating a B2B Data Transformation solution consists of several logical phases, each of which has implications for naming conventions. Some of these naming conventions are based upon best practices discovered during the creation of B2B Data Transformation solutions; others are restrictions imposed on the naming of solution artifacts that are due to both the use of the underlying file system and the need to make solutions callable from a wide variety of host runtime and development environments. The main phases involved in the construction of a B2B Data Transformation solution are: 1. The creation of one more transformation projects using the B2B Data Transformation Studio (formerly known as ContentMaster Studio) authoring environment. A typical solution may involve the creation of many transformation projects. 2. The publication of the transformation projects as transformation services. 3. The deployment of the transformation services. 4. The creation/configuration of the host integration environment to invoke the published transformation services. Each of these phases has implications for the naming of transformation solution components and artifacts (i.e., projects, TGP scripts, schemas, published services). Several common patterns occur in B2B Data Transformation solutions that have implications for naming:
q

Many components are realized physically as file system objects such as files and directories. For maximum compatibility and portability, it is desirable to name these objects so that they can be transferred between Windows, UNIX and other platforms without having to rename the objects so as to conform to different file system conventions. Inputs and outputs to and from B2B Data Transformation services are often files or entities designated by URLs. Again restrictions of underlying file systems play an important role here. B2B Data Transformation solutions are designed to be embeddable that is callable from a host application or environment through the use of scripts, programming language APIs provided for languages such as C, C# and Java, and through the use of agents for PowerCenter and other platforms. Hence some of the naming conventions are based on maximizing usability of transformation services from within various host environments or APIs. Within B2B Data Transformation projects, most names and artifacts are global the scope of names is global to the project.

B2B Data Transformation Studio Designer


B2B Data Transformation Studio is the user interface for the development of B2B Data Transformation solutions. It is based on the open source Eclipse environment and inherits many of its characteristics regarding project naming and structure. The workspace is organized as a set of sub-directories, with one sub-directory representing each project. A specially designated directory named .metatdata is used to hold metadata about the current workspace. For more information about Studio Designer and the workspace refer to Establishing a B2B Data Transformation Development Architecture .

INFORMATICA CONFIDENTIAL

BEST PRACTICES

375 of 954

At any common level of visibility, B2B Data Transformation requires that all elements have distinct names. Thus no two projects within a repository or workspace may share the same name. Likewise, no two TGP script files, XML schemas, global parser, mapper, serializer or variable definition may share the same name. Within a transformation (such as parser, mapper or serializer) groupings, actions or subsections of a transformation may be assigned names. In this context, the name does not strictly identify the section but is used as both a developer convenience and as a way to identify the section in the event file. In this case, names are allowed to be duplicated and often the name serves as a shorthand comment about the section. In these cases, there are no restrictions on the name although it is recommended that the name is unique, short and intuitively identifies the section. Often the name may be used to refer to elements in the specification (such as Map 835 ISA Segment). Contrary to the convention for global names, spaces are often used for readability. To distinguish between sub-element names that are only used within transformations, and the names of entry points, scripts and variables that are used as service parameters etc., refer to these names as public names.

B2B Data Transformation Studio Best Practices


As B2B Data Transformation Studio will load all projects in the current workspace into the studio environment, keeping all projects under design in a single workspace leads to both excessive memory usage and logical clutter between transformations belonging to different, possibly unrelated, solutions. Note: B2B Data Transformation Studio allows for the closing of projects to reduce memory consumption. While this aids with memory consumption it does not address the logical organization aspects of using separate workspaces. Use Separate Workspaces for Separate Solutions For distinct logical solutions, it is recommended to use separate logical workspaces to organize projects relating to separate solutions. Refer to Establishing a B2B Data Transformation Development Architecture for more information. Create Separate Transformation Projects for Each Distinct Service From a logical organization perspective, it is easier to manage data transformation solutions if only one primary service is published from each project. Secondary services from the same project should be reserved for the publication of test or troubleshooting variations of the same primary service. The one exception to this should be where multiple services are substantially the same with the same transformation code but with minor differences to inputs. One alternative to publishing multiple services from the same project is to publish a shared service which is then called by the other services in order to perform the common transformation routines. For ease of maintenance, it is often desirable to name the project after the primary service which it publishes. While these do not have to be the same, it is a useful convention and simplifies the management of projects. Use Names Compatible with Command Line Argument Formats When a transformation service is invoked at runtime, it may be invoked on the command line (via cm_console), via .Net or Java Apis, via integration agents to invoke a service from a hosting platform such as WebMethods, BizTalk or IBM ProcessServer or from PowerCenter via the UDO option for PowerCenter. Use Names Compatible with Programming Language Function Names While the programming APIs allow for the use of any string as the name, to simplify interoperability with future APIs and command line tools, the service name should be compatible with the names for C# and Java variable names, and with argument names for Windows, Unix and other OS command line arguments.
INFORMATICA CONFIDENTIAL BEST PRACTICES 376 of 954

Use Names Compatible with File System Naming on Unix and Windows Due to the files produced behind the scenes, the published service name and project names needs to be compatible with the naming conventions for file and directory names on their target platforms. To allow for optimal cross platform migration in the future, names should be chosen so as to be compatible with file naming restrictions on Windows, Unix and other platforms. Do Not Include Version or Date Information in Public Names It is recommended that project names, published service names, names of publicly accessible transformations and other public names do not include version numbers or date of creation information. Due to the way in which B2B Data Transformation operates, the use of dates or version numbers would make it difficult to use common source code control systems to track changes to projects. Unless the version corresponds to a different version of a business problem such as dealing with two different versions of an HL7 specification - it is recommended that names do not include version or date information.

Naming B2B Data Transformation Projects


When a project is created, the user is prompted for the project name.

Project names will be used by default as the published service name. Both, the directory for the project within a
INFORMATICA CONFIDENTIAL BEST PRACTICES 377 of 954

workspace and the main cmw project file name will be based on the project name. Due to the recommendation that a project name is used to define the published service name, the project name should not conflict with the name of an existing service unless the project publishes that service. Note: B2B Data Transformation disallows the use of $, , ~, , ^, *, ?, >, < , comma, `, \, /, ;, | in project names. Project naming should be clear and consistent within both a repository and workspace. The exact approach to naming will vary depending on an organizations needs.

Project Naming Best Practices


Project Names Must Be Unique Across Workspaces in Which They Occur Also if project generated services will be deployed onto separate production environments, the naming of services will need to be unique on those environments also. Do Not Name a Project after a Published Service, unless the Project Produces that Published Service This requirement can be relaxed if service names distinct from project names are being used. Do Not name a Project .metadata This will conflict with the underlying Eclipse metadata. Do Not include Version or Date Information in Project Names While it may be appealing to use version or date indicators in project names, the ideal solution for version tracking of services is to use a source control system such as CVS, Visual Studio SourceSafe, Source Depot or one of the many other commercially available or open-source source control systems. Consider Including the Source Format in the Name If transformations within a project will operate predominantly on one primary data source format, including the data source in the project name may be helpful. For example: TranslateHipaa837ToXml Consider Including the Target Format in the Name If transformations within a project will produce predominantly one target data format, including the data format in the project name may be helpful. For example: TranslateCobolCopybookToSwift Use Short, Descriptive Project Names Include enough descriptive information within the project name to indicate its function. Remember that the project name will also determine the default published service name. For ease of readability in B2B Data Transformation studio, it is also recommended to keep project names to 80 characters or less. Consider also conforming to C identifier names (combinations of a-z, A-Z, 0-9, _) which should provide maximum
INFORMATICA CONFIDENTIAL BEST PRACTICES 378 of 954

conformance. Keep Project Names Compatible with File and Directory Naming Restrictions on Unix, Windows and other Platforms As project names determine file and directory names for a variety of solution artifacts, it is highly recommended that project names conform to file name restrictions across a variety of file systems. While it is possible to use invalid Unix file names as project names on Windows, and invalid Windows file names on Unix projects, it is recommended to avoid OS file system conflicts where possible to maximize future portability. More detailed file system restrictions are identified in the appendix. Briefly, these include:
q

Do not use system file names such as CON, PRN, AUX, CLOCK$, NUL,COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9 Do not use reserved eclipse names such as .metadata Do not use characters such as |\?*<":>+[]/ or control characters Optionally exclude spaces and other whitespace characters from service names

q q q

Use Project Names Compatible with Deployed Service Names As it is recommended that where possible service names are the same as the project that produces them, names of projects should also follow the service naming recommendations for command line parameters, and API identifiers.

Naming Published Services


When a project is published, it will be by default have the same name as the project from which it was published.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

379 of 954

Many of the restrictions for project names should be observed and if possible, the service should be named after the project name.

Published Service Naming Best Practices


Service Names must be Unique across the Environment on which they will be Deployed Allow for Service Names to be used as Command Line Parameters The B2B Data Transformation utility cm_console provides for quick testing of published services. It takes as its first argument, the name of the service to invoke. For each of use of cm_console, the project name should not include spaces, tabs or newlines, single or double quotes, characters such as |, , ;, %, $, >, \, / Allow for Service Names to be used as Programming Language Identifiers While B2B Data Transformation currently allows for the service name to be passed in as an arbitrary string when calling the Java and .Net APIs, other agents may expose the service as a function or method in their platform. For maximum compatibility it is recommended that service names conform to the rules for C identifiers begin with a letter, allowing combinations of 0-9, A-Z, a-z and _ only. It is also necessary to consider if the host environment distinguishes between alpha character case when naming variables. Some application platforms may not distinguish between testService, testservice and TESTSERVICE. Allow for Service Names to be used as Web Service Names The WSDL specification allows for the use of letters, digits, ., -, _, : , combining chars and extenders to be used as a web service name (or any XML nmtoken valued attribute). B2B Data Transformation does not permit use of : in a project name so it is recommended that names be kept to a combination of letters, digits, ., -, _ if they are to be used as web services. Conforming to C identifier names will guarantee compatibility. Keep Service Names Compatible with File and Directory Naming Restrictions on Unix, Windows and other Platforms As service names determine file and directory names for a variety of solution artifacts, it is highly recommended that service names conform to file name restrictions across a variety of file systems. While it is possible to use invalid Unix file names as service names on Windows, and invalid Windows file names on Unix services, it is recommended to avoid OS file system conflicts where possible to maximize future portability. More detailed file system restrictions are identified in the appendix below. Briefly, these include:
q

Do not use system file names such as CON, PRN, AUX, CLOCK$, NUL,COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9 Do not use reserved eclipse names such as .metadata Do not use characters such as |\?*<":>+[]/ or control characters Optionally exclude spaces and other whitespace characters from service names

q q q

Naming Transformation Script Files (TGP Scripts)


TGP scripts have the naming restrictions common to all files on the platform on which they are being deployed.
INFORMATICA CONFIDENTIAL BEST PRACTICES 380 of 954

Naming Transformation Components


Transformation components such as parsers, mappers, serializers, variables etc., must be unique within a project. There is a single global namespace within any B2B Data Transformation project for all transformation components. One exception exists to this global namespace for component names. That is sequences of actions within a component such as a mapper or parser may be given a name. In this case the name is only used for commentary purposes and to assist in matching events to the sequence of the script actions that produced the event. For these sub-component names, no restrictions apply although it is recommended that the names are kept short to ease browsing in the events viewer. The remarks attribute should be used for longer descriptive commentary on the actions taken.

Transformation Component Naming Best Practices


Use Short Descriptive Names Names of components will show up in event logs, error output and other tracing and logging mechanisms. Keeping names short will ease the need for large amounts of screen real estate when browsing the event view for debugging. Incorporate Source and Target Formats in the Name Optionally Use Prefixes to Annotate Components Used for Internal Use Only When a component such as a parser mapper, is used for internal purposes only, it may be useful to use names of components that are prefixed with a letter sequence indicating the type of component. Type of Component Variable Prefix v Notes Do not adorn variables used for external service parameters Alternatively use descriptive name MapXToY Alternatively use descriptive name i.e ParseMortgageApplication Alternatively use descriptive name i.e RemoveWhitespace Alternatively use descriptive name Serialize837 Alternatively use name XToY describing preprocessing

Mapper Parser

map psr

Transformer

tr

Serializer Preprocessor

ser pr

In addition, names for components should take into account the following suggested rules: 1. Limit names to a reasonably short length. A limit of 40 characters is suggested. 2. Consider using the name of the input and or output data. 3. Consider limiting names to alphabetic characters, underscores, and numbers.
INFORMATICA CONFIDENTIAL BEST PRACTICES 381 of 954

Variables Exposed as Service Parameters Should be Unadorned When a variable is being used to hold a service parameter, no prefix should be used. Use a reasonably short descriptive name instead.

XML Schema Naming


In many B2B Data Transformation solution scenarios, the XML schemas which are the source or target of transformations are defined externally and control over the naming and style of schema definition is limited. However, sometimes a transformation project may require one or more intermediate schemas. The following best practices may help with the use of newly created XML schemas in B2B Data Transformation projects. Use a Target Namespace Using all no namespace schemas leads to a proliferation of types within the B2B Data Transformation studio environment under a single default namespace. By using namespaces on intermediate schemas it can reduce the logical clutter in addition to making intermediate schemas more re-usable. Always Qualify the XML Schema Namespace Qualify the XML Schema namespace even when using qualified elements and attributes for the domain namespace. It makes schema inclusion and import simpler. Consider the use of Explicit Named Complex Types vs. Anonymous Complex Types The use of anonymous complex types reduces namespace clutter in PowerExchange studio. However when multiple copies of schema elements are needed, having the ability to define variables of a complex type simplifies the creation on many transformations. By default, a transformation project allows for the existence of one copy of a schema at a time. Through the use of global complex types, additional variables may be defined to hold secondary copies for interim processing. Example: Use of anonymous type: <xsd:element name=Book> <xsd:complexType> <xsd:sequence> <xsd:element name=Title type=xsd:string/> <xsd:element name=Author type=xsd:string/> </xsd:sequence> </xsd:complexType> </element> Use of global type: <xsd:complexType name=Publication> <xsd:sequence> <xsd:element name=Title type=xsd:string/> <xsd:element name=Author type=xsd:string/> </xsd:sequence> </xsd:complexType>
INFORMATICA CONFIDENTIAL BEST PRACTICES 382 of 954

<xsd:element name=Book type=Publication/> Through the use of the second form of the definition, we can create a variable of the type Publication.

Appendix: File Name Restrictions On Different Platforms


Reserved Characters and Words
Many operating systems prohibit control characters from appearing in file names. Unix-like systems are an exception, as the only control character forbidden in file names is the null character, as that's the end-of-string indicator in C. Trivially, Unix also excludes the path separator / from appearing in filenames. Some operating systems prohibit some particular characters from appearing in file names: Character / \ ? % Name slash backslash Reason used as a path name component separator in Unix-like, MS-DOS and Windows. treated the same as slash in MS-DOS and Windows, and as the escape character in Unix systems (see Note below)

question mark used as a wildcard in Unix, and Windows; marks a single character. percent sign used as a wildcard in RT-11; marks a single character. used as a wildcard in Unix, MS-DOS, RT-11, VMS and Windows. Marks any sequence of characters (Unix, Windows, later versions of MS-DOS) or any sequence of characters in either the basename or extension (thus "*.*" in early versions of MS-DOS means "all files". used to determine the mount point / drive on Windows; used to determine the virtual device or physical device such as a drive on RT-11 and VMS; used as a pathname separator in classic Mac OS. Doubled after a name on VMS, indicates the DECnet nodename (equivalent to a NetBIOS (Windows networking) hostname preceded by "\\".) designates software pipelining in Windows.

asterisk

colon

| " < >

vertical bar

quotation mark used to mark beginning and end of filenames containing spaces in Windows. less than greater than used to redirect input, allowed in Unix filenames. used to redirect output, allowed in Unix filenames. allowed but the last occurrence will be interpreted to be the extension separator in VMS, MS-DOS and Windows. In other OSes, usually considered as part of the filename, and more than one full stop may be allowed.

period

Note: Some applications on Unix-like systems might allow certain characters but require them to be quoted or escaped; for example, the shell requires spaces, <, >, |, \ and some other characters such as : to be quoted:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

383 of 954

five\ and\ six\<seven (example of escaping) 'five and six<seven' or "five and six<seven" (examples of quoting) In Windows the space and the period are not allowed as the final character of a filename. The period is allowed as the first character, but certain Windows applications, such as Windows Explorer, forbid creating or renaming such files (despite this convention being used in Unix-like systems to describe hidden files and directories). Among workarounds are using different explorer applications or saving a file from an application with the desired name. Some file systems on a given operating system (especially file systems originally implemented on other operating systems), and particular applications on that operating system, may apply further restrictions and interpretations. See comparison of file systems for more details on restrictions imposed by particular file systems. In Unix-like systems, MS-DOS, and Windows, the file names "." and ".." have special meanings (current and parent directory respectively). In addition, in Windows and DOS, some words might also be reserved and can not be used as filenames. For example, DOS Device files: CON, PRN, AUX, CLOCK$, NUL COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Operating systems that have these restrictions cause incompatibilities with some other filesystems. For example, Windows will fail to handle, or raise error reports for, these legal UNIX filenames: aux.c, q"uote"s.txt, or NUL.txt.

Comparison of File Name Limitations


Alphabetic Case Sensitivity caseinsensitive casedestruction Allowed Character Set Reserved Characters Reserved Words

System

Max Length

Comments

MS-DOS FAT

AZ 09 - _

all except allowed

8+3

Win95 VFAT case-insensitive

any

|\?*<":>+[]/ control characters |\?*<":>/ control characters

255

WinXP NTFS

optional

any

aux, con, prn

255

OS/2 HPFS

caseinsensitive casepreservation

any

|\?*<":>/

254

INFORMATICA CONFIDENTIAL

BEST PRACTICES

384 of 954

Mac OS HFS

caseinsensitive casepreservation

any

255

Finder is limited to 31 characters

Mac OS HFS+

caseinsensitive casepreservation

any

: on disk, in classic Mac OS, and at the Carbon layer in Mac OS X; / at the Unix layer in Mac OS X

255

Mac OS 8.1 - Mac OS X

most UNIX file systems

case-sensitive casepreservation case-sensitive casepreservation

any except reserved

/ null

255

a leading . means ls and file managers will not by default show the file a leading . indicates a "hidden" file

early UNIX (AT&T)

any

14

POSIX "Fully case-sensitive portable casefilenames"[2] preservation

AZaz09. _-

/ null

Filenames to avoid include: a. out, core, . profile, . history, .cshrc

14

hyphen must not be first character

BeOS BFS

case-sensitive

UTF-8

255 Flat filesystem with no subdirs. A full "file specification" includes device, filename and extension (file type) in the format: dev:filnam.ext. a full "file specification" includes nodename, diskname, directory/ies, filename, extension and version in the format: OURNODE::MYDISK: [THISDIR.THATDIR] FILENAME.EXTENSION;2 Directories can only go 8 levels deep. 8 directory levels max (for Level 1 conformance)

DEC PDP-11 case-insensitive RT-11

RADIX-50

6+3

DEC VAX VMS

case-insensitive

AZ 09 _

32 per component; earlier 9 per component; latterly, 255 for a filename and 32 for an extension. 255

ISO 9660

case-insensitive AZ 09 _ .

Last updated: 30-May-08 22:03

INFORMATICA CONFIDENTIAL

BEST PRACTICES

385 of 954

Naming Conventions - Data Quality Challenge


As with any other development process, the use of clear, consistent, and documented naming conventions contributes to the effective use of Informatica Data Quality (IDQ). This Best Practice provides suggested naming conventions for the major structural elements of the IDQ Designer and IDQ Plans.

Description
IDQ Designer
The IDQ Designer is the user interface for the development of IDQ plans. Each IDQ plan holds the business rules and operations for a distinct process. IDQ plans may be constructed for use inside the IDQ Designer (a runtime plan), using the athanor-rt command line utility (also runtime), or within an integration with PowerCenter (a real-time plan). IDQ requires that each IDQ plan belong to a project. Optionally, plans may be organized in folders within a project. Folders may be nested to span more than one level. The organizational structure of IDQ is summarized below. Element Repository Project Folder Plan Parent None. This is the top level organization structure. Repository. There may be multiple projects in a repository. Project or Folder. Folders may be nested. Project or Folder.

At any common level of visibility, IDQ requires that all elements have distinct names. Thus no two projects within a repository may share the same name. Likewise, no two folders at the same level within a project may share the same name. The rule also applies to plans within the same folder. IDQ will not permit an element to be renamed if the new name would conflict with an existing element at the same level. A dialog will explain the error.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

386 of 954

To prevent naming conflicts when an element is copied, it will be prefixed with Copy of if it is pasted at the same level as the source of the copy. If the length of the new name is longer than the allowed length for names of the type of element, the name will be truncated.

Naming Projects
When a project is created, it will be by default have the name New Project.

Project naming should be clear and consistent within a repository. The exact approach to naming will vary depending on an organizations needs. Suggested naming rules include: 1. Limit project names to 22 characters if possible. The limit imposed by the repository is 30 characters. Limiting project names to 22 characters allows Copy of to be prefixed to copies of a project without truncating characters. 2. Include enough descriptive information within the project name so an unfamiliar user will have a reasonable idea of what plans may be included in the project. 3. If plans within a project will operate on only one data source, including the data source in the project name may be helpful. 4. If abbreviations are used, they should be consistent and documented.

Naming Folders
When a new project is created, by default it will contain four folders, named Consolidation, Matching, Profiling, and Standardization.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

387 of 954

This naming convention for folders tracks the major types of IDQ plans. While the default naming convention may prove satisfactory in many cases, it imposes an organizational structure for plans that may not be optimal. Therefore, another naming convention may make more sense in a particular circumstance. Naming guidelines for folders include: 1. Limit folder names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting folder names to 42 characters allows Copy of to be prefixed to copies of a folder without truncating characters. 2. Include enough descriptive information within the folder name so an unfamiliar user will have a reasonable idea of what plans may be included in the folder. 3. If abbreviations are used, they should be consistent and documented.

Naming Plans
When a new plan is created, the user is required to select from one of the four main plan classifications, Analysis, Matching, Standardization, or Consolidation. By default, the new plan name will correspond to the option selected.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

388 of 954

Including the plan type as part of the plan name is helpful in describing what the plan does. Other suggested naming rules include: 1. Limit plan names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting plan names to 42 characters allows Copy of to be prefixed to copies of a plan without truncating characters. 2. Include enough descriptive information within the plan name so an unfamiliar user will have a reasonable idea of what the plan does at a high level. 3. While the project and folder structure will be visible within the IDQ Designer and will be required when using athanor-rt, it is not as readily visible within PowerCenter. Therefore, repetition of the information conveyed by the project and folder names may be advisable. 4. If abbreviations are used, they should be consistent and documented.

Naming Components
Within the Designer, component types may be identified by their unique icons as well as by hovering over a component with a mouse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

389 of 954

However, the component has no visible name at this level. It is only after opening a component for viewing that the components name becomes visible.

It is suggested that component names be prefixed with an acronym identifying the component type. While less critical than field naming, as discussed below, using a prefix allows for consistent naming, for clarity, and it makes field naming more efficient in some cases. Suggested prefixes are listed below. Component Address Validator Bigram Character Labeller Prefix AV_ BG_ CL_

INFORMATICA CONFIDENTIAL

BEST PRACTICES

390 of 954

Context Parser Edit Distance Hamming Distance Jaro Distance Merge Mixed Field Matcher Nysiis Profile Standardizer Rule Based Analyzer Scripting Search Replace Soundex Splitter To Upper Token Labeller Token Parser Weight Based Analyzer Word Manager

CP_ ED_ HD_ JD_ MG_ MFM_ NYS_ PS_ RBA_ SC_ SR_ SX_ SPL_ TU_ TL_ TP_ WBA_ WM_

In addition, names for components should take into account the following suggested rules: 1. Limit names to a reasonably short length. A limit of 32 characters is suggested. In many cases, component names are also useful for field names, and databases limit field lengths at varying sizes. 2. Consider using the name of the input field or at least the field type. 3. Consider limiting names to alphabetic characters, spaces, underscores, and numbers. This will make the corresponding field names compatible with most likely output destinations. 4. If the component type abbreviation itself is not sufficient to identify what the component does, include an identifier for the function of the component in its name. 5. If abbreviations are used, they should be consistent and documented.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

391 of 954

Naming Dictionaries
Dictionaries may be given any name suitable for the operating system on which they will be used. It is suggested that dictionary naming consider the following rules: 1. Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both Windows and UNIX, avoid using spaces. 2. If a dictionary supplied by Informatica is to be modified, it is suggested that the dictionary be renamed and/ or moved to a new folder. This will avoid accidentally overwriting the modifications when an update is installed. 3. If abbreviations are used, they should be consistent and documented.

Naming Fields
Careful field naming is probably the most critical standard to follow when using IDQ.
q

IDQ requires that all fields output by components have unique names; a name cannot be carried through from component to component. The power of IDQ leads to complex plans with many components. IDQ does not have the data lineage feature of PowerCenter, so the component name is the clearest indicator of the source of an input component when a plan is being examined.

q q

With those considerations in mind, the following naming rules are suggested: 1. Prefix each output field name with the type of component. Component Address Validator Bigram Character Labeller Context Parser Edit Distance Hamming Distance Jaro Distance Merge Mixed Field Matcher Prefix AV_ BG_ CL_ CP_ ED_ HD_ JD_ MG_ MFM_

INFORMATICA CONFIDENTIAL

BEST PRACTICES

392 of 954

Nysiis Profile Standardizer Rule Based Analyzer Scripting Search Replace Soundex Splitter To Upper Token Labeller Token Parser Weight Based Analyzer Word Manager

NYS_ PS_ RBA_ SC_ SR_ SX_ SPL_ TU_ TL_ TP_ WBA_ WM_

2. Use meaningful field names, with consistent, documented abbreviations. 3. Use consistent casing. 4. While it is possible to rename output fields in sink components, this practice should be avoided when practical, since there is no convenient way to determine which source field provides data to the renamed output field.

Last updated: 04-Jun-08 18:50

INFORMATICA CONFIDENTIAL

BEST PRACTICES

393 of 954

Performing Incremental Loads Challenge


Data warehousing incorporates very large volumes of data. The process of loading the warehouse in a reasonable timescale without compromising its functionality is extremely difficult. The goal is to create a load strategy that can minimize downtime for the warehouse and allow quick and robust data management.

Description
As time windows shrink and data volumes increase, it is important to understand the impact of a suitable incremental load strategy. The design should allow data to be incrementally added to the data warehouse with minimal impact on the overall system. This Best Practice describes several possible load strategies.

Incremental Aggregation
Incremental aggregation is useful for applying incrementally-captured changes in the source to aggregate calculations in a session. If the source changes only incrementally, and you can capture those changes, you can configure the session to process only those changes with each run. This allows the PowerCenter Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session. If the session performs incremental aggregation, the PowerCenter Integration Service saves index and data cache information to disk when the session finishes. The next time the session runs, the PowerCenter Integration Service uses this historical information to perform the incremental aggregation. To utilize this functionality set the Incremental Aggregation Session attribute. For details see Chapter 24 in the Workflow Administration Guide. Use incremental aggregation under the following conditions:
q q

Your mapping includes an aggregate function. The source changes only incrementally.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

394 of 954

You can capture incremental changes (i.e., by filtering source data by timestamp). You get only delta records (i.e., you may have implemented the CDC (Change Data Capture) feature of PowerExchange).

Do not use incremental aggregation in the following circumstances:


q q

You cannot capture new source data. Processing the incrementally-changed source significantly changes the target. If processing the incrementally-changed source alters more than half the existing target, the session may not benefit from using incremental aggregation. Your mapping contains percentile or median functions.

Some conditions that may help in making a decision on an incremental strategy include:
q

Error handling, loading and unloading strategies for recovering, reloading, and unloading data. History tracking requirements for keeping track of what has been loaded and when Slowly-changing dimensions. Informatica Mapping Wizards are a good start to an incremental load strategy. The Wizards generate generic mappings as a starting point (refer to Chapter 15 in the Designer Guide)

Source Analysis
Data sources typically fall into the following possible scenarios:
q

Delta records. Records supplied by the source system include only new or changed records. In this scenario, all records are generally inserted or updated into the data warehouse. Record indicator or flags. Records that include columns that specify the intention of the record to be populated into the warehouse. Records can be selected based upon this flag for all inserts, updates, and deletes. Date stamped data. Data is organized by timestamps, and loaded into the warehouse based upon the last processing date or the effective date range. Key values are present. When only key values are present, data must be checked against what has already been entered into the warehouse. All values must be checked before entering the warehouse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

395 of 954

No key values present. When no key values are present, surrogate keys are created and all data is inserted into the warehouse based upon validity of the records.

Identify Records for Comparison


After the sources are identified, you need to determine which records need to be entered into the warehouse and how. Here are some considerations:
q

Compare with the target table. When source delta loads are received, determine if the record exists in the target table. The timestamps and natural keys of the record are the starting point for identifying whether the record is new, modified, or should be archived. If the record does not exist in the target, insert the record as a new row. If it does exist, determine if the record needs to be updated, inserted as a new record, or removed (deleted from target) or filtered out and not added to the target. Record indicators. Record indicators can be beneficial when lookups into the target are not necessary. Take care to ensure that the record exists for update or delete scenarios, or does not exist for successful inserts. Some design effort may be needed to manage errors in these situations.

Determine Method of Comparison


There are four main strategies in mapping design that can be used as a method of comparison:
q

Joins of sources to targets. Records are directly joined to the target using Source Qualifier join conditions or using Joiner transformations after the Source Qualifiers (for heterogeneous sources). When using Joiner transformations, take care to ensure the data volumes are manageable and that the smaller of the two datasets is configured as the Master side of the join. Lookup on target. Using the Lookup transformation, lookup the keys or critical columns in the target relational database. Consider the caches and indexing possibilities. Load table log. Generate a log table of records that have already been inserted into the target system. You can use this table for comparison with lookups or joins, depending on the need and volume. For example, store keys in a separate table and compare source records against this log table to determine load strategy. Another example is to store the dates associated with the data already loaded into a log table. MD5 checksum function. Generate a unique value for each row of data and then compare previous and current unique checksum values to determine

INFORMATICA CONFIDENTIAL

BEST PRACTICES

396 of 954

whether the record has changed.

Source-Based Load Strategies Complete Incremental Loads in a Single File/Table


The simplest method for incremental loads is from flat files or a database in which all records are going to be loaded. This strategy requires bulk loads into the warehouse with no overhead on processing of the sources or sorting the source records. Data can be loaded directly from the source locations into the data warehouse. There is no additional overhead produced in moving these sources into the warehouse.

Date-Stamped Data
This method involves data that has been stamped using effective dates or sequences. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed. With the use of relational sources, the records can be selected based on this effective date and only those records past a certain date are loaded into the warehouse. Views can also be created to perform the selection criteria. This way, the processing does not have to be incorporated into the mappings but is kept on the source component. Placing the load strategy into the other mapping components is more flexible and controllable by the Data Integration developers and the associated metadata. To compare the effective dates, you can use mapping variables to provide the previous date processed (see the description below). An alternative to Repository-maintained mapping variables is the use of control tables to store the dates and update the control table after each load. Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A Router transformation or filter can be placed after the Source Qualifier to remove old records.

Changed Data Based on Keys or Record Information


Data that is uniquely identified by keys can be sourced according to selection criteria. For example, records that contain primary keys or alternate keys can be used to determine if they have already been entered into the data warehouse. If they exist, you
INFORMATICA CONFIDENTIAL BEST PRACTICES 397 of 954

can also check to see if you need to update these records or discard the source record. It may be possible to perform a join with the target tables in which new data can be selected and loaded into the target. It may also be feasible to lookup in the target to see if the data exists.

Target-Based Load Strategies


q

Loading directly into the target. Loading directly into the target is possible when the data is going to be bulk loaded. The mapping is then responsible for error control, recovery, and update strategy. Load into flat files and bulk load using an external loader. The mapping loads data directly into flat files. You can then invoke an external loader to bulk load the data into the target. This method reduces the load times (with less downtime for the data warehouse) and provides a means of maintaining a history of data being loaded into the target. Typically, this method is only used for updates into the warehouse. Load into a mirror database. The data is loaded into a mirror database to avoid downtime of the active data warehouse. After data has been loaded, the databases are switched, making the mirror the active database and the active the mirror.

Using Mapping Variables


You can use a mapping variable to perform incremental loading. By referencing a datebased mapping variable in the Source Qualifier or join condition, it is possible to select only those rows with greater than the previously captured date (i.e., the newly inserted source data). However, the source system must have a reliable date to use. The steps involved in this method are:

Step 1: Create mapping variable


In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet Designer. Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case, make your variable a date/time. For the Aggregation option, select MAX. In the same screen, state your initial value. This date is used during the initial run of the
INFORMATICA CONFIDENTIAL BEST PRACTICES 398 of 954

session and as such should represent a date earlier than the earliest desired data. The date can use any one of these formats:
q q q q

MM/DD/RR MM/DD/RR HH24:MI:SS MM/DD/YYYY MM/DD/YYYY HH24:MI:SS

Step 2: Reference the mapping variable in the Source Qualifier


The select statement should look like the following: Select * from table A where CREATE DATE > date($$INCREMENT_DATE. MM-DD-YYYY HH24:MI:SS)

Step 3: Refresh the mapping variable for the next session run using an Expression Transformation
Use an Expression transformation and the pre-defined variable functions to set and use the mapping variable. In the expression transformation, create a variable port and use the SETMAXVARIABLE variable function to capture the maximum source date selected during each run. SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE) CREATE_DATE in this example is the date field from the source that should be used to identify incremental rows. You can use the variables in the following transformations:
q q q q

Expression Filter Router Update Strategy

INFORMATICA CONFIDENTIAL

BEST PRACTICES

399 of 954

As the session runs, the variable is refreshed with the max date value encountered between the source and variable. So, if one row comes through with 9/1/2004, then the variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is preserved. Note: This behavior has no effect on the date used in the source qualifier. The initial select always contains the maximum data value encountered during the previous, successful session run. When the mapping completes, the PERSISTENT value of the mapping variable is stored in the repository for the next run of your session. You can view the value of the mapping variable in the session log file. The advantage of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date) since the variable takes care of it. After a successful session run, the PowerCenter Integration Service saves the final value of each variable in the repository. So when you run your session the next time, only new data from the source system is captured. If necessary, you can override the value saved in the repository with a value saved in a parameter file.

Using PowerExchange Change Data Capture


PowerExchange (PWX) Change Data Capture (CDC) greatly simplifies the identification, extraction, and loading of change records. It supports all key mainframe and midrange database systems, requires no changes to the user application, uses vendor-supplied technology where possible to capture changes, and eliminates the need for programming or the use of triggers. Once PWX CDC collects changes, it places them in a change stream for delivery to PowerCenter. Included in the change data is useful control information, such as the transaction type (insert/update/delete) and the transaction timestamp. In addition, the change data can be made available immediately (i.e., in real time) or periodically (i.e., where changes are condensed). The native interface between PowerCenter and PowerExchange is PowerExchange Client for PowerCenter (PWXPC). PWXPC enables PowerCenter to pull the change data from the PWX change stream if real-time consumption is needed or from PWX condense files if periodic consumption is required. The changes are applied directly. So if the action flag is I, the record is inserted. If the action flag is U, the record is updated. If the action flag is D, the record is deleted. There is no need for change detection logic in the PowerCenter mapping.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

400 of 954

In addition, by leveraging group source processing, where multiple sources are placed in a single mapping, the PowerCenter session reads the committed changes for multiple sources in a single efficient pass, and in the order they occurred. The changes are then propagated to the targets, and upon session completion, restart tokens (markers) are written out to a PowerCenter file so that the next session run knows the point to extract from.

Tips for Using PWX CDC

After installing PWX, ensure the PWX Listener is up and running and that connectivity is established to the Listener. For best performance, the Listener should be co-located with the source system. In the PWX Navigator client tool, use metadata to configure data access. This means creating data maps for the non-relational to relational view of mainframe sources (such as IMS and VSAM) and capture registrations for all sources (mainframe, Oracle, DB2, etc). Registrations define the specific tables and columns desired for change capture. There should be one registration per source. Group the registrations logically, for example, by source database. For an initial test, make changes in the source system to the registered sources. Ensure that the changes are committed. Still working in PWX Navigator (and before using PowerCenter), perform Row Tests to verify the returned change records, including the transaction action flag (the DTL__CAPXACTION column) and the timestamp. Set the required access mode: CAPX for change and CAPXRT for real time. Also, if desired, edit the PWX extraction maps to add the Change Indicator (CI) column. This CI flag (Y or N) allows for field level capture and can be filtered in the PowerCenter mapping. Use PowerCenter to materialize the targets (i.e., to ensure that sources and targets are in sync prior to starting the change capture process). This can be accomplished with a simple pass-through batch mapping. This same bulk mapping can be reused for CDC purposes, but only if specific CDC columns are not included, and by changing the session connection/mode. Import the PWX extraction maps into Designer. This requires the PWXPC

INFORMATICA CONFIDENTIAL

BEST PRACTICES

401 of 954

component. Specify the CDC Datamaps option during the import.


q

Use group sourcing to create the CDC mapping by including multiple sources in the mapping. This enhances performance because only one read/ connection is made to the PWX Listener and all changes (for the sources in the mapping) are pulled at one time. Keep the CDC mappings simple. There are some limitations; for instance, you cannot use active transformations. In addition, if loading to a staging area, store the transaction types (i.e., insert/update/delete) and the timestamp for subsequent processing downstream. Also, if loading to a staging area, include an Update Strategy transformation in the mapping with DD_INSERT or DD_UPDATE in order to override the default behavior and store the action flags. Set up the Application Connection in Workflow Manager to be used by the CDC session. This requires the PWXPC component. There should be one connection and token file per CDC mapping/session. Set the UOW (unit of work) to a low value for faster commits to the target for real-time sessions. Specify the restart token location and file on the PowerCenter Integration Service (within the infa_shared directory) and specify the location of the PWX Listener. In the CDC session properties, enable session recovery (i.e., set the Recovery Strategy to Resume from last checkpoint). Use post-session commands to archive the restart token files for restart/ recovery purposes. Also, archive the session logs.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

402 of 954

Real-Time Integration with PowerCenter Challenge


Configure PowerCenter to work with various PowerExchange data access products to process real-time data. This Best Practice discusses guidelines for establishing a connection with PowerCenter and setting up a realtime session to work with PowerCenter.

Description
PowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter supports the following types of real-time data:
q

Messages and message queues. PowerCenter with the real-time option can be used to integrate third-party messaging applications using a specific PowerExchange data access product. Each PowerExchange product supports a specific industry-standard messaging application, such as WebSphere MQ, JMS, MSMQ, SAP NetWeaver, TIBCO, and webMethods. You can read from messages and message queues and write to messages, messaging applications, and message queues. WebSphere MQ uses a queue to store and exchange data. Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the message exchange is identified using a topic. Web service messages. PowerCenter can receive a web service message from a web service client through the Web Services Hub, transform the data, and load the data to a target or send a message back to a web service client. A web service message is a SOAP request from a web service client or a SOAP response from the Web Services Hub. The Integration Service processes real-time data from a web service client by receiving a message request through the Web Services Hub and processing the request. The Integration Service can send a reply back to the web service client through the Web Services Hub or write the data to a target. Changed source data. PowerCenter can extract changed data in real time from a source table using the PowerExchange Listener and write data to a target. Real-time sources supported by PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL Server, Oracle and VSAM.

Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the third-party messaging application and message itself. Each PowerExchange product supplies its own connection attributes that need to be configured properly before running a real-time session.

Setting Up Real-Time Session in PowerCenter


The PowerCenter real-time option uses a zero latency engine to process data from the messaging system. Depending on the messaging systems and the application that sends and receives messages, there may be a period when there are many messages and, conversely, there may be a period when there are no messages. PowerCenter uses the attribute Flush Latency to determine how often the messages are being flushed to the target. PowerCenter also provides various attributes to control when the session ends. The following reader attributes determine when a PowerCenter session should end:
q

Message Count - Controls the number of messages the PowerCenter Server reads from the source
BEST PRACTICES 403 of 954

INFORMATICA CONFIDENTIAL

before the session stops reading from the source.


q

Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it stops reading from the source. Time Slice Mode - Indicates a specific range of time that the server read messages from the source. Only PowerExchange for WebSphere MQ uses this option. Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading messages from the source.

The specific filter conditions and options available to you depend on which Real-Time source is being used. For example -Attributes for PowerExchange for DB2 for i5/OS:

Set the attributes that control how the reader ends. One or more attributes can be used to control the end of session. For example, set the Reader Time Limit attribute to 3600. The reader will end after 3600 seconds. The idle time limit is set to 500 seconds. The reader will end if it doesnt process any changes for 500 seconds (i.e., it remains idle for 500 seconds). If more than one attribute is selected, the first attribute that satisfies the condition is used to control the end of session.
INFORMATICA CONFIDENTIAL BEST PRACTICES 404 of 954

Note:: The real-time attributes can be found in the Reader Properties for PowerExchange for JMS, TIBCO, webMethods, and SAP iDoc. For PowerExchange for WebSphere MQ , the real-time attributes must be specified as a filter condition. The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often PowerCenter should flush messages, expressed in milli-seconds. For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition is reached. The Source Based Commit condition is defined in the Properties tab of the session. The message recovery option can be enabled to ensure that no messages are lost if a session fails as a result of unpredictable error, such as power loss. This is especially important for real-time sessions because some messaging applications do not store the messages after the messages are consumed by another application. A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on the source system from an external application. Each UOW may consist of a different number of rows depending on the transaction to the source system. When you use the UOW Count Session condition, the Integration Service commits source data to the target when it reaches the number of UOWs specified in the session condition. For example, if the value for UOW Count is 10, the Integration Service commits all data read from the source after the 10th UOW enters the source. The lower you set the value, the faster the Integration Service commits data to the target. The lower value also causes the system to consume more resources.

Executing a Real-Time Session


A real-time session often has to be up and running continuously to listen to the messaging application and to process messages immediately after the messages arrive. Set the reader attribute Idle Time to -1 and Flush Latency to a specific time interval. This is applicable for all PowerExchange products except for PowerExchange for WebSphere MQ where the session continues to run and flush the messages to the target using the specific flush latency interval. Another scenario is the ability to read data from another source system and immediately send it to a real-time target. For example, reading data from a relational source and writing it to WebSphere MQ. In this case, set the session to run continuously so that every change in the source system can be immediately reflected in the target. A real-time session may run continuously until a condition is met to end the session. In some situations it may be required to periodically stop the session and restart it. This is sometimes necessary to execute a postsession command or run some other process that is not part of the session. To stop the session and restart it, it is useful to deploy continuously running workflows. The Integration Service starts the next run of a continuous workflow as soon as it completes the first. To set a workflow to run continuously, edit the workflow and select the Scheduler tab. Edit the Scheduler and select Run Continuously from Run Options. A continuous workflow starts automatically when the Integration Service initializes. When the workflow stops, it restarts immediately.

Real-Time Sessions and Active Transformations


Some of the transformations in PowerCenter are active transformations, which means that the number of input rows and output rows of the transformations are not the same. For most cases, active transformation requires
INFORMATICA CONFIDENTIAL BEST PRACTICES 405 of 954

all of the input rows to be processed before processing the output row to the next transformation or target. For a real-time session, the flush latency will be ignored if DTM needs to wait for all the rows to be processed. Depending on user needs, active transformations, such as aggregator, rank, sorter can be used in a real-time session by setting the transaction scope property in the active transformation to Transaction. This signals the session to process the data in the transformation every transaction. For example, if a real-time session is using an aggregator that sums a field of an input, the summation will be done per transaction, as opposed to all rows. The result may or may not be correct depending on the requirement. Use the active transformation with realtime session if you want to process the data per transaction. Custom transformations can also be defined to handle data per transaction so that they can be used in a realtime session.

PowerExchange Real Time Connections


PowerExchange NRDB CDC Real Time connections can be used to extract changes from ADABAS, DATACOM, IDMS, IMS and VSAM sources in real time. The DB2/390 connection can be used to extract changes for DB2 on OS/390 and the DB2/400 connection to extract from AS/400. There is a separate connection to read from DB2 UDB in real time. The NRDB CDC connection requires the application name and the restart token file name to be overridden for every session. When the PowerCenter session completes, the PowerCenter Server writes the last restart token to a physical file called the RestartToken File. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it last left off. Every PowerCenter session needs to have a unique restart token filename. Informatica recommends archiving the file periodically. The reader timeout or the idle timeout can be used to stop a real-time session. A post-session command can be used to archive the RestartToken file. The encryption mode for this connection can slow down the read performance and increase resource consumption. Compression mode can help in situations where the network is a bottleneck; using compression also increases the CPU and memory usage on the source system.

Archiving PowerExchange Tokens


When the PowerCenter session completes, the Integration Service writes the last restart token to a physical file called the RestartToken File. The token in the file indicates the end point where the read job ended. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it left off. The token file is overwritten each time the session has to write a token out. PowerCenter does not implicitly maintain an archive of these tokens. If, for some reason, the changes from a particular point in time have to replayed, we need the PowerExchange token from that point in time. To enable such a process, it is a good practice to periodically copy the token file to a backup folder. This procedure is necessary to maintain an archive of the PowerExchange tokens. A real-time PowerExchange session may be stopped periodically, using either the reader time limit or the idle time limit. A post-session command is used to copy the restart token file to an archive folder. The session will be part of a continuous running workflow, so when the session completes after the post session command, it automatically restarts again. From a data processing standpoint very little changes; the process pauses for a moment, archives the token, and starts again.
INFORMATICA CONFIDENTIAL BEST PRACTICES 406 of 954

The following are examples of post-session commands that can be used to copy a restart token file (session. token) and append the current system date/time to the file name for archive purposes: cp session.token session`date '+%m%d%H%M'`.token Windows: copy session.token session-%date:~4,2%-%date:~7,2%-%date:~10,4%-%time:~0,2%-%time:~3,2%.token

PowerExchange for WebSphere MQ


1. In the Workflow Manager, connect to a repository and choose Connection > Queue 2. The Queue Connection Browser appears. Select New > Message Queue 3. The Connection Object Definition dialog box appears You need to specify three attributes in the Connection Object Definition dialog box:
q

Name - the name for the connection. (Use <queue_name>_<QM_name> to uniquely identify the connection.) Queue Manager - the Queue Manager name for the message queue. (in Windows, the default Queue Manager name is QM_<machine name>) Queue Name - the Message Queue name

To obtain the Queue Manager and Message Queue names:


q q

Open the MQ Series Administration Console. The Queue Manager should appear on the left panel Expand the Queue Manager icon. A list of the queues for the queue manager appears on the left panel

Note that the Queue Managers name and Queue Name are case-sensitive.

PowerExchange for JMS


PowerExchange for JMS can be used to read or write messages from various JMS providers, such as WebSphere MQ, JMS, BEA WebLogic Server. There are two types of JMS application connections:
q q

JNDI Application Connection, which is used to connect to a JNDI server during a session run. JMS Application Connection, which is used to connect to a JMS provider during a session run.

JNDI Application Connection Attributes are:


q q q q

Name JNDI Context Factory JNDI Provider URL JNDI UserName


BEST PRACTICES 407 of 954

INFORMATICA CONFIDENTIAL

q q

JNDI Password JMS Application Connection

JMS Application Connection Attributes are:


q q q q q q

Name JMS Destination Type JMS Connection Factory Name JMS Destination JMS UserName JMS Password

Configuring the JNDI Connection for WebSphere MQ


The JNDI settings for WebSphere MQ JMS can be configured using a file system service or LDAP (Lightweight Directory Access Protocol). The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the WebSphere MQ Java installation/bin directory. If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting: INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting: INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory Find the PROVIDER_URL settings. If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following provider URL setting and provide a value for the JNDI directory. PROVIDER_URL=file: /<JNDI directory> <JNDI directory> is the directory where you want JNDI to store the .binding file. Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the provider URL setting and specify a hostname. #PROVIDER_URL=ldap://<hostname>/context_name For example, you can specify: PROVIDER_URL=ldap://<localhost>/o=infa,c=rc

INFORMATICA CONFIDENTIAL

BEST PRACTICES

408 of 954

If you want to provide a user DN and password for connecting to JNDI, you can remove the # from the following settings and enter a user DN and password: PROVIDER_USERDN=cn=myname,o=infa,c=rc PROVIDER_PASSWORD=test The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI application connection in the Workflow Manager:

JMSAdmin.config Settings:

JNDI Application Connection Attribute

INITIAL_CONTEXT_FACTORY

JNDI Context Factory

PROVIDER_URL

JNDI Provider URL

PROVIDER_USERDN

JNDI UserName

PROVIDER_PASSWORD

JNDI Password

Configuring the JMS Connection for WebSphere MQ


The JMS connection is defined using a tool in JMS called jmsadmin, which is available in the WebSphere MQ Java installation/bin directory. Use this tool to configure the JMS Connection Factory. The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.
q q

When Queue Connection Factory is used, define a JMS queue as the destination. When Connection Factory is used, define a JMS topic as the destination.

The command to define a queue connection factory (qcf) is: def qcf(<qcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port) The command to define JMS queue is: def q(<JMS_queue_name>) qmgr(queue_manager_name) qu(queue_manager_queue_name) The command to define JMS topic connection factory (tcf) is: def tcf(<tcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port)
INFORMATICA CONFIDENTIAL BEST PRACTICES 409 of 954

The command to define the JMS topic is: def t(<JMS_topic_name>) topic(pub/sub_topic_name) The topic name must be unique. For example: topic (application/infa) The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

JMS Object Types

JMS Application Connection Attribute

QueueConnectionFactory or TopicConnectionFactory

JMS Connection Name

JMS Queue Name or JMS Topic Name

JMS Destination

Configure the JNDI and JMS Connection for WebSphere


Configure the JNDI settings for WebSphere to use WebSphere as a provider for JMS sources or targets in a PowerCenterRT session. JNDI Connection Add the following option to the file JMSAdmin.bat to configure JMS properly: -Djava.ext.dirs=<WebSphere Application Server>bin For example: -Djava.ext.dirs=WebSphere\AppServer\bin The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series Java/bin directory. INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory PROVIDER_URL=iiop://<hostname>/ For example: PROVIDER_URL=iiop://localhost/ PROVIDER_USERDN=cn=informatica,o=infa,c=rc PROVIDER_PASSWORD=test JMS Connection

INFORMATICA CONFIDENTIAL

BEST PRACTICES

410 of 954

The JMS configuration is similar to the JMS Connection for WebSphere MQ.

Configure the JNDI and JMS Connection for BEA WebLogic


Configure the JNDI settings for BEA WebLogic to use BEA WebLogic as a provider for JMS sources or targets in a PowerCenterRT session. PowerCenter Connect for JMS and the JMS hosting Weblogic server do not need to be on the same server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the right place. JNDI Connection The WebLogic Server automatically provides a context factory and URL during the JNDI set-up configuration for WebLogic Server. Enter these values to configure the JNDI connection for JMS sources and targets in the Workflow Manager. Enter the following value for JNDI Context Factory in the JNDI Application Connection in the Workflow Manager: weblogic.jndi.WLInitialContextFactory Enter the following value for JNDI Provider URL in the JNDI Application Connection in the Workflow Manager: t3://<WebLogic_Server_hostname>:<port> where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and port is the port number for the WebLogic Server. JMS Connection The JMS connection is configured from the BEA WebLogic Server console. Select JMS -> Connection Factory. The JMS Destination is also configured from the BEA WebLogic Server console. From the Console pane, select Services > JMS > Servers > <JMS Server name> > Destinations under your domain. Click Configure a New JMSQueue or Configure a New JMSTopic. The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

WebLogic Server JMS Object

JMS Application Connection Attribute

Connection Factory Settings: JNDIName

JMS Application Connection Attribute

INFORMATICA CONFIDENTIAL

BEST PRACTICES

411 of 954

Connection Factory Settings: JNDIName

JMS Connection Factory Name

Destination Settings: JNDIName

JMS Destination

In addition to JNDI and JMS setting, BEA WebLogic also offers a function called JMS Store, which can be used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is available from the Console pane: select Services > JMS > Stores under your domain.

Configuring the JNDI and JMS Connection for TIBCO


TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter Connect for JMS cant connect directly with the Rendezvous Server. TIBCO Enterprise Server, which is JMS-compliant, acts as a bridge between the PowerCenter Connect for JMS and TIBCO Rendezvous Server. Configure a connectionbridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server for PowerCenter Connect for JMS to be able to read messages from and write messages to TIBCO Rendezvous Server. To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous Server, follow these steps: 1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise Server. 2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server. Configure the following information in your JNDI application connection:
q q

JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory Provider URL.tibjmsnaming://<host>:<port> where host and port are the host name and port number of the Enterprise Server.

To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server: 1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below, so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems: tibrv_transports = enabled 2.

Enter the following transports in the transports.conf file: [RV] type = tibrv // type of external messaging system topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer daemon = tcp:localhost:7500 // default daemon for the Rendezvous server The transports in the transports.conf configuration file specify the communication protocol between TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties on a destination can list one or more transports to use to communicate with the TIBCO Rendezvous system.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

412 of 954

3. Optionally, specify the name of one or more transports for reliable and certified message delivery in the export property in the file topics.conf. as in the following example: topicname export="RV" The export property allows messages published to a topic by a JMS client to be exported to the external systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous reliable and certified messaging protocols.

PowerExchange for webMethods


When importing webMethods sources into the Designer, be sure the webMethods host file doesnt contain . character. You cant use fully-qualified names for the connection when importing webMethods sources. You can use fully-qualified names for the connection when importing webMethods targets because PowerCenter doesnt use the same grouping method for importing sources and targets. To get around this, modify the host file to resolve the name to the IP address. For example: Host File: crpc23232.crp.informatica.com crpc23232

Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods source definition. This step is only required for importing PowerExchange for webMethods sources into the Designer. If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate document back to the broker for every document it receives. PowerCenter populates some of the envelope fields of the webMethods target to enable webMethods broker to recognize that the published document is a reply from PowerCenter. The envelope fields destid and tag are populated for the request/reply model. Destid should be populated from the pubid of the source document and tag should be populated from tag of the source document. Use the option Create Default Envelope Fields when importing webMethods sources and targets into the Designer in order to make the envelope fields available in PowerCenter.

Configuring the PowerExchange for webMethods Connection


To create or edit the PowerExchange for webMethods connection select Connections > Application > webMethods Broker from the Workflow Manager. PowerExchange for webMethods connection attributes are:
q q q q q q q q

Name Broker Host Broker Name Client ID Client Group Application Name Automatic Reconnect Preserve Client State
BEST PRACTICES 413 of 954

INFORMATICA CONFIDENTIAL

Enter the connection to the Broker Host in the following format <hostname: port>. If you are using the request/reply method in webMethods, you have to specify a client ID in the connection. Be sure that the client ID used in the request connection is the same as the client ID used in the reply connection. Note that if you are using multiple request/reply document pairs, you need to setup different webMethods connections for each pair because they cannot share a client ID.

Last updated: 27-May-08 16:27

INFORMATICA CONFIDENTIAL

BEST PRACTICES

414 of 954

Session and Data Partitioning Challenge


Improving performance by identifying strategies for partitioning relational tables, XML, COBOL and standard flat files, and by coordinating the interaction between sessions, partitions, and CPUs. These strategies take advantage of the enhanced partitioning capabilities in PowerCenter.

Description
On hardware systems that are under-utilized, you may be able to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity. In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.

Assumptions
The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.
q

Indexing has been implemented on the partition key when using a relational source. Source files are located on the same physical machine as the PowerCenter Server process when partitioning flat files, COBOL, and XML, to reduce network overhead and delay. All possible constraints are dropped or disabled on relational targets. All possible indexes are dropped or disabled on relational targets. Table spaces and database partitions are properly managed on the target system. Target files are written to same physical machine that hosts the PowerCenter

q q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

415 of 954

process in order to reduce network overhead and delay.


q

Oracle External Loaders are utilized whenever possible

First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics: Check idle time and busy percentage for each thread. This gives the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the reader, transformation, and writer threads:
q q q

Total Run Time Total Idle Time Busy Percentage

Under-utilized or intermittently-used CPUs. To determine if this is the case, check the CPU usage of your machine. The column ID displays the percentage utilization of CPU idling during the specified interval without any I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance may be improved by adding a partition.
q q

Windows 2000/2003 - check the task manager performance tab. UNIX - type VMSTAT 1 10 on the command line.

Sufficient I/O. To determine the I/O statistics:


q q

Windows 2000/2003 - check the task manager performance tab. UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that the CPU spends idling (i.e., the unused capacity of the CPU.)

Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To determine if the session is paging:
q

Windows 2000/2003 - check the task manager performance tab.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

416 of 954

UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page space during the specified interval. PO displays the number of pages swapped out to the page space during the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more memory, if possible.

If you determine that partitioning is practical, you can begin setting up the partition.

Partition Types
PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you must specify a partition type. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:

Round-robin Partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need to distribute rows evenly and do not need to group data among partitions. In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.
q q q

Source file 1: 100,000 rows Source file 2: 5,000 rows Source file 3: 20,000 rows

In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes approximately one third of the data.

Hash Partitioning
The PowerCenter Server applies a hash function to a partition key to group data among

INFORMATICA CONFIDENTIAL

BEST PRACTICES

417 of 954

partitions. Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data based on a primary key are processed in the same partition.

Key Range Partitioning


With this type of partitioning, you specify one or more ports to form a compound partition key for a source or target. The PowerCenter Server then passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. Refer to Workflow Administration Guide for further directions on setting up Key range partitions. For example, with key range partitioning set at End range = 2020, the PowerCenter Server passes in data where values are less than 2020. Similarly, for Start range = 2020, the PowerCenter Server passes in data where values are equal to greater than 2020. Null values or values that may not fall in either partition are passed through the first partition.

Pass-through Partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point without redistributing them. Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to (or cannot) change the distribution of data across partitions. The Data Transformation Manager spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

418 of 954

It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread. When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative process of adding partitions. Continue adding partitions to the session until you meet the desired performance threshold or observe degradation in performance.

Tips for Efficient Session and Data Partitioning


q

Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more information on Restrictions on the Number of Partitions. Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value for the non-partitioned session. Set cached values for sequence generator. For a session with n partitions, there is generally no need to use the Number of Cached Values property of the sequence generator. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the non-partitioned session. Partition the source data evenly. The source data should be partitioned into equal sized chunks for each partition. Partition tables. A notable increase in performance can also be realized when the actual source and target tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the setup of tablespaces. Consider using external loader. As with any session, using an external loader may increase session performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning. Write throughput. Check the session statistics to see if you have increased the write throughput. Paging. Check to see if the session is now causing the system to page. When you partition a session and there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches. When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for each partition. If the memory is not bumped up, the system may start paging to disk, causing

INFORMATICA CONFIDENTIAL

BEST PRACTICES

419 of 954

degradation in performance. When you finish partitioning, monitor the session to see if the partition is degrading or improving session performance. If the session performance is improved and the session meets your requirements, add another partition

Session on Grid and Partitioning Across Nodes


Session on Grid (provides the ability to run a session on multi-node integration services. This is most suitable for large-size sessions. For small and medium size sessions, it is more practical to distribute whole sessions to different nodes using Workflow on Grid. Session on Grid leverages existing partitions of a session b executing threads in multiple DTMs. Log service can be used to get the cumulative log. See PowerCenter Enterprise Grid Option for detailed configuration information.

Dynamic Partitioning
Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the number of partitions. With the Session on Grid option, more partitions can be added when more resources are available. Also the number of partitions in a session can be tied to partitions in the database to facilitate maintenance of PowerCenter partitioning to leverage database partitioning.

Last updated: 06-Dec-07 15:04

INFORMATICA CONFIDENTIAL

BEST PRACTICES

420 of 954

Using Parameters, Variables and Parameter Files Challenge


Understanding how parameters, variables, and parameter files work and using them for maximum efficiency.

Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific transformations and to those server variables that were global in nature. Transformation variables were defined as variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression, Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect the subdirectories for source files, target files, log files, and so forth. More current versions of PowerCenter made variables and parameters available across the entire mapping rather than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow Manager. Using parameter files, these values can change from session-run to session-run. With the addition of workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility and reducing parameter file maintenance. Other important functionality that has been added in recent releases is the ability to dynamically create parameter files that can be used in the next session in a workflow or in other workflows.

Parameters and Variables


Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or session. A parameter file can be created using a text editor such as WordPad or Notepad. List the parameters or variables and their values in the parameter file. Parameter files can contain the following types of parameters and variables:
q q q q

Workflow variables Worklet variables Session parameters Mapping parameters and variables

When using parameters or variables in a workflow, worklet, mapping, or session, the Integration Service checks the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for these parameters and variables, the Integration Service checks for the start value of the parameter or variable in other places. Session parameters must be defined in a parameter file. Because session parameters do not have default values, if the Integration Service cannot locate the value of a session parameter in the parameter file, it fails to initialize the session. To include parameter or variable information for more than one workflow, worklet, or session in a single parameter file, create separate sections for each object within the parameter file. Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks use, as necessary. To specify the parameter file that the Integration Service uses with a workflow, worklet, or session, do either of the following:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

421 of 954

q q

Enter the parameter file name and directory in the workflow, worklet, or session properties. Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in the command line.

If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd command line, the Integration Service uses the information entered in the pmcmd command line.

Parameter File Format


When entering values in a parameter file, precede the entries with a heading that identifies the workflow, worklet or session whose parameters and variables are to be assigned. Assign individual parameters and variables directly below this heading, entering each parameter or variable on a new line. List parameters and variables in any order for each task. The following heading formats can be defined:
q q q

Workflow variables - [folder name.WF:workflow name] Worklet variables -[folder name.WF:workflow name.WT:worklet name] Worklet variables in nested worklets - [folder name.WF:workflow name.WT:worklet name.WT:worklet name...] Session parameters, plus mapping parameters and variables - [folder name.WF:workflow name.ST: session name] or [folder name.session name] or [session name]

Below each heading, define parameter and variable values as follows:


q q q q

parameter name=value parameter2 name=value variable name=value variable2 name=value

For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $ $State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The session also uses session parameters to connect to source files and target databases, as well as to write session log to the appropriate session log file. The following table shows the parameters and variables that can be defined in the parameter file:

Parameter and Variable Type String Mapping Parameter Datetime Mapping Variable Source File (Session Parameter)

Parameter and Variable Name $$State $$Time $InputFile1

Desired Definition MA 10/1/2000 00:00:00 Sales.txt

INFORMATICA CONFIDENTIAL

BEST PRACTICES

422 of 954

Database Connection (Session Parameter) Session Log File (Session Parameter)

$DBConnection_Target

Sales (database connection) d:/session logs/firstrun. txt

$PMSessionLogFile

The parameter file for the session includes the folder and session name, as well as each parameter and variable:
q q q q q q

[Production.s_MonthlyCalculations] $$State=MA $$Time=10/1/2000 00:00:00 $InputFile1=sales.txt $DBConnection_target=sales $PMSessionLogFile=D:/session logs/firstrun.txt

The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable. This allows the Integration Service to use the value for the variable that was set in the previous session run

Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to creating a port in most transformations (See the second figure, below).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

423 of 954

Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect change to mapping variables:
q q q q

SetVariable SetMaxVariable SetMinVariable SetCountVariable

A mapping variable can store the last value from a session run in the repository to be used as the starting value for the next session run.
q

Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date. Aggregation type. This entry creates specific functionality for the variable and determines how it stores data. For example, with an aggregation type of Max, the value stored in the repository at the end of each session run would be the maximum value across ALL records until the value is deleted. Initial value. This value is used during the first session run when there is no corresponding and overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value is identified, then a data-type specific default value is used.

Variable values are not stored in the repository when the session:
q q q

Fails to complete. Is configured for a test load. Is a debug session.


BEST PRACTICES 424 of 954

INFORMATICA CONFIDENTIAL

Runs in debug mode and is configured to discard session output.

Order of Evaluation
The start value is the value of the variable at the start of the session. The start value can be a value defined in the parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined initial value for the variable, or the default value based on the variable data type. The Integration Service looks for the start value in the following order: 1. 2. 3. 4. Value in session parameter file Value saved in the repository Initial value Default value

Mapping Parameters and Variables


Since parameter values do not change over the course of the session run, the value used is based on:
q q q

Value in session parameter file Initial value Default value

Once defined, mapping parameters and variables can be used in the Expression Editor section of the following transformations:
q q q q q

Expression Filter Router Update Strategy Aggregator

Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined join, and source filter sections, as well as in a SQL override in the lookup transformation.

Guidelines for Creating Parameter Files


Use the following guidelines when creating parameter files:
q

Enter folder names for non-unique session names. When a session name exists more than once in a repository, enter the folder name to indicate the location of the session. Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions individually. Specify the same parameter file for all of these tasks or create several parameter files. If including parameter and variable information for more than one session in the file, create a new section for each session. The folder name is optional. [folder_name.session_name] parameter_name=value

INFORMATICA CONFIDENTIAL

BEST PRACTICES

425 of 954

variable_name=value mapplet_name.parameter_name=value [folder2_name.session_name] parameter_name=value variable_name=value mapplet_name.parameter_name=value


q

Specify headings in any order. Place headings in any order in the parameter file. However, if defining the same parameter or variable more than once in the file, the Integration Service assigns the parameter or variable value using the first instance of the parameter or variable. Specify parameters and variables in any order. Below each heading, the parameters and variables can be specified in any order. When defining parameter values, do not use unnecessary line breaks or spaces. The Integration Service may interpret additional spaces as part of the value. List all necessary mapping parameters and variables. Values entered for mapping parameters and variables become the start value for parameters and variables in a mapping. Mapping parameter and variable names are not case sensitive. List all session parameters. Session parameters do not have default values. An undefined session parameter can cause the session to fail. Session parameter names are not case sensitive. Use correct date formats for datetime values. When entering datetime values, use the following date formats: MM/DD/RR MM/DD/RR HH24:MI:SS MM/DD/YYYY MM/DD/YYYY HH24:MI:SS

Do not enclose parameters or variables in quotes. The Integration Service interprets everything after the equal sign as part of the value. Do enclose parameters in single quotes. In a Source Qualifier SQL Override use single quotes if the parameter represents a string or date/time value to be used in the SQL Override. Precede parameters and variables created in mapplets with the mapplet name as follows: mapplet_name.parameter_name=value mapplet2_name.variable_name=value

Sample: Parameter Files and Session Parameters


Parameter files, along with session parameters, allow you to change certain values between sessions. A
INFORMATICA CONFIDENTIAL BEST PRACTICES 426 of 954

commonly-used feature is the ability to create user-defined database connection session parameters to reuse sessions for different relational sources or targets. Use session parameters in the session properties, and then define the parameters in a parameter file. To do this, name all database connection session parameters with the prefix $DBConnection, followed by any alphanumeric and underscore characters. Session parameters and parameter files help reduce the overhead of creating multiple mappings when only certain attributes of a mapping need to be changed.

Using Parameters in Source Qualifiers


Another commonly used feature is the ability to create parameters in the source qualifiers, which allows you to reuse the same mapping, with different sessions, to extract specified data from the parameter files the session references. Moreover, there may be a time when it is necessary to create a mapping that will create a parameter file and the second mapping to use that parameter file created from the first mapping. The second mapping pulls the data using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter file created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a parameter file for another session to use.

Sample: Variables and Parameters in an Incremental Strategy


Variables and parameters can enhance incremental strategies. The following example uses a mapping variable, an expression transformation object, and a parameter file for restarting.

Scenario
Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new information. The environment data has an inherent Post_Date that is defined within a column named Date_Entered that can be used. The process will run once every twenty-four hours.

Sample Solution
Create a mapping with source and target objects. From the menu create a new mapping variable named $ $Post_Date with the following attributes:
q q q q

TYPE Variable DATATYPE Date/Time AGGREGATION TYPE MAX INITIAL VALUE 01/01/1900

Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE (--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the Integration Service to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime. The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the function for setting the variable will reside. An output port named Post_Date is created with a data type of date/ time. In the expression code section, place the following function: SETMAXVARIABLE($$Post_Date,DATE_ENTERED)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

427 of 954

The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed forward. For example:

DATE_ENTERED 9/1/2000 10/30/2001 9/2/2000

Resultant POST_DATE 9/1/2000 10/30/2001 10/30/2001

Consider the following with regard to the functionality: 1. In order for the function to assign a value, and ultimately store it in the repository, the port must be connected to a downstream object. It need not go to the target, but it must go to another Expression Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream transformation object. 2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not work. In this case, make the session Data Driven and add an Update Strategy after the transformation containing the SETMAXVARIABLE function, but before the Target. 3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is preserved.

The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900 providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it encounters. Upon successful completion of the session, the variable is updated in the repository for use in the next session run. To view the current value for a particular variable associated with the session, right-click on the session in the Workflow Monitor and choose View Persistent Values. The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

428 of 954

Resetting or Overriding Persistent Values


To reset the persistent value to the initial value declared in the mapping, view the persistent value from Workflow Manager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing the Order of Evaluation to use the Initial Value declared from the mapping. If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:
q

Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A session may (or may not) have a variable, and the parameter file need not have variables and parameters defined for every session using the parameter file. To override the variable, either change, uncomment, or delete the variable in the parameter file. Run pmcmd for that session, but declare the specific parameter file within the pmcmd command.

Configuring the Parameter File Location


Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in the workflow or session properties:
q q q

Select either the Workflow or Session, choose, Edit, and click the Properties tab. Enter the parameter directory and name in the Parameter Filename field. Enter either a direct path or a server variable directory. Use the appropriate delimiter for the Integration Service operating system.

The following graphic shows the parameter filename and location specified in the session task.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

429 of 954

The next graphic shows the parameter filename and location specified in the Workflow.

In this example, after the initial session is run, the parameter file contents may look like: [Test.s_Incremental] ;$$Post_Date= By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a simple Perl script or manual change can update the parameter file to: [Test.s_Incremental] $$Post_Date=04/21/2001 Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value and uses that value for the session run. After successful completion, run another script to reset the parameter file.

Sample: Using Session and Mapping Parameters in Multiple Database Environments

INFORMATICA CONFIDENTIAL

BEST PRACTICES

430 of 954

Reusable mappings that can source a common table definition across multiple databases, regardless of differing environmental definitions (e.g., instances, schemas, user/logins), are required in a multiple database environment.

Scenario
Company X maintains five Oracle database instances. All instances have a common table definition for sales orders, but each instance has a unique instance name, schema, and login.

DB Instance ORC1 ORC99 HALC UGLY GORF

Schema aardso environ hitme snakepit gmer

Table orders orders order_done orders orders

User Sam Help Hi Punch Brer

Password max me Lois Judy Rabbit

Each sales order table has a different name, but the same definition:

ORDER_ID DATE_ENTERED DATE_PROMISED DATE_SHIPPED EMPLOYEE_ID CUSTOMER_ID SALES_TAX_RATE STORE_ID

NUMBER (28) DATE DATE DATE NUMBER (28) NUMBER (28) NUMBER (5,4) NUMBER (28)

NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL

Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the strings are named according to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then create a Mapping Parameter named $$Source_Schema_Table with the following attributes:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

431 of 954

Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required since this solution uses parameter files. Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.

Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.
INFORMATICA CONFIDENTIAL BEST PRACTICES 432 of 954

Override the table names in the SQL statement with the mapping parameter. Using Workflow Manager, create a session based on this mapping. Within the Source Database connection dropdown box, choose the following parameter: $DBConnection_Source. Point the target to the corresponding target and finish. Now create the parameter files. In this example, there are five separate parameter files.

Parmfile1.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=aardso.orders $DBConnection_Source= ORC1

Parmfile2.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=environ.orders $DBConnection_Source= ORC99

Parmfile3.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=hitme.order_done $DBConnection_Source= HALC

Parmfile4.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=snakepit.orders $DBConnection_Source= UGLY

Parmfile5.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table= gmer.orders

INFORMATICA CONFIDENTIAL

BEST PRACTICES

433 of 954

$DBConnection_Source= GORF Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular parameter file is as follows: pmcmd startworkflow -s serveraddress:portno -u Username -p Password -paramfile parmfilename s_Incremental You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual password.

Notes on Using Parameter Files with Startworkflow


When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter Integration Service runs the workflow using the parameters in the file specified. For UNIX shell users, enclose the parameter file name in single quotes: -paramfile '$PMRootDir/myfile.txt' For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the name includes spaces, enclose the file name in double quotes: -paramfile "$PMRootDir\my file.txt" Note: When writing a pmcmd command that includes a parameter file located on another machine, use the backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the server variable. pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\ $PMRootDir/myfile.txt' In the event that it is necessary to run the same workflow with different parameter files, use the following five separate commands: pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile2.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile3.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile4.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile5.txt 1 1 Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script can change the parameter file for the next session.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

434 of 954

Dynamically creating Parameter Files with a mapping


Using advanced techniques a PowerCenter mapping can be built that produces as a target file a parameter file (. parm) that can be referenced by other mappings and sessions. When many mappings use the same parameter file it is desirable to be able to easily re-create the file when mapping parameters are changed or updated. This also can be beneficial when parameters change from run to run. There are a few different methods of creating a parameter file with a mapping. There is a mapping template example on the my.informatica.com that illustrates a method of using a PowerCenter mapping to source from a process table containing mapping parameters and to create a parameter file. This same feat can be accomplished also by sourcing a flat file in a parameter file format with code characters in the fields to be altered. [folder_name.session_name] parameter_name= <parameter_code> variable_name=value mapplet_name.parameter_name=value [folder2_name.session_name] parameter_name= <parameter_code> variable_name=value mapplet_name.parameter_name=value In place of the text <parameter_code> one could place the text filename_<timestamp>.dat. The mapping would then perform a string replace wherever the text <timestamp> occurred and the output might look like: Src_File_Name= filename_20080622.dat This method works well when values change often and parameter groupings utilize different parameter sets. The overall benefits of using this method are such that if many mappings use the same parameter file, changes can be made by updating the source table and recreating the file. Using this process is faster than manually updating the file line by line.

Final Tips for Parameters and Parameter Files


Use a single parameter file to group parameter information for related sessions. When sessions are likely to use the same database connection or directory, you might want to include them in the same parameter file. When connections or directories change, you can update information for all sessions by editing one parameter file. Sometimes you reuse session parameters in a cycle. For example, you might run a session against a sales database everyday, but run the same session against sales and marketing databases once a week. You can create separate parameter files for each session run. Instead of changing the parameter file in the session properties each time you run the weekly session, use pmcmd to specify the parameter file to use when you start the session.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

435 of 954

Use reject file and session log parameters in conjunction with target file or target database connection parameters. When you use a target file or target database connection parameter with a session, you can keep track of reject files by using a reject file parameter. You can also use the session log parameter to write the session log to the target machine. Use a resource to verify the session runs on a node that has access to the parameter file. In the Administration Console, you can define a file resource for each node that has access to the parameter file and configure the Integration Service to check resources. Then, edit the session that uses the parameter file and assign the resource. When you run the workflow, the Integration Service runs the session with the required resource on a node that has the resource available. Save all parameter files in one of the process variable directories. If you keep all parameter files in one of the process variable directories, such as $SourceFileDir, use the process variable in the session property sheet. If you need to move the source and parameter files at a later date, you can update all sessions by changing the process variable to point to the new directory.

Last updated: 29-May-08 17:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

436 of 954

Using PowerCenter with UDB Challenge


Universal Database (UDB) is a database platform that can be used to run PowerCenter repositories and act as source and target databases for PowerCenter mappings. Like any software, it has its own way of doing things. It is important to understand these behaviors so as to configure the environment correctly for implementing PowerCenter and other Informatica products with this database platform. This Best Practice offers a number of tips for using UDB with PowerCenter.

Description
UDB Overview
UDB is used for a variety of purposes and with various environments. UDB servers run on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX. UDB supports two independent types of parallelism: symmetric multi-processing (SMP) and massively parallel processing (MPP). Enterprise-Extended Edition (EEE) is the most common UDB edition used in conjunction with the Informatica product suite. UDB EEE introduces a dimension of parallelism that can be scaled to very high performance. A UDB EEE database can be partitioned across multiple machines that are connected by a network or a high-speed switch. Additional machines can be added to an EEE system as application requirements grow. The individual machines participating in an EEE installation can be either uniprocessors or symmetric multiprocessors.

Connection Setup
You must set up a remote database connection to connect to DB2 UDB via PowerCenter. This is necessary because DB2 UDB sets a very small limit on the number of attachments per user to the shared memory segments when the user is using the local (or indirect) connection/protocol. The PowerCenter server runs into this limit when it is acting as the database agent or user. This is especially apparent when the repository is installed on DB2 and the target data source is on the same DB2 database. The local protocol limit will definitely be reached when using the same connection node
INFORMATICA CONFIDENTIAL BEST PRACTICES 437 of 954

for the repository via the PowerCenter Server and for the targets. This occurs when the session is executed and the server sends requests for multiple agents to be launched. Whenever the limit on number of database agents is reached, the following error occurs: CMN_1022 [[IBM][CLI Driver] SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command. SQLSTATE=55032] The following recommendations may resolve this problem:
q q

Increase the number of connections permitted by DB2. Catalog the database as if it were remote. (For information of how to catalog database with remote node refer Knowledgebase id 14745 at my.Informatica. com support Knowledgebase) Be sure to close connections when programming exceptions occur. Verify that connections obtained in one method are returned to the pool via close() (The PowerCenter Server is very likely already doing this). Verify that your application does not try to access pre-empted connections (i. e., idle connections that are now used by other resources).

q q

q q

DB2 Timestamp
DB2 has a timestamp data type that is precise to the microsecond and uses a 26character format, as follows: YYYY-MM-DD-HH.MI.SS.MICROS (where MICROS after the last period recommends six decimals places of second) The PowerCenter Date/Time datatype only supports precision to the second (using a 19 character format), so under normal circumstances when a timestamp source is read into PowerCenter, the six decimal places after the second are lost. This is sufficient for most data warehousing applications but can cause significant problems where this timestamp is used as part of a key. If the MICROS need to be retained, this can be accomplished by changing the format of the column from a timestamp data type to a character 26 in the source and target definitions. When the timestamp is read from DB2, the timestamp will be read in and converted to character in the YYYY-MM-DD-HH.MI.SS.MICROS format. Likewise,
INFORMATICA CONFIDENTIAL BEST PRACTICES 438 of 954

when writing to a timestamp, pass the date as a character in the YYYY-MM-DD-HH.MI. SS.MICROS format. If this format is not retained, the records are likely to be rejected due to an invalid date format error. It is also possible to maintain the timestamp correctly using the timestamp data type itself. Setting a flag at the PowerCenter Server level does this; the technique is described in Knowledge Base article 10220 at my.Informatica.com.

Importing Sources or Targets


If the value of the DB2 system variable APPLHEAPSZ is too small when you use the Designer to import sources/targets from a DB2 database, the Designer reports an error accessing the repository. The Designer status bar displays the following message: SQL Error:[IBM][CLI Driver][DB2]SQL0954C: Not enough storage is available in the application heap to process the statement. If you receive this error, increase the value of the APPLHEAPSZ variable for your DB2 operating system. APPLHEAPSZ is the application heap size (in 4KB pages) for each process using the database.

Unsupported Datatypes
PowerMart and PowerCenter do not support the following DB2 datatypes:
q q q q

Dbclob Blob Clob Real

DB2 External Loaders


The DB2 EE and DB2 EEE external loaders can both perform insert and replace operations on targets. Both can also restart or terminate load operations.
q

The DB2 EE external loader invokes the db2load executable located in the PowerCenter Server installation directory. The DB2 EE external loader can load data to a DB2 server on a machine that is remote to the PowerCenter Server.
BEST PRACTICES 439 of 954

INFORMATICA CONFIDENTIAL

The DB2 EEE external loader invokes the IBM DB2 Autoloader program to load data. The Autoloader program uses the db2atld executable. The DB2 EEE external loader can partition data and load the partitioned data simultaneously to the corresponding database partitions. When you use the DB2 EEE external loader, the PowerCenter Server and theDB2 EEE server must be on the same machine.

The DB2 external loaders load from a delimited flat file. Be sure that the target table columns are wide enough to store all of the data. If you configure multiple targets in the same pipeline to use DB2 external loaders, each loader must load to a different tablespace on the target database. For information on selecting external loaders, see "Configuring External Loading in a Session" in the PowerCenter User Guide.

Setting DB2 External Loader Operation Modes


DB2 operation modes specify the type of load the external loader runs. You can configure the DB2 EE or DB2 EEE external loader to run in any one of the following operation modes:
q q

Insert. Adds loaded data to the table without changing existing table data. Replace. Deletes all existing data from the table, and inserts the loaded data. The table and index definitions do not change. Restart. Restarts a previously interrupted load operation. Terminate. Terminates a previously interrupted load operation and rolls back the operation to the starting point, even if consistency points were passed. The tablespaces return to normal state, and all table objects are made consistent.

q q

Configuring Authorities, Privileges, and Permissions


When you load data to a DB2 database using either the DB2 EE or DB2 EEE external loader, you must have the correct authority levels and privileges to load data into to the database tables. DB2 privileges allow you to create or access database resources. Authority levels provide a method of grouping privileges and higher-level database manager maintenance and utility operations. Together, these functions control access to the database manager and its database objects. You can access only those objects for which you have the required privilege or authority. To load data into a table, you must have one of the following authorities:
INFORMATICA CONFIDENTIAL BEST PRACTICES 440 of 954

q q q

SYSADM authority DBADM authority LOAD authority on the database, with INSERT privilege

In addition, you must have proper read access and read/write permissions:
q

The database instance owner must have read access to the external loader input files. If you use run DB2 as a service on Windows, you must configure the service start account with a user account that has read/write permissions to use LAN resources, including drives, directories, and files. If you load to DB2 EEE, the database instance owner must have write access to the load dump file and the load temporary file.

Remember, the target file must be delimited when using the DB2 AutoLoader.

Guidelines for Performance Tuning


You can achieve numerous performance improvements by properly configuring the database manager, database, and tablespace container and parameter settings. For example, MAXFILOP is one of the database configuration parameters that you can tune. The default value for MAXFILOP is far too small for most databases. When this value is too small, UDB spends a lot of extra CPU processing time closing and opening files. To resolve this problem, increase MAXFILOP value until UDB stops closing files. You must also have enough DB2 agents available to process the workload based on the number of users accessing the database. Incrementally increase the value of MAXAGENTS until agents are not stolen from another application. Moreover, sufficient memory allocated to the CATALOGCACHE_SZ database configuration parameter also benefits the database. If the value of catalog cache heap is greater than zero, both DBHEAP and CATALOGCACHE_SZ should be proportionally increased. In UDB, the LOCKTIMEOUT default value is 1. In a data warehouse database, set this value to 60 seconds. Remember to define TEMPSPACE tablespaces so that they have at least 3 or 4 containers across different disks, and set the PREFETCHSIZE to a multiple of EXTENTSIZE, where the multiplier is equal to the number of containers. Doing so will enable parallel I/O for larger sorts, joins, and other database functions requiring substantial TEMPSPACE space.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

441 of 954

In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an INTRA_PARALLEL value of YES for CPU parallelism. The database configuration parameter DFT_DEGREE should be set to a value between ANY and 1 depending on the number of CPUs available and number of processes that will be running simultaneously. Setting the DFT_DEGREE to ANY can prove to be a CPU hogger since one process can take up all the processing power with this setting. Setting it to one does not make sense as there is no parallelism in one. Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB. Data warehouse databases perform numerous sorts, many of which can be very large. SORTHEAP memory is also used for hash joins, which a surprising number of DB2 users fail to enable. To do so, use the db2set command to set environment variable DB2_HASH_JOIN=ON. For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If real memory is available, some clients use even larger values for these configuration parameters. SQL is very complex in a data warehouse environment and often consumes large quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9. UDB uses NUM_IO_CLEANERS for writing to TEMPSPACE, temporary intermediate tables, index creations, and more. SET NUM_IO_CLEANERS equal to the number of CPUs on the UDB server and focus on your disk layout strategy instead. Lastly, for RAID devices where several disks appear as one to the operating system, be sure to do the following: 1. db2set DB2_STRIPED_CONTAINERS=YES (do this before creating tablespaces or before a redirected restore) 2. db2set DB2_PARALLEL_IO=* (or use TablespaceID numbers for tablespaces residing on the RAID devices for example DB2_PARALLEL_IO=4,5,6,7,8,10,12,13) 3. Alter the tablespace PREFETCHSIZE for each tablespace residing on RAID devices such that the PREFETCHSIZE is a multiple of the EXTENTSIZE.

Database Locks and Performance Problems


When working in an environment with many users that target a DB2 UDB database,
INFORMATICA CONFIDENTIAL BEST PRACTICES 442 of 954

you may experience slow and erratic behavior resulting from the way UDB handles database locks. Out of the box, DB2 UDB database and client connections are configured on the assumption that they will be part of an OLTP system and place several locks on records and tables. Because PowerCenter typically works with OLAP systems where it is the only process writing to the database and users are primarily reading from the database, this default locking behavior can have a significant impact on performance Connections to DB2 UDB databases are set up using the DB2 Client Configuration utility. To minimize problems with the default settings, make the following changes to all remote clients accessing the database for read-only purposes. To help replicate these settings, you can export the settings from one client and then import the resulting file into all the other clients.
q

Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the configuration settings and make sure the Enable Cursor Hold option is not checked. Connection Mode should be Shared, not Exclusive Isolation Level should be Read Uncommitted (the minimum level) or Read Committed (if updates by other applications are possible and dirty reads must be avoided)

q q

For setting the Isolation level to dirty read at the PowerCenter Server level, you can set a flag can at the PowerCenter configuration file. For details on this process, refer to the KB article 13575 in my.Informatica.com support knowledgebase. If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration utility, then highlight the database connection you use and select Properties. In Properties, select Settings and then select Advanced. You will see these options and their settings on the Transaction tab To export the settings from the main screen of the IBM DB2 client configuration utility, highlight the database connection you use, then select Export and all. Use the same process to import the settings on another client. If users run hand-coded queries against the target table using DB2's Command Center, be sure they know to use script mode and avoid interactive mode (by choosing the script tab instead of the interactive tab when writing queries). Interactive mode can lock returned records while script mode merely returns the result and does not hold them.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

443 of 954

If your target DB2 table is partitioned and resides across different nodes in DB2, you can use a target partition type DB Partitioning in PowerCenter session properties. When DB partitioning is selected, separate connections are opened directly to each node and the load starts in parallel. This improves performance and scalability.

Last updated: 13-Feb-07 17:14

INFORMATICA CONFIDENTIAL

BEST PRACTICES

444 of 954

Using Shortcut Keys in PowerCenter Designer Challenge


Using shortcuts and work-arounds to work as efficiently as possible in PowerCenter Mapping Designer and Workflow Manager.

Description
After you are familiar with the normal operation of PowerCenter Mapping Designer and Workflow Manager, you can use a variety of shortcuts to speed up their operation. PowerCenter provides two types of shortcuts:
q

keyboard shortcuts to edit repository objects and maneuver through the Mapping Designer and Workflow Manager as efficiently as possible, and shortcuts that simplify the maintenance of repository objects.

General Suggestions Maneuvering the Navigator Window


Follow these steps to open a folder with workspace open as well: 1. While highlighting the folder, click the Open folder icon. Note: Double-clicking the folder name only opens the folder if it has not yet been opened or connected to. 2. Alternatively, right-click the folder name, then click on Open.

Working with the Toolbar and Menubar


The toolbar contains commonly used features and functions within the various client tools. Using the toolbar is often faster than selecting commands from within the menubar.
q q

To add more toolbars, select Tools | Customize. Select the Toolbar tab to add or remove toolbars.

Follow these steps to use drop-down menus without the mouse: 1. Press and hold the <Alt> key. You will see an underline under one letter of each of the menu titles. 2. Press the underlined letter for the desired drop-down menu. For example, press 'r' for the 'Repository' drop-down menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

445 of 954

3. Press the underlined letter to select the command/operation you want. For example, press 't' for 'Close All Tools'. 4. Alternatively, after you have pressed the <Alt> key, use the right/left arrows to navigate across the menubar, and up/down arrows to expand and navigate through the drop-down menu.. Press Enter when the desired command is highlighted.
q

To create a customized toolbar for the functions you frequently use, press <Alt> <T> (expands the Tools drop-down menu) then <C> (for Customize). To delete customized icons, select Tools | Customize, and then remove the icons by dragging them directly off the toolbar To add an icon to an existing (or new) toolbar, select Tools | Customize and navigate to the Commands tab. Find your desired command, then "drag and drop" the icon onto your toolbar. To rearrange the toolbars, click and drag the toolbar to the new location. You can insert more than one toolbar at the top of the designer tool to avoid having the buttons go off the edge of the screen. Alternatively, you can position the toolbars at the bottom, side, or between the workspace and the message windows. To use a Docking\UnDocking window (e.g., Repository Navigator), double-click on the window's title bar. If you are having trouble docking the the window again, right-click somewhere in the white space of the runaway window (not the title bar) and make sure that the "Allow Docking" option is checked. When it is checked, drag the window to its proper place and, when an outline of where the window used to be appears, release the window.

q q

Keyboard Shortcuts
Use the following keyboard shortcuts to perform various operations in Mapping Designer and Workflow Manager.

To: Cancel editing in an object Check and uncheck a check box Copy text from an object onto a clipboard Cut text from an object onto the clipboard

Press: Esc Space Bar Ctrl+C Ctrl+X.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

446 of 954

Edit the text of an object Find all combination and list boxes Find tables or fields in the workspace Move around objects in a dialog box (When no objects are selected, this will pan within the workspace) Paste copied or cut text from the clipboard into an object Select the text of an object To start help

F2. Then move the cursor to the desired location Type the first letter of the list Ctrl+F Ctrl+directional arrows

Ctrl+V

F2 F1

Mapping Designer Navigating the Workspace


When using the "drag & drop" approach to create Foreign Key/Primary Key relationships between tables, be sure to start in the Foreign Key table and drag the key/field to the Primary Key table. Set the Key Type value to "NOT A KEY" prior to dragging. Follow these steps to quickly select multiple transformations: 1. Hold the mouse down and drag to view a box. 2. Be sure the box touches every object you want to select. The selected items will have a distinctive outline around them. 3. If you miss one or have an extra, you can hold down the <Shift> or <Ctrl> key and click the offending transformations one at a time. They will alternate between being selected and deselected each time you click on them. Follow these steps to copy and link fields between transformations: 1. You can select multiple ports when you are trying to link to the next transformation. 2. When you are linking multiple ports, they are linked in the same order as they are in the source transformation. You need to highlight the fields you want in the source transformation and hold the mouse button over the port name in the target transformation that corresponds to the source transformation port. 3. Use the Autolink function whenever possible. It is located under the Layout menu (or accessible by right-clicking somewhere in the workspace) of the Mapping Designer. 4. Autolink can link by name or position. PowerCenter version 6 and later gives you the option of entering prefixes or suffixes (when you click the 'More' button). This is especially helpful when you are trying to autolink from a Router transformation to some target transformation. For example, each group created in a Router has a distinct suffix number added to the port/ field name. To autolink, you need to choose the proper Router and Router group in the 'From Transformation' space. You also need to click the 'More' button and enter the appropriate suffix value. You must do both to create a link. 5. Autolink does not work if any of the fields in the 'To' transformation are already linked to another group or another stream. No error appears; the links are simply not created. Sometimes, a shared object is very close to (but not exactly) what you need. In this case, you may want to make a copy of the object with some minor alterations to suit your purposes. If you try to simply click and drag the object, it will ask you if you want to make a shortcut or it will be reusable every time. Follow these steps to make a non-reusable copy of a reusable object: 1. 2. 3. 4. 5. 6. Open the target folder. Select the object that you want to make a copy of, either in the source or target folder. Drag the object over the workspace. Press and hold the <Ctrl> key (the crosshairs symbol '+' will appear in a white box) Release the mouse button, then release the <Ctrl> key. A copy confirmation window and a copy wizard window appears.
BEST PRACTICES 447 of 954

INFORMATICA CONFIDENTIAL

7. The newly created transformation no longer says that it is reusable and you are free to make changes without affecting the original reusable object.

Editing Tables/Transformations
Follow these steps to move one port in a transformation: 1. Double-click the transformation and make sure you are in the "Ports" tab. (You go directly to the Ports tab if you doubleclick a port instead of the colored title bar.) 2. Highlight the port and click the up/down arrow button to reposition the port. 3. Or, highlight the port and then press <Alt><w> to move the port down or <Alt> <u> to move the port up. Note: You can hold down the <Alt> and hit the <w> or <u> multiple times to reposition the currently highlighted port downwards or upwards, respectively. Alternatively, you can accomplish the same thing by following these steps: 1. 2. 3. 4. Highlight the port you want to move by clicking the number beside the port. Grab onto the port by its number and continue holding down the left mouse button.. Drag the port to the desired location (the list of ports scrolls when you reach the end). A red line indicates the new location. When the red line is pointing to the desired location, release the mouse button. Note: You cannot move more than one port at a time with this method. See below for instructions on moving more than one port at a time. If you are using PowerCenter version 6.x, 7.x, or 8.x and the ports you are moving are adjacent, you can follow these steps to move more than one port at a time: 1. Highlight the ports you want to move by clicking the number beside the port while holding down the <Ctrl> key. 2. Use the up/down arrow buttons to move the ports to the desired location.
q q

To add a new field or port, first highlight an existing field or port, then press <Alt><f> to insert the new field/port below it. To validate a defined default value, first highlight the port you want to validate, and then press <Alt><v>. A message box will confirm the validity of the default value. After creating a new port, simply begin typing the name you wish to call the port. There is no need to to remove the default "NEWFIELD" text prior to labelling the new port. This method could also be applied when modifying existing port names. Simply highlight the existing port, by clicking onto the port number, and begin typing the modified name of the port. To prefix a port name, press <Home> to bring the cursor to the beginning of the port name. In addition, to add a suffix to a port name, press <End> to bring the curso to the end of the port name. Checkboxes can be checked (or unchecked) by highlighting the desired checkbox, and pressing SPACE bar to toggle the checkmark on and off.

Follow either of these steps to quickly open the Expression Editor of an output or variable port: 1. Highlight the expression so that there is a box around the cell and press <F2> followed by <F3>. 2. Or, highlight the expression so that there is a cursor somewhere in the expression, then press <F2>.
q q

To cancel an edit in the grid, press <Esc> so the changes are not saved. For all combo/drop-down list boxes, type the first letter on the list to select the item you want. For example, you can highlight a port's Data type box without displaying the drop-down. To change it to 'binary', type <b>. Then use the arrow keys to go down to the next port. This is very handy if you want to change all fields to string for example because using the up and down arrows and hitting a letter is much faster than opening the drop-down menu and making a choice each time. To copy a selected item in the grid, press <Ctrl><c>. To paste a selected item from the Clipboard to the grid, press <Ctrl><v>. To delete a selected field or port from the grid, press <Alt><c>. To copy a selected row from the grid, press <Alt><o>.
BEST PRACTICES 448 of 954

q q q q

INFORMATICA CONFIDENTIAL

To paste a selected row from the grid, press <Alt><p>.

You can use either of the following methods to delete more than one port at a time.
q q

-You can repeatedly hit the cut button; or You can highlight several records and then click the cut button. Use <Shift> to highlight many items in a row or <Ctrl> to highlight multiple non-contiguous items. Be sure to click on the number beside the port, not the port name while you are holding <Shift> or <Ctrl>.

Editing Expressions
Follow either of these steps to expedite validation of a newly created expression:
q

Click on the <Validate> button or press <Alt> and <v>. Note: This validates and leaves the Expression Editor open.

Or, press <OK> to initiate parsing/validating of the expression. The system closes the Expression Editor if the validation is successful. If you click OK once again in the "Expression parsed successfully" pop-up, the Expression Editor remains open.

There is little need to type in the Expression Editor. The tabs list all functions, ports, and variables that are currently available. If you want an item to appear in the Formula box, just double-click on it in the appropriate list on the left. This helps to avoid typographical errors and mistakes (such as including an output-only port name in an expression formula). In version 6.x and later, if you change a port name, PowerCenter automatically updates any expression that uses that port with the new name. Be careful about changing data types. Any expression using the port with the new data type may remain valid, but not perform as expected. If the change invalidates the expression, it will be detected when the object is saved or if the Expression Editor is active for that expression. The following table summarizes additional shortcut keys that are applicable only when working with Mapping Designer:

To: Add a new field or port Copy a row

Press Alt + F Alt + O

INFORMATICA CONFIDENTIAL

BEST PRACTICES

449 of 954

Cut a row Move current row down Move current row up Paste a row Validate the default value in a transformation Open the Expression Editor from the expression field To start the debugger

Alt + C Alt + W Alt + U Alt + P Alt + V F2, then press F3 F9

Repository Object Shortcuts


A repository object defined in a shared folder can be reused across folders by creating a shortcut (i.e., a dynamic link to the referenced object). Whenever possible, reuse source definitions, target definitions, reusable transformations, mapplets, and mappings. Reusing objects allows sharing complex mappings, mapplets or reusable transformations across folders, saves space in the repository, and reduces maintenance. Follow these steps to create a repository object shortcut: 1. Expand the shared folder. 2. Click and drag the object definition into the mapping that is open in the workspace. 3. As the cursor enters the workspace, the object icon appears along with a small curve; as an example, the icon should look like this:

4. A dialog box appears to confirm that you want to create a shortcut. If you want to copy an object from a shared folder instead of creating a shortcut, hold down the <Ctrl> key before dropping the object into the workspace.

Workflow Manager Navigating the Workspace


When editing a repository object or maneuvering around the Workflow Manager, use the following shortcuts to speed up the operation you are performing:

To: Create links

Press: Press Ctrl+F2 to select first task you want to link. Press Tab to select the rest of the tasks you want to link Press Ctrl+F2 again to link all the tasks you selected

INFORMATICA CONFIDENTIAL

BEST PRACTICES

450 of 954

Edit tasks name in the workspace Expand a selected node and all its children Move across to select tasks in the workspace Select multiple tasks

F2 SHIFT + * (use asterisk on numeric keypad) Tab Ctrl + Mouseclick

Repository Object Shortcuts


Mappings that reside in a shared folder can be reused within workflows by creating shortcut mappings. A set of workflow logic can be reused within workflows by creating a reusable worklet.

Last updated: 13-Feb-07 17:25

INFORMATICA CONFIDENTIAL

BEST PRACTICES

451 of 954

Working with JAVA Transformation Object Challenge


Occasionally special processing of data is required that is not easy to accomplish using existing PowerCenter transformation objects. Transformation tasks like looping through data 1 to x number of times is not a functionality native to the existing PowerCenter transformation objects. For these situations, the Java Transformation provides the ability to develop Java code with unlimited possibilities for transformation capabilities. This Best Practice addresses questions that are commonly raised about using JTX and how to make effective use of it, and supplements the existing PowerCenter documentation on the JTX.

Description
The Java Transformation (JTX) introduced in PowerCenter 8.0 provides a uniform means of entering and maintaining program code written in Java to be executed for every record being processed during a session run. The Java code is maintained, entered, and viewed within the PowerCenter Designer tool. Below is a summary of some of typical questions about JTX.

Is a JTX a passive or an active transformation?


A JTX can be either passive or active. When defining a JTX you must choose one or the other type. Once you make this choice you will not be able to change it without deleting the JTX, saving the repository and recreating the object. Hint: If you are working with a versioned repository, you will have to purge the deleted JTX from the repository before you can recreate it with the same name.

What parts of a typical Java class can be used in a JTX?


The following standard features can be used in a JTX:
q q

static initialization blocks can be defined on the tab Helper Code. import statements can be listed on the tab Import Packages.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

452 of 954

static variables of the Java class as a whole (i.e., counters for instances of this class) as well as non-static member variables (for every single instance) can be defined on the tab Helper Code. Auxiliary member functions or static functions may be declared and defined on the tab Helper Code. static final variables may be defined on the tab Helper Code. However, they are private by nature; no object of any other Java class will be able to utilize these. Auxiliary functions (static and dynamic) can be defined on the tab Helper Code.

Important Note: Before trying to start a session utilizing additional import clauses in the Java code, make sure that the environment variable CLASSPATH contains the necessary .jar files or directories before the PowerCenter Integration Service has been started. All non-static member variables declared on the tab Helper Code are automatically available to every partition of a partitioned session without any precautions. In other words, one object of the respective Java class that is generated by PowerCenter will be instantiated for every single instance of the JTX and for every session partition. For example, if you utilize two instances of the same reusable JTX and have set the session to run with three partitions, then six individual objects of that Java class will be instantiated for this session run.

What parts of a typical Java class cannot be utilized in a JTX?


The following standard features of Java are not available in a JTX:
q q q

Standard and user-defined constructors Standard and user-defined destructors Any kind of direct user-interface, be it a Swing GUI or a console-based user interface

What else cannot be done in a JTX?


One important note for a JTX is that you cannot retrieve, change, or utilize an existing DB connection in a JTX (such as a source connection, a target connection, or a relational connection to a LKP). If you would like to establish a database connection, use JDBC in the JTX. Make sure in this case that you provide the necessary

INFORMATICA CONFIDENTIAL

BEST PRACTICES

453 of 954

parameters by other means.

How can I substitute constructors and the like in a JTX?


User-defined constructors are mainly used to pass certain initialization values to a Java class that you want to process only once. The only way in a JTX to get this work done is to pass those parameters into the JTX as a normal port; then you define a boolean variable (initial value is true). For example, the name might be constructMissing on the Helper Code tab. The very first block in the On Input Row block will then look like this: if (constructMissing) { // do whatever you would do in the constructor constructMissing = false; } Interaction with users is mainly done to provide input values to some member functions of a class. This usually is not appropriate in a JTX because all input values should be provided by means of input records. If there is a need to enable immediate interaction with a user for one or several or all input records, use an inter-process communication mechanism (i.e., IPC) to establish communication between the Java class associated with the JTX and an environment available to a user. For example, if the actual check to be performed can only be determined at runtime, you might want to establish a JavaBeans communication between the JTX and the classes performing the actual checks. Beware, however, that this sort of mechanism causes great overhead and subsequently may decrease performance dramatically. Although in many cases such requirements indicate that the analysis process and the mapping design process have not been executed optimally.

How do I choose between an active and a passive JTX?


Use the following guidelines to identify whether you need an active or a passive JTX in your mapping:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

454 of 954

As a general rule of thumb, a passive JTX will usually execute faster than an active JTX . If one input record equals one output record of the JTX, you will probably want to use a passive JTX. If you have to produce a varying number of output records per input record (i. e., for some input values the JTX will generate one output record, for some values it will generate no output records, for some values it will generate two or even more output records) you will have to utilize an active JTX . There is no other choice. If you have to accumulate one or more input records before generating one or more output records, you will have to utilize an active JTX . There is no other choice. If you have to do some initialization work before processing the first input record, then this fact does in no way determine whether to utilize an active or a passive JTX. If you have to do some cleanup work after having processed the last input record, then this fact does in no way determine whether to utilize an active or a passive JTX.

If you have to generate one or more output records after the last input record has been processed, then you have to use an active JTX. There is no other choice except changing the mapping accordingly to produce these additional records by other means.

How do I set up a JTX and use it in a mapping?


As with most standard transformations you can either define a reusable JTX or an instance directly within a mapping. The following example will describe how to define a JTX in a mapping. For this example assume that the JTX has one input port of data type String and three output ports of type String, Integer, and Smallint. Note: As of version 8.1.1 the PowerCenter Designer is extremely sensitive regarding the port structure of a JTX; make sure you read and understand the Notes section below before designing your first JTX, otherwise you will encounter issues when trying to run a session associated to your mapping. 1. Click the button showing the java icon, then click on the background in the main window of the Mapping Designer. Choose whether to generate a passive or an active JTX (see How do I choose between an active and a passive JTX above). Remember, you cannot change this setting later. 2. Rename the JTX accordingly (i.e., rename it to JTX_SplitString).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

455 of 954

3. Go to the Ports tab; define all input-only ports in the Input Group, define all output-only and input-output ports in the Output Group. Make sure that every output-only and every input-output port is defined correctly. 4. Make sure you define the port structure correctly from the onset as changing data types of ports after the JTX has been saved to the repository will not always work. 5. Click Apply. 6. On the Properties tab you may want to change certain properties. For example, the setting "Is Partitionable" is mandatory if this session will be partitioned. Follow the hints in the lower part of the screen form that explain the selection lists in detail. 7. Activate the tab Java Code. Enter code pieces where necessary. Be aware that all ports marked as input-output ports on the Ports tab are automatically processed as pass-through ports by the Integration Service. You do not have to (and should not) enter any code referring to pass-through ports. See the Notes section below for more details. 8. Click the Compile link near the lower right corner of the screen form to compile the Java code you have entered. Check the output window at the lower border of the screen form for all compilation errors and work through each error message encountered; then click Compile again. Repeat this step as often as necessary until you can compile the Java code without any error messages. 9. Click OK. 10. Only connect ports of the same data type to every input-only or input-output port of the JTX. Connect output-only and input-output ports of the JTX only to ports of the same data type in transformations downstream. If any downstream transformation expects a different data type than the type of the respective output port of the JTX, insert an EXP to convert data types. Refer to the Notes below for more detail. 11. Save the mapping. Notes:
q

The primitive Java data types available in a JTX that can be used for ports of the JTX to connect to other transformations are Integer, Double, and Date/ Time. Date/time values are delivered to or by a JTX by means of a Java long value which indicates the difference of the respective date/time value to midnight, Jan 1st, 1970 (the so-called Epoch) in milliseconds; to interpret this value, utilize the appropriate methods of the Java class GregorianCalendar. Smallint values cannot be delivered to or by a JTX. The Java object data types available in a JTX that can be used for ports are String, byte arrays (for Binary ports), and BigDecimal (for Decimal values of arbitrary precision). In a JTX you check whether an input port has a NULL value by calling the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

456 of 954

function isNull("name_of_input_port"). If an input value is NULL, then you should explicitly set all depending output ports to NULL by calling setNull ("name_of_output_port"). Both functions take the name of the respective input / output port as a string.
q

You retrieve the value of an input port (provided this port is not NULL, see previous paragraph) simply by referring to the name of this port in your Java source code. For example, if you have two input ports i_1 and i_2 of type Integer and one output port o_1 of type String, then you might set the output value with a statement like this one: o_1 = "First value = " + i_1 + ", second value = " + i_2; In contrast to a Custom Transformation, it is not possible to retrieve the names, data types, and/or values of pass-through ports except if these passthrough ports have been defined on the Ports tab in advance. In other words, it is impossible for a JTX to adapt to its port structure at runtime (which would be necessary, for example, for something like a Sorter JTX). If you have to transfer 64-bit values into a JTX, deliver them to the JTX by means of a string representing the 64-bit number and convert this string into a Java long variable using the static method Long.parseLong(). Likewise, to deliver a 64-bit integer from a JTX to downstream transformations, convert the long variable to a string which will be an output port of the JTX (e.g. using the statement o_Int64 = "" + myLongVariable ). As of version 8.1.1, the PowerCenter Designer is very sensitive regarding data types of ports connected to a JTX. Supplying a JTX with not exactly the expected data types or connecting output ports to other transformations expecting other data types (i.e., a string instead of an integer) may cause the Designer to invalidate the mapping such that the only remedy is to delete the JTX, save the mapping, and re-create the JTX. Initialization Properties and Metadata Extensions can neither be defined nor retrieved in a JTX. The code entered on the Java Code sub-tab On Input Row is inserted into some other code; only this complete code constitutes the method execute() of the resulting Java class associated to the JTX (see output of the link "View Code" near the lower-right corner of the Java Code screen form). The same holds true for the code entered on the tabs On End Of Data and On Receiving Transactions with regard to the methods. This fact has a couple of implications which will be explained in more detail below. If you connect input and/or output ports to transformations with differing data types, you might get error messages during mapping validation. One such error message occurring quite often indicates that the byte code of the class cannot be retrieved from the repository. In this case, rectify port connections to all input and/or output ports of the JTX and edit the Java code (inserting one blank comment line usually suffices) and recompile the Java code again.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

457 of 954

The JTX (Java Transformation) doesn't currently allow pass-through ports. Thus they have to be simulated by splitting them up into one input port and one output port, then the values of all input ports have to be assigned to the respective output port. The key here is the input port of every pair of ports has to be in the Input Group while the respective output port has to be in the Output Group. If you do not do this, there is no warning in designer but it will not function correctly.

Where and how to insert what pieces of Java code into a JTX?
A JTX always contains a code skeleton that is generated by the Designer. Every piece of code written by a mapping designer is inserted into this skeleton at designated places. Because all these code pieces do not constitute the sole content of the respective functions, there are certain rules and recommendations as to how to write such code. As mentioned previously, a mapping designer can neither write his or her own constructor nor insert any code into the default constructor or the default destructor generated by the Designer. All initialization work can be done in either of the following two ways:
q q

as part of the static{} initialization block, by inserting code that in a standalone class would be part of the destructor into the tab On End Of Data, by inserting code that in a standalone class would be part of the constructor into the tab On Input Row.

The last case (constructor code being part of the On Input Row code) requires a little trick: constructor code is supposed to be executed once only, namely before the first method is called. In order to resemble this behavior, follow these steps: 1. On the tab Helper Code, define a boolean variable (i.e., constructorMissing) and initialize it to true. 2. At the beginning of the On Input Row code, insert code that looks like the following: if( constructorMissing) { // do whatever the constructor should have done

INFORMATICA CONFIDENTIAL

BEST PRACTICES

458 of 954

constructorMissing = false; } This will ensure that this piece of code is executed only once, namely directly before the very first input row is processed. The code pieces on the tabs On Input Row, On End Of Data, and On Receiving Transaction are embedded in other code. There is code that runs before the code entered here will execute, and there is more code to follow; for example, exceptions raised within code written by a developer will be caught here. As a mapping developer you cannot change this order, so you need to be aware of the following important implication. Suppose you are writing a Java class that performs some checks on an input record and, if the checks fail, issues an error message and then skips processing to the next record. Such a piece of code might look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord)) { logMessage( ERROR: one of the two checks failed!); return; } // else insertIntoTarget( inputRecord); countOfSucceededRows ++; This code will not compile in a JTX because it would lead to unreachable code. Why? Because the return at the end of the if statement might enable the respective function (in this case, the method will have the name execute()) to ignore the subsequent code that is part of the framework created by the Designer.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

459 of 954

In order to make this code work in a JTX, change it to look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord)) { } else { insertIntoTarget( inputRecord); logMessage( ERROR: one of the two checks failed!);

countOfSucceededRows ++; } The same principle (never use return in these code pieces) applies to all three tabs On Input Row, On End Of Data, and On Receiving Transaction. Another important point is that the code entered on the On Every Record tab is embedded in a try-catch block. So never include any try-catch code on this tab.

How fast does a JTX perform?


A JTX communicates with PowerCenter by means of JNI (Java Native Invocation). This mechanism has been defined by Sun Micro-systems in order to allow Java code to interact with dynamically linkable libraries. Though JNI has been designed to perform fast, it still creates some overhead to a session due to:
q

the additional process switches between the PowerCenter Integration Service and the Java Virtual Machine (JVM) that executes as another operating system process Java not being compiled to machine code but to portable byte code (although this has been largely remedied in the past years due to the introduction of JustIn-Time compilers) which is interpreted by the JVM The inherent complexity of the genuine object model in Java (except for most sorts of number types and characters everything in Java is an object that

INFORMATICA CONFIDENTIAL

BEST PRACTICES

460 of 954

occupies space and execution time). So it is obvious that a JTX cannot perform as fast as, for example, a carefully written Custom Transformation. The rule of thumb is for simple JTX to require approximately 50% more total running time than an EXP of comparable functionality. It can also be assumed that Java code utilizing several of the fairly complex standard classes will need even more total runtime when compared to an EXP performing the same tasks.

When should I use a JTX and when not?


As with any other standard transformation, a JTX has its advantages as well as disadvantages. The most significant disadvantages are:
q

The Designer is very sensitive in regards to the data types of ports that are connected to the ports of a JTX. However, most of the troubles arising from this sensitivity can be remedied rather easily by simply recompiling the Java code. Working with long values representing days and time within, for example, the GregorianCalendar can be extremely difficult to do and demanding in terms of runtime resources (memory, execution time). Date/time ports in PowerCenter are by far easier to use. So it is advisable to split up date/time ports into their individual components, such as year, month, and day, and to process these singular attributes within a JTX if needed. In general a JTX can reduce performance simply by the nature of the architecture. Only use a JTX when necessary. A JTX always has one input group and one output group. For example, it is impossible to write a Joiner as a JTX.

Significant advantages to using a JTX are:


q

Java knowledge and experience are generally easier to find than comparable skills in other languages. Prototyping with a JTX can be very fast. For example, setting up a simple JTX that calculates the calendar week and calendar year for a given date takes approximately 10-20 minutes. Writing Custom Transformations (even for easy tasks) can take several hours. Not every data integration environment has access to a C compiler used to compile Custom Transformations in C. Because PowerCenter is installed with
BEST PRACTICES 461 of 954

INFORMATICA CONFIDENTIAL

its own JDK, this problem will not arise with a JTX.

In Summary
q

If you need a transformation that adapts its processing behavior to its ports, a JTX is not the way to go. In such a case, write a Custom Transformation in C, C++, or Java to perform the necessary tasks. The CT API is considerably more complex than the JTX API, but it is also far more flexible. Use a JTX for development whenever a task cannot be easily completed using other standard options in PowerCenter (as long as performance requirements do not dictate otherwise). If performance measurements are slightly below expectations, try optimizing the Java code and the remainder of the mapping in order to increase processing speed.

Last updated: 04-Jun-08 19:14

INFORMATICA CONFIDENTIAL

BEST PRACTICES

462 of 954

Error Handling Process Challenge


For an error handling strategy to be implemented successfully, it must be integral to the load process as a whole. The method of implementation for the strategy will vary depending on the data integration requirements for each project. The resulting error handling process should however, always involve the following three steps: 1. 2. 3. Error identification Error retrieval Error correction

This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.

Description
A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as: Relational database error logging Email notification of workflow failures Session error thresholds The reporting capabilities of PowerCenter Data Analyzer Data profiling

These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart below:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

463 of 954

Error Identification
The first step in the error handling process is error identification. Error identification is often achieved through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and referential integrity constraints at the database. This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors, and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables. Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was called within the mapping.

Error Retrieval
The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository, it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as number of errors is greater than zero).

Error Correction
The final step in the error handling process is error correction. As PowerCenter automates the process of error identification, and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message, error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports emailing a report directly through the web-based interface to make the process even easier. For further automation, a report broadcasting rule that emails the error report to a developers inbox can be set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be implemented. The exact method of data correction depends on various factors such as the number of records with errors, data availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred. Considerations made during error correction include: The owner of the data should always fix the data errors. For example, if the source data is coming from an external system, then the errors should be sent back to the source system to be fixed. In some situations, a simple re-execution of the session will reprocess the data. Does partial data that has been loaded into the target systems need to be backed-out in order to avoid duplicate processing of rows. Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be manually inserted into the target table using a SQL statement.

Any approach to correct erroneous data should be precisely documented and followed as a standard.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

464 of 954

If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session to correct the errors and load the corrected data into the ODS or staging area.

Data Profiling Option


For organizations that want to identify data irregularities post-load but do not want to reject such rows at load time, the PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that provides profile reporting such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default values). The Data Profiling option comes with a license to use Data Analyzer reports that source the data profile warehouse to deliver data profiling information through an intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports can be delivered to users through the same easy-to-use application.

Integrating Error Handling, Load Management, and Metadata


Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the load management process and the load metadata; it is the integration of all these approaches that ensures the system is sufficiently robust for successful operation and management. The flow chart below illustrates this in the end-to-end load process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

465 of 954

INFORMATICA CONFIDENTIAL

BEST PRACTICES

466 of 954

Error handling underpins the data integration system from end-to-end. Each of the load components performs validation checks, the results of which must be reported to the operational team. These components are not just PowerCenter processes such as business rule and field validation, but cover the entire data integration architecture, for example: Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)? Source File Validation. Is the source file datestamp later than the previous load? File Check. Does the number of rows successfully loaded match the source rows read?

Last updated: 09-Feb-07 13:42

INFORMATICA CONFIDENTIAL

BEST PRACTICES

467 of 954

Error Handling Strategies - Data Warehousing Challenge


A key requirement for any successful data warehouse or data integration project is that it attains credibility within the user community. At the same time, it is imperative that the warehouse be as up-to-date as possible since the more recent the information derived from it is, the more relevant it is to the business operations of the organization, thereby providing the best opportunity to gain an advantage over the competition. Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information. Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e., before the business tries to use the information in error). The types of error to consider include: Source data structures Sources presented out-of-sequence Old sources represented in error Incomplete source files Data-type errors for individual fields Unrealistic values (e.g., impossible dates) Business rule breaches Missing mandatory data O/S errors RDBMS errors

These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors) concerns.

Description
In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and privileges change. Realistically, however, the operational applications are rarely able to cope with every possible business scenario or combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is a risk that the source data does not match what the data warehouse expects. Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business managers. If erroneous data does reach the warehouse, it must be identified and removed immediately (before the current version of the warehouse can be published). Preferably, error data should

INFORMATICA CONFIDENTIAL

BEST PRACTICES

468 of 954

be identified during the load process and prevented from reaching the warehouse at all. Ideally, erroneous source data should be identified before a load even begins, so that no resources are wasted trying to load it. As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and end-users become owners of their data. As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load management, data quality and key management, and operational processes and procedures. Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the failure occurred. Quality management defines the criteria whereby data can be identified as in error; and error management identifies the specific error(s), thereby allowing the source data to be corrected. Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors, perhaps indicating a failure in operational procedure. Error management must therefore be tightly integrated within the data warehouse load process. This is shown in the high level flow chart below:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

469 of 954

INFORMATICA CONFIDENTIAL

BEST PRACTICES

470 of 954

Error Management Considerations High-Level Issues


From previous discussion of load management, a number of checks can be performed before any attempt is made to load a source data set. Without load management in place, it is unlikely that the warehouse process will be robust enough to satisfy any end-user requirements, and error correction processing becomes moot (in so far as nearly all maintenance and development resources will be working full time to manually correct bad data in the warehouse). The following assumes that you have implemented load management processes similar to Informaticas best practices.

Process Dependency checks in the load management can identify when a source data set is missing, duplicates a previous version, or has been presented out of sequence, and where the previous load failed but has not yet been corrected. Load management prevents this source data from being loaded. At the same time, error management processes should record the details of the failed load; noting the source instance, the load affected, and when and why the load was aborted. Source file structures can be compared to expected structures stored as metadata, either from header information or by attempting to read the first data row. Source table structures can be compared to expectations; typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in metadata), or by simply running a describe command against the table (again comparing to a pre-stored version in metadata). Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational application). In every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and any other relevant process-level details.

Low-Level Issues
Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further error management processes need to be applied to the individual source rows and fields.

Individual source fields can be compared to expected data-types against standard metadata within the repository, or additional information added by the development. In some instances, this is enough to abort the rest of the load; if the field structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all or (more worryingly) is likely to be processed unpredictably. Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Builtin error handling can be used to spot failed date conversions, conversions of string to numbers, or missing required data. In rare cases, stored procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the potentially crushing impact on performance if a particularly error-filled load occurs. Business rule breaches can then be picked up. It is possible to define allowable values, or acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear from the mapping metadata that the business rules are included in the mapping itself). A more flexible approach is to use external tables to codify the business rules. In this way, only the rules tables need to be amended if a new business rule needs to be applied. Informatica has suggested methods to implement such a process. Missing Key/Unknown Key issues have already been defined in their own best practice document Key Management in Data Warehousing Solutions with suggested management techniques for identifying and handling them. However, from an error handling perspective, such errors must still be identified and recorded, even when key management techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular source data fails, it is difficult to realize when there is a systematic problem in the source systems. Inter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of events (e.g., a customer query, followed by a booking request,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

471 of 954

followed by a confirmation, followed by a payment). If the events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem with the source system, or the way in which the source system is being used. An important principle to follow is to try to identify all of the errors on a particular row before halting processing, rather than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced data set if we already know it is in error; however, since the row needs to be corrected at source, then reprocessed subsequently, it is sensible to identify all the corrections that need to be made before reloading, rather than fixing the first, re-running, and then identifying a second error (which halts the load for a second time).

OS and RDBMS Issues


Since best practice means that referential integrity (RI) issues are proactively managed within the loads, instances where the RDBMS rejects data for referential reasons should be very rare (i.e., the load should already have identified that reference information is missing). However, there is little that can be done to identify the more generic RDBMS problems that are likely to occur; changes to schema permissions, running out of temporary disk space, dropping of tables and schemas, invalid indexes, no further table space extents available, missing partitions and the like. Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space, command syntax, and authentication may occur outside of the data warehouse. Often such changes are driven by Systems Administrators who, from an operational perspective, are not aware that there is likely to be an impact on the data warehouse, or are not aware that the data warehouse managers need to be kept up to speed. In both of the instances above, the nature of the errors may be such that not only will they cause a load to fail, but it may be impossible to record the nature of the error at that point in time. For example, if RDBMS user ids are revoked, it may be impossible to write a row to an error table if the error process depends on the revoked id; if disk space runs out during a write to a target table, this may affect all other tables (including the error tables); if file permissions on a UNIX host are amended, bad files themselves (or even the log files) may not be accessible. Most of these types of issues can be managed by a proper load management process, however. Since setting the status of a load to complete should be absolutely the last step in a given process, any failure before, or including, that point leaves the load in an incomplete state. Subsequent runs should note this, and enforce correction of the last load before beginning the new one. The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational Administrators and DBAs have proper and working communication with the data warehouse management to allow proactive control of changes. Administrators and DBAs should also be available to the data warehouse operators to rapidly explain and resolve such errors if they occur.

Auto-Correction vs. Manual Correction


Load management and key management best practices (Key Management in Data Warehousing Solutions) have already defined auto-correcting processes; the former to allow loads themselves to launch, rollback, and reload without manual intervention, and the latter to allow RI errors to be managed so that the quantitative quality of the warehouse data is preserved, and incorrect key values are corrected as soon as the source system provides the missing data. We cannot conclude from these two specific techniques, however, that the warehouse should attempt to change source data as a general principle. Even if this were possible (which is debatable), such functionality would mean that the absolute link between the source data and its eventual incorporation into the data warehouse would be lost. As soon as one of the warehouse metrics was identified as incorrect, unpicking the error would be impossible, potentially requiring a whole section of the warehouse to be reloaded entirely from scratch.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

472 of 954

In addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user training. The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and managed.

Error Management Techniques Simple Error Handling Structure


The following data structure is an example of the error metadata that should be captured as a minimum within the error handling strategy.

The example defines three main sets of information:

The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including: o o o process-level (e.g., incorrect source file, load started out-of-sequence) row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and reconciliation (e.g., incorrect row numbers, incorrect file total etc.).

The ERROR_HEADER table provides a high-level view on the process, allowing a quick identification of the frequency of error for particular loads and of the distribution of error types. It is linked to the load management processes via the SRC_INST_ID and PROC_INST_ID, from which other process-level information can be gathered. The ERROR_DETAIL table stores information about actual rows with errors, including how to identify the specific row that was in error (using the source natural keys and row number) together with a string of field identifier/value pairs concatenated together. It is not expected that this

INFORMATICA CONFIDENTIAL

BEST PRACTICES

473 of 954

information will be deconstructed as part of an automatic correction load, but if necessary this can be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for subsequent reporting.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

474 of 954

Error Handling Strategies - General Challenge


The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for handling data errors, and alternatives for addressing the most common types of problems. For the most part, these strategies are relevant whether your data integration project is loading an operational data structure (as with data migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing structure.

Description
Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business. When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two conflicting goals:
q q

The need for accurate information. The ability to analyze or process the most complete information available with the understanding that errors can exist.

Data Integration Process Validation


In general, there are three methods for handling data errors detected in the loading process:
q

Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete. Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are and how they affect the completeness of the data. Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect downstream transactions. The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the target if it is correct, and it would then be loaded into the data mart using the normal process.

Reject None. This approach gives users a complete picture of the available data without having to consider data that was not available due to it being rejected during the load process. The problem is that the data may not be complete or accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty transactions. With Reject None, the complete set of data is loaded, but the data may not support correct transactions or

INFORMATICA CONFIDENTIAL

BEST PRACTICES

475 of 954

aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different hierarchies. The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and fixed. The development strategy may include removing information from the target, restoring backup tapes for each nights load, and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or data marts.
q

Reject Critical. This method provides a balance between missing information and incorrect information. It involves examining each row of data and determining the particular data elements to be rejected. All changes that are valid are processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process. This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates. Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at various levels in the organization. Attributes provide additional descriptive information per key element. Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can usually be loaded with the existing dimensional data unless the update is to a key element. The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some tasks from the Reject None approach, in that processes must be developed to fix incorrect data in the entire target data architecture. Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to understand that some information may be held out of the target, and also that some of the information in the target data structures may be at least temporarily allocated to the wrong hierarchies.

Handling Errors in Dimension Profiles


Profiles are tables used to track history changes to the source data. As the source systems change, profile records are created with date stamps that indicate when the change took place. This allows power users to review the target data using either current (As-Is) or past (As-Was) views of the data. A profile record should occur for each change in the source data. Problems occur when two fields change in the source system and one of those fields results in an error. The first value passes validation, which produces a new profile record, while the second value is rejected and is not included in the new profile. When this error is fixed, it would be desirable to update the existing profile rather than creating a new one, but the logic needed to perform this UPDATE instead of an INSERT is complicated. If a third field is changed in the source before the error is fixed, the correction process is complicated further. The following example represents three field values in a source system. The first row on 1/1/2000 shows the original values. On 1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is invalid. On 1/10/2000, Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field
INFORMATICA CONFIDENTIAL BEST PRACTICES 476 of 954

2 is finally fixed to Red.

Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000

Field 1 Value Closed Sunday Open Sunday Open Sunday Open Sunday

Field 2 Value Black BRed BRed Red

Field 3 Value Open 9 5 Open 9 5 Open 24hrs Open 24hrs

Three methods exist for handling the creation and update of profiles: 1. The first method produces a new profile record each time a change is detected in the source. If a field value was invalid, then the original field value is maintained.

Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000

Profile Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000

Field 1 Value Closed Sunday Open Sunday Open Sunday Open Sunday

Field 2 Value Black Black Black Red

Field 3 Value Open 9 5 Open 9 5 Open 24hrs Open 24hrs

By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been created. 2. The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the profile record for the change to Field 3. If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. If the third field changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system. 3. The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the Field 2 value in both.

Date

Profile Date

Field 1 Value

Field 2 Value

Field 3 Value

INFORMATICA CONFIDENTIAL

BEST PRACTICES

477 of 954

1/1/2000 1/5/2000 1/10/2000 1/15/2000

1/1/2000 1/5/2000 1/10/2000 1/5/2000 (Update) 1/10/2000 (Update)

Closed Sunday Open Sunday Open Sunday Open Sunday

Black Black Black Red

Open 9 5 Open 9 5 Open 24hrs Open 9-5

1/15/2000

Open Sunday

Red

Open 24hrs

If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records, when in reality a new profile record should have been entered.

Recommended Method
A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis of the data until the correction method is determined because the current information is reflected in the new Profile.

Data Quality Edits


Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target. The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators can be used to:
q q q

Show the record and field level quality associated with a given record at the time of extract. Identify data sources and errors encountered in specific records. Support the resolution of specific record error types via an update and resubmission process.

Quality indicators can be used to record several types of errors e.g., fatal errors (missing primary key value), missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error, data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be loaded to the target because they lack a primary key field to be used as a unique record identifier in the target. The following types of errors cannot be processed:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

478 of 954

A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and used to generate a notice to the sending system indicating that x number of invalid records were received and could not be processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has been replaced or not. The source file or record is illegible. The file or record would be sent to a reject queue. Metadata indicating that x number of invalid records were received and could not be processed may or may not be available for a general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is possible to determine whether the invalid record has been replaced or not. If the file or record is illegible, it is likely that individual unique records within the file are not identifiable. While information can be provided to the source system site indicating there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis.

In these error types, the records can be processed, but they contain errors:
q q q

A required (non-key) field is missing. The value in a numeric or date field is non-numeric. The value in a field does not fall within the range of acceptable values identified for the field. Typically, a reference table is used for this validation.

When an error is detected during ingest and cleansing, the identified error type is recorded.

Quality Indicators (Quality Code Table)


The requirement to validate virtually every data element received from the source data systems mandates the development, implementation, capture and maintenance of quality indicators. These are used to indicate the quality of incoming data at an elemental level. Aggregated and analyzed over time, these indicators provide the information necessary to identify acute data quality problems, systemic issues, business process problems and information technology breakdowns. The quality indicators: 0-No Error, 1-Fatal Error, 2-Missing Data from a Required Field, 3-Wrong Data Type/ Format, 4-Invalid Data Value and 5-Outdated Reference Table in Use, apply a concise indication of the quality of the data within specific fields for every data type. These indicators provide the opportunity for operations staff, data quality analysts and users to readily identify issues potentially impacting the quality of the data. At the same time, these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner.

Handling Data Errors


The need to periodically correct data in the target is inevitable. But how often should these corrections be performed? The correction process can be as simple as updating field information to reflect actual values, or as complex as deleting data from the target, restoring previous loads from tape, and then reloading the information correctly. Although we try to avoid performing a complete database restore and reload from a previous point in time, we cannot rule this out as a possible solution.

Reject Tables vs. Source System


As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data and the related error messages indicating the causes of error. The business needs to decide whether analysts should be allowed to fix data in the reject tables, or whether data fixes will be restricted to source systems. If errors are fixed in the reject tables, the target will not be synchronized with the source systems. This can present credibility problems when trying to track the history of changes in the target data architecture. If all fixes occur in the source
INFORMATICA CONFIDENTIAL BEST PRACTICES 479 of 954

systems, then these fixes must be applied correctly to the target data.

Attribute Errors and Default Values


Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research). Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target. When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter thetarget. Some rules that have been proposed for handling defaults are as follows:

Value Types Reference Values

Description Attributes that are foreign keys to other tables Y/N indicator fields Any other type of attribute

Default Unknown

Small Value Sets Other

No Null or Business provided value

Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not translate into a reference table value, we use the Unknown value. (All reference tables contain a value of Unknown for this purpose.) The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these values, we use the value that represents off or No as the default. Other values, like numbers, are handled on a case-by-case basis. In many cases, the data integration process is set to populate Null into these fields, which means undefined in the target. After a source system value is corrected and passes validation, it is corrected in the target.

Primary Key Errors


The business also needs to decide how to handle new dimensional values such as locations. Problems occur when the new key is actually an update to an old key in the source system. For example, a location number is assigned and the new location is transferred to the target using the normal process; then the location number is changed due to some source business rule such as: all Warehouses should be in the 5000 range. The process assumes that the change in the primary key is actually a new warehouse and that the old warehouse was deleted. This type of error causes a separation of fact data, with some data being attributed to the old primary key and some to the new. An analyst would be unable to get a complete picture. Fixing this type of error involves integrating the two records in the target data, along with the related facts. Integrating the two rows involves combining the profile information, taking care to coordinate the effective dates of the profiles to sequence properly. If two profile records exist for the same day, then a manual decision is required as to which is correct. If facts were loaded using both primary keys, then the related fact rows must be added together and the originals deleted in order to correct the data.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

480 of 954

The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same target data ID really represent two different IDs). In this case, it is necessary to restore the source information for both dimensions and facts from the point in time at which the error was introduced, deleting affected records from the target and reloading from the restore to correct the errors.

DM Facts Calculated from EDW Dimensions


If information is captured as dimensional data from the source, but used as measures residing on the fact records in the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement. If we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.

Fact Errors
If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.

Data Stewards
Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different data for the same dimensional entity.

Reference Tables
The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source systems into the target structures. The translation tables contain one or more rows for each source value and map the value to a matching row in the reference table. For example, the SOURCE column in FILE X on System X can contain O, S or W. The data steward would be responsible for entering in the translation table the following values:

Source Value O S W

Code Translation OFFICE STORE WAREHSE

INFORMATICA CONFIDENTIAL

BEST PRACTICES

481 of 954

These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar field may use a two-letter abbreviation like OF, ST and WH. The data steward would make the following entries into the translation table to maintain consistency across systems:

Source Value OF ST WH

Code Translation OFFICE STORE WAREHSE

The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL process uses the reference table to populate the following values into the target:

Code Translation OFFICE STORE WAREHSE

Code Description Office Retail Store Distribution Warehouse

Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the entire target data architecture.

Dimensional Data
New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the source system value to the target value. For location, this is straightforward, but over time, products may have multiple source system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves as a good example for error handling.) There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the data steward to review it. The first option requires the data steward to create the translation for new entities, while the second lets the ETL process create the translation, but marks the record as Pending Verification until the data steward reviews it and changes the status to Verified before any facts that reference it can be loaded. When the dimensional value is left as Pending Verification however, facts may be rejected or allocated to dummy values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate an email each night if there are any translation table entries pending verification. The data steward then opens a report that lists them.
INFORMATICA CONFIDENTIAL BEST PRACTICES 482 of 954

A problem specific to Product is that when it is created as new, it is really just a changed SKU number. This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to be merged, requiring manual intervention. The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but really represent two different products). In this case, it is necessary to restore the source information for all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information.

Manual Updates
Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.

Multiple Sources
The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain subsets of the required information. For example, one system may contain Warehouse and Store information while another contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the correct information. When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared information on only one source system, the two systems then contain different information. If the changed system is loaded into the target, it creates a new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with the same information. To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the one profile record created for that day. One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can use the field level information to update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the correct source fields as primary and by the data integration team to customize the load process.

Last updated: 05-Jun-08 12:48

INFORMATICA CONFIDENTIAL

BEST PRACTICES

483 of 954

Error Handling Techniques - PowerCenter Mappings Challenge


Identifying and capturing data errors using a mapping approach, and making such errors available for further processing or correction.

Description
Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the production environment, data must be checked and validated prior to entry into the target system. One strategy for catching data errors is to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and unexpected transformation or database constraint errors.

Data Validation Errors


The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements. Consider the following questions:
q q q q q q q q q

What types of data errors are likely to be encountered? Of these errors, which ones should be captured? What process can capture the possible errors? Should errors be captured before they have a chance to be written to the target database? Will any of these errors need to be reloaded or corrected? How will the users know if errors are encountered? How will the errors be stored? Should descriptions be assigned for individual errors? Can a table be designed to store captured errors and the error descriptions?

Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g., executing a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of functionality in a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the target table; constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to the session log and the reject/bad file, thus improving performance. Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to them. This approach can be effective for many types of data content error, including: date conversion, null values intended for not null target fields, and incorrect data formats or data types.

Sample Mapping Approach for Data Validation Errors


In the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written to not null columns in a target CUSTOMER table. Once a null value is identified, the row containing the error is to be separated from the data flow and logged in an error table. One solution is to implement a mapping similar to the one shown below:

An expression transformation can be employed to validate the source data, applying rules and flagging records with one or more errors. A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows with a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would refer to the mapping name and the ROW_ID would be created by a sequence generator. The composite key is designed to allow developers to trace rows written to the error tables that store information useful for error reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

484 of 954

The table ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the error description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of reference for reporting. The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns: ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the entire row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to reprocess them. The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended for a not null target field could generate an error message such as NAME is NULL or DOB is NULL. This step can be done in an expression transformation (e.g., EXP_VALIDATION in the sample mapping). After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a normalizer transformation. After a single source row is normalized, the resulting rows can be filtered to leave only errors that are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would be generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL. The following table shows how the error data produced may look.

Table Name: NAME NULL Table Name: FOLDER_NAME CUST CUST CUST

CUSTOMER_ERR DOB NULL ERR_DESC_TBL MAPPING_ID DIM_LOAD DIM_LOAD DIM_LOAD ROW_ID 1 1 1 ERROR_DESC Name is NULL DOB is NULL LOAD_DATE 10/11/2006 10/11/2006 SOURCE CUSTOMER_FF CUSTOMER_FF CUSTOMER_FF Target CUSTOMER CUSTOMER CUSTOMER ADDRESS NULL ROW_ID 1 MAPPING_ID DIM_LOAD

Address is NULL 10/11/2006

The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of data validation errors. Data validation error handling can be extended by including mapping logic to grade error severity. For example, flagging data validation errors as soft or hard.
q q

A hard error can be defined as one that would fail when being written to the database, such as a constraint error. A soft error can be defined as a data content error.

A record flagged as hard can be filtered from the target and written to the error tables, while a record flagged as soft can be written to both the target system and the error tables. This gives business analysts an opportunity to evaluate and correct data imperfections while still allowing the records to be processed for end-user reporting. Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems. The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is identified, the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture identified errors, the operations team can effectively communicate data quality issues to the business users.

Constraint and Transformation Errors


Perfect data can never be guaranteed. In implementing the mapping approach described above to detect errors and log them to an error table, how can we handle unexpected errors that arise in the load? For example, PowerCenter may apply the validated data to the database; however the relational database management system (RDBMS) may reject it for some unexpected reason. An RDBMS may, for example, reject data if constraints are violated. Ideally, we would like to detect these database-level errors automatically and send them to the same error table used to store the soft errors caught by the mapping approach described above. In some cases, the stop on errors session property can be set to 1 to stop source data for which unhandled errors were encountered from being loaded. In this case, the process will stop with a failure, the data must be corrected, and the entire source may need to be reloaded or recovered. This is not always an acceptable approach. An alternative might be to have the load process continue in the event of records being rejected, and then reprocess only the records that were found to be in error. This can be achieved by configuring the stop on errors property to 0 and switching on relational error logging for a session. By default, the error-messages from the RDBMS and any un-caught transformation errors are sent to the session log. Switching on relational error logging redirects these messages to a selected database in which four tables are automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS. The PowerCenter Workflow Administration Guide contains detailed information on the structure of these tables. However, the PMERR_MSG table stores the error messages that were encountered in a session. The following four columns of this table allow us to retrieve any RDBMS errors: SESS_INST_ID: A unique identifier for the session. Joining this table with the Metadata Exchange (MX) View REP_LOAD_SESSIONS in the repository allows the MAPPING_ID to be retrieved.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

485 of 954

TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error occurs, this is the name of the target transformation. TRANS_ROW_ID: Specifies the row ID generated by the last active source. This field contains the row number at the target when the error occurred. ERROR_MSG: Error message generated by the RDBMS

With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load session (i.e., an additional PowerCenter session) can be implemented to read the PMERR_MSG table, join it with the MX View REP_LOAD_SESSION in the repository, and insert the error details into ERR_DESC_TBL. When the post process ends, ERR_DESC_TBL will contain both soft errors and hard errors. One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide lineage. This can be difficult when the source and target rows are not directly related (i.e., one source row can actually result in zero or more rows at the target). In this case, the mapping that loads the source must write translation data to a staging table (including the source key and target row number). The translation table can then be used by the post-load session to identify the source key by the target row number retrieved from the error log. The source key stored in the translation table could be a row number in the case of a flat file, or a primary key in the case of a relational data source.

Reprocessing
After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed by members of the business or operational teams. The rows listed in this table have not been loaded into the target database. The operations team can, therefore, fix the data in the source that resulted in soft errors and may be able to explain and remediate the hard errors. Once the errors have been fixed, the source data can be reloaded. Ideally, only the rows resulting in errors during the first run should be reprocessed in the reload. This can be achieved by including a filter and a lookup in the original load mapping and using a parameter to configure the mapping for an initial load or for a reprocess load. If the mapping is reprocessing, the lookup searches for each source row number in the error table, while the filter removes source rows for which the lookup has not found errors. If initial loading, all rows are passed through the filter, validated, and loaded. With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run, the records successfully loaded should be deleted (or marked for deletion) from the error table, while any new errors encountered should be inserted as if an initial run. On completion, the post-load process is executed to capture any new RDBMS errors. This ensures that reprocessing loads are repeatable and result in reducing numbers of records in the error table over time.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

486 of 954

Error Handling Techniques - PowerCenter Workflows and Data Analyzer Challenge


Implementing an efficient strategy to identify different types of errors in the ETL process, correct the errors, and reprocess the corrected data.

Description
Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the standards of acceptable data quality; and process errors, which are driven by the stability of the process itself. The first step in implementing an error handling strategy is to understand and define the error handling requirement. Consider the following questions:
q q q q

What tools and methods can help in detecting all the possible errors? What tools and methods can help in correcting the errors? What is the best way to reconcile data across multiple systems? Where and how will the errors be stored? (i.e., relational tables or flat files)

A robust error handling strategy can be implemented using PowerCenters built-in error handling capabilities along with Data Analyzer as follows:
q

Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process failures. Data Errors: Setup the ETL process to:
r

Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables for analysis, correction, and reprocessing. Setup Data Analyzer alerts to notify the PowerCenter Administrator in the event of any rejected rows. Setup customized Data Analyzer reports and dashboards at the project level to provide information on failed sessions, sessions with failed rows, load time, etc.

r r

Configuring an Email Task to Handle Process Failures


Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the event of a session failure. Create a reusable email task and use it in the On Failure Email property settings in the Components tab of the session, as shown in the following figure.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

487 of 954

When you configure the subject and body of a post-session email, use email variables to include information about the session run, such as session name, mapping name, status, total number of records loaded, and total number of records rejected. The following table lists all the available email variables:
Email Variables for Post-Session Email Email Variable %s %e %b %c %i %l %r Session name. Session status. Session start time. Session completion time. Session elapsed time (session completion time-session start time). Total rows loaded. Total rows rejected. Source and target table details, including read throughput in bytes per second and write throughput in rows per second. The PowerCenter Server includes all information displayed in the session detail dialog box. Name of the mapping used in the session. Description

%t

%m

INFORMATICA CONFIDENTIAL

BEST PRACTICES

488 of 954

%n %d %g

Name of the folder containing the session. Name of the repository containing the session. Attach the session log to the message. Attach the named file. The file must be local to the PowerCenter Server. The following are valid file names: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>. Note: The file name cannot include the greater than character (>) or a line break.

%a<filename>

Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include these variables in the email message only.

Configuring Row Error Logging in PowerCenter


PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these tables to capture data errors greatly reduces the time and effort required to implement an error handling strategy when compared with a custom error handling solution. When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the PowerCenter Server logs error information that allows you to determine the cause and source of the error. The PowerCenter Server logs information such as source name, row ID, current row data, transformation, timestamp, error code, error message, repository name, folder name, session name, and mapping information. This error metadata is logged for all row-level errors, including database errors, transformation errors, and errors raised through the ERROR() function, such as business rule violations. Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you enable error logging and chose the Relational Database Error Log Type, the PowerCenter Server offers you the following features:
q

Generates the following tables to help you track row errors:


r

PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source row. PMERR_MSG. Stores metadata about an error and the error message. PMERR_SESS. Stores metadata about the session. PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype, when a transformation error occurs.
s

r r r

Appends error data to the same tables cumulatively, if they already exist, for the further runs of the session. Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors to go to one set of error tables, you can specify the prefix as EDW_ Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do this, you specify the same error log table name prefix for all sessions.

Example:
In the following figure, the session s_m_Load_Customer loads Customer Data into the EDW Customer table. The Customer Table in EDW has the following structure: CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY)
INFORMATICA CONFIDENTIAL BEST PRACTICES 489 of 954

CUSTOMER_NAME NULL VARCHAR2(30) CUSTOMER_STATUS NULL VARCHAR2(10) There is a primary key constraint on the column CUSTOMER_ID. To take advantage of PowerCenters built-in error handling features, you would set the session properties as shown below:

The session property Error Log Type is set to Relational Database, and Error Log DB Connection and Table name Prefix values are given accordingly. When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes information into the Error Tables as shown below:

EDW_PMERR_DATA
WORKFLOW_ RUN_ID WORKLET_ RUN_ID SESS_ INST_ ID 3 TRANS_NAME TRANS_ ROW_ID TRANS_ROW DATA SOURCE_ ROW_ID SOURCE_ ROW_ TYPE -1 SOURCE_ LINE_ ROW_ NO DATA N/A 1

Customer_Table 1

D:1001:00000000 -1 0000|D:Elvis Pres| D:Valid

INFORMATICA CONFIDENTIAL

BEST PRACTICES

490 of 954

Customer_Table 2

D:1002:00000000 -1 0000|D:James Bond|D:Valid D:1003:00000000 -1 0000|D:Michael Ja| D:Valid

-1

N/A

Customer_Table 3

-1

N/A

EDW_PMERR_MSG
WORKFLOW_ SESS_ RUN_ID INST_ID 6 3 SESS_ START_TIME 9/15/2004 18:31 9/15/2004 18:33 9/15/2004 18:34 REPOSITORY_ NAME pc711 FOLDER_ WORKFLOW_ TASK_ NAME NAME INST_PATH Folder1 wf_test1 s_m_test1 MAPPING_ NAME m_test1 LINE_ NO 1

pc711

Folder1

wf_test1

s_m_test1

m_test1

pc711

Folder1

wf_test1

s_m_test1

m_test1

EDW_PMERR_SESS
WORKFLOW_ SESS_ RUN_ID INST_ID 6 3 SESS_ START_TIME 9/15/2004 18:31 9/15/2004 18:33 9/15/2004 18:34 REPOSITORY_ NAME pc711 FOLDER_ WORKFLOW_ TASK_ NAME NAME INST_PATH Folder1 wf_test1 s_m_test1 MAPPING_ NAME m_test1 LINE_ NO 1

pc711

Folder1

wf_test1

s_m_test1

m_test1

pc711

Folder1

wf_test1

s_m_test1

m_test1

EDW_PMERR_TRANS
WORKFLOW_RUN_ID SESS_INST_ID TRANS_NAME TRANS_GROUP TRANS_ATTR LINE_ NO Input Customer _Id:3, Customer _Name:12, Customer _Status:12 1

Customer_Table

By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.

Error Detection and Notification using Data Analyzer


Informatica provides Data Analyzer for PowerCenter Repository Reports with every PowerCenter license. Data Analyzer is Informaticas powerful business intelligence tool that is used to provide insight into the PowerCenter repository metadata.
INFORMATICA CONFIDENTIAL BEST PRACTICES 491 of 954

You can use the Operations Dashboard provided with the repository reports as one central location to gain insight into production environment ETL activities. In addition, the following capabilities of Data Analyzer are recommended best practices:
q

Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an entry made into the error tables PMERR_DATA or PMERR_TRANS. Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter folders for easy analysis. Configure reports to provide detailed information of the row level errors for each session. This can be accomplished by using the four error tables as sources of data for the reports

Data Reconciliation Using Data Analyzer


Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS to targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing tedious queries, comparing two separately produced reports, or using constructs such as DBLinks. Upgrading the Data Analyzer licence from Repository Reports to a full license enables Data Analyzer to source your companys data (e.g., source systems, staging areas, ODS, data warehouse, and data marts) and provide a reliable and reusable way to accomplish data reconciliation. Using Data Analyzers reporting capabilities, you can select data from various data sources such as ODS, data marts, and data warehouses to compare key reconciliation metrics and numbers through aggregate reports. You can further schedule the reports to run automatically every time the relevant PowerCenter sessions complete, and setup alerts to notify the appropriate business or technical users in case of any discrepancies. For example, a report can be created to ensure that the same number of customers exist in the ODS in comparison to a data warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by comparing key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation reports can be run automatically after PowerCenter loads the data, or they can be run by technical users or business on demand. This process allows users to verify the accuracy of data and builds confidence in the data warehouse solution.

Last updated: 09-Feb-07 14:22

INFORMATICA CONFIDENTIAL

BEST PRACTICES

492 of 954

Business Case Development Challenge


Establishing an Integration Competency Center (ICC) or shared infrastructure takes money, resources and management attention. While most enterprises have requirements and defined standards for business case documents to justify projects that involve a financial expenditure or significant organizational change, integration initiatives present unique challenges that are associated with multi-functional cross-organizational initiatives. While most IT personnel are quite capable of articulating the technical case for the degree of centralization that the ICC, standard technology or shared integration system represents; proving the business case is likely to be more challenging as the technical case alone is unlikely to secure the required funding. It is important to identify the departments and individuals that are likely to benefit directly and indirectly from the implementation of the ICC. This Best Practice provides a systematic approach to researching, documenting and presenting the business justification for these sorts of complex integration initiatives.

Description
The process to establish a business case for an ICC or shared infrastructure such as an Enterprise Data Warehouse, ETL hub, Enterprise Service Bus, or Data Governance program (to name just a few) is fundamentally an exercise in analysis and persuasion. This process is demonstrated graphically in the figure below:

The following sections describe each step in this process.

Step 1: Clarify Business Need


Data integration investments should be part of a business strategy that must, in turn, be part of the overall corporate strategy. Mismatched IT investments will only move the organization in the wrong direction more quickly. Consequently, an investment in data integration should (depending on the enterprise requirements) be based on a requirement to:
q q q q q q

Improve data quality Reduce future integration costs Reduce system architecture complexity Increase implementation speed for new systems Reduce corporate costs Support business priorities

The first step in the business case process therefore is to state the integration problem in such a way as to clearly define the circumstances leading to the consideration of the investment. This step is important because it identifies both the questions to be resolved by the analysis and the boundaries of the investigation. The problem statement identifies the need to be satisfied, the problem to be solved or the opportunity to be exploited. The problem statement should address:
q q q

the corporate and program goals and other objectives affected by the proposed investment a description of the problem, need, or opportunity a general indication of the range of possible actions

Although the immediate concern may be to fulfill the needs of a specific integration opportunity, you must, nevertheless, consider the overall corporate goals. A business solution that does not take into account corporate priorities and business strategies may never deliver its expected benefits due to unanticipated changes within the organization or its processes. There is a significant danger associated with unverified assumptions that can derail business case development at the outset of the process. It is imperative to be precise about the business need that the ICC is designed to address; abandon preconceptions and get to the core of the requirements. Do not assume that the perceived benefits of a centralized service such as an ICC are so obvious that they do not need to be specifically stated. In summary, key activities in Step 1 include:
q q q q q q

Define and agree on the problem, opportunity or goals that will guide the development of the business case. Use brainstorming techniques to envision how you would describe the business need in a compelling way. Start with the end in mind based on what you know. Prepare a plan to gather the data and facts needed to justify the vision. Understand the organizations financial accounting methods and business case development standards. Understand the enterprise funding approval governance processes.

Note: It is important to review the core assumptions as the business case evolves and new information becomes available.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

493 of 954

TIP
Management by Fact (MBF) is a tool used in a number of methodologies including Six-Sigma, CMM, and QPM. It is a concise summary of quantified problem statement, performance history, prioritized root causes, and corresponding countermeasures for the purpose of data-driven problem analysis and management. MBF:
q q q q

Uses the facts Eliminates bias Tightly couples resources and effort to problem-solving Clarifies the problem
r r

Use 4 Whats to help quantify the problem statement Quantify the gap between actual and desired performance

Determines root cause


r r

Separate beliefs from facts Use 5 Whys by the time you have answered the 5th why you should understand the root cause

The key output from Step 1 is a notional problem statement (with placeholders or guesses for key measures) and several graphical sketches representing a clearly defined problem/ needs statement using business terms to describe the problem. The following figure shows the basic structure of a problem statement with supporting facts and proposed resolution that can be summarized on a one-page PowerPoint slide. While the basic structure should be defined in Step 1, the supporting details and action plans will emerge from the subsequent analysis steps.

Step 2: Identify Options and Define Approach


The way in which you describe solutions or opportunities is likely to shape the analysis that follows. Do not focus on specific technologies, products or methods, as this may exclude other options that might produce the same benefits, but at a lower cost or increased benefits for the same cost. Instead, try to identify all of the possible ways in which the organization can meet the business objectives described in the problem statement. In this way, the options that are developed and analyzed will have a clear relationship to the organizations needs. Unless this relationship is clear, you may be accused of investing in technology for technologys sake. Available options must include the base case, as well as a range of other potential solutions. The base case should show how an organization would perform if it did not pursue the data integration investment proposal or otherwise change its method of operation. It is important to highlight any alternative solutions to the integration investment. A description of what is meant by doing nothing is required here. It is not adequate to state the base case simply as the continuation of the current situation. It must account for future developments over a period long enough to serve as a basis of comparison for a new system. For example, an organization that keeps an aging integration technique may face increasing maintenance costs as the systems gets older and the integrations more complex. There may be more frequent system failures and changes causing longer periods of down time. Maintenance costs may become prohibitive, service delays intolerable or workloads unmanageable. Alternatively, demand for a business units services may ultimately decrease, permitting a reduction of costs without the need for an integration investment. Be sure to examine all the options in both the short and long-term. Short Term: The document should highlight the immediate effect of doing nothing. For example the competition may have already implemented systems such as a Customer Data Integration hub or a Data Quality program and are able to offer more competitive services. Thus, the enterprise may already be losing market share because of its inability to change and react to market conditions. If there is significant market share loss, it should be presented so as to emphasize the need for something to be done. Long Term: The base case should predict the long-term costs and benefits of maintaining the current method of operation, taking into account the known external pressures for change, such as predicted changes in demand for service, budgets, and staffing or business direction. Problems can be solved in different ways and to different extents. In some cases, options are available that concentrate on making optimum use of existing systems or on altering current procedures. These options may require little or no new investment and should be considered. A full-scale analysis of all options is neither achievable nor necessary. A screening process is the best way to ensure that the analysis proceeds with only the most promising options. Screening allows a wide range of initial options to be considered, while keeping the level of effort reasonable. Establishing a process for screening options has the added advantage of setting out in an evaluation framework the reasons for selecting, as well as rejecting, particular options. Options should be ruled out as soon as it becomes clear that other choices are superior from a cost-benefit perspective. A comparative cost-benefit framework should quickly identify the key features likely to make a difference among options. Grouping options with similar key features can help identify differences associated with cost disadvantages or benefit advantages that would persist even if subjected to more rigorous analysis. Options may be ruled out on the basis that their success depends too heavily on unproven technology or that they just will not work. Take care not to confuse options that will not work with options that are merely less desirable. Options that are simply undesirable will drop out when you begin to measure the costs and benefits. The objective is to subject options to an increasingly rigorous analysis. A good rule of thumb is that, when in doubt about the economic merits of a particular option, analyst INFORMATICA CONFIDENTIAL BEST PRACTICES 494 the of 954

should retain it for subsequent, more detailed rounds of estimation. To secure funds in support of ICC infrastructure investments, a number of broad-based strategies and detailed methods can be used. Below are five primary strategies that address many of the funding challenges: 1. Recurring quick wins. This strategy involves making a series of small incremental investments as separate projects, each of which provides demonstrable evidence of progress. This strategy works best when the work can be segmented. 2. React to a crisis. While it may not be possible to predict when a crisis will occur, experienced integration staff are often able to see a pattern and anticipate in what areas a crisis is likely to emerge. By way of analogy, it may not be easy to predict when the next earthquake will occur, but we can be quite accurate about predicting where it is likely to occur based on past patterns. The advantage that can be leveraged in a crisis situation is that senior management attention is clearly focused at solving the problem. A challenge however is that there is often a tendency to solve the problem quickly, which may not allow sufficient time for a business case that addresses the underlying structural issues and root causes of the problem. This strategy therefore requires that the ICC team perform some advance work and be prepared with a rough investment proposal for addressing structural issues and be ready to quickly present it when the opportunity presents itself. 3. Executive vision. This strategy relies on ownership being driven by a top level executive (e.g., CEO, CFO, CIO, etc.) who has control over a certain amount of discretionary funding. In this scenario, a business case may not be required because the investment is being driven by a belief in core principles and a top-down vision. This is often the path of least resistance if you have the fortune to have an executive with the appropriate vision that aligns with the ICC charter/mission. The downside is that if the executive leaves the organization or is promoted into another role, the ICC momentum and any associated investment may fade away if insufficient cross-functional support has been developed. 4. Ride on a wave. This strategy involves tying the infrastructure investment to a large project with definite ROI and implementing the foundational elements to serve future projects and the enterprise overall rather than just the large projects needs. Examples include purchasing the hardware and software for an enterprise data integration hub in conjunction with a corporate merger/acquisition program or building an enterprise hub as part of a large ERP system implementation. This strategy may make it easier to secure the funds for an infrastructure that is hard to justify on its own merits, but has the risk of becoming too project-specific and not as reusable by the rest of the enterprise. 5. Create the wave. This strategy involves developing a clear business case with defined benefits and a revenue/cost sharing model that are agreed to in advance by all stakeholders who will use the shared infrastructure. This is one of the most difficult strategies to execute because it requires a substantial up-front investment in building the business case and gaining broad-based organizational support. But it can also be one of the most rewarding because all the hard work to build support and address the political issues is done early. In summary, the activities in Step 2 to identify the options and develop the approach are:
q

Assemble the evaluation team


r r r r

Include a mix of resources if possible, including some change agents and change resistors Use internal resources that know their way around the organization Use external resources to ask the naive questions and side-step internal politics Prepare for evaluation including understanding the organizational financial standards and internal accounting methods

Define the options and a framework for structuring the investment decision
r r

Identify the options (including the baseline status quo option) Screen the options based on short-term and long-term effects

Determine the Business Case Style

The key deliverable resulting from step 2 is a list of options and a rough idea of the answers to the following questions for each of them:
q q q q q q q q

How does the solution relate to the enterprise objectives and priorities? What overall program performance results will the option achieve? What may happen if the option is not selected? What additional outcomes or benefits may occur if this option is selected? Who are the stakeholders? What are their interests and responsibilities? What effect does the option have on them? What will be the implications for the organizations human resources? What are the projected improvements in timeliness, productivity, cost savings, cost avoidance, quality and service? How much will it cost to implement the integration solution? Does the solution involve the innovative use of technology? If so, what risks does that involve?

TIP Return on Investment (ROI) is often used as a generic term for any kind of measure that compares the financial costs and benefits of an action. A more narrow finance definition is rate of return for example, if you put $100 into a savings account and have $105 a year later, the ROI is 5% Some of the most common ROI methods are:
q

Payback period in months or years is equal to the investment amount divided by the incremental annual cash flow; for example, if I invest $1 million, how long will it take to earn the same amount in incremental revenue? Net Present Value (NPV) is the present (discounted) value of future cash inflows minus the present value of the investment and any associated future cash outflows. Internal Rate of Return (IRR) is the discount rate that results in a net present value of zero for a series of future cash flows.

q q

Step 3: Determine Costs


It is necessary to define the costs associated with options that will influence the investment decision. Be sure to consider all the cost dimensions, including:
q q q q

Fixed vs. direct vs. variable costs Internal vs. external costs Capital vs. operating costs One-time costs vs. ongoing cost model

Before defining costs, it is generally useful to create a classification for the various kinds of activities that will make up the project effort. This structure will vary greatly from project to project. Following is an example for a data integration project that has a large number of interfaces with several level of complexity. This classification will help categorize the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

495 of 954

different integration efforts within the data integration initiative. The categories are based on 1) the number of fields that must be transformed between source and target; 2) the number of targets for each transformation; and 3) the complexity of the data structure. The categories are: Simple Two applications, fewer than 50 field and simple flat message layouts Moderate Two or more applications, fewer than 50 fields or hierarchical message layouts Complex Multiple applications, more than 50 fields or complex message layouts Note: This is a very simple example and some projects may have 10 or more classification schemes for various aspects of the project. Once the classification for a given project is determined, it can be used to develop cost models for various scenarios. The fixed and direct costs are the costs that do not vary on the number of integration interfaces that are to be built. They can be costs that are incurred over a period of time but can be envisaged as one end number at the end that period. Direct up-front costs are the out-of-pocket development and implementation expenses. These can be substantial and should be carefully assessed. Fortunately, these costs are generally well documented and easily determined, except for projects that involve new technology or software applications. The main categories of direct/fixed costs are:
q q q q q q q q q q q q

hardware and peripherals packaged and customized software initial data collection or conversion of archival data data quality analysis and profiling facilities upgrades, including site preparation and renovation design and implementation testing and prototyping documentation additional staffing requirements initial user training transition, such as costs of running parallel systems quality assurance and post-implementation reviews

Direct ongoing costs are the out-of-pocket expenses that occur over the lifecycle of the investment. The costs to operate a facility, as well as to develop or implement an option, must be identified. The main categories of direct ongoing costs are:
q q q q q q

salaries for staff software license fees; maintenance and upgrades computing equipment and maintenance user support ongoing training reviews and audits

Note: Not all of these categories are included in every data integration implementation. It is important to pick the costs that reflect your implementation accurately. The primary output from Step 3 is a financial model (typically a spreadsheet) around the options with different views according to the interests of the main stakeholders.

TIP Business Case Bundling Some elements of a program may be necessary but hard to justify on their own; include these elements with other elements that are easier to justify. Enterprise Level Licensing In determining costs, there is often an opportunity for getting help from external resources such as suppliers of actual or potential technology and services since they build business cases for a living and often can suggest creative solutions. For example, there are many licensing options for all the components within the Informatica product suite; your Account Manager can provide information on the costs associated with the organizational options you have identified. In the context of an ICC, there may well be significant cost advantages in licensing at the level of the enterprise.

TIP A Total Cost of Operation (or Ownership) TCO assessment ideally offers a picture of not only the cost of purchase but all aspects in the further use and maintenance of a solution. This includes items such as:
q q q q q q q

Development expenses, testing infrastructure and expenses, and deployment costs Costs of training support personnel and the users of the system Costs associated with failure or outage (planned and unplanned) Diminished performance incidents (i.e., if users are kept waiting) Costs of security breaches (in loss of reputation and recovery costs) Costs of disaster preparedness and recovery, floor space and electricity Marginal incremental growth, decommissioning, e-waste handling, and more

When incorporated in any financial benefit analysis (e.g., ROI, IRR, EVA), a TCO provides a cost basis for determining the economic value of that investment. TCO can, and often does, vary dramatically against TCA (total cost of acquisition) Although TCO is far more relevant in determining the viability of any capital investment, many organizations make ROI investment decisions by considering only the initial implementation costs..

INFORMATICA CONFIDENTIAL

BEST PRACTICES

496 of 954

Step 4: Define Benefits


This step identifies and quantifies the potential benefits of a proposed integration investment. Both quantitative and qualitative benefits should be defined. Determine how improvements in productivity and service are defined and also the methods for realizing the benefits.
q q q q q

Direct (hard) and indirect (soft) benefits Financial model of the benefits Collect industry studies to complement internal analysis Identify anecdotal examples to reinforce facts Define how benefits will be measured

To structure the evaluation, you will have to clearly identify and quantify the projects advantages. A structure is required to set a range within which the benefits of an integration implementation can be realized. Conservative, moderate, and optimistic values are used in the attempt to produce a final range, which realistically contains the benefits to the enterprise of an integration project, but also reflects the difficulty of assigning precise values to some of the elements.
q q q

Conservative values reflect the highest costs and lowest benefits possible Moderate values reflect those you believe to most accurately reflect the true value of the data integration implementation Optimistic estimates reflect values that are highly favorable but also not improbable

Many of the greatest benefits of a data integration are realized months or even years after the project has been completed. These benefits should be estimated for each of the threeto-five years following the project. In order to account for the changing value of money over time, all future pre-tax costs and benefits are discounted at the Internal Rate of Return (IRR) percent. Direct (Hard) Benefits: The enterprise will immediately notice a few direct benefits from improving data integration in its projects. These are mainly cost savings over traditional pointto-point integration in which interfaces and transformations between applications are hard-coded, and the cost savings the enterprise will incur due to the enhanced integration and automation made possible by data integration. Key considerations include:
q q q q q q

Cost savings Reduction in complexity Reduction in staff training Reduction in manual processes Incremental revenue linked directly to the project Governance and compliance controls that are directly linked

Indirect (Soft) Benefits: The greatest benefits from an integration project usually stem from the extended flexibility the system will have. For this reason, these benefits tend to be longer-term and indirect. Among them are:
q q q q q q q q q q q q q q

Increase in market share Decrease in cost of future application upgrades Improved data quality and reporting accuracy Decrease in effort required for integration projects Improved quality of work for staff and reduced turnover Better management decisions Reduced wastage and re-work Ability to adopt a managed service strategy Increased scalability and performance Improved services to suppliers and customers Increase in transaction auditing capabilities Decreased time to market for mission critical projects Increased security features Improved regulatory compliance

It is possible to turn indirect benefits into direct benefits by performing a detailed analysis and working with finance and management stakeholders to gain support. This may not always be necessary, but often is essential (especially with a Make the Wave business case style). Since it can take a lot of time and effort to complete this analysis, the recommended best practice is to select only one indirect benefit as the target for a detailed analysis. Refer to Appendix C at the end of this document for an additional list of possible business value categories and analysis options.

TIP
Turn subjective terms into numbers Fact-based, quantitative drivers and metrics are more compelling than subjective ones. Outcome-based objectives are more persuasive than activity-based measures. If detailed analysis is not feasible for the entire scope, it may be sufficient to use a combination of big-picture top-down numbers for macro-level analysis plus a micro-level analysis on a representative piece of the whole.

The following figure illustrates an example of a compelling exposition of ICC benefits. Note that the initial cost of integrations developed by the ICC is greater than hand-coding, but after 100 integrations the ICC cost is less. In this example (which is based on a real-life case), the enterprise developed more than 300 integrations per year, which translates into a saving of $3 million per year.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

497 of 954

It is also useful to identify the project beneficiaries and to understand their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart a useful reference document that ensures that all project team members understand the corporate/business organization.

TIP
Leverage Industry Studies Industry studies from research firms such as Gartner, Forrester and AMR Research can be used to highlight the value of an integration approach. For example, all the percentages below are available from various research reports. This example is for a US-based telecommunications company with annual revenue of $10B. The industry study suggests that an ICC would save $30million per year for an organization of this size. Company Revenue Telecommunications Industry) % of revenue spent on IT1 $ 5.0% $ 10,000,000,000 500,000,000

% of IT budget spent on investments1 % of investment projects spent on Integration2 % of Integration project savings resulting from an ICC3

40.0% $ 35.0% $ 30.0% $

200,000,000 70,000,000 21,000,000

% of IT budget spent on MOOSE1 % of MOOSE spent on maintenance (guesstimate - no study available) % of integration savings on maintenance costs resulting from an ICC3

60.0% $ 15.0% $ 20.0% $

300,000,000 45,000,000 9,000,000

Total potential annual savings resulting from an ICC

30,000,000

Notes: 1. Forrester, 11-13-2007, "US IT Spending Benchmarks For 2007" 2. Gartner, 11-6-2003, "Client Issues for Application Integration" 3. Gartner, 4-4-2008, "Cost Cutting Through the Use of an Integration Competency Center or SOA Center of Excellence" Moose = Maintain and Operate the IT Organization, Systems, and Equipment

Step 5: Analyze Options


After you have identified the options, the next step is to recommend one. Before selecting an option to recommend, you need to have a good understanding of the organizations goals, its business processes, and the business requirements that must be satisfied. To evaluate investment options, select criteria that will allow measurement and comparison. The following list presents some possible analyses, starting with those that involve hard financial returns and progressing to those that are more strategic:
q

Analysis of cost effectiveness: demonstrates, in financial terms, improvements in performance or in service delivery; and shows whether the benefits from the data integration investmentCONFIDENTIAL outweigh its costs. INFORMATICA BEST PRACTICES 498 of 954

Analysis of displaced or avoided costs: compares the proposed systems costs to those of the system it would displace or avoid; and may justify the proposal on a leastcost basis if it can be assumed that the new system will have as many benefits as the current system. Work value analysis: requires analysis of work patterns throughout the organization and of ways that would re-adjust the number and types of skills required; and assumes that additional work needs to be done, that management allocates resources efficiently, and that workers allocate time efficiently. Cost of quality analysis: estimates the savings to be gained by reducing the cost of quality assurance, such as the cost of preventing or repairing a product failure; and can consider savings that are internal and external to the organization, such as the enterprises cost to return a product. Option value analysis: estimates the value of future opportunities that the organization may now pursue because of the project; uses decision trees and probability analysis; and includes savings on future projects, portions of the benefits of future projects, and reductions in the risks associated with future projects. Analysis of technical importance: justifies an infrastructure investment because a larger project that has already received approval could not proceed without it. This is likely when enterprises initiate data integration program as a consequence of a merger or acquisition and two large ERP systems need to communicate. Alignment with business objectives: includes the concept of strategic alignment modeling, which is one way to examine the interaction between IT strategy and business strategy; and allows managers to put a value on the direct contribution of an investment to the strategic objectives of the organization. Analysis of level-of-service improvements: estimates the benefits to enterprises of increases in the quantity, quality, or delivery of services; and must be done from the enterprises viewpoint. Research and development (R&D): is a variant of option value analysis, except that the decision on whether to invest in a large data integration project depends on the outcome of a pilot project; is most useful for high-risk projects, where R&D can assess the likelihood of failure and help managers decide whether to abort the project or better manage its risks; and requires management to accept the consequences of failure and to accept that the pilot is a reasonable expense in determining the viability of an data integration project.

TIP Use analytical techniques, such as discounted cash flow (DCF), internal rate of return (IRR), return on investment (ROI), net present value (NPV), or break-even/payback analysis to estimate the dollar value of options. After you have quantified the costs and benefits, it is essential to conduct a cost-benefit analysis of the various options. Showing the incremental benefits of each option relative to the base case requires less analysis, since the analyst does not have to evaluate the benefits and costs of an entire program or service. Some benefits may not be quantifiable. Nevertheless, these benefits should be included in the analysis, along with the benefits to individuals within and external to the organization. You have to look at the project from two perspectives: the organizations perspective as the supplier of products and services; and the enterprises or publics perspective as the consumer of those services. Hard cost savings come from dedicated resources (people and equipment) while more uncertain savings come from allocated costs such as overheads and workload. When estimating cost avoidance, keep these two types of savings separate. Assess how likely it is that the organization will realize savings from allocated resources, and estimate how long it will take to realize these savings.

TIP Since the cost-benefit analysis is an important part of the decision-making process, verify calculations thoroughly. Check figures on spreadsheets both before and during the analysis. Include techniques and assumptions in the notes accompanying the analysis of each option.

Step 6: Evaluate Risks


Step 6 presents ways to help identify and evaluate the risks that an integration investment may face so that they can be included in the business case. It also discusses how to plan to control, or minimize the risk associated with implementing a data integration investment. Key activities are:
q q q q

Identify the risks Characterize in terms of impact, likelihood of occurrence, and interdependence Prioritize to determine which risks need the most immediate attention Devise an approach to assume, avoid or control the risks

The purpose of risk assessment and management is to determine and resolve threats to the successful achievement of investment objectives and especially to the benefits identified in the business case. The assessment and management of risk are ongoing processes that continue throughout the duration of an integration implementation and are used to make decisions about the project implementation. The first decision faced by an integration investment option is whether to proceed. The better the risks are understood and planned for when this decision is made, the more reliable a decision and the better the chances of success. The method underlying most risk assessment and management approaches can be summarized by the following five-step process: 1. 2. 3. 4. 5. identify the risks facing the project characterize the risks in terms of impact, likelihood of occurrence, and interdependence prioritize the risks to determine which need the most immediate attention devise an approach to assume, avoid or control the risks monitor the risks

All but the last of these can and should be undertaken as part of the business-case analysis conducted prior to the decision to proceed.

TIP A group can assess risk more thoroughly than an individual. Do not use the unreliable practice of discounting the expected net gains and then assuming that the remainder is safe.

Not all risks are created equal. For each risk identified, characterize the degree of risk in terms of:
q q q

its impact on the project (e.g., slight delay or show-stopper) the probability of its occurrence (e.g., from very unlikely to very likely) its relationship to other risks (e.g., poor data quality can lead to problems data mapping)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

499 of 954

Once the risks have been identified and characterized, they can then be ranked in order of priority to determine which should be tackled first. Priority should be based on a combination of an events impact, likelihood, and interdependence. For example, risks that have a severe impact are very likely to occur. Therefore, they should be dealt with first to avoid having to deal with additional risks. You can assign priorities to risk factors by assigning a weight to each risk for each of the three characteristics (i.e., impact, likelihood and interdependence) and multiplying the three values to create a composite score. The risk with the highest score gets the highest priority.

TIP A general rule of thumb is to develop a risk mitigation plan only for the top five risks based on the rationale that a) there is no point focusing on lower priority risks if the major ones arent addressed and b) due to limited management attention it is not feasible to tackle too many at once. After you have mitigated some of the highest priority risks, reevaluate the list on an ongoing basis and focus again on the top five.

Three main types of risks arise in IT projects:


q

Lack of control. Risks of this type arise from a project teams lack of control over the probability of occurrence of an event and/or its consequences. For example, the risks related to senior managers decisions are often a result of the lack of control a project team has over senior managers. Lack of information. Risks of this type arise from a project teams lack of information regarding the probability of occurrence of an event or its consequences. For example, risks related to the use of new technologies are often the result of a lack of information about the potential or performance of these technologies. Lack of time. Risks of this type arise from a project teams inability to find the time to identify the risks associated with the project or a given course of action, or to assess the probability of occurrence of an event or the impact of its consequences.

There are three main types of responses to risk in data integration projects and they are listed in ascending order of their potential to reduce risk:
q q q

Assume. In this type of response, a department accepts the risk and does not take action to prevent an events occurrence or to mitigate its impact. Control. In this type of response, a department takes no action to reduce the probability of occurrence of an event, but upon occurrence, attempts to mitigate its impact. Avoid. In this type of response, a department takes action prior to the occurrence of an event in order either to reduce its probability of occurrence or mitigate its impact.

Selection of a type of response depends on the priority assigned to a risk, its nature (i.e., whether it is amenable to control or avoidance), and the resources available to the project. In general, the higher the priority of a risk, the more vigorous the type of response applied.

TIP Do not avoid or hide risks. The credibility of the business case will be enhanced by clearly identifying risks and developing a mitigation strategy to address them.

TIP Do not assume that maintaining the status quo constitutes minimum risk. What is changing in the external environment, in the behavior of your customers, suppliers and competitors?

Step 7: Package the Case


Assemble the business case documentation and package it for consumption by the targeted stakeholders. Key activities in this step include:
q q q

Identify the audience Prepare the contents of the report Package the case in different formats to make a compelling and engaging presentation
r r r r r

Descriptive graphics Animation or simulation of process or data quality issues Case studies and anecdotes Comparative financial data Customer testimonials

Proposal Writing contains further details and advice for developing persuasive proposals.

TIP The way the business case is presented can make a significant difference; sometimes format may be more important than content. Graphics go a long way; its worth experimenting with many different graphic techniques to find the one that works.

Step 8: Present the Case


The best analysis and documentation will be useless unless the decision-makers buy in and give the necessary approvals. Step 8 provides suggestions to help ensure that your recommendations get a fair hearing. Utilize both formal and informal methods to present the proposal successfully. Key activities in this step include:
q q q q

Find the right sponsor(s) Leverage individual and group presentations Promotion and cross-functional buy-in Model, pilot and test market the proposed solution

INFORMATICA CONFIDENTIAL

BEST PRACTICES

500 of 954

Find a sponsor who can galvanize support for the business case, as well as for the subsequent implementation. The sponsor should be a person in a senior management position. Avoid starting a project without a sponsor. The investment proposal will compete for the attention of decision-makers and the organization as a whole. This attention is crucial and Informatica can be a vital element in helping enterprises to lobby the project decision makers throughout the lifecycle of the decision process. Consequently, the proposal itself must be promoted, marketed, and sold. Market your proposal with an eye towards the enterprise culture and the target audience. Word of mouth is an important, but often overlooked, way of delivering limited information to a finite group. A business case must convince decision-makers that the analysis, conclusions, and recommendations are valid. To do this, use theoretical or practical models, pilot projects, and test marketing. Remember, seeing is believing and a demonstration is worth a thousand pictures. Furthermore, the model or pilot allows the assessment of any ongoing changes to the environment or to the assumptions. One can then answer the what if scenarios that inevitably surface during the decision-making process. At the same time, there is a basis for re-assessment and revision in the investments. Leverage the 3-30-3-30 technique to prepare the appropriate message. See Appendix A at the end of this document for details.

TIP Remember that presentations are never about simply communicating information they are about initiating some action. The presentation is not about what you want to tell your audience, but what they need to know in order to take action. Prepare for your presentation with the questions:
q q

What action do I want the stakeholders to take? What questions will they have and what information will they need in order to take the desired action?

Step 9: Review Results


Step 9 outlines a process for conducting ongoing reviews during the lifecycle of the data integration project. Be realistic in your assessment of the feedback from the preceding stage. Whatever the difficulties encountered in the decision-making process, follow up after the decision to review the ongoing validity of the investment and reinforce support. Key activities in this step include:
q q q

Plan for scheduled and unscheduled reviews Develop a stakeholder communication plan Initiate key metrics tracking and reporting

Reviews help to verify that the IT investment decision remains valid, and that all costs and benefits resulting from that decision are understood, controlled, and realized. The investment analysis contained in the business case defines the goals of the implementation project and serves as a standard against which to measure the projects prospects for success at review time. The following types of reviews can be conducted:
q

Independent reviews. These are conducted by an independent party at major checkpoints to identify environmental changes, overrun of time and cost targets, or other problems. Internal peer reviews. The object of the peer review is for the group to verify that the project is still on course and to provide expert advice, counsel and assistance to the project manager. In this way, the combined skills and experience of internal staff is applied to the project. External peer reviews. ICCs may also draw upon similar people in other departments or organizations to provide a different perspective and to bring a wide range of expertise to bear on project strategies, plans and issues. Project team sanity checks. Another source of early warning for project problems is the project team members. These people are the most intimately aware of difficulties or planned activities that may pose particular challenges. Oversight reviews. These reviews, under a senior steering committee, should be planned to take place at each checkpoint to reconfirm that the project is aligned with ICC priorities and directions and to advise senior management on project progress. Investment reviews. The enterprise auditor can also review the performance of projects and, upon completion, the performance of the investment. At an investment review, the auditor reviews and verifies the effect of the investment to ascertain that the investment was justified.

The reviews should start as soon as money is spent on the investment. Major project reviews should be scheduled to coincide with the release of funds allocated to the project. In this approach, the project sponsor releases only the funds needed to reach the next scheduled review. The performance of the project is reviewed at each scheduled checkpoint or when the released funds run out. After review, departmental management can decide to proceed with the project as planned, modify the project or its funding, or even terminate the project, limiting the loss to the amount previously released. Investment reviews can be scheduled to coincide with project reviews during investment implementation.
q q q q

The first investment review should be conducted no later than the midpoint of the project schedule, when the deliverables are under development. The second should be conducted after the end of the implementation project, when the deliverables have just started to be used in production. A final review should be conducted after the investment has been in production for between six months and a year. The exact dates for these reviews should, ideally, be determined by the timing of the investment deliverables. This timing should be clearly indicated in the investment plan.

The approved investment analysis should form the basis for criteria used in all reviews. The project schedule of deliverables, based on the investment analysis, establishes the timing criteria for project reviews. After each review, the sponsor should say whether the investment will stop or continue. An investment may be stopped, even temporarily, for any of the following reasons:
q q q q

There is no agreement on how to conduct the review. The review showed that most of the expected results were not achieved. There were changes to the approved investment analysis, and it was not clear that the enterprise was made aware of the full implications of the changes. Changes to the approved investment analysis were accepted, but there was no additional funding for the changes or the enterprise had not accepted the new risks.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

501 of 954

For the final investment review, the enterprise should demonstrate to the auditor that the investment achieved the expected results, and the auditor should report on the investments level of success.

TIP Leverage techniques to maintain effective communications and avoid organizational support diffusion. Keeping the business case as a live document will allow updates and course correction to reflect changing priorities and market pressures. Follow-through to build credibility for the next project.

Appendix A: 30-3-30-3 for Presenting the Business Case


30 Seconds Generate curiosity (e.g., elevator speech) Future oriented and focus on the positive 3 Minutes 30 Minutes 3 Hours

Purpose of the Session

Describe status (e.g., status report) Educate value (e.g., review t session) Current state status and value provided to the business and technology users

Collaboration (e.g., conference) Whole picture, cover all aspects of integration. Leave no stone unturned

Focus of the Session

Issues, concerns, success stories

You want the audience to think what?

Your enthusiasm and passion for data integration

E.g., ICC activities are integrated How much you have achieved with Data integration is valuable but not into all aspects of the project little or no funding easy lifecycle Segmented into the layers; simple and straightforward Points of integration, how data quality impacts the business and customers Understand the value as well as the utility of data integration Detailed definitions, examples of value, stress the importance of growth

Message

Simple and high level; establish connections or relationships Request for additional information regarding integration and your initiative

Audience Action Desired

Support for data integration and the ICC

Agreement and consensus

Adapted from R. Todd Stephens, 2005. Used by permission.

Appendix B Case Studies


Case Studies
This section provides two ICC case studies based on real-world examples. The case studies have been disguised to allow them to be as specific as possible about the details.

Case Study 1: Shared Services ICCan Executive Vision Investment Strategy


In case study 1, several senior IT executives of GENCO had a strong belief that the organization would benefit from a shared services integration team. An ICC was established including a team of software developers with the expectation that the group would develop some highly reusable software and would recover most of the staff costs by charging their time out to projects. After almost one year, it became clear that the line-of-business (LOB) project teams were not accepting the ICC, so some changes were made to the team to turn it around. The turnaround began with the introduction of a new ICC director and the development of a business case: specifically, a financial justification to create a framework of reusable components such that the traditional data integration effort could be transformed from a custom development effort into a more efficient assembly process. The business case took three months to develop and resulted in approval of an investment of $3 million. The underlying premise of the business case was simple: GENCO was building more than 100 batch and real-time interfaces per year at an average cost of $30,000 per interface and an average development time of 30 days. And because there were no enterprise-wide standards, each interface was a work of art that presented challenges to maintain and support; the proposal was to invest $3 million to produce a standard framework to reduce the cost per interface to $10,000 and shorten the development lifecycle to 10 days; the hard savings would be $2 million in the first year plus soft benefits of reducing the time-to-market window and standardizing the integration software to reduce software maintenance. While the business case was compelling, it was not easy to come up with the numbers. For example, some project teams did not want to share any information about their cost or time to build integrations. On the surface, their excuse was that they didnt have metrics and their staff were all so busy that they didnt have time to do analysis on past projects. The underlying reason may have been that they didnt believe in the value of an ICC and were fearful of losing control over some aspect of their work to a centralized group. In another example, data from a large project was uncovered that showed that the average cost to build an interface was $50,000 but the IT executive in charge refused to acknowledge the numbers on the basis that it was an apples to oranges comparison and that the numbers therefore werent relevant (the real reason may have been more political). In the end, it required a negotiation with the executive to agree on the $30,000 baseline metric. Although the actual baseline cost was higher, the negotiated baseline was still sufficient to make a strong business case. The $3-million investment was approved even though only one-third of it was needed to fund the reusable software. The rest was used for implementing a metadata repository and a semi automated process to effectively manage and control the development of interfaces by a distributed team (including team members in India), educating and training the LOB project teams on how to use the new capability, and creating a subsidy to allow the ICC to sell the initial integration project work at a lower rate than the actual cost in the first few months until the cost efficiencies took hold. Note that the funding request did not split the $3 million into the various components. It used the quantifiable cost reduction opportunity that had significant hard benefits to justify a broader investment, which included elements that were more difficult to quantify and justify. What were the results? In the 18 months after the business case was approved, the ICC delivered 300 integrations in line with the projected cost reductions, which meant that the financial results significantly exceeded the approved plan. Furthermore, a typical integration was being built in five days or less, which also exceeded the time-to-market goal. The

INFORMATICA CONFIDENTIAL

BEST PRACTICES

502 of 954

ICC made an effort to communicate progress on a quarterly basis to the CIO and the executive team with particular emphasis on the measurable benefits. Finally, the metadata repository to track and report progress of all integration requests was in place with an easy-to-use interface for project managers to have visibility into the process. This turned out to be one of the major factors in breaking down the not invented here syndrome by providing transparency to the project teams and following through on delivery commitments. This was another key factor in sustaining cross-functional support after the initial funding approval.

Case Study 2: Integration Hub Consolidationa Creating the Wave Investment Strategy
A CIO was once heard to exclaim, I have a billion-dollar budget and no money to spend. This wasnt the CIO of BIGCO (the pseudonym for this case study), but it could have been. The problem at BIGCO was that an ever increasing portion of the annual IT budget was being spent just to keep the lights on for items such as ongoing maintenance of applications, regulatory changes demanded by the federal government, disaster recovery capabilities mandated by the board, and ongoing operations. One of the biggest perceived drivers of this trend was unnecessary complexity in the IT environment. Clearly, some amount of complexity is necessary in a modern IT environment due to the inherent intricacy of a multinational business operating in many legal jurisdictions, with millions of customers, 100,000-plus employees, hundreds of products, and dozens of channels for customers and suppliers to interact. However, a tremendous amount of unnecessary complexity at BIGCO was self-imposed by past practices such as acquiring other companies without fully consolidating the systems, implementation of application systems in silos resulting in duplicate and overlapping data and functions across the enterprise, lack of governance resulting in incremental growth of systems to address only tactical needs, and integration as an afterthought without an enterprise standard framework. No one at BIGCO disagreed with the problem all the way from the CEO (who discussed it regularly in public forums) to the CIO to the software developers. Metaphorically, much of the low-hanging fruit had already been picked but the really juicy fruit was still at the top of the tree. It was hard to pick because of the challenges mentioned at the introduction of this paper. This case explores how these challenges were addressed in a specific scenario: consolidating 30 legacy integration systems and transforming them into an efficient enterprise hub using the latest technologies. The 30 systems had been built up incrementally over 10 years through thousands of projects without a master architectural blueprint. Each change was rational on its own, but the result had multiple instances of middleware in a complex integration situation that clearly cost too much to maintain, was difficult to change, and was susceptible to chaotic behavior in day-to-day operations. A lot of money was at stake in this case. The 30 systems had an annual run-rate operating cost of $50 million, and an initial back-of-the-envelope analysis showed that it could be cut in half. While there was some top-down executive support, much broader cross-organizational support was necessary, so the ICC team decided to use the Creating the Wave strategy. The first step was to build a business case. This turned out to be a 6-month exercise involving a core team of four staff members, who engaged more than 100 stakeholders from multiple functions across the enterprise. They started out by gathering 18 months of historical cost information about each of the 30 systems. Some stakeholders didnt think 18 months was sufficient, so the team went to three years of history and for many of the systems eventually tracked down five years of history. At the core of the business case, the ICC team wanted to show what would happen to the $50-million run-rate cost over the next three years under the status quo scenario and compare it to the run-rate cost in a simplified environment. They used MS Excel to construct the financial business model. It started as a couple of worksheets that grew over time. The final version was 13MB and comprised 48 worksheets showing five years of history and three years of projections for various scenarios, plus month-by-month project costs for two years. All of it was sliced and diced to show various views for different organizational groups. What were the results of this case study? The final business model showed that an investment of $20 million would result in a net ongoing operational saving of $25 million per year. The gross savings relative to the baseline $50 million per year cost was actually projected to reduce by $30 million, but because the project was also introducing new capabilities for building an enterprise hub, the new capabilities were projected to add $5 million per year to the run-rate operating cost. The net savings were $25 million annually. The lesson here once again is to include some hard-to- justify elements in a larger project that can be justified.

Appendix C Business Value Analysis


1. ASK, using the table below for ideas/examples:
q q q q q

What is the business goal of this project? Is this relevant? For example, is the business goal of this project to.? What are the business metrics or KPIs associated with this goal? How will the business measure the success of this project? Are any of these examples relevant?

2. PROBE: If the business sponsor needs more help understanding how data impacts business value, use these example projects and data capabilities to probe.
q q q q q q q

These data integration projects are often associated with this business goal. Is this data integration project being driven by this business goal? How does data accessibility affect the business? Does having access to all your data improve the business? Do these examples resonate? How does data availability affect the business? Does having data available when it's needed improve the business? Do these examples resonate? How does data quality affect the business? Does having good data quality improve the business? Do these examples resonate? How does data consistency affect the business? Does having consistent data improve the business? Do these examples resonate? How does data audit ability affect the business? Does having an audit trail on your data improve the business? Do these examples resonate? How does data security affect the business? Does ensuring secure data access improve the business? Do these examples resonate?

3. DOCUMENT the key metrics and estimated impact based on the sponsors input:
q q q

The key business metrics relevant to this project The sponsor's estimated impact on that metric (e.g. increase cross-sell rate from 3% to 5%) What is the estimated dollar value of that impact? (sponsor must provide this estimate based on their own calculations

INFORMATICA CONFIDENTIAL

BEST PRACTICES

503 of 954

Business Value Category A. INCREASE REVENUE New Customer Acquisition

Explanation

Typical Metrics

Data Integration Examples

Lower the costs of acquiring new customers

- cost per new customer acquisition - cost per lead - # new customers acquired/month per sales rep or per office/store

- Marketing analytics - Customer data quality improvement - Integration of 3rd party data (from credit bureaus, directory services, salesforce.com, etc.) - Single view of customer across all products, channels - Marketing analytics & customer segmentation - Customer lifetime value analysis

Cross-Sell / Up sell

Increase penetration and sales - % cross-sell rate within existing customers - # products/customer - % share of wallet - customer lifetime value Increase sales productivity, and - sales per rep or per employee improve visibility into demand - close rate - revenue per transaction

Sales and Channel Management

- Sales/agent productivity dashboard - Sales & demand analytics - Customer master data integration - Demand chain synchronization - Data sharing across design, development, production and marketing/sales teams - Data sharing with 3rd parties e.g. contract manufacturers, channels, marketing agencies, etc. - Cross-geography/cross-channel pricing visibility - Differential pricing analysis and tracking - Promotions effectiveness analysis

New Product / Service Delivery

Accelerate new product/service - # new products launched/year introductions, and improve "hit - new product/service launch time rate" of new offerings - new product/service adoption rate

Pricing / Promotions

Set pricing and promotions to stimulate demand while improving margins

- margins - profitability per segment - cost-per-impression, cost-per-action

B. LOWER COSTS Supply Chain Management Lower procurement costs, increase supply chain visibility, and improve inventory management - purchasing discounts - inventory turns - quote-to-cash cycle time - demand forecast accuracy - product master data integration - demand analysis - cross-supplier purchasing history

Production & Service Delivery

Lower the costs to manufacture - production cycle times products and/or deliver services - cost per unit (product) - cost per transaction (service) - straight-through-processing rate Lower distribution costs and improve visibility into distribution chain - distribution costs per unit - average delivery times - delivery date reliability - # invoicing errors - DSO (days sales outstanding) - % uncollectible - % fraudulent transactions - End-of-quarter days to close - Financial reporting efficiency - Asset utilization rates

- cross-enterprise inventory rollup - scheduling and production synchronization

Logistics & Distribution

- integration with 3rd party logistics management and distribution partners

Invoicing, Collections and Fraud Improve invoicing and Prevention collections efficiency, and detect/prevent fraud

- invoicing/collections reconciliation - fraud detection

Financial Management

Streamline financial management and reporting

- Financial data warehouse & reporting - Financial reconciliation - Asset management & tracking

C. MANAGE RISK Compliance (e.g. SEC/SOX/Basel Prevent compliance outages to -# negative audit/inspection findings II/ PCI) Risk avoid investigations, penalties, - probability of compliance lapse and negative impact on brand - cost of compliance lapses (fines, recovery costs, lost business) - audit/oversight costs Financial/Asset Risk Management Improve risk management of key assets, including financial, commodity, energy or capital assets Reduce downtime and lost business, prevent loss of key data, and lower recovery costs - errors & omissions - probability of loss - expected loss - safeguard and control costs - mean time between failure (MTBF) - mean time to recover (MTTR) - recovery time objective (RTO) - recover point objective (RPO-- data loss) - Financial reporting - Compliance monitoring & reporting

- Risk management data warehouse - Reference data integration - Scenario analysis - Corporate performance management - Resiliency and automatic failover/recovery for all data integration processes

Business Continuity/ Disaster Recovery Risk

Examples of Key Capabilities by Data Attribute

INFORMATICA CONFIDENTIAL

BEST PRACTICES

504 of 954

Business Value Category A. INCREASE REVENUE New Customer Acquisition

Accessibility

Availability

Quality

Consistency

Audit Ability

Security

Cross-firewall access to third party customer data e.g. credit bureaus, address directories, list brokers, etc.

Accelerated delivery Customer targeting and of sales lead data to on-boarding based on appropriate channels accurate customer/ prospect/market data

Sharing of correlated customer data with sales and third party channels to reduce channel conflict and duplication

Predict the impact of changes, e.g. switching credit bureaus or implementing a new marketing system

Secure access to valuable customer lead, financial and other information

Cross-Sell / Up sell

Opportunity identification with integrated access to CRM, SFA, ERP and others

Real-time customer analytics enabling tailored cross-selling at customer touch points

Accurate, complete, de- Single view of duplicated customer data customer reconciling to create single view differences in business definitions & structures across groups Alignment of channel/ sales incentives based on consistent sales productivity data

Improve governance of customer master data by maintaining visibility into definition of and changes to data Provide traceability for demand and revenue reports through data lineage

Customer data privacy and security assurance to protect customers and comply with regulations

Sales and Channel Management

Incorporation of revenue data, internal or external SFA data, and data in forecast spreadsheets Access to data in both applications/ systems as well as design documents

Continuous Completeness and availability of lead, validity of sales activity, pipeline and revenue pipeline and demand data data to sales, partners and channels Distributed, roundthe-clock environment for collaborative data sharing Accurate, de-duplicated product design and development data across functional and geographical boundaries

Secure access for partners/distributors to share sensitive demand and revenue information

New Product / Service Delivery

Consistent application of product and service definitions and descriptions across functions and with partners Global/cross-functional reconciliation of pricing and promotions data

Ensuring compliance with product regulations through version control and view of lineage

Improved collaboration on prototyping, testing and piloting through secure data sharing

Pricing / Promotions

Holistic pricing management based on data from applications, including pricing spreadsheets

Real-time pricing data Complete, accurate product pricing and to enable constant monitoring & on-the-fly discount/ profitability data pricing adjustments based on demand

Rationalization of pricing and improved record keeping for price changes

Segregation of differential pricing and promotions data for different customers, channels, etc.

B. LOWER COSTS Supply Chain Management Access to EDI and unstructured data (typically in Excel/Word) from suppliers/ distributors Real-time supply chain management, aligned with just-intime production models De-duplicated, complete view of products and materials data to improve supply chain efficiency Reconciled view of purchases across all suppliers to improve purchasing effectiveness & negotiation stance Reconciled product and materials master data to ensure accurate inventory and production planning Improve governance of product master data by maintaining visibility into definition of and changes to data Encrypted data exchange with extended network of suppliers/ distributors

Production & Service Integrated access to EDI, Near real-time Delivery production and MRP, SCM and other transaction data to data streamline operations

Improved planning and product management based on accurate materials, inventory and order data

Ensure compliance with production regulations through version control and view of lineage

Role-based access to critical operational data, based on business need

Logistics & Distribution

Bi-directional integration of data with 3rd party logistics and distribution partners

Availability of order status and delivery data on a real-time, as-needed basis

Reduction in logistics and Consistent definition distribution errors with across extended accurate, validated data ecosystem of key data such as ship-to, delivery information Reduced errors in invoicing/billing to accelerate collections Reconciliation of purchase orders to invoices to payments across geographies, organizational hierarchies Consistent interpretation of chart of accounts across all functions and geographies

Predict the impact of changes, e.g. flagging dependencies on a 3rd party provider's data

Encrypted data exchange with extended network of logistics partners, distributors and customers Secure customer access to billing and payment data

Invoicing, Collections and Fraud Prevention

Integration of historical customer data with third party data to detect suspect transactions and prevent fraud

Hourly or daily availability of reconciled invoicing and payments data

Detection/prevention of inefficiencies or fraud though dashboard and alerts

Financial Management

Incorporation of spreadsheet data with data from financial management and reporting systems

On-demand availability of financial management data to business users

Improved fidelity in financial management with accurate, complete financial data

Built-in audit trail on Segregated, secure financial reporting data to access to sensitive ensure transparency financial data and regulatory compliance

C. MANAGE RISK Compliance (e.g. SEC/SOX/Basel II/ PCI) Risk Leverage compliance metrics tracked in spreadsheets, along with system-based data On-demand, continuous availability to reporting and monitoring systems Proactive reduction of data conformity and accuracy issues through scoring and monitoring Reconcile data being reported across groups, functions to ensure consistency Ensuring compliance on data integrity through version control and data lineage Secured, encrypted reporting data available to authorized, designated personnel

INFORMATICA CONFIDENTIAL

BEST PRACTICES

505 of 954

Financial & Asset Risk Management

Integrate financial and risk management systems data with spreadsheet-based data

Real-time availability of key financial and risk indicators for ongoing monitoring & prevention High availability and automatic failover/ recovery to prevent or minimize downtime

Continuous data quality Validation of monitoring to maintain correlated data for fidelity on financial data financial reporting and risk management

Visualize data relationships and dependencies at both business and IT level

Ease management oversight through access records and granular privilege management

Business Continuity/ Access to off-premise Disaster Recovery data to support Risk secondary/backup systems

Updated, de-duplicated data reduces and simplifies data storage and management requirements

Synchronized data across primary and backup systems

Impact and dependency analysis across multiple applications and systems for continuity and recovery planning

Secure cross-firewall access for operations from secondary data centers

Last updated: 03-Jun-08 16:03

INFORMATICA CONFIDENTIAL

BEST PRACTICES

506 of 954

Canonical Data Modeling Challenge


A challenge faced by most large corporations is achieving efficient information exchange in a heterogeneous environment. The typical large enterprise has hundreds of applications that serve as systems of record for information that were developed independently based on incompatible data models -- yet they must share information efficiently and accurately in order to effectively support the business and create positive customer experiences. The key issue is one of scale and complexity that is not typically evident in small to medium sized organizations. The problem arises when there are a large number of application interactions in a constantly changing application portfolio. If these interactions are not designed and managed effectively, they can result in production outages, poor performance, high maintenance costs and lack of business flexibility. Business-to-Business (B2B) systems often grow organically over time to include systems that an organization builds in addition to buys. The importance of canonical data models grow as a system grows. The challenge that use of canonical data models solves is to reduce the number of transformations needed between systems and reduce the number of interfaces that a system supports. The need for this is usually not obvious when there are only 1 or two formats in an end to end system, but when the system reaches a critical mass in number of data formats supported (and in work required to integrate a new system, customer, document type), that is when the importance of one or more canonical models becomes important. For example, if a B2B system accepts 20 different inputs, passes that data to legacy systems and generates 40 different outputs - it is apparent that unless the legacy system uses some shared canonical model, introducing a new input type requires modifications to the legacy systems, flow processes, etc. Put simply, if you have 20 different inputs and 40 different outputs, and all outputs can be produced from any input, then you will need 800 different paths unless you take the approach to transform all inputs to one or more canonical forms and transform all responses from 1 or more forms to the 40 different outputs. This is a fundamental aspect of how Informatica B2B Data Transformation operates in that all inputs are parsed from the original form to XML (not necessarily the same XML schema) and all outputs are serialized from XML to the target output form. The cost to creating canonical models is that they often require design and maintenance involvement from staff from multiple teams. This best practice describes three canonical techniques which can help to address the issues of data heterogeneity in an environment where application components must share information in order to provide effective business solutions.
INFORMATICA CONFIDENTIAL BEST PRACTICES 507 of 954

Description
This section introduces three canonical best practices in support of the Velocity methodology and modeling competencies. 1. Canonical Data Modeling 2. Canonical Interchange Modeling 3. Canonical Physical Formats Canonical techniques are valuable when used appropriately in the right circumstances. The key best practices which are elaborated in the following sections are:
q

Use canonical data models in business domains where there is a strong emphasis to build rather than buy application systems. Use canonical interchange modeling at build-time to analyze and define information exchanges in a heterogeneous application environment. Use canonical physical formats at run-time in many-to-many or publish/subscribe integration patterns. In particular in the context of a business event architecture. Plan for appropriate tools to support analysts and developers. Develop a plan to maintain and evolve the canonical models as discrete enterprise components. The ongoing costs to maintain the canonical models can be significant and should be budgeted accordingly.

q q

For a large scale system-of-systems in a distributed computing environment, the most desirable scenario is to achieve loose coupling and high cohesion resulting in a solution that is highly reliable, efficient, easy to maintain, and quick to adapt to changing business needs. Canonical techniques can play a significant role in achieving this ideal state. The graphic below outlines how the three canonical techniques generally align with, and enable the qualities, in each of the four Coupling/Cohesion quadrants. Note: There is some overlap between the techniques since there is no hard black-and-white definition of these techniques and their impact on a specific application.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

508 of 954

Each of the three techniques has a sweet spot; that is, they can be applied in a way that is extremely effective and provides significant benefits. The application of these methods to a given implementation imparts architectural qualities to the solution. This best practice does not attempt to prescribe which qualities are desirable or not since that is the responsibility of the solutions architect to determine. For example, tight coupling could be a good thing or a bad thing depending on the needs and expectations of the customer. Tight coupling generally results in better response time and network performance in comparison to loose coupling but it also can have a negative impact on adaptability of components. Furthermore, the three canonical best practices are generally not used in isolation; they are typically used in conjunction with other methods as part of an overall solutions methodology. As a result, it is possible to expand, shrink, or move the sweet spot subject to how it is used with other methods. This best practice does not address the full spectrum of dependencies with other methods and their resultant implications, but it does attempt to identify some common pitfalls to be avoided.

Common Pitfalls
Peanut Butter: One pitfall that is pertinent to all three canonical practices is the Peanut Butter pattern which basically involves applying the methods in all situations. To site a common metaphor to a hammer everything looks like a nail. It certainly is possible to drive a screw with a hammer, but its not pretty and not ideal. When, and exactly how, to apply the canonical best practices should be a conscious, wellconsidered decision based on a keen understanding of the resulting implications.

Canonical Data Modeling


INFORMATICA CONFIDENTIAL BEST PRACTICES 509 of 954

Canonical Data Modeling is a technique for developing and maintaining a logical model of the data required to support the needs of the business for a subject area. Some models may be relevant to an industry supply chain, the enterprise as a whole, or a specific line of business or organizational unit. The intent of this technique is to direct development and maintenance efforts such that the internal data structures of application systems conform to the canonical model as closely as possible. This technique seeks to eliminate heterogeneity by aligning the internal data representation of applications with a common shared model. In an ideal scenario, there would be no need to perform any transformations at all when moving data from one component to another, but for practical reasons this is virtually impossible to achieve at an enterprise scale. Newly built components are easier to align with the common models, but legacy applications may also be aligned with the common model over time as enhancements and maintenance activities are carried out.

Common Pitfalls
q

Data model bottleneck: A Canonical Data Model is a centralization strategy that requires an adequate level of ongoing support to maintain and evolve it. If the central support team is not staffed adequately, it will become a bottleneck for changes which could severely impact agility. Heavy-Weight Serialized Objects: There are two widely-used techniques for exchanging data in a distributed computing environment -- serialized objects and message transfer. The use of serialized objects can negate the positive benefits of high cohesion if they are used to pass around large, complex objects that are not stable and subject to frequent changes. The negative impacts include excessive processing capacity consumption, increased network latency and higher project costs through extended integration test cycles.

Canonical Interchange Modeling


Canonical Interchange Modeling is a technique for analyzing and designing information exchanges between services that have incompatible underlying data models. This technique is particularly useful for modeling interactions between heterogeneous applications in a many-to-many scenario. The intent of this technique is to make data mapping and transformations transparent at build time. This technique maps data from many components to a common Canonical Data Model which thereby facilitates rapid mapping of data between individual components, since they all have a common reference model.

Common Pitfalls
q

Mapping with unstructured tools: Mapping data interchanges for many enterprise business processes can be extremely complex. For example, Excel is not sophisticated enough to handle the details in environments with a large number of entities (typically over 500) and with more than two source or target applications. Without adequate tools such as Informaticas Metadata Manager the level of manual effort needed to maintain the canonical models and the mappings to dependent applications in a highly dynamic environment can become a major resource drain that is not sustainable and error prone.
BEST PRACTICES 510 of 954

INFORMATICA CONFIDENTIAL

Proper tools are needed for complex environments.


q

Indirection at Run-Time: Interchange Modeling is a build time technique. If the same concept of an intermediate canonical format is applied at run-time, it results in extra overhead and a level of indirection that can significantly impact performance and reliability. The negative impacts can become even more severe when used in conjunction with a serialized object information exchange pattern; that is, large complex objects that need to go through two (or more) conversions when being moved from application A to B (this can become a show-stopper for high-performance real-time applications when SOAP and XML are added to equation).

Canonical Physical Format


Canonical Physical Format prescribes a specific runtime data format and structure for exchanging information. The prescribed generic format may be derived from the Canonical Data Model or may simply be a standard message format that all applications are required to use for certain types of information. The intent of this technique is to eliminate heterogeneity for data in motion by using standard data structures at run-time for all information exchanges. The format is frequently independent of either the source or the target system and requires that all applications in a given interaction transform the data from their internal format to the generic format.

Common Pitfalls
q

Complex Common Objects: Canonical Physical Formats are particularly useful when simple common objects are exchanged frequently between many service providers and many service consumers. Care should be taken not to use this technique for larger or more complex business objects since it tends to tightly couple systems which can lead to longer time to market and increased maintenance costs. Non-transparent Transformations: Canonical Physical Formats are most effective when the transformations from a components internal data format to the canonical format are simple and direct with no semantic impedance mismatch. Care should be taken to avoid semantic transformations or multiple transformations in an end-to-end service flow. While integration brokers (or ESBs) are a useful technique for loose coupling, they also add a level of indirection which can complicate debugging and run-time problem resolution. The level of complexity can become paralyzing over time if service interactions result in middleware calling middleware with multiple transformations in an end-to-end data flow. Inadequate Exception Handling: The beauty of a loosely-coupled architecture is that components can change without impacting others. The danger is that in a large-scale distributed computing environment with many components changing dynamically, the overall system-of-system can assume chaotic (unexpected) behavior. One effective counter strategy is to ensure that every system that accepts Canonical Physical Formats also includes a manual work queue for any inputs that it cant interpret. The recommended approach is to make exception handling an integral part of the normal day-to-day operating procedure by pushing each message/object into a work queue for a human to review and disposition.

Canonical Modeling Methodology


INFORMATICA CONFIDENTIAL BEST PRACTICES 511 of 954

Canonical models may be defined in any number of business functional or process domains at one of four levels: 1. 2. 3. 4. Supply Chain external inter-company process and data exchange definitions Enterprise enterprise-wide data definitions (i.e., master data management programs) Organization specific business area or functional group within the enterprise System a defined system or system-of-systems

For example, a supply chain canonical model in the mortgage industry is MISMO (Mortgage Industry Standards Maintenance Organization) which publishes an XML message architecture and a data dictionary for:
q q q q q q q q q

Underwriting Mortgage insurance application Credit reporting Flood and title insurance Property appraisal Loan delivery Product and pricing Loan Servicing Secondary mortgage market investor reporting

The MISMO standards are defined at the Supply Chain level and companies in this industry may choose to adopt these standards and participate in their evolution. Even if a company doesnt want to take an active role, they will no-doubt need to understand the standards since other companies in the supply chain will send data in these formats and may demand that they receive information according to these standards. A company may also choose to adopt the MISMO standard at the Enterprise level possibly with some extensions or modifications to suit their internal master data management initiative. Or one business unit such as the Mortgage business within a financial institution may adopt the MISMO standards as their canonical information exchange model or data dictionary again possibly with extensions or modifications. Finally, a specific application system, or collection of systems, may select the MISMO standards as their canonical model also with some potential changes. In one of the more complex scenarios, a given company may need to understand and manage an external Supply Chain canonically, using an Enterprise version of the canonical format, using one or many organizational versions, and using one or many system versions. Furthermore, all of the models are dynamic and change from time to time which requires careful monitoring and version control. A change at one level may also have a ripple effect and drive changes in other levels (either up or down). As shown in the figure below, steps 1 through 5 are a one-time effort for each domain while steps 6 through 11 are repeated for each project that intends to leverage the canonical models.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

512 of 954

1. Define Scope: Determine the business functional or process domain and the level (Supply Chain, Enterprise, Organization or System) of the canonical modeling effort. 2. Select Tools and Repository: In small scale or simple domains, tools such as Excel and a source code repository may be adequate. In complex environments with many groups/ individuals involved, a more comprehensive structured metadata repository will be needed with a mechanism for access by a broad range of users. 3. Identify Content Administrator: In small scale or simple domains, the administration of the canonical models may be a part-time job for a data analyst, metadata administrator, process analyst or developer. In large and complex environments it is often necessary to have a separate administrator for each level and each domain. 4. Define Communities: Each level and each domain should have a defined community of stakeholders. At the core of each community is the canonical administrator, data analysts, process analysis and developers that are directly involved in developing and maintaining the canonical model. A second layer of stakeholders are the individuals that need to understand and apply the canonical models. A third and final layer of stakeholders are individuals such as managers, architects, program managers and business leaders that need to understand the benefits and constraints of canonical models. 5. Establish Governance Process: Define how the canonical models will be developed and changed over time as well as the roles and authorities of the individuals in the defined community. This step also defines the method of communication between individuals, frequency of meetings, versioning process, publishing methods and approval process. 6. Select Canonical Technique: Each project needs to decide which of the three techniques will be used: Canonical Data Modeling, Canonical Interchange Modeling or Canonical Physical Formats. This decision is generally made by the solution architect. 7. Document Sources and Targets: This step involves identifying existing documentation for the systems and information exchanges involved in the project. If the documentation doesnt exist, in most cases it must be reverse-engineered (unless a given system is being retired). 8. Identify Related Canonicals: This step involves identified relevant or related canonicals in other domains or at other levels that may already be defined in the enterprise. It is also often worth exploring data models of some of the large ERP systems vendors that are involved in the project as well as researching which external industry standards may be applicable. 9. Develop Canonical Model: This step involves a) an analysis effort, b) an agreement

INFORMATICA CONFIDENTIAL

BEST PRACTICES

513 of 954

process to gain consensus across the defined community, and c) a documentation effort to capture the results. The canonical model may be developed either a) top-down based on the expertise and understanding of domain experts, b) bottom-up by rationalizing and normalizing definitions from different systems, or c) adopting and tailoring existing canonical models. 10. Build Target Scenario: This step is the project effort associated with leveraging the canonical model in the design, construction or operation of the system components. Note that the canonical models may be used only at design time (as in the case of canonical interchange modeling) or also at construction and run-time in the case of the other two canonical techniques. 11. Refresh Metadata & Models: This is a critical step to ensure that any extensions or modifications to the canonical models that were developed during the course of the specific project are documented and captured in the repository and that other enterprise domains that may exist are aware of the changes in the event that other models may be impacted as well.

Summary
The key best practices are:
q

Use canonical data models in business domains where there is a strong emphasis to build rather than buy application systems. Use canonical interchange modeling at build-time to analyze and define information exchanges in a heterogeneous application environment. Use canonical physical formats at run-time in many-to-many or publish/subscribe integration patterns. In particular in the context of a business event architecture. Plan for appropriate tools such as Informatica Metadata Manager to support analysts and developers. Develop a plan to maintain and evolve the canonical models as discrete enterprise components. The ongoing costs to maintain the canonical models can be significant and should be budgeted accordingly.

In summary, there are three defined Canonical Best Practices each of which has a distinct objective. Each method imparts specific qualities on the resultant implementation which can be compared using a coupling/cohesion matrix. It is the job of the architect and systems integrator to make a conscious decision and select the methods that are most appropriate in a given situation. The methods can be very effective, but they also come with a cost so care should be taken to acquire appropriate tools and to plan for the ongoing maintenance and support of the canonical artifacts.

Appendix A Case Study in applying Canonical Physical Formats in the


INFORMATICA CONFIDENTIAL BEST PRACTICES 514 of 954

Insurance Industry
This appendix contains selected elements of Informaticas approach for implementing the ACORD Life data and messaging standards. ACORD XML for Life, Annuity & Health is a family of specifications for the insurance industry intended to enable real-time, cross platform business partner message/information sharing. The primary Informatica product that supports this capability is B2B Data Transformation (or DT for short). The core challenge is to transforming custom source ACORD Life messages into a common Enterprise canonical format (see the figure below) while achieving key quality metrics such as: 1. Scalability: Ability to support many formats and a high volume of messages 2. Maintainability: Ability to make changes to source, process and product characteristics of existing transformations 3. Extensibility: Ease of adding new source formats to the system 4. Cost effectiveness: Minimizing cost of implementation and ongoing maintenance

Informatica B2B Data Transformation Capability Overview Core Mapping Capabilities


The B2B Data Transformation core is ideally suited to address the complex XML-to-XML transformations that are required to support most projects. Several key capabilities are described in the below. Map Objects The basic building block for a transformation a Map object - takes a Source and a Target XPath as an input and moves data accordingly. Any kind of transformation logic can be associated with the element-level data move through use of Transformers. For instance, a sequence of Transformers may include data cleansing, a string manipulation, and a lookup.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

515 of 954

Groups Individual Map objects may be combined into Groups. Groups are collections of objects (like Maps and other Groups) that either succeed or fail together. This transactional in-memory behavior is essential for complex transformations like ACORD Life. For instance, if an XML aggregate with a particular id is not found then all other mappings in the same Group will fail. Sequences B2B Data Transformation also provides the ability to handle most complex XML sequences on both the source and target at once. Specifically, at any given time the transformation may access any source or target construct based on its order or its key. For instance, in the complex transformation example below the transformation logic can be easily expressed with a few B2B Data Transformation constructs:
q q q

what source Party the extension code 60152 is contained within what target Party corresponds to the source Party update the Relation object that describes the target Party

The ability to combine such processing logic with direct manipulation of source and target XSD structures is a unique characteristic of B2B Data Transformation. As a result, the logic that is captured in B2B Data Transformation is compact, maintainable, and extensible.

B2B Data Transformation Development Environment


B2B Data Transformation provides a visual interactive Studio environment that simplifies and shortens the development process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

516 of 954

In the Studio, transformations are developed incrementally with continuous visual feedback simulating how the transformation logic would be applied to the actual data.

Specification Driven Transformations


The Informatica Analyst extends the Studio IDE to further accelerate the process of documenting and creating complex XML-to-XML transformations. The Analyst is an Excel spreadsheet based tool that provides a drag-and-drop environment for defining a complex transformation between two XML schemas. Once the transformation specification is defined, the actual transformation runtime is automatically generated from the specification.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

517 of 954

The Analyst capabilities may be a key accelerator to bootstrap the development of Custom Sourceto-GBO transformations and to implement a number of required transformation components. AutoMapping of like elements between schemas (look at the spreadsheet cell W3 in the above figure) is another key feature that may prove to be valuable in this project. AutoMapping creates maps between elements of the same name that are located in similar branches of XML. It also creates 3 important reports:
q q q

non-mapped source elements non-mapped target elements stub implementation of non-matching types for the mapped elements

These reports form a foundation for documentation, specification creation and implementation of an XML-to-XML transformation. This technology asset is a key component of the overall solution and its use and specific application will be more clearly defined in the component design phase of this project.

Accord Versioning Methodology


ACORD Life specifications undergo continuous change. Most significant changes to this model

INFORMATICA CONFIDENTIAL

BEST PRACTICES

518 of 954

were introduced before version 2.12, but the specification is still evolving. The only reliable mechanism for discovering deltas between two versions of ACORD Life is a sideby-side comparison of its respective XML schemas (XSDs). Informatica has developed and deployed a number of successful solutions in the area of ACORD XML versioning. Specifically, the Analyst tool was created to support ACORD XML versioningclass solutions.

High-Level Transformation Design


For the purpose of technical transformation analysis in this proposal, Informatica introduce a functional breakdown of the end-to-end transformation to finer-grain components. Note: The breakdown and composition order of components may change during the actual delivery of a project.

Descriptions of the individual tasks within the transformation are provided below. ACORD Versioning The versioning step provides a mechanism for moving data between two distinct versions of ACORD Life standard. Due to complexity of Life schemas and minimal Revision documentation provided by ACORD, the task of implementing versioning is complex and time-consuming.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

519 of 954

Informatica tools and techniques will allow implementing generic versioning transformations rapidly and reliably (see prior sections). Structural Changes Life models are flexible enough to allow the same information to be described in multiple ways. For instance, same data, like policy face value may be placed in various valid locations between source and target formats.

In addition, custom ACORD implementations may also alter the structure of data. For example, the format below describes a primary insured role of the person in the Person aggregate itself rather than in a respective Relation aggregate.

These changes need to be accommodated by the transformation. Another type of structural change results from a certain intended style or layout of the message that is maintained by a target format. For instance, target format lays-out Parties data in a different way that source.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

520 of 954

Structural changes lead to another significant processing step that is required for transformations maintaining referential integrity. As important aggregates (like Parties and Holdings) are reformatted and re-labeled, the Relation objects need to reflect new modified aggregate IDs that are consistent with the original Relations. Vendor Extension Processing Each source format contains custom extensions that need to be transformed into target format. Sometimes the extension processing is as simple as a straight element-to-element mapping shown below.

In other scenarios it may be more complex and involve additional changes to multiple target elements and aggregates. Code Lists Another typical aspect of a transformation deals with content substitution, primarily with code lists. For instance, product codes may differ between the formats.

Enrichment The final transformation step is data enrichment. Enrichment varies from inserting new system information into XML message

INFORMATICA CONFIDENTIAL

BEST PRACTICES

521 of 954

to analyzing data patterns, content or business rules to derive new elements, like SSN type code below.

Run-Time Configuration
Run-time configuration of the system is optimized in the following dimensions (see the figure below):
q q

Separation of implementation and configuration concerns for individual formats Reuse of common transformation logic and configuration

Transformations for each one of the source formats are created, maintained and extended in isolation from other sources. Transformation logic and configuration data may be versioned independently for each one of the sources.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

522 of 954

Similarly, multiple versions may be created for each of the Common Components. Depending on a choice of a run-time container for the transformations (Java or WPS), the invoking logic would have to maintain the knowledge of how multiple versions are compatible with each other. Such logic may be driven by a configuration file.

Project Methodology
The following activities are typical in projects associated with using the ACORD data and message standards. 1. 2. 3. 4. Requirement, system, and data analysis Creation of transformation specifications Configuration system design Common component implementation a. Schema minimization b. ACORD Versioning c. Enrichment components 5. Source-specific component implementation a. Source schema customization b. Transformation development The rest of this section details each of the activities.

Analysis
In this initial project phase, we thoroughly assess and catalogue existing input and output data samples, schemas, and other available relevant documentation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

523 of 954

Based on the assessment, we build a transformation profile outlining transformation patterns required for the implementation. The patterns include type conversions, common templates, lookup tables, etc.

Specifications
In this phase we customize our generic transformation specification template in accordance with the transformation profile. We then jointly populate the specification with transformation rules.

Configuration Design
The next step is to understand the AS-IS configuration system. Then based on additional requirements we collaboratively design an approach for a new configuration system. The new system needs to support two flavors of configuration:
q

Centralized configuration for system-wide rules and properties. For instance, addition of a new Product and corresponding Business Rules Source-specific configuration for individual source formats. For instance, conversion tables for code-lists specific to the source

As a part of the design process we go through a number of life-cycle use-case scenarios to understand how the new system would be maintained and extended.

Common Components
This project phase includes a number of practical activities that need to be performed before source-specific transformation implementation may begin. Schema Minimization A very important step in the implementation process is to effectively minimize the full ACORD Life schema. The process is a combination of manual trimming and specialized tools that remove orphan elements. ACORD Versioning Versioning accommodates changes in schema that exist between two ACORD Life versions. For instance, versioning would move data from ACORD LIFE 2.5 to 2.13 in a single transformation. Versioning can be used as a black-box reusable transformation if it is applied to a consistent source format. For instance, a GBO Enterprise Life 2.5 to GBO Enterprise Life 2.13 versioning transformation may be used in conjunction with any transformation that produces GBO Enterprise Life 2.5.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

524 of 954

However, if versioning is used in a context of a customized source format (like ABC Life 2.5 to GBO Enterprise Life 2.13) it is not reusable and becomes an embedded part of the custom source transformation. Techniques for creating a mapping for black-box or embedded versioning are the same. Exact determination of the approach needs to be made in the context of the project when more information is available. Enrichment Components Enrichment logic should be reused by all sources and should be performed once the data is moved into a common format. At this stage we should have determined what the common format is. It is likely a custom version of ACORD Life 2.13 extended with Enterprise specific element. However, detailed requirements analysis and design needs to be performed to determine the exact format and the enrichment functionality.

Source-specific Implementation
Source Schema Customization Once an underlying ACORD Life schema for a transformation source format is minimized, it then needs to be customized in order to align with source data. The goal of this work is to derive a schema against which a source data sample would validate. This includes addition of custom extensions (OLifeExtension) as well as schema changes to accommodate source deviations from a standard Life schema. One type of changes deals with schema content restrictions and data-types. For instance, in the sample below, the source datatype needs to change from double to string or integer.

Another type of schema changes is structural where elements change order, element names, place in hierarchy, etc. Transformation Development In this phase we use the customized schema and the transformation specification to implement the transformation. In the process of development, we produce a number of re-usable transformation components that can be re-used across multiple source-specific implementations (such as lookup tables, target skeletons, etc.).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

525 of 954

Chargeback Accounting Challenge


The ICC operates within a larger organization that needs to make periodic decisions about its priorities for allocating limited resources. In this context, an ICC needs not only to secure initial funding but also to demonstrate that continued investment is justified. This Best Practice describes the options for estimating and allocating costs in an Informatica ICC. One of these options is to dispense with a chargeback mechanism and simply use a centrally funded cost center. However, there are additional benefits to a systematic approach to chargeback, not the least of which is the psychological impact on consumers and providers of the service. It also encourages users to understand the internal financial accounting of the organization so that customers and management alike can work within the realm of its constraints and understand how it relates to the culture.

Description
The kind of chargeback model that is appropriate varies according to several factors. A simple alignment between the five ICC models and a similar classification of chargeback models would be convenient but, in practice, there are several combinations of chargeback models that may be used with the ICC models. This Best Practice focuses on specific recommendations related to the most common and recommended patterns. This document also introduces an economic framework for evaluating funding alternatives and the organizational behavior that results from them. It includes the following sections:
q q

Economic Framework Chargeback ModelsAlignment with ICC Models

Economic Framework
As the following figure illustrates, the horizontal dimension of the economic framework is the investment category with strategic demands at one end of the spectrum and tactical demands at the other end. Strategic demands typically involve projects that drive business transformations or process changes and usually have a well-defined business case. Tactical demands are associated with day-to-day operations or keeping the lights on. In the middle of the spectrum, some organizations have an additional category for infrastructure investmentsthat is, project-based funding, focused on technology refresh or mandatory compliance-related initiatives. These are projects that are generally considered nondiscretionary and hence appear to be maintenance. The vertical dimension is the funding source and refers to who pays for the services: the consumer or the provider. In a free market economy, money is used in the exchange of products or services. For internal shared services organizations, rather than exchanging real money, accounting procedures are used to move costs between accounting units. When costs are transferred from an internal service provider to the consumer of the service, it is generally referred to as a chargeback. The following figure shows the economic framework for evaluating funding alternatives.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

526 of 954

If we lay these two dimensions out along the X and Y axis with a dividing line in the middle, we end up with these four quadrants: 1. Demand-Based Sourcing: This operating model responds to enterprise needs by scaling its delivery resource in response to fluctuating project demands. It seeks to recover all costs through internal accounting allocations to the projects it supports. The general premise is that the ICC can drive down costs and increase value by modeling itself after external service providers and operating as a competitive entity. 2. Usage-Based Chargeback: This operating model is similar to the Demand-Based Sourcing model but generally focuses on providing services for ongoing IT operations rather than project work. The emphasis, once again, is that the ICC operates like a standalone business that is consumer-centric, market-driven, and constantly improving its processes to remain competitive. While the Demand-Based Sourcing model may have a project-based pricing approach, the Usage-based model uses utility-based pricing schemes. 3. Enterprise Cost Center: Typically, this operating model is a centrally funded function. This model views the ICC as a relatively stable support function with predictable costs and limited opportunities for process improvements. 4. Capacity-Based Sourcing: This operating model strives to support investment projects using a centrally funded project support function. Centrally-funded ICCs that support projects are an excellent model for implementing practices or changes that project teams may resist. Not charging project teams for central services is one way to encourage their use. The challenge with this model is to staff the group with adequate resources to handle peak workloads and to have enough non-project work to keep the staff busy during non-peak periods. In general, the ICCs that are funded by consumers and are more strategic in nature rely on usage-based chargeback mechanisms. ICCs that are provider-funded and tactical rely on capacity-based sourcing or cost center models.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

527 of 954

Chargeback Models
This Best Practice defines the following types of chargeback models:
q q q q q q

Service-Based Pricing Fixed Price Tiered Flat Rate Resource Usage Direct Cost Cost Allocation

Service-Based Pricing and Fixed Price


These are the most sophisticated of the chargeback models and require that the ICC clearly define its service offerings and structure a pricing model based on defined service levels. Service-based pricing is used for ongoing service while fixed pricing is used for incremental investment projects. In other words, both of them are a fixed price for a defined service. This model is most suitable for a mature ICC that has well-defined service offerings and a good cost-estimating model. Advantages:
q

Within reasonable limits, the clients budget is based on its ability to make an informed decision on purchases from the supplier The client transfers the risk of cost over-runs to the ICC Incentive for the ICC to control and manage the project and deliver on-time and on-budget

q q

Disadvantages:
q

Internal accounting is more complex and there may not be a good mechanism to cover cost overruns that are not funded by the client A given client may pay a higher price if the actual effort is less than expected (note: at a enterprise level this is not a disadvantage since cost over-runs on one project are funded by cost under-runs on other projects)

Tiered Flat Rate


The Tiered Flat Rate is sometimes called a utility pricing model. In this model, the consumer pays a flat rate per unit of service, but the flat rate may vary based on the total number of units or some other measure. This model is based on the assumption that there are economies of scale associated with volume of usage and therefore the price should vary based on it. Advantages:
q

Within reasonable limits, the clients budget is based on its ability to make an informed decision on purchases from the supplier The client is encouraged to continue using ICC services as volume increases rather than looking for other sources

INFORMATICA CONFIDENTIAL

BEST PRACTICES

528 of 954

Disadvantages:
q

May discourage the client from being efficient, thereby reducing usage since a lower tier may cost more per unit of consumption

Resource Usage
The client pays according to resource usage; the following types of resources are available:
q q q

Number of records Data volume CPU time

Informatica technology can support the collection of metrics for all three types of resource usage. PowerCenter Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the PowerCenter metadata repository. The Repository Manager generates the following views when you create or upgrade a repository.
q q q

REP_WFLOW_RUN REP_SESS_LOG REP_SESS_TBL_LOG against a target.

This view displays the run statistics for all workflows by folder This view provides log information about sessions. This view contains information about the status of an individual session run

Additionally, Informatica Complex Data Exchange can be used to parse the output from a range of platform and database accounting reports and extract usage metrics. The advantages and disadvantages are shown below: a) Number of Records: Advantages:
q q q

Number of records processed per a period of time can be easily measured in PowerCenter metadata Easier to compute the cost per record Not dependent upon server speed for measurement

Disadvantages:
q q q

Number of rows may not equate to large data volumes. How to fairly equate rows to a monetary amount Depending on the robustness of the implemented solution, it may take more hardware resources for Project A to process n rows than it takes Project B to process n rows

b) Data Volume

INFORMATICA CONFIDENTIAL

BEST PRACTICES

529 of 954

Advantages:
q q

More logical than counts of data records Easily measured in PowerCenter metadata

Disadvantages:
q q

Not simple to compute the total amount of machine resources used and hence the cost Depending on the robustness of the implemented solution, it may take more hardware resources for Project A to process n rows than it takes Project B to process n rows

c) CPU Time Advantages:


q q

Easily measured by PowerCenter metadata Ability to measure server utilization versus the previous two methods

Disadvantages:
q q

Need to identify the processes by user to charge by this method In a shared hardware environment, if other processes are running on the server at the same time, the server run time may be longer, unfairly penalizing the customer processing this data

Direct Cost
The client pays the direct costs associated with a request that may include incrementally purchased hardware or software as well as a cost per per-hour or per-day for developers or analysts. The development group provides an estimate (with assumptions) of the cost to deliver based on the understanding of the clients requirements. Advantages:
q q q

Changes to requirements are easily absorbed into the project contract Project activities independent of pricing pressures The client has clear visibility as to exactly what s(he) is paying for

Disadvantages:
q

The client absorbs the risk associated with both the product definition and the estimation of delivery cost The client must be concerned with and therefore pay attention to the day-to-day details of the ICC There is no cost incentive for the development group to deliver in the most cost effective way and hence the total cost of ownership might be high

q q

Cost Allocation
INFORMATICA CONFIDENTIAL BEST PRACTICES 530 of 954

Costs can also be allocated on a more-or-less arbitrary basis irrespective of any actual resource usage. This method is typically used for ongoing operating costs but may also be used for project work. The general presumption with this method is that most IT costs are either fixed or shared and therefore should simply be allocated or spread across the various groups that utilize the services. Advantages:
q q

Ease of budgeting and accounting All centralized and fixed costs are accounted for regardless of the demand

Disadvantages:
q q q

Needs sponsorship from the Executives for larger IT budgets rather than departmental funding High-level allocation may not be seen as fair if one business unit is larger than another There is little-to-no connection between the specific services that a consumer uses and the costs they pay (since the costs are based on other arbitrary measures)

Each model has a unique profile of simplicity, fairness, predictability and controllability from the consumer perspective which is represented graphically in the following figure.

In general, Informatica recommends a hybrid approach of service-based, flat rate, and measured resource usage methods of charging for services provided to internal client:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

531 of 954

q q q q

Direct cost for hardware and software procurement Fixed price Service-based for project implementation Measured Resource Usage for operations Tiered flat rate for support and maintenance

Alignment with ICC Models


There are five standard ICC models, as illustrated in the figure below:

How do the five ICC models align with the financial framework and is there an ideal approach for each of the ICC organizational models? The short answer is it depends. In other words, many organizational constraints that can be linked to accounting rules, corporate culture, or management principles may dictate one approach or another. The reality is that any combination of the four financial models can be used with any of the five ICC models. That said, there is a common pattern, or recommended sweet spot for how to best align the ICC model with the financial accounting models. The following figure summarizes that alignment.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

532 of 954

The Best Practices ICC typically focuses on promoting integration standards and best practices for new projects or initiatives, which puts it on the strategic end of the budget spectrum. Furthermore, it is often a centrally funded group with little or no chargeback in support of a charter to act as an organizational change agent. Zero chargeback costs encourage project teams to use the ICC and therefore spread the adoption of best practices. The Standard Services ICC is often a hybrid model encompassing both centrally-funded governance technology or governance activities (which service consumers are not likely to pay for) as well as training services or shared software development (especially in an SOA COE), typically charged back to projects. The Shared Services ICC is the most common approach and may involve both project activities and operational activities. Because most Shared Services groups are organized as a federation, it complicates the charge-back accounting to the point where it is too cumbersome or meaningless (e.g., people costs are already distributed because the resources reside in different cost centers). If a charge-back scheme is used for a Shared Services ICC, it is typically a hybrid approach based on a combination of project charges and operational allocations. The Central Services ICC requires more mature charge-back techniques based on the service levels or usage. This is important because it requires strong consumer orientation and incentives to encourage responsiveness in order to be perceived positively and sustain operations. In most organizations with a Central Services group, if the service consumers do not feel their needs are being met, they find another alternative and, over time the ICC is likely to disappear or morph into a shared services function. In other words, a centrally funded Central Services group is not a sustainable model. It puts too much emphasis on central planning, which results in dysfunctional behavior and therefore cannot be sustained indefinitely. The Self-Service ICC is typically either 100 percent centrally funded with no chargeback or 100 percent fully cost recovered. This particular ICC can typically be outsourced, or operate internally on a fully-loaded cost basis, or be absorbed into the general network and IT infrastructure. A hybrid funding model for a Self-Service ICC is
INFORMATICA CONFIDENTIAL BEST PRACTICES 533 of 954

unusual.

Appendix A Chargeback Case Studies


This section provides two ICC chargeback case studies based on real-world examples. The case studies have been disguised to allow us to be as specific as possible about the details.

Case Study #1: Charge-Back ModelETL COE Production Support Chargeback


A large U.S.-based financial institution, BIGBANK, was looking for a way to reduce the cost of loading its Teradata-based data warehouse. The extract, transfer, load (ETL) process was mainframe based with an annual internal cost of more than $10 million per year, which was charged back to end users through an allocation process based on the percentage of data stored on the warehouse by each line of business (LOB). Daily load volume into the warehouse was 20 Terabytes/month and demand for new loads was growing steadily. BIGBANK decided to implement a mid-range solution for all new ETL processes and eventually retire the more expensive mainframe-based solution. Initial implementation costs of a highly scalable mid-range solution, including licensing, hardware, storage, and labor were approximately $2.2 million annually. This solution consisted of an 11-node, grid computing based Sun solution with a shared Oracle-RAC data repository. Three nodes were dedicated for production with two nodes each for development, system integration test, user acceptance test, and contingency. Estimated daily ETL load capacity for this solution was greater than 40 TB/month. Management wanted to implement the new solution using a self-funding mechanism, specifically a charge-back model whereby the projects and business units using the shared infrastructure would fund it. To achieve this goal, the cost recovery model had to be developed and it had to be compelling. Furthermore, given that the ETL capacity of the new mid-range environment exceeded the daily load volumes of the existing Teradata warehouse, there was significant opportunity for expanding how many applications could to use the new infrastructure. The initial thought was to use load volumes measured in GB/month to determine charge-back costs based on the total monthly cost of the environment, which included the non-production elements. There would be an allocation to each LOB based upon data moved in support of a named application using the environment. Load volumes were measured daily using internal mid-range measurement tools and costs were assigned based upon GB/month moved. The problem with this approach is that early adopters would be penalized, so instead, a fixed price cap was set on the cost/GB/month. Initially, the cost cap for the first four consumers was set at $800/ GB to find the right balance between covering much of the cost but at a price-point that was still tolerable. The plan was to further reduce the cost/GB as time went on and more groups used the new system. After 18 months, with over 30 applications onboard and loading more than six TB/month, the GB/month cost was reduced to less than $50/GB. Load volumes and the associated costs were tracked monthly. Every six months, the costs were adjusted based upon the previous six months data and assigned to the appropriate named applications. Over time, the charge-back methodology of GB/month proved to be incomplete. Required labor was driven more by the number of load jobs per supported application and less by total volumes. The charge-back model was adjusted to tie labor costs to the total number of jobs per application per month. Hardware and software costs remained tied to GB loaded per month. All in all, the charge-back approach was an effective way to use projects to fund a shared infrastructure. At the
INFORMATICA CONFIDENTIAL BEST PRACTICES 534 of 954

time of this writing, use of the new mid-range solution continues to grow. There is no set date when the legacy mainframe ETL jobs will be fully retired, but with all the new ETL work being deployed on the mid-range infrastructure, the legacy jobs will gradually shrink due to attrition; eventually, it will be an easy decision to invest in migrating the remaining ones to the new environment.

Case Study #2: Charge-Back ModelETL COE Initiative-Driven Capacity Chargeback


Building further upon the case study at BIGBANK, funding for incremental production support personnel was identified as a serious risk in the early in the stages of deployment of the new shared infrastructure. The new Data Integration environment and its associated processes were internally marketed as highly available and immediately ready for any application that needed the service. Over time, however, it became increasingly obvious that as more applications moved from project status to production status, incremental production support staff would be required. Forecasting that incremental production support labor and having the funds available to source the labor became increasingly challenging in an organization that planned base support budgets annually. In short, demand for production support resources were driven by projects, which was out of sync with the annual operating budget planning cycle. As stated in the previous case study, the environment was self-funded by the applications using a fairly simple charge-back methodology. The charge-back methodology assumed that sufficient production support staff would be available to support all consuming applications. However, the data used to calculate a monthly application chargeback was based upon actual throughput metrics after several months in production. In other words, the metrics that showed that additional staff would be required became apparent months after the workload had already increased. When an application came aboard that required extensive support but did not have incremental labor forecast, the production support staff in place was forced to resort to heroic efforts to maintain the application. The resultant staff angst and internal customer dissatisfaction were significant. To solve this issue, the concept of an operational surcharge to support the project before moving the application into production was instituted based upon estimated data and job volumes. Called Operational Capacity from New Initiatives (OCNI), this cost was added to the project estimate before initial funding was allocated to the project. Once the project was approved and funds transferred between cost centers, the OCNI funds were pooled in a separate cost center (i.e., held in escrow), often from multiple projects until the work volume in the environment exceeded prescribed limits (usually average hours worked by the production support staff during four weeks). When work volume limits were exceeded, incremental staff was sourced and paid for with the escrowed OCNI dollars. At the end of the budget year, the incremental staff were moved to the base operating budget and the cycle started over again with the new budget year. This allowed the flexibility to rapidly add production support staff as well as effectively plan for base staff in the following budget year forecast. The result was that operational support resources more closely aligned with the actual support workload. The operational staff were not as stressed, the internal consumers were happier because their costs were included upfront in the planning cycle, the finance staff were pleased to make the model work within the constraints of the companys accounting rules, and IT management had increased confidence in making decisions related to the annual operating budget planning. In summary, it was a win-win for everyone.
Last updated: 29-May-08 16:48

INFORMATICA CONFIDENTIAL

BEST PRACTICES

535 of 954

Engagement Services Management Challenge


Because Integration Competency Centers (ICCs) are, by definition, shared services functions that support many and varied customers, it is essential that they operate as internal businesses with a portfolio of services that potential customers can easily find, understand and order. This Best Practice focuses on defining and developing the various services to be provided by the ICC. The number of different services provided (e.g., Production Operations, Metadata Management and Integration Training) determines the initial size and scope of the ICC. Once the services have been defined and the appropriate capabilities established, the organization can consider sustaining the service(s) in an operational mode. This document does not address operations management or the service level agreement (SLA).

Description
Services to be offered by an ICC should include the following attributes.
q q q q q q

Name of service Description / narrative of service Who is the likely buyer or consumer of the service Value proposition Cost of service Ordering mechanism and delivery process

Many people confuse service definitions with delivery processes. There is a natural tendency for individuals to describe what they do from their own perspective rather than from the perspective of the customer who is the consumer of the service. Lack of clarity on this distinction is a primary cause of failed adoption of an ICC. It is imperative therefore, to internalize this distinction as a first step. Any attempt to develop a service portfolio or value proposition in advance of obtaining this insight is pointless, because the result will be an organization that is perceived to be internally rather than externally-focused, a situation that defeats the purpose of an ICC. The sequence of steps needed to fully define the portfolio of services is as follows:
q q q q

Define the services, which in turn Defines engagement and delivery processes, which in turn Specifies the capabilities and activities, which in turn Drives requirements for tools and skills.

In summary, the first step is to define the service from the customers perspective. For example, consider a package shipping company. If it defines its service and value proposition as guaranteed rapid delivery of packages from anywhere to anywhere in the world, it is likely to maximize processes such as an extensive network of local delivery vehicles, a fleet of airplanes and sophisticated package sorting and tracking systems. If, on the other hand, it defines its service and value proposition as low-cost delivery of bulk goods to major U.S. cities, it is likely to maximize its network of truck and train delivery vehicles between major cities. Note that in this second scenario the customer is different (i.e., a smaller number of commercial customers instead of a large number of
INFORMATICA CONFIDENTIAL BEST PRACTICES 536 of 954

consumer customers) and the value proposition is also different (i.e., low cost versus speed and flexibility). Thus, it is essential to begin with a description of the service based on a clear understanding of who the customer is, and what the value proposition is from the customer perspective. Once that has been established, you can begin to design the processes, including how the service will be discovered and ordered by the customers. After the process definitions are complete, the ICC can proceed to define the capabilities and activities necessary to deliver the service requests and also to determine the tools and staff skills required. Note: Fulfillment elements such as capabilities, activities and tools (while essential to maintain a competitive service delivery) are irrelevant to the customer. For example, customers dont care how the delivery company knows about the current status of each package, as long as the organization can report the status requested by the customer. Similarly for an ICC, the internal customers dont really care how the developer optimizes the performance of a given ETL transformation they only care that it satisfies their functional and quality requirements. There are two key tests for determining if the services have been defined at the appropriate level of detail.
q

The first is to list and count them. If you have identified more than ten ICC services, you have probably mistaken fulfillment elements for services. The second is to apply a market-based price to the service. If an external organization with a comparable service description and a specific pricing model cannot be located, the service is probably defined at the wrong level.

Because defining services correctly is the foundation for a successful ICC operation, all automation, organization, and process engineering efforts should be postponed until all misconceptions are resolved. Service identification begins with the question, "What need or want are we fulfilling?" In other words, "What is our market, who are our customers and what do they require from us?" A service is defined in terms of explicit value to the customer and addresses items such as scope, depth, and breadth of services offered. For example, it should consider whether the service will be a one-size-fits-all offering, or whether gradations in service levels will be supported. Individual services can then be aggregated into a service portfolio, which is the external representation of the ICCs mission, scope, and strategy. As such, it articulates the services that the ICC chooses to offer. Two points are implied:
q

The ICC will consciously determine the services it offers simply performing a service "because we always have" isn't relevant. No ICC can be world-class in everything just as an enterprise may develop strategic partnerships to outsource non-core competencies, so must the ICC. This means sourcing strategically for services that are noncore to ensure that the ICC is obtaining the best value possible across the entire service portfolio. Outsourcing for selected portions of the service delivery can have other benefits as well such as the ability to scale-up resources during periods of peak demand rather than hiring (and later laying off) employees.

A value proposition states the unique benefits of the service in terms the customer can relate to. It answers the questions, "Why should I (i.e., the customer) buy this service?" and "Why should I buy it from the ICC?" ICCs are well positioned (due to their cross-functional charter) to understand the nuances of internal customer needs and to predict future direction and develop value propositions in a way that appeals to them. The following figure is an example of an externally focused, value-based service portfolio for an ICC. It may be difficult to obtain relevant external benchmarks for comparison, but it should always be possible to find variants or lower-level services that can be aggregated; or higher-level services that can be decomposed. This sample list was generated by browsing the Internet and discovering five sites that offer similar services and then synthesizing the service descriptions to align with the ICC charter within the culture and terminology for a given enterprise.
INFORMATICA CONFIDENTIAL BEST PRACTICES 537 of 954

See Selecting the Right ICC Model for a full list of services that can be provided. Service Name Product Evaluation & Selection Value Proposition The Product Evaluation & Selection service is a thorough, fact-based evaluation and selection process to provide a better understanding of differences among vendor offerings, seen in the light of the ICC landscape, and to identify vendors that best meet the requirements. This service considers all the enterprise requirements and resolves conflicts between competing priorities including security, performance, legal, purchasing, technology, standards, risk compliance and operational management (just to name a few). It is the most efficient way to involve cross-functional teams to ensure that once a product is selected, all the organizational processes are in place to ensure that it is implemented and supported effectively. The Application Portfolio Optimization service provides a thorough inventory, complete assessment, analysis and rationalization for a specific LOB and its IT applications with respect to business strategy, current and future business and technology requirements and industry standards. It gives LOBs the ability to assess application rationalization opportunities across a variety of business functions and technology platforms. It provides an holistic application portfolio view of planning and investment prioritization that includes application system capability sequencing and dependencies (roadmaps). It provides support to LOB teams to reduce ongoing IT costs, improve operational stability and accelerate implementation of new capabilities by systematically reducing the number of applications and data replications. Consumer Application Teams Cost Process

RFI No charge for quick assessment and RFP 1-page vendor brief.

Architecture Direct Cost Chargeback for in-depth evaluation.

PMO

Application Portfolio Optimization

CIO Teams

Negotiated Fixed Price

Information Architecture Roadmap

LOB Business Executives

INFORMATICA CONFIDENTIAL

BEST PRACTICES

538 of 954

Integration Training & Best Practices

The Integration Training & Best Practice service facilitates capturing and disseminating intellectual capital associated with integration processes, techniques, principles and tools to create synergies across the company. The integration practice leverages model-based planning techniques to simplify and focus complex decision-making for strategic investments. It includes a formal peer-review process for promoting integration practices that work well in one LOB or technology domain to a standard Best Practice that is applicable across the enterprise.

Integration No charge for Team members ad-hoc support across IT and brief (<1 hour) presentations.

Integration Training, Internal Newsletter, BLOGS, Brown Bag Presentations, Integration Principles

Direct cost chargeback for formal training sessions.

Integration Consulting

Application The Integration Consulting service enables project teams to tap into a group of Teams dedicated domain experts to adopt and successfully implement new technologies. This service translates business and technology strategies into technical design requirements and assists projects with integration activities and deliverables through any and all stages of the project lifecycle. Performance is measured by factors such as investment expense, operating cost, system availability and the degree to which the solutions can support both the existing business strategy and be adapted to sustain emerging trends.

Direct cost chargeback.

Integration Project Request

TIP Each set of services based upon ICC models allows organizations to focus on the type of services that will provide the best ROI for those models. Aligning services and resources to cost savings helps organizations derive value from the ICC.

Other Factors Affecting Service Offerings: Strategic vs. Tactical Priorities


Another significant challenge of determining service offerings is the question of targeting offerings based on Strategic Initiatives or Tactical Projects. For example, if an ICC has a charter to directly support all strategic initiatives and provide support to tactical projects on an advisory basis, the service portfolio might include several comprehensive service offerings for strategic initiatives (e.g., end-to-end analysis, design, development, deployment and ongoing maintenance of integrations) and offer a Best Practices Advisory service for tactical projects. By reviewing a list of IT projects provided by the Project Management Office (PMO) for an organization, projects can be scored on a 1 to 5 numerical scale or simply as High, Medium or Low depending on the level of cross-functional integration that is required. The following figure illustrates the number of tactical versus strategic projects that an ICC might address.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

539 of 954

Once categorized and scored with regard to integration needs, the ICC could provide central services such as development management of strategic projects that have a high index of integration versus projects with low levels of cross-functionality (which could be supported with a Best Practice ICC model). The goal is to focus on ICC service offerings that are geared toward strategic integration initiatives and to provide minimal ICC services for tactical projects.

Summary
The key to a successful ICC is to offer a set of services that add value to the ICC team. Services can be very helpful in reducing project overhead for common functions in each data integration project. When such functions are removed from the project, the core integration effort can be reduced substantially.
Last updated: 29-May-08 16:48

INFORMATICA CONFIDENTIAL

BEST PRACTICES

540 of 954

Information Architecture Challenge


Implementing best practices to provide for a data governance program requires the following activities:
q q q q q q

Creating various views or models (i.e., levels of abstraction) for multiple stakeholders. Adopting a shared modeling tool and repository that supports easy access to information. Keeping the models current as the plans and environment change. Maintaining clear definitions of data, involved applications/systems and process flow/dependencies. Leveraging metadata for Data Governance processes (i.e., inquiry, impact analysis, change management, etc.). Clearly defining the integration and interfaces among the various Informatica tools and between Informatica tools with other repositories and other vendor tools.

Description
Information architecture is the art and science of presenting and visually depicting concept models of complex information systems in a clear and simplified format for all of the various stakeholders and roles. There are three key elements of the Information Architecture best practice:
q q q

Methodology for how to create and sustain the models Framework for organizing various model views Repository for storing models and their representations

Methodology
The information architecture methodology is described here in the context of a broader data governance methodology, as shown in the figure below. Many of the activities and techniques are applicable in other contexts such as data migration programs or data rationalization in support of mergers and acquisitions. It is the task of the architect and program team to tailor the methodology for a given program or enterprise strategy or purpose.

The following paragraphs provide a high-level description of the ten steps of the data governance methodology. The information architecture methodology described in this best practice is most closely aligned with step 3 and steps 5 through 10. For details on steps 1-4, refer to the Data Governance Enterprise Strategy document. 1. Organize Governance Committee: Identify the business and IT leaders that will serve as the decision group for the enterprise, define the committee charter and business motivation for its existence, and establish its operating model. Committee members need to understand why they are there, know the boundaries of the issues to be discussed, and have an idea of how they will go about the task at hand. 2. Define Governance Framework: Define the what, who, how and when of the governance process, document data policies,
INFORMATICA CONFIDENTIAL BEST PRACTICES 541 of 954

integration principles and technology standards that all programs must comply with. 3. Develop Enterprise Reference Models: Establish top-down conceptual reference models including a) Target Operating Blueprint, b) Business Function/Information Matrix and c) Business Component Model. 4. Assign Organizational Roles: Identify data owners and stewards for information domains, responsible parties/owners of shared business functions in an SOA strategy, or compliance coordinators in a Data Governance program. 5. Scope Program: Leverage the enterprise models to clearly define the scope of a given program and develop a plan for the roadmapping effort. Identify the high-level milestones required to complete the program and provide a general description of what is to take place within each of the larger milestones identified. 6. Assess Baseline and Data Quality: Leverage the enterprise models and the scope definition to complete a current-state architectural assessment, profile data quality, and identify business and technical opportunities. 7. Develop Target Architecture: Develop a future-state data/systems/service architecture in an iterative fashion in conjunction with Step 6. As additional business and technical opportunities become candidates for inclusion, the projected target architecture will also change. 8. Plan Migration Roadmap: Develop the overall program implementation strategy and roadmap. From the efforts in Step 5, identify and sequence the activities and deliverables within each of the larger milestones. This is a key part of the implementation strategy with the goal of developing a macro managed roadmap which adheres to defined best practices. Identifying activities does not include technical tasks which are covered in the next steps. 9. Develop Program Models: Create business data models and information exchange models for the defined program (i.e., logical and physical models are generally created by discrete projects within the program). The developed program models use functional specifications in conjunction with technical specifications. 10. Implement Projects: This is a standard project and program management discipline with the exception that some data governance programs have no defined end. It may be necessary to loop back to step 5 periodically and/or provide input to steps 2, 3 or 4 to keep them current and relevant as needs change. As the projects are implemented, observe which aspects could have been more clearly defined and at which step an improvement should take place.

Information Architecture Framework


The information architecture framework is illustrated in the following figure.

Key features of the framework include:


q

A four layer architecture with each layer focusing on a level of abstraction that is relevant for a particular category of stakeholder and the information they need:
r r r r

Layer 4 Enterprise View: Overarching context for information owners & stewards Layer 3 Business View: Domain models for business owners and project sponsors Layer 2 Solution View: Architecture models for specific systems and solutions Layer 1 Technology View: Technical models for developers, engineers and operations staff

q q

Layer 3 is based on the reference models defined in Layer 4; these layers are developed from a top down perspective Layers 1 and 2 are created top-down when doing custom development (i.e., able to control and influence the data models) and bottom-up when doing legacy or package integration (i.e., little ability to control the data model and generally a need to reverse engineer the models using analytical tools) Relevant information about the models is maintained in a metadata repository which may be centralized (i.e., contains all
BEST PRACTICES 542 of 954

INFORMATICA CONFIDENTIAL

metadata) or federated (i.e., contains some metadata as well as common keys that can be used to link with other repositories to develop a consolidated view, as required)
q

Separate models for representing data at rest (i.e., data persisted in a repository and maintained by an application component) and data in motion (i.e., data exchanged between application components)

Reference Models
There are three models in the enterprise reference model layer of the information architecture framework, but only one instance of these models for any given enterprise.
q

Target Operating Blueprint: Business context diagram for the enterprise showing key elements of organizational business units, brands, suppliers, customers, channels, regulatory agencies, and markets. Business Function/Information Matrix: Used to define essential operational capabilities and related service and information flows to generate create/use matrix. Basic service functions are used as navigation points to process models (i.e., reflected in process and target systems models). Business Component Model: Used to define reference system families derived from the create/use matrix. Serves as navigation point into other systems view models.

Reference models may be purchased, developed from scratch, or adapted from vendor/industry models. A number of IT vendors and analyst firms offer various industry or domain-specific reference models. The level of detail and usefulness of the models varies greatly. It is not in the scope of this best practice to evaluate such models, only to recognize that they exist and may be worthy of consideration. There are also a significant number of open industry standard reference models that also should be considered. For example, the Supply-Chain Operations Reference (SCOR) is a process reference model that has been developed and endorsed by the Supply-Chain Council (SCC) as the cross-industry de facto standard diagnostic tool for supply chain management. Another example is the Mortgage Industry Standards Maintenance Organization, which maintains process and information exchange definitions in the mortgage industry. The reference models that are available from Proact, Inc. are particularly well suited to data integration and data governance programs. Some key advantages of buying a framework rather than developing one from scratch include:
q

Minimizing internal company politics: Since most internal groups within a company have their own terminology (i.e., domainspecific reference model), it is often a very contentious issue to rationalize differences between various internal models and decide which one to promote as the common enterprise model. A common technique that is often attempted, but frequently fails, is to identify the most commonly used internal model and make it the enterprise model. This can alienate other functions who dont agree with the model and can in the long run undermine the data governance program and cause it to fail. An effective external model however, can serve as a rallying point and bring different groups from the organization together rather than pitting them against each other or forcing long, drawn-out debates. Avoid paving the cow path: The cow path is a metaphor for the legacy solutions that have evolved over time. An internally developed model tends to often reflect the current systems and processes (some of which may not be ideal) since there is a tendency to abstract away details from current processes. This in turn can entrench current practices which may in fact not be ideal. An external model, almost by definition, is generic and does not include organization-specific implementation details. Faster development: It is generally much quicker to purchase a model (and tailor it if necessary) than to develop a reference model from the ground up. The difference in time can be very significant. A rough rule of thumb is that adopting an external model takes roughly one to three months while developing a model can take one to three years. While the reference model may involve some capital costs, the often hidden costs of developing a reference model from scratch are much greater.

Regardless of whether you buy or build the reference models, in order for them to be effective and successful, they must have the following attributes:
q

Holistic: The models must describe the entire enterprise and not just one part. Furthermore, the models must be hierarchical and support several levels of abstractions. The lowest level of the hierarchy must be mutually exclusive and comprehensive (ME&C), which means that each element in the model describes a unique and non-overlapping portion of the enterprise while the collection of elements describes the entire enterprise. Note: It is critical to resist the urge to model only a portion of the enterprise. For example, if the data governance program focus is on customer data information, it may seem easier and more practical to only model customer related functions and data. The issue is that without the context of a holistic model, the definition of functions and data will inherently be somewhat ambiguous and therefore be an endless source of debate and disagreement.

Practical: It is critical to establish the right level of granularity of the enterprise models. If they are too high-level, they will be too conceptual; if they are too low-level, the task of creating the enterprise models can become a boiling the ocean problem and consume a huge amount of time and resources. Both of these extremes of too little detail or too much detail are nonpractical and the root cause of failure for many data governance programs.
BEST PRACTICES 543 of 954

INFORMATICA CONFIDENTIAL

TIP There are two secrets to achieving the right level of granularity. First, create a hierarchy of functions and information subjects. At the highest level it is common to have in the range of 5-10 functions and information subjects that describe the entire enterprise. Second, at the lowest level in the hierarchy, stop modeling when you start getting into how rather than what. A good way to recognize that you are into the realm of how is if you are getting into technology-specific or implementation details. A general rule of thumb is that an enterprise reference model at the greatest level of detail typically has between 100 and 200 functions and information subjects.

Stable: Once developed, reference models should not change frequently unless the business itself changes. If the reference models did a good job separating the what from the how, then a business process change should not impact the reference models but if the organization expands its product or service offerings into new areas, either through a business transformation initiative or a merger/acquisition, then the reference model should change. Example of scenarios that would cause the reference model to change include a retail organization transforming its business by manufacturing some its own products or a credit card company acquiring a business that originates and services securitized car and boat loans.

Reference models, once created, serve several critical roles: 1. Define the scope of selected programs and activities. The holistic and ME&C nature of the reference models allows a clear definition of what is in scope and out of scope. 2. A common language and framework to describe and map the current state enterprise architecture. The reference model is particularly useful for identifying overlapping or redundant applications and data. 3. They are particularly useful for identifying opportunities for different functional groups in the enterprise to work together on common solutions. 4. They provide tremendous insight for creating target architectures that reflect sound principles of well-defined but decoupled components.

Information Model
There are two information models on the Business View (Layer 3) of the information architecture framework. These are sometimes referred to as semantic models since there may be separate instances of the models for different business domains.
q

Business Glossary: List of business data elements with a corresponding description, enterprise level or domain specific, validation rules and other relevant metadata. Used to identify source of record, quality metrics, ownership authority and stewardship responsibility. Information Object Model: Used to provide traceability from the enterprise function and information subject models to the business glossary (i.e., an information object includes a list of data elements from the business glossary). Possible use for assessing current information management capabilities (reflected in process and target systems models) or as a conceptual model for custom-developed application components.

The Business Glossary is implemented as a set of objects in Metadata Manager to capture, navigate, and publish business terms. This model is typically implemented as a custom extension to the Metadata Manager (refer to the Metadata Manager Best Practices for more details) rather than as Word or Excel documents (although these formats are acceptable for very simple glossaries in specific business domains). The Business Glossary allows business users, data stewards, business analysts, and data analysts to create, edit, and delete business terms that describe key concepts of the business. While business terms are the main part of the model, it can also be used to describe related concepts like data stewards, synonyms, categories/classifications, rules, valid values, quality metrics, and other items. Refer to Metadata Manager Business Glossary and the Data Quality and Profiling Best Practices for more information. Creating a data model for the business glossary is a normal data modeling activity and may be customized for each enterprise based on needs. A basic version of the model should contain the following classes, properties and associations:
q q q q q

Category (name, description, context) Business Term (name, description, context, rule, default value, quality score, importance level) Data Steward (name, description, Email address, phone number) Domain (either a data type, a range, or a set of valid values.) Valid Value (name of the value itself and a description)
BEST PRACTICES 544 of 954

INFORMATICA CONFIDENTIAL

Relationships: Category Category DataSteward BusinessTerm BusinessTerm Contains Contains Owns Has HasSynonym BusinessTerm Category BusinessTerm ValidValue BusinessTerm

The following diagram shows the discovery, definition and specification flow among various users and between the analyst tools and PowerCenter. The diagram illustrates pictorially the general processes involved in several different use cases.

Process Models
There are three process models on the Business View (Layer 3) of the information architecture framework.
q

Information Exchange Model: Used to identify information exchanges across business components including use of integration systems (e.g., hubs, busses, warehouses, etc.) to enable the exchanges. All of the various information exchanges are represented within a single diagram or model at this level of abstraction. Operational sequence model: Alternate representation for operational scenarios using UML techniques. Individual operational processes may have their own model representation. Business Event Model: A common (canonical) description of business events including business process trigger, target subscribers and payload description (from elements in the business glossary). This model also fits into the classification of semantic models but in this case for data in motion rather than data at rest.

For further details on how to develop the information exchange model and operational sequence model, refer to the Proact enterprise architecture standards.
INFORMATICA CONFIDENTIAL BEST PRACTICES 545 of 954

Repository
Earlier sections of this document addressed the information architecture methodology and framework. The final dimension to the Information Architecture best practice is a repository-based meta-model and modeling tool. Informaticas Metadata Manager provides an enterprise repository to store models as well as other customizable metadata. Metadata Manager also offers an integrated solution with PowerCenter metadata. Refer to the Metadata Management Enterprise Competency for more information.

Last updated: 03-Jun-08 15:13

INFORMATICA CONFIDENTIAL

BEST PRACTICES

546 of 954

People Resource Management Challenge


Selecting the right team members to staff an Integration Competency Center (ICC) is a significant implementation challenge; as is developing the team members skills to correspond with the ICCs service offerings. Each member of the team has strengths and weaknesses in a variety of disciplines. The key to building a successful ICC team is to fit the discipline strengths to the needs of the ICC. This Best Practice focuses on the process of selecting individual team members and organizing the team with appropriate reporting structures. Overall, the team members should have the following qualities that promote team unity and allow them to effectively work in a cross-functional multi-disciplinary environment. Competency Know the Business Understands the complexities and key factors that impact the business
q

Junior ICC Staff Understands the companys vision, mission, strategy, goals, and culture. Has a general knowledge of business and integration program operations. Articulates the business value delivered through ICC.
q q

Senior ICC Staff Understands and articulates the integration and financial plan for current year and how those plans support the strategy and goals of the enterprise. Continually examines competitor and best in class performers to identify ways to enhance the integration service and value to the enterprise. Integrates industry, market, and competitive data into integration planning process. Proactively connects others to best in class performance to drive enhancement of the integration practice. Analyzes and understands the needs of all key stakeholders and takes steps to ensure their continued satisfaction. Partners with LOB teams to define and develop shared goals and expectations. Finds and develops common ground among a wide range of stakeholders. Coaches staff across the organization on integration practices.
q q

Master ICC Staff Demonstrates deep/ broad integration and functional skills. Demonstrates a business perspective that is much broader than one function/ group. Cuts to the heart of complex business/ functional issues. Monitors and evaluates the execution and quality of the strategic planning process.

Collaborate Integrates and collaborates well inside and outside of the organization, holding a customer-centric position

Surfaces and resolves conflict with minimal noise. Builds partnerships across the integration community and positively represents the team in the services we provide. Builds trust by consistently delivering on promises.

Builds broad-based business relationships across the organization (including business executives) Creates win/win scenarios with key vendors. Leverages external industry organizations to achieve enterprise goals.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

547 of 954

Customer Focus Knows and cares about customers works well in a team to exceed expectations

Defines and reviews the requirements, deliverables and costs of proposed solutions. Identifies variances from system performance and customer requirements and collaborates with LOB teams to improve the variances. Ensures each recommended solution has a scope, timelines, desired outcomes, and performance measures that are well defined and communicated. Asks open-ended probing questions to further define problems, uncover needs and clarify objectives. Maintains current view of best in class practices through selflearning and benchmarking. Scans the environment to remain abreast of new developments in business & technology trends.

Contracts and sets clear expectations with internal customers about goals, roles, resources, costs, timing, etc. Positions and sells business and technology partners on innovative opportunities impacting people, process and/or technology. Forecasts how the business is changing and how IT will need to support it.

Advises senior executives on how solutions will support short- and longterm strategic direction. Drives multi-year strategy and funding/cost-saving opportunities across the enterprise.

Drive for Learning Sizes up and acts on the learning implications of business strategy and its execution

Improves the quality of the integration program by developing and coaching staff across the enterprise to build their individual and collective performance and capability to the standards that will meet the current and future needs of the Business. Is recognized as an expert at in one or more broad integration domains.

Is recognized as an expert outside of the enterprise in one or more integration domains. Represents the enterprise on Industry boards and committees. Participates in major industry and academic conferences. Serves as an active member in international standards committees.

Capitalize on Opportunities Recognizes possibilities that increase depth of integration solutions

Identifies patterns and relationships between seemingly disparate data to generate new solutions or alternatives. Gathers necessary data to define the symptoms and root causes (who, what, why and costs) of a problem. Develops alternatives based on facts, available resources, and constraints.

Initiates assessments to investigate business and technology threats and opportunities. Translates strategies into specific objectives and action plans. Collaborates across functional groups to determine impact before implementing new processes and procedures. Uses financial, competitive and statistical modeling to define and analyze opportunities. Integrates efforts across LOBs to support strategic priorities.

Uncovers hidden growth opportunities within market/industry segments to create competitive advantage. Formulates effective strategies consistent with the business and competitive strategy of the enterprise in a global economy. Identifies factors in the external and internal environment affecting the organizations strategic plans and objectives.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

548 of 954

Change Leadership initiates and creates conditions for change

Does not wait for orders to take action on new ideas. Expresses excitement freely concerning new ideas and change.

Transcends silos to achieve enterprise results. Skillfully influences peers and colleagues to promote and sell ideas. Displays personal courage by taking a stand on controversial and challenging changes. Leads integrated charge across LOB organizations to achieve competitive advantage. Identifies opportunities, threats, strengths and weaknesses of the enterprise. Demonstrates a sense of urgency to capitalize on innovations and opportunities. Challenges the status quo. Acts in a strategic role in the development and maintenance of integrations for a line of business or infrastructure sub-domain that are in compliance with Enterprise standards. Provides in-depth technical and systems consultation to internal clients and technical management to ensure alignment with standards. Guides the organization in proper application of integration practice. Leverages both deep and broad technical knowledge, strong influencing and facilitation skills, and knowledge of integration processes and techniques to influence organizational alignment around a common direction.

Leverages industry, market and competitor trends to make a compelling case for change within the company. Mobilizes the organization to adapt to marketplace changes. Proactively plans responses to new and disruptive technologies.

Organizational Alignment Creates process and infrastructure to carry out plans and strategies

Advises application teams on technology direction. Develops and maintains business system and corporate integration solutions. Responsible for working on medium to complex integration projects, recommending exceptions to standards, reviewing and approving architectural impact designs and directing implementation of the integration for multiple applications. Conducts complex technology and system assessments for component integration. Acts as a lead in component integration and participates in enterprise integration activity.

Performs as the integration subject matter expert in a specific domain. Organizes, leads, and facilitates cross-entity, enterprise-wide redesign initiatives that will encompass an end to end analysis and future state redesign that requires specialized knowledge or skill critical to the redesign effort.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

549 of 954

Accountable Can be counted on to strive for outstanding results

Asks probing questions to uncover and manage the needs and interests of all parties involved. Explores alternative positions and identifies common interests to reach a win/win outcome.

Skillfully influences others to acquire resources, overcome barriers, and gain support to ensure team success. Negotiates project timeline changes to meeting unforeseen developments or additional unplanned requests. Escalates issues to appropriate parties when decision cannot be reached. Assumes accountability for delivering results that requires collaboration with individuals or groups in multiple functions. Collaborates with partners across functions to define and implement innovations that improve process execution and service delivery.

Skillfully influences peers and management to promote and sell ideas. Is accountable for planning, conducting, and directing the most complex, strategic, corporate-wide business problems to be solved with automated systems. Engages others in strategic discussions to leverage their insights and create shared ownership of the outcomes.

Description
For each of the ICC models, the number of shared resources increases with the size of the organization and the type of ICC model that is chosen. In the following table, the number of ICC staff is represented as a percentage of the total IT staff (i.e., total IT includes both internal employees and external contract staff). For example, if a Best Practices ICC is implemented for a company with 100 IT staff, the number of ICC resources would be one to two resources. For a Shared Services ICC, the number of resources can be five to ten resources. Anything less than one dedicated resource means there is no actual ICC.

* - Number of ICC shared resources as a percentage increases dramatically as Integration Developers are added to perform the integration as part of the ICC

Best Practice Model


The recommended minimum number of dedicated, shared resources is usually two. The maximum size of the group depends upon the amount of infrastructure currently in the organization. The roles for this model include the following:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

550 of 954

Training and Knowledge Coordinator


q

Develops and maintains mapping patterns, reusable templates, and best-practices documents, usually stored in a central location and readily available to anyone in the enterprise. Coordinates vendor-offered or internally sponsored training in specific integration technology products. Also prepares and delivers events such as seminars or internal presentations to various teams

Metadata Specialist / Data Steward


q

Creates standards for capturing and maintaining metadata. The metadata repositories themselves will likely be maintained by various groups around the enterprise in a federated model, so the focus in this model is to define and enforce federated metadata standards and processes to sustain them. Responsible for the data ownership and stewardship of business elements within a particular subject area. Handles data definition and attributing. Works in conjunction with the Metadata Specialist.

The Metadata Specialist role is responsible for capturing the sources of information governed by the ICC while the Data Steward is usually a business user or knowledgeable IT resource that is very familiar with the data. This is a corporate investment that may begin by cataloguing information and ultimately becomes the repository for business information.

Technology Standards Model


Up to four shared resources are typically required to support the evolution process. Note that in smaller teams, one individual may play more than one role. The primary roles for this model are shown in the organization chart below:

Standards Coordinator
q

Actively monitors and promotes industry standards and adapts relevant ones to the needs of the enterprise. Defines, documents, and communicates internal enterprise standards. May also act as the companys official representative to external standards organizations, and may propose or influence the development of new industry standards. Works with the Knowledge Coordinator to publish standards within the organization.

Technical Architect
q

Develops and maintains the layout and details of the software configurations and physical hardware used to support the ICC, including tools for the ICC operations (e.g., monitoring, element management, metadata repository, and scanning tools) or for the middleware systems (e.g., message brokers, Web service application servers, and ETL hubs).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

551 of 954

Vendor Manager
q

Leads the efforts to select integration products and participates in the selection of vendors for the servers, storage, and network facilities needed for integration efforts. Handles vendor relationships on an ongoing basis, including maintaining awareness of trends, supporting contract negotiations, and escalating service and support issues.

The organization may be a formal one if an ICC Director has been named or may be informal if the individuals report to different managers and coordinate their efforts through some type of committee or consensus team. Many of the roles may be part of an overall steering committee with each of these roles reporting to an unofficial management group.

Shared Services Model


The shared services model adds a management structure and organization in contrast to the prior models that are typically based on a committee-type structure. The management structure of this model generally provides development support resources and also begins to perform consulting for each of the key projects. The number of services is defined by the ICC Director based on the scope and mission of the organization. This model is probably the most dynamic of any since the scope of services dictates the initial number of roles. Refer to Engagement Services Management for options.

The above organization chart includes the key positions for this model. Any of the positions below may also be added: ICC Director
q

Ensures that the ICC is in alignment with the IT strategy and anticipates business requirements. Manages the ICC staff, prepares business cases for integration investment initiatives, and is responsible for the annual budgets.

Data Architect
q

Provides project-level architecture review as part of the design process for data integration projects, develops and maintains the enterprise integration data model, supports complex data analysis and mapping activities, and enforces enterprise data standards.

Project Manager
q

Supplies full-time management resources experienced in data integration to ensure project success. Is adept at managing dependencies between teams, identifying cross-system risks, resolving issues that cut across application areas, planning complex release schedules, and enforcing conformance to an integration methodology.

Quality Assurance Specialist


q

Develops QA standards and practices for integration testing, furnishes data validation on integration load tasks, supports string testing (a unit test for integration elements) activities in support of integration development, and leads the execution of end-to-end integration testing.

Change Control Specialist


q

Manages the migration to production of shared objects that may impact multiple project teams, determines impacts of
BEST PRACTICES 552 of 954

INFORMATICA CONFIDENTIAL

changing one system in an end-to-end business process, and facilitates a common release schedule when multiple system changes need to be synchronized. Business Analyst
q

Facilitates solutions to complex, cross-functional business challenges; evaluates the applicability of technologies, including commercial ERP software; models business processes; and documents business rules concerning process and data integrations.

Integration Developer
q

Reviews design specifications in detail to ensure conformance to standards and identify any issues upfront, performs detailed data analysis activities, develops data transformation mappings, and optimizes performance of integration system elements.

Centralized Services Model


The centralized services model formalizes both the development and production resources as part of the ICC. The centralized services model focuses primarily on actual project delivery and production support. Thus, the roles of client management and production control are introduced. Again, the organization of the ICC differs from organization to organization as centralized resources such as Security, Production Control exist and may be leveraged.

The key positions added in this model include: Engagement Manager


q

Maintains the relationship with the internal customers of the ICC. Acts as the primary point of contact for all new work requests, supplies estimates or quotes to customers, and creates statements of work (or engagement contracts).

Repository Administrator
q

Ensures leverage and reuse of development assets, monitors repository activities, resolves data quality issues, and requests routine or custom reports.

Security Administrator
q

Provides access to the tools and technology needed to complete data integration development and overall data security, maintains middleware-access control lists, supports the deployment of new security objects, and handles configuration of middleware security.

Sample Organization Charts


The following organization charts represent samples of a Shared Services model and a Central Services model, respectively. The solid line relationships are responsible for setting priorities, setting performance reviews and establishing development priorities. The dashed line relationships indicate coordination, communication, training and development responsibilities. These relationships need a strong affiliation with the resource manager in setting priorities, but the ICC is not fully responsible for the resource management component.
INFORMATICA CONFIDENTIAL BEST PRACTICES 553 of 954

The chart above shows an example of a Shared Services model organization structure. This model includes both solid- and dashedlines with the standards and development groups represented by dashed-line relationship. The focus of this model is to build the Operations, Metadata, and Education functions, which coordinate with the Standards committees and Development groups. The Shared Services enable the development teams to succeed with the integration effort, but to only influence the associated priorities and policies. As the Shared Services model matures and becomes more recognized in the IT organization, the ICC should become increasingly responsible for project development. The following chart shows an example of a developed Central Services model organization. This model shows only dashed-line relationships with the Standards Committee and the Project Management Office (PMO).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

554 of 954

The focus of this model is to build the project development capability through the introduction of Data Movement Architecture, Engagement Management and Development capabilities. These disciplines enable the ICC to have more influence in establishing priorities so that integration issues become more important to the development process.

Last updated: 29-May-08 16:48

INFORMATICA CONFIDENTIAL

BEST PRACTICES

555 of 954

Planning the ICC Implementation Challenge


After choosing a model for an Integration Competency Center (ICC), as described in Selecting the Right ICC Model, the challenge is get the envisioned ICC off the ground. This Best Practice answers the Why, What, When and Who questions for the next 30, 60, 90 and 120 + day increments in the form of activities, deliverables and milestones. The most critical factor in the planning is actually choosing the right project to start the ICC with.

Description

Different ICC Models Require Different Resources


As neither the Best Practices nor the Technical Standards ICCs have their own platforms they require significantly less activities than the Shared Services or Central Services models. The diagram below shows how they can make the project delivery better through the sharing of past experiences.

Planning the Best Practices ICC


The Best Practices ICC does not require the implementation of a platform. A best practices ICC can be developed and built from the ground up by documenting current practices that work well and taking the time to improve upon them. This is best done by a group that carries out many projects and that can make the time to review its processes. The processes can be improved upon and then published on the company intranet for other project teams to make use of. The Best Practices ICC model does not enforce standards. So a certain amount of evangelizing may be required for the practices to be adopted by other project teams. It is only with the agreement and adoption of the best practices by others that an ICC results. Therefore navigating the political and personal goals of individuals leading other projects and getting their buy in to the ICC best practices is important.

Planning the Technical Standards ICC


As discussed above, the Best Practices ICC model does not include an enforcement role. To achieve enforcement of standards across projects there must be managerial agreement or centralized edicts that determine and enforce standards. In planning the development of a Technical Standards ICC, key elements include authority and consensus. Either this will be a practice consolidation exercise in its own right, or the model is established by completing a project successfully according to best practices and obtaining agreement to turn those best practices into enforced standards and approve exceptions. The Technical Standards ICC does not necessarily have an implementation of a common shared platform. Enforcement of standards though does mean that a common and shared platform is the first logical extension of this model.

Executive Sponsorship
INFORMATICA CONFIDENTIAL BEST PRACTICES 556 of 954

The most critical success factor for an ICC is having unwavering executive sponsorship. Since an ICC brings together people, processes, and technology, executive (typically CIO level) sponsorship is needed to institute the level of organization change necessary to implement the ICC. Since an ICC is a paradigm shift for employees who are accustomed to a project silo approach, there can be resistance to a new, more efficient paradigm. Sometimes resistance is due to perceived job insecurity stemming from an ICC. This perception should be curtailed as a functioning ICC actually opens the door for more data integration opportunities due to the lower cost of integrating the data. The executive level sponsorship will greatly help the perception of the ICC and will facilitate the necessary level of organizational change required. Note: In practice, Informatica has found that executive sponsorship is most crucial in organizations that are most resistant to change. Examples include government or educational entities, financial institutions, or organizations that have a long, established history. Next, it is important for the IT organization to recognize the value of data integration. If the organization is not familiar with data integration or Informatica, a successful project must occur first. A level of success must be established with data integration and Informatica to establish an ICC. Establishing successful projects and proving value quickly and incrementally is essential. Once the value of data integration has been established within the organization, then the business case can be made to implement an ICC to lower the incremental cost of data integration. Just as quick wins were important in establishing the business case for an ICC they are also important as an ICC is implemented. As such, the 30/60/90/120+ day plan below outlines an example ICC rollout plan with incremental deliverables and opportunities for quick wins.

Initial Project Selection


Planning an ICC involves the considered selection of an initial project to showcase ICC function and success. An appropriate project is needed that will avoid risk to the ICC approach itself. A high risk and complex project with high visibility that fails (and not due to the ICC implementation) as the first project, could damage the standing of the ICC within the organization. Therefore high risk and overly complex projects should be avoided. The initial project should also be representative of other projects planned for the next year. If this is the case there is likely to be more scope for sharing and reuse between the projects. Therefore more benefit will be derived from ICC. The initial project should be carefully chosen and ideally fit the following criteria:
q q q

A pilot project Moderately Challenging Representative of other projects to be undertaken in the next year

When choosing an initial project for the ICC, consider the following:
q q q q q q q q q

Project Scope and Features Projects that demonstrate the values of the ICC Reusable Opportunities. Budget Issues Central funding to encourage acceptance Initial Setup Costs Project sponsor may not be able to provide all the necessary budget for the initial project Obtaining Resources Staffing resources from within the organization may need to be allocated by other authorities

Note: The budgeting allocated will determine the scope and model of the ICC. Make sure that the scope and budget are sized appropriately. If they dont match, plan to utilize some of the Financial Management best practices in order to obtain adequate funding. TIP A period of evangelizing within the organization may be required to garner support for adoption for the ICC. As projects continue development and implementation prior to the formal adoption of an ICC, there maybe an opportunity to develop best practices in grass roots fashion that become part of a formal ICC established at a later date.

Choose initial projects that have:


q

well established requirements


BEST PRACTICES 557 of 954

INFORMATICA CONFIDENTIAL

q q q q

well defined business benefit reasonable complexity to delivery expectations lower level of risk good visibility, but low level of political intricacies

Resources
The implementation of the ICC requires resources. The ICC has to provide the starting point for cross project reuse of sharable assets such as technology, practices, shared objects and processes. It will need budget and man power. These resources will need to be obtained in the form of central budget and allocation with or without a chargeback model. Alternatively they can be paid from the project budgets that will be undertaken with the ICC. In the event of budget and resource issues problems can be circumvented with a grass roots best practice configuration. Ultimately, the ICC will need resources and management support if it is going to provide more than best practices learned from previous projects. The resources required fall into two broad categories:
q q

Resources to implement the ICC Infrastructure that will drive change and improvements. Provision of development and production support.

Planning: Shared Services and Central Services ICC Establishing a 120 Day Implementation Plan
This Best Practice suggests planning the ICC implementation at periods of 30, 60, 90 and 120 days out. Certain implementations of an ICC where there is a central offering will have a plan at 120 days or further out for additional infrastructure and shared services. The purpose is to use the four iterations to show milestones at each phase or iteration. The plan is designed and treated as any other project in the IT department where there is a defined start time and end time where the ICC is designed, developed and launched. 120 days (or 4 months) was chosen to help scope the process for each organizations culture and to help the ICC Director properly set expectations with management in showing value after this period of time. Larger organizations might want to develop an implementation plan around a less intrusive model such as Best Practices or Technology Standards where there is still payback, but where less organizational change and alignment is required. For the purposes of this Best Practice, the 120 day plan is based upon the Shared or Centralize Services model and ensures that incremental deliverables are accomplished with respect to the implementation of the ICC. This timeline also provides opportunities to enjoy successes at each step of the way and to communicate those successes to the ICC executive sponsor as each milestone is achieved. It is also important to note that since the Central Services ICC model is lengthy (6+ months) to fully implement, it might be worthwhile to repeat another 4 month iteration by first implementing a Best Practices model, then implementing a Shared or Central Services model in another 120 day plan. Some of the pre-project activities include:
q q

Business Case for ICC Hardware/Software ordered

The Table below lists the milestones that can be expected to show successful progress of a Shared Services ICC project once the preproject activities have been completed. 120-Day ICC Start-up Project Milestones Milestone Category

Day 30

Day 60

Day 90

Day 120

INFORMATICA CONFIDENTIAL

BEST PRACTICES

558 of 954

ICC Director named Resource plan approved Sponsors & stakeholders identified

Core team members on board Key partnerships with internal governance groups formalized Stakeholder communication plan documented ICC services defined Core Integration standards or principles documented

People
q

Subcontractor and 3rd party agreements signed off Initial team training completed Enterprise training plan documented ICC service engagement and delivery process defined Internal communications and marketing plan documented Chargeback model approved Operating procedures documented (i.e. availability mgmt, failover, disaster recovery, backup, configuration mgmt, etc.)

Staff competency evaluations and development plans documented

ICC charter approved Early adopters and project opportunities identified

Process & Policy

Services are discoverable and orderable by internal customers Regular metrics reporting in place Ongoing Metadata management process in place Applications connected and using the integration platform SLA agreements signed off

Technology

Integration platform configured

q q

ICC tools selected Service Level Agreement template established

The table implies the gradual ramp-up of an effective ICC model, starting with building the initial team and then progressing to more formalized procedures. A successful ICC is likely to become the corporate enterprise standard for data integration, so policy development with regard to the ICC can be expected. Such policy developments should be considered major success milestones.

Planning: Activities, Deliverables and Milestones for the 120 Day Plan
Below is the 120 day plan broken into 30/60/90/120 day increments that are further categorized by, Activities, Deliverables and Milestones. This makes it easier for those engaged in the initiation of an ICC to see what they should be focusing on and what the end results should be.

30 Day Scorecard
The following plan outlines the people, process, and technology steps that should occur during the first 30 days of the ICC rollout:

Activities

q q q q

Name Director to ICC organization Solicit Agreement for ICC approach Identify Executive Sponsor and key stakeholders Identify Required resource roles and skills:
r

Identify, assemble, and budget for the human resources necessary to support the ICC rollout; This maybe spread over several roles and individuals

Define Project Charter for ICC


r

ICC launch should be treated as a project

Refine Business Case


r

Identify, estimate, and budget for the necessary technical resources (e.g., hardware, software).
BEST PRACTICES 559 of 954

INFORMATICA CONFIDENTIAL

Note: To encourage projects to utilize the ICC model, it can often be effective to provide hardware and software resources without any internal chargeback for the first year of the ICC conception. Alternatively, the hardware and software costs can be funded by the projects that are likely to leverage the ICC.
q q

Identify Early Adopter Projects and Plans that can be supported by the ICC Install and implement infrastructure for the ICC (hardware and software).
r

Implement a technical infrastructure for the ICC. This includes implementing the hardware and software required to support the initial five projects (or projects within the scope of the first year of the ICC) in both a development and production capacity. Typically, this technical infrastructure is not the end-goal configuration, but it should include a hardware and software configuration that can easily meld into the end-goal configuration. The hardware and software requirements of the short-term technical infrastructure are generally limited to the components required for the projects that will leverage the infrastructure during the first year. Future scalability is a consideration here, so consider that new servers could be added to a grid later.

Deliverables

q q q q q

Resource Plan ICC Project Charter List of prospective Early Adopter Projects Ballpark ICC Budget Estimate Technical Infrastructure Install

Milestones

q q q q

ICC Executive Sponsor approval of initial Project Charter and refined Business Case ICC Director on project full time List of 1-3 Early Adopter Projects Technical infrastructure completed and installed

60 Day Scorecard
As the ICC successfully engages on its initial projects the following activities should occur in the 30 to 60 day period.

Activities

Establish ICC Support Processes (key partnerships groups such as PMO, Systems Management, Database, Enterprise Architecture, etc.) Develop a Stakeholder communication plan Allocate core team resources to support new and forth coming projects on the ICC platform. Develop and adopt core development standards using sources like Velocity to see best practices in use Define ICC Services Evaluate and select ICC tools Develop Service Level Agreement template

q q q q q q

Deliverables

Best practice and standards documents for:


BEST PRACTICES 560 of 954

INFORMATICA CONFIDENTIAL

r r r r r r r

Error handling processes Naming standards Slowly changing data management Deployment Performance tuning Detailed design documents And other Velocity Best Practice Deliverables

List of tools added to the ICC environment, such as:


r r r r r r

PowerCenter Metadata Reporter PowerCenter Team Based Development Model Metadata Manager Data Quality, Profiling and Cleansing Various PowerExchange connectivity access products Other tools appropriate for ICC management

q q q q

Agreements with key partnerships with internal governance groups Stakeholder communication plan ICC Service Offerings Service Level Agreement template

Milestones

Service offerings introduced to support best practices Engagement Services Management for a format and outline of marketable services) Best Practice and Standards document available Core Team Members assigned roles and responsibilities Key partnerships with internal governance groups formally established Regular Stakeholder and Sponsor meetings (Weekly)

q q q q

90 Day Scorecard Activities

q q q q q q q q

Select key contractors and 3rd parties that would assist in the ICC operations Initial training class to early adopter project teams Development of enterprise training plan Final revision of ICC delivery processes and communication of them Define rules of engagement when using ICC delivery processes Develop chargeback model for services Develop internal communications and marketing plan Implement Disaster Recovery and High Availability as features of the ICC
r

As projects join the ICC that has disaster recovery/failover needs, the appropriate implementation of DR/Failover should be completed for the ICC infrastructure.

Approve operational service level agreements (SLA) between ICC and hosted projects

Deliverables

INFORMATICA CONFIDENTIAL

BEST PRACTICES

561 of 954

q q q q q q q

Operational procedure manuals Operational service level agreements (SLA) between projects leveraging the ICC services Published rules of engagement when utilizing ICC services Published internal chargeback model to ICC client organizations. Signed subcontractor and 3rd party agreements Internal communications and marketing plan Disaster Recovery / High Availability features in place

Milestones

q q q q q q

Disaster recovery and high availability services in place and referenced in SLAs Robust list of ICC services defined and available as consumable items for client organizations Chargeback models approved Service level agreements in place and rules of engagement in place for ICC services Initial early adopter training class completed Published training schedule for the enterprise available

120 Day Scorecard Activities

q q q q q q

Operational environment ready for project on-boarding Develop change control procedures Reporting metrics to show performance of the ICC development Initial projects on-boarded to ICC Metadata strategy developed and process in place Publish a list of additional software components that can be leveraged by ICC customers/projects. Examples include:
r

High Availability PowerCenter Enterprise Grid Option Unstructured Data Option

Forecast of a longer term technical infrastructure, including both hardware and software. This technical infrastructure can generally provide cost-effective options for horizontal scaling such as leveraging Informaticas Enterprise Grid capabilities with a relatively inexpensive hardware platform, such as Linux or Windows. Further refinement of chargeback models

Deliverables

q q q q q q

Competency evaluations of ICC staff and key project team members ICC Helpdesk for production support on 24/7 basis for urgent issues. Operational environment in place Change control process documented Metadata strategy published SLA agreements signed off

Milestones
INFORMATICA CONFIDENTIAL BEST PRACTICES 562 of 954

q q

ICC established as the enterprise standard for all data integration project needs. ICC service catalogue available for services such as: - data architecting, data modeling, developing, testing and others such as business analysis. ICC operations support established Applications connected and using shared integration platform

q q

Last updated: 29-May-08 16:48

INFORMATICA CONFIDENTIAL

BEST PRACTICES

563 of 954

Proposal Writing Challenge


Writing an effective proposal is about persuading an executive audience. Your primary focus should not be to provide an exhaustive list of details or to try to answer all of the questions that may come up as it is read. This best practice provides guidelines on how to write a proposal that reaches your audience with:
q q q

Basic concepts of structure and presentation Principles for graphics Principles for text

This best practice, which picks up the thread from Business Case Development, is less concerned with researching options and content than with delivering a proposal that is well-organized and persuasive. Remember, A good proposal is not one that is written, but one that is read.

Description
Basic Principles
If I had more time, this letter would have been shorter.Voltaire, Goethe, Twain. . . Stated by great minds, spoken across times and languages, the above quote remains true today. It takes more effort to be concise than it does to include every detail. It may seem counter-intuitive, but providing all of the possible detail is futile if your document isnt read. The key is to determine what the most important messages are and to focus your efforts on those. 1. Visualize first: Before anything else, think about your end target. Build your graphics to communicate that message, and then add text to persuade. (More on the role of graphics versus text later.) 2. Sell benefits not just features: Dont spend all of your time explaining what. If you can intrigue your audience and convince them why you can explain exactly how. First, you must persuade them that they want to know more. 3. Every page should answer Why? and So What?: A lot of documents do a great job with what, especially those written by substance-oriented people, but all that effort can be wasted if you dont get your readers to care and understand how what you are suggesting is a benefit.Alwaysanswer why? and so what? 4. Entire core document (excluding attachments) should be viewable as an Executive Summary: a. Consistent look and feel: Keep it crisp, clean and united as one document. Readers should be able to absorb the entire message in a scan that takes only a few

INFORMATICA CONFIDENTIAL

BEST PRACTICES

564 of 954

minutes. b. Divided into modules: No matter what the cost and scope, the core elements of any proposal are essentially the same. Divide your proposal into logical modules that make it easy to navigate and absorb. c. Two pages for each module: Each section of the document should be no more than two pages, with half of the length used for graphics. d. Persuasive text and descriptive graphics not the other way around: All of your substance should be represented in the graphics. The text should be reserved primarily for the why. 5. More emphasis on format than content: Packaging has a huge impact on readability and in this context getting the document read is the primary objective. Take the time to focus on the documents format without compromising your content. Format can have a huge impact on acceptance, readability and communicating points accurately. 6. More emphasis on format than content: Packaging has a huge impact on readability and in this context getting the document read is the primary objective. Take the time to focus on the documents format without compromising your content. Format can have a huge impact on acceptance, readability and communicating points accurately.

Recommended Structure
Making due allowance for variances in complexity and scale, it is generally recommended that each proposal have seven sections, each of which is two pages long; a document which is only fourteen pages long is more likely to be read in its entirety. The following sections should be included:
q q q q q q q

Business Opportunity Alternatives Considered Proposed Approach Implementation Plan Deliverables Resource Plan Financial Summary

Each section should include one page of text and a one-page graphic. If required, additional appendices can be included for detailed specifications.

Guiding Principles for Graphics


Graphics are no less important than text; and in some cases are more important. A quick scan of the documents graphics should convey your end message and intrigue your reader to look more closely. Edward Tuftes landmark book Visual Explanations provides some deep insights for advanced practices, but if you are just starting out, consider the following guidelines:
INFORMATICA CONFIDENTIAL BEST PRACTICES 565 of 954

1. Descriptive in nature should tell the story: Your reader should be able to look at the graphics and know what the document is about without reading any text. The graphics should illustrate the what of your proposal in a clear and concise manner. 2. Ideally, achieve a 10-second and 10-minute impact: Your graphic should be clear enough that the core message is apparent within ten seconds, yet complex enough that ten minutes later the reader will still be extracting new content. 3. Dont turn sideways fold-out for big graphs: Remember, take the time to focus on the format; put yourself in your readers shoes and make it easy for them. 4. Good graphics are hard to create: And they take time. Practice, practice, practice. The creation of excellent graphics often takes multiple iterations, white board brainstorming sessions, drawing ideas out and getting peer reviews. Keep your audience in mind. It is not always clear what is best for each audience. 5. Should be complete before text: Your text should support your graphics, persuading that the ideas illustrated are good ones. Making clear connections and eliminating redundancy is easier when the graphics come first. This also helps in evaluating if the graphics can stand alone in conveying your end message. 6. Will be read: Reading images is often faster than reading text, an important consideration when appealing to an executive audience. 7. Are worth 1,000 words: Dont spend another thousand words rehashing what your graphics have already said. Make sure your graphics send a clear, eloquent message. If your graphic can stand alone it can be passed on. Make sure it stands up out of context. The following graphic from Edward Tuftes work is a comprehensive example of a picture worth a 1,000 words:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

566 of 954

Guiding Principles for Text


After taking the time to create graphics that communicate your end message, dont waste your text on redundancy. Instead, use your text to persuade. A good exercise to help focus on communicating what is important: Try to cut out half the words without losing content. If you put it all in, nothing gets read. 1. Should communicate only two things: a. Benefits: Your graphics should convey what you are proposing, but they wont explain what is in it for the executive. Use your text to do this. b. Advantage of your proposal/solution/recommendation: Focus on explaining why this approach is the optimal one. 2. Persuasive in nature: Again, leave the what to the graphics. Use specific statements and facts to support your point. Numbers can be very persuasive, but be prepared to substantiate them! 3. Grade 8 reading level: Put yourself in the readers shoes and make it easy for them to read. Avoid big words and convoluted sentence structure. A good check is to consider if an eighth grader could read it and understand your message. If its a technical subject, also consider if someone who doesnt do more than surf the Internet could understand it. 4. Related to graphics: Since your graphics are descriptive and your text is persuasive, ensure that there are clear associations between the two. A logical what and a persuasive why are meaningless unless its easy to see how what you are proposing provides the outlined benefits. 5. Written from readers perspective (avoid we, us, you, etc.): Again, step into the readers shoes. Make it easy for them to understand the key points. You arent trying to impress them with your knowledge. You are trying to reach and persuade them. 6. Avoid absolutes (always, never, best, etc.): Absolutes can get you into trouble! Review your document for statements with absolutes, take those words out and reread your document. Often nothing meaningful is lost and statements are crisper. 7. Edited by a 3rd party: Additional perspectives engaged in critiquing and brainstorming are invaluable and an objective party may more easily identify unnecessary information. For example, compare the following proposal introductions: This project will define and document Product & Account Architecture, covering business strategy, business architecture, and technical architecture for product sales and fulfillment, service enrollment and fulfillment, account closing/service discontinuation, account/service change and selected account transactions. Or The cost for this project is $10.3M with annual savings of $12.5M resulting in a 10-month payback. In addition to the hard benefits, this project will increase customer satisfaction, improve data accuracy and enhance compliance enforcement. Clearly, the second paragraph has greater impact. The tone is persuasive rather than academic.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

567 of 954

Summary
The single objective of a proposal is to win support for the proposed solution; this provides a very simple measure for the effectiveness of any paragraph of text or any graphic. Essentially, the entire document should be regarded as an extended executive summary. Every page should answer Why? and So What? It is best to visualize first and then to write. Text should be persuasive and graphics should be descriptive not the other way around. Sell benefits rather than features.

Last updated: 30-May-08 14:13

INFORMATICA CONFIDENTIAL

BEST PRACTICES

568 of 954

Selecting the Right ICC Model Challenge


Choosing the right Integration Competency Center (ICC) implementation requires answering questions about the services the ICC should offer and having an understanding the current organizational culture, financial accounting processes and human resource allocation. Consider the benefits of an ICC implementation versus the costs and risks of a project silo approach (which essentially is a decision to not implement an ICC). Choosing the right model is significant because a good ICC will grow in importance to the organization. This Best Practice provides an overview of items to consider that can help in choosing the appropriate structure for an ICC. The main challenge in ICC startup is to identify the appropriate organizational structure for a given enterprise. There are four main factors that help to determine the type of ICC model that an organization should implement. They are:
q q q q

IT Organization Size Business Value/Opportunity Planning for an ICC Implementation IT Strategic Alignment Urgency/Bias for Action by Business Community

Description
What are the ICC Models?
Integration Competency Centers fall in to five main models:
q q q q q

Best Practices Technology Standards Shared Services Central Services Self Service

The first model (Project Silos) in the figure below is not really an ICC. It is the situation that often exists before organizations begin an ICC infrastructure to improve data integration efficiency.

Model 1 - Best Practices


A Best Practices ICC is the easiest to implement, which makes it a good first step for an organization that wants to begin leveraging integration expertise. The Best Practices ICC model focuses on establishing proven processes across business units, defining
INFORMATICA CONFIDENTIAL BEST PRACTICES 569 of 954

processes for data integration initiatives and recommending appropriate technology, but it does not share the development workload with individual project teams. The result is a higher overall ROI for each data integration initiative. To achieve this goal, a Best Practices ICC documents and distributes recommended operating procedures and standards for development, management and mapping patterns. It also defines how to manage change within an integration project. The people who lead this effort are typically those in the organization who have the most integration expertise. They form a virtual team consisting of project managers and ETL lead developers from different projects. The most important roles in this type of ICC are the knowledge coordinator who collects and distributes best practices and the ICC manager who ensures that the ICC anticipates business requirements and that business managers and customers turn to the ICC for assistance with integration initiatives. The primary function of this ICC model is to document best practices. It does not include a central support or development team to implement those standards across projects. To implement a Best Practices ICC, companies need a flexible development environment that supports diverse teams and that enables the team to enhance and extend existing systems and processes.

Model 2 - Technology Standards


The Technology Standards model standardizes development processes on a single, unified technology platform, enabling greater reuse of work from project to project. Although neither technology nor people are shared, standardization creates synergies among disparate project teams. A Technology Standards ICC provides the same knowledge leverage as a Best Practices ICC, but enforces technical consistency in software development and hardware choices. A Technology Standards ICC focuses on processes, including standardizing and enforcing naming conventions, establishing metadata standards, instituting change management procedures and providing standards training. This type of ICC also reviews emerging technologies, selects vendors, and manages hardware and software systems. The people within a Technology Standards ICC typically come from different development teams, and may move from one team to another. However, at its core is a group of best practices leaders. These most likely include the following roles:
q q q q q q

Technology Leader Metadata Administrator Knowledge Coordinator Training Coordinator Vendor Manager ICC Manager

A Technology Standards ICC standardizes all integration activities on a common platform and links repositories for optimized metadata sharing. To support these activities the ICC needs technologies that provide for metadata management; enable maximum reuse of systems, processes, resources and interfaces; and offer a robust repository, including embedded rules and relationships and a model for sharing data.

Model 3 - Shared Services


The Shared Services model defines processes, standardizes architecture and maintains a centralized team for shared work, but most development work occurs in the distributed lines of business. This hybrid centralized/decentralized model optimizes resources. A Shared Services ICC optimizes the efficiency of integration project teams by providing a common, supported technical environment and services ranging from development support all the way through to a help desk for projects in production. This type of ICC is significantly more complex than a Best Practices or Technology Standards model. It establishes processes for knowledge management, including product training, standards enforcement, technology benchmarking, and metadata management and it facilitates impact analysis, software quality and effective use of developer resources across projects. The team takes responsibility for the technical environment, including hardware and software procurement, architecture, migration, installation, upgrades, and compliance. The Share Services ICC is responsible for departmental cost allocation; for ensuring high levels of availability through careful capacity planning; and for security, including repository administration and disaster recovery planning. The ICC also takes on the task of selecting and managing professional services vendors. The Shared Services ICC supports development activities, including performance and tuning. It provides QA, change management, acceptance and documentation of shared objects. The Shared Services ICC supports projects through a development help desk, estimation, architecture review, detailed design review and system testing. It also supports cross project integration through schedule management and impact analysis. When a project goes into production, the ICC helps to resolve problems through an operations help desk and data validation. It monitors
INFORMATICA CONFIDENTIAL BEST PRACTICES 570 of 954

schedules and the delivery of operations metadata. It also manages change from migration to production, provides change control review and supports process definition. The roles within a Shared Services ICC include technology leader, technical architect, and data integration architectsomeone who understands ETL, EAI, EII, Web services, and other integration technologies. A repository administrator and a metadata administrator ensure leverage and reuse of development assets across projects, set up user groups and connections, administer user privileges and monitor repository activities. This type of ICC also requires a knowledge coordinator, a training coordinator, a vendor manager, an ICC manager, a product specialist, a production operator, a QA manager, and a change control coordinator. A Shared Services ICC requires a shared environment for development, QA, and production.

Model 4 - Central Services


Centralized integration initiatives can be the most efficient and have the most impact on the organization. A Central Services ICC controls integration across the enterprise. It carries out the same processes as the other models, but in addition usually has its own budget and a chargeback methodology. It also offers more support for development projects, providing management, development resources, data profiling, data quality and unit testing. Because a Central Services ICC is more involved in development activities than the other models, it requires a production operator and a data integration developer. In this ICC model, standards and processes are defined, technology is shared and a centralized team is responsible for all development work on integration initiatives. Like a Shared Services ICC, it also includes the roles of technology leader, technical architect, data integration architect, repository administrator, metadata administrator, knowledge coordinator, training coordinator, vendor manager, ICC manager, product specialist, production operator, QA manager and change control coordinator. To achieve its goals, a Central Services ICC needs a live and shared view of the entire production environment. Tools to maximize reuse of systems, processes, resources and interfaces are essential, as is visibility into dependencies and assets. A Central Services ICC depends on robust metadata management tools and tools that enable the team to enhance and extend existing systems and processes.

Model 5 - Self Service


The self-service ICC model achieves both a highly efficient operation and furnishes an environment where innovation can flourish. Self-service ICCs require strict enforcement of a set of application integration standards through automated processes and have a number of tools and systems in place that support automated or semi-automated processes. Multiple Model Approach In addition to considering the above models, it is desirable to offer services and an organization structure based upon the level of integration involved. The type of projects that are being carried out would have a heavy weighting on the ICC model chosen. The figure below illustrates the viewpoint that certain ICC models suit certain types of projects. Strategic projects initiated by an organization may usually be better served under a central services model. Tactical operational projects initiated and controlled by lower tiers of management may be better served with a Best Practices or Shared Services ICC implementation that reflects a certain amount of autonomy that is present within that level of an organization. It is also important to note that even if the organization adopts a central services ICC model, not all projects may fall into a Central Services ICC model. Some projects require very specific SLAs (Service Level Agreements) that are much more stringent than other projects, and as such they may require a less stringent ICC model.

Activities
INFORMATICA CONFIDENTIAL BEST PRACTICES 571 of 954

A high level plan to select the right ICC model is outlined below. The following sections elaborate on the key decision criteria and activities associated with these steps.

Evaluate Selection Criteria: Determine the recommended model based on best practice guidelines related to organizational size, strategic alignment, business value, and urgency for realizing significant benefits. Document Objective and Potential Benefits: Define what the business intent or purpose of the ICC is and what the desired or expected benefits are (this is not a business case at this stage) Define Service Scope: Define the scope of services that will be offered by the ICC (detailed service definitions occur in step 6 or 7). Determine Organizational Constraints: Identify budget limitations or operational constraints imposed such as geographic distribution of operations and degree of independence by operational groups. Select an ICC Model: Recommend a model and gain executive support or sponsorship to proceed. If there is an urgent need to implement the ICC move to step 6 otherwise proceed to step 7. Develop 120 Day Implementation Plan: Leverage Planning the ICC Implementation to implement and launch an ICC. Evolve to Target State: Develop a future-state vision and begin implementing it. Planning the ICC Implementation may still provide useful guidance, but the time-frame may be measured in years rather than months.

Which Model is Best for an Organization?


There are four main factors that help to determine the type of ICC model that an organization should implement. They are:
q q q q

IT Organization Size Business Value/Opportunity Planning IT Strategic Alignment Urgency/Bias for Action by Business Community

Criteria 1 IT Organization Size


The size of the IT organization is one factor in selecting the right ICC model. While any ICC Model could be used in any size organization, each model has a sweet spot that optimizes the benefits to the enterprise and minimizes potential negative factors. For example, the Best Practices model could be used in a very large multi-national organization, but it would not capture all of the benefits that could be realized. Conversely, a Central Services or Shared Services model could be implemented in an IT department with 50 staff, but the formality of these models would add extra overhead which is not necessary in smaller groups. Below are initial guidelines to help determine which model fits best. Note that the number of staff includes both employees of the organization as well as contract personnel. Size of IT Organization Less than 200 Staff 200-500 Staff Suggested Model Best Practices Technology Standards

INFORMATICA CONFIDENTIAL

BEST PRACTICES

572 of 954

500-2000 Staff Greater than 2,000 Staff

Central Services Shared Services

The sweet spot for Central Services is between 500-2,000 IT staff. Organizations of that size can benefit most from a centralized approach while avoiding negative factors like diseconomies of scale that can occur in larger organizations. Organizations with greater than 2,000 staff are generally better suited to leverage a Shared Services model. In very large organizations with 5,000 or more staff and several CIO groups, it is common to see a Shared Services model where each organization under a CIO has a Central Services ICC with the efforts across the groups being coordinated by an enterprise Shared Services ICC.

Criteria 2 Business Value/Opportunity Planning


Identifying the value and opportunity of data integration to the business is another factor to selecting the right model. For example, if there are defined business initiatives that require integration (e.g., a customer data warehouse) then there is more opportunity to leverage investment in an ICC. Conversely if the business units that make up a large corporation operate as separate and autonomous groups with few end-to-end processes and little need to share information (like a holding company for example) then the focus to obtain an efficient IT operations should be more on optimizing IT practices and leveraging technology standards. Business System Pattern Siloed business units, separate IT operations infrastructure Siloed business units, shared IT operations infrastructure Master Data Management, Customer Data Integration type data initiatives Data Integration Vision established or Enterprise Data Suggested Model Best Practices Technology Standards Shared Services Central Services

The main opportunity here for a more centralized approach is that there is a business initiative to bring all enterprise information together to find the single version of the truth. An ICC can certainly gain sponsorship in building the sustaining integration organization that is required to meet this vision. Less visionary initiatives may require educating business users of the need for data integration and might require more of a pay as you go approach to show successes incrementally. However, elaborative meetings to extract the business vision on data could be very helpful in defining the vision and identifying the need for data integration.

Criteria 3 IT Strategic Alignment


How the IT organization is aligned to data integration is another key criteria that can be used to measure readiness for each of the different ICC models. Evaluate whether the key IT projects are integration-driven vs. operations-driven or if they are focused on key business application systems. IT Project Pattern Main projects are separate, supporting single business unit and not integrated with other systems Suggested Model Best Practices

Main projects are separate, supporting single business unit, but need to access data from 1-3 other systems Technology Standards Main Projects are wider in scope and require cross functional teams to address integration with greater than 3 other systems Shared Services

INFORMATICA CONFIDENTIAL

BEST PRACTICES

573 of 954

Main Projects are focused on major integration efforts such as customer integration, key system data exchanges, etc

Central Services

Consider looking at the IT Project Portfolio to gain an understanding of the criteria used to prioritize projects. Projects that are focused on reductions in cost or redundancy can be helpful in identifying key integration and cross functional needs across an organization. Other items to look for are key data integration initiatives such as consolidation/retirement of applications that occur through merger activities, reduction/simplification of processes, etc. with a focus to reduce IT support costs and invest in infrastructure to free up development resources.

Criteria 4 Urgency/Bias for Action by Business Community


A final criteria for selecting an ICC model is how quickly the organization needs to act. A sense of urgency (or bias for action) may provide the incentive to move more quickly to a Central Services model. Urgency Level Perceived benefits for sharing practices, but no immediate or pressing reason to address integration needs from a business or IT perspective. Desire to standardize technology and reduce variations in the IT infrastructure, but no compelling need at the moment to address business or data integration issues. There are number of opportunities for collaboration, resource sharing, and reuse of development components, but these may be addressed incrementally over time. There are one or more key strategic initiatives that are being driven topdown that require collaboration and coordination across multiple groups and must show progress and results quickly. Suggested Model Best Practices

Technology Standards

Shared Services

Central Services

A sense of urgency helps to identify the business case and provides the impetus necessary to execute it. Therefore, it is considered a factor in choosing the right model.

Matching Budget to Suggested Model Selection


To put all of this together, use the four criteria above to determine which model to implement. Match the budget to the suggested model. It may be necessary to build a business case (particularly when implementing a Shared or Central Services model). Examples of how to build a business case and develop a chargeback model can be found in the Financial Management competency. The remainder of this best practice is focused on the activities that make up an ICC. This information will help in estimating the cost and organizational impact of establishing the ICC function.

Objectives and Benefits


After selecting a model for an ICC implementation there are important questions to explore:
q q

What are the objectives of implementing an ICC? What will the services and benefits offered by the ICC consist of?

Typical ICC objectives include:


INFORMATICA CONFIDENTIAL BEST PRACTICES 574 of 954

q q

Promoting data integration as a formal discipline. Developing a set of experts with data integration skills and processes and leveraging their knowledge across the organization. Building and developing skills, capabilities and best practices for integration processes and operations. Monitoring, assessing and selecting integration technology and tools. Managing integration pilots. Leading and supporting integration projects with the cooperation of subject matter experts. Reusing development work such as source definitions, application interfaces and codified business rules.

q q q q q

Although a successful project that shares its lessons with other teams can be a great way to begin developing organizational awareness for the value of an ICC, setting up a more formal ICC requires upper management buy-in and funding. Some of the typical benefits that can be realized from doing so include:
q q q q q q

Rapid development of in-house expertise through coordinated training and shared knowledge. Leveraging of shared resources and "best practice" methods and solutions. More rapid project deployments. Higher quality/reduced risk for data integration projects. Reduced costs of project development and maintenance. Shorter time to ROI.

When examining the move towards an ICC model that optimizes and (in certain situations) centralizes integration functions, consider two things:
q q

The problems, costs and risks associated with a project silo-based approach The potential benefits of an ICC environment

ICC Activities
What Activities does an ICC perform? The common activities provided by an ICC can be divided into four major categories:
q q q q

Knowledge Management Environment Development Support Production Support

The ICC Activities Summary table below breaks down the activities that can be provided by ICC based on the four categories above:

Knowledge Management

Development Support

INFORMATICA CONFIDENTIAL

BEST PRACTICES

575 of 954

Training Standards Training Product Training Standards Standards Development Standards Enforcement Methodology Mapping Patterns Technology Emerging Technologies Benchmarking Metadata Metadata Standards Metadata Enforcement Data Integration Catalog

Performance Performance and Tuning Shared Objects Shared Object Quality Assurance Shared Object Change Management Shared Object Acceptance Shared Object Documentation Project Support Development Helpdesk Software/Method Selection Project Estimation Project Management Project Architecture Review Detailed Design Review Development Resources Data Profiling Data Quality Testing Unit Testing System Testing Cross Project Integration Schedule Management/Planning Impact Analysis

Environment Hardware Vendor Selection and Management Hardware Procurement Hardware Architecture Hardware Installation Hardware Upgrades Software Vendor Selection and Management Software Procurement Software Architecture Software Installation Software Upgrades Compliance (Licensing) Professional Services Vendor Selection and Management Vendor Qualification Security Security Administration Disaster Recovery Financial Budget Departmental Cost Allocation Scalability/Availability High Availability Capacity Planning
INFORMATICA CONFIDENTIAL

Production Support Issue Resolution Operations Helpdesk Data Validation Production Monitoring Schedule Monitoring Operations Metadata Delivery Change Management Object Migration Change Control Review Process Definition

BEST PRACTICES

576 of 954

The activities that could potentially be provided by ICCs for each category are described in the tables below. The ICC models that they usually fall into are abbreviated as:
q q q q

Best Practices (BP) Technology Sharing (TS) Shared Services (SS) Central Services (CS)

ICC Knowledge Management Activities Activity Standards Training ICC Models TS SS CS BP Description Training of best practices including but not limited to naming conventions, unit test plans, configuration management strategy and project methodology. Co-ordination of vendor offered or internally sponsored training of specific technology products. Creating best practices including but not limited to naming conventions, unit test plans and coding standards. Enforcing development teams to use documented best practices through formal development reviews, metadata reports, project audits or other means. Creating methodologies to support development initiatives. Examples include methodologies for rolling out data warehouses and data integration projects. Typical topics in a methodology include but are not limited to:
q q q q

Product Training

SS CS

Standards Development

BP TS SS CS

Standards Enforcement

BP TS SS CS

Methodology

SS CS

Project Management Project Estimation Development Standards Operational Support

Mapping Patterns

SS CS

Developing and maintaining mapping patterns (templates) to speed up development time and promote mapping standards across projects. Responsible for the assessment of emerging technologies and determining if/ where they fit in the organization and policies around their adoption/use Conducting and documenting tests on hardware and software in the organization to establish performance benchmarks Enforcing development teams to conform to documented metadata standards Enforcing development teams to conform to documented metadata standards

Emerging Technologies

TS SS CS

Benchmarking

TS SS CS

Metadata Standards Metadata Enforcement

BP TS SS CS SS CS

INFORMATICA CONFIDENTIAL

BEST PRACTICES

577 of 954

Data Integration

SS CS

Track the list of systems involved in data integration efforts, the integration between systems, and the use/subscription of data integration feeds. This information is critical to managing the interconnections in the environment in order to avoiding duplication of integration efforts and knowing when particular integration feeds are no longer needed.

ICC Environment Activities Activity Hardware Vendor Selection and Management Hardware Procurement ICC Models TS SS CS Description Selection of vendors for the hardware tools needed for integration efforts that may span Servers, Storage and network facilities

SS CS

Responsible for the purchasing process for hardware items that may include receiving and cataloging the physical hardware items. Developing and maintaining the physical layout and details of the hardware used to support the Integration Competency Center Setting up and activating new hardware as it becomes part of the physical architecture supporting the Integration Competency Center Managing the upgrade of hardware including operating system patches, additional cpu/memory upgrades, replacing old technology etc. Selection of vendors for the software tools needed for integration efforts. Activities may include formal RFPs, vendor presentation reviews, software selection criteria, maintenance renewal negotiations and all activities related to managing the software vendor relationship. Responsible for the purchasing process for software packages and licenses Developing and maintaining the architecture of the software package(s) used in the competency center. This may include flowcharts and decision trees of what software to select for specific tasks. Setting up and installing new software as it becomes part of the physical architecture supporting the Integration Competency Center Managing the upgrade of software including patches and new releases. Depending on the nature of the upgrade, significant planning and rollout efforts may be required during upgrades. (Training, testing, physical installation on client machines etc) Monitoring and ensuring proper licensing compliance across development teams. Formal audits or reviews may be scheduled. Physical documentation should be kept matching installed software with purchased licenses.

Hardware Architecture

SS CS

Hardware Installation

SS CS

Hardware Upgrade

SS CS

Software Vendor Selection Management

TS SS CS

Software Procurement Software Architecture

SS CS SS CS

Software Installation

SS CS

Software Upgrades

SS CS

Compliance Licensing

SS CS

INFORMATICA CONFIDENTIAL

BEST PRACTICES

578 of 954

Professional Services Selection and Management

SS CS

Selection of vendors for professional services efforts related to integration efforts. Activities may include managing vendor rates and bulk discount negotiations, payment of vendors, reviewing past vendor work efforts, managing list of preferred vendors, etc. Activities may include formal vendor interviews as consultants/contracts are proposed for projects, checking vendor references and certifications, to formally qualifying selected vendors for specific work tasks (i.e., Vendor A is qualified for Java development while Vendor B is qualified for ETL and EAI work) Provide access to the tools and technology needed to complete data integration development efforts including software user ids, source system user id/ passwords, and overall data security of the integration efforts. Ensures enterprise security processes are followed. Perform risk analysis in order to develop and execute a plan for disaster recovery including repository backups, off-site backups, failover hardware, notification procedures and other tasks related to a catastrophic failure (i.e., server room fire destroys dev/prod servers). Yearly budget management for the Integration Competency Center. Responsible for managing outlays for services, support, hardware, software and other costs. For clients where shared services costs are to be spread across departments/ business units for cost purposes. Activities include defining metrics uses for cost allocation, reporting on the metrics, and applying cost factors for billing on a weekly/monthly or quarterly basis as dictated. Design and implementation of hardware, software and procedures to ensure high availability of the data integration environment. Design and plan for additional integration capacity to address the growth in size and volume of data integration in the future for the organization.

Professional Services Vendor Qualification

SS CS

Security Administration

SS CS

Disaster Recovery

SS CS

Budget

CS

Departmental Cost Allocation

SS CS

Scalability High Availability Capacity Planning

SS CS

SS CS

ICC Development Support Activities Activity ICC Models Description Provide targeted performance and tuning assistance for integration efforts. Provide on-going assessments of load windows and schedules to ensure service level agreements are being met. Provide quality assurance services for shared objects so that objects conform to standards and do not adversely affect the various projects that may be using them. Manage the migration to production of shared objects which may impact multiple project teams. Activities include defining the schedule for production moves, notifying teams of changes, and coordinating the migration of the object to production. Define and documenting the criteria for a shared object and officially certifying an object as one that will be shared across project teams.

Performance and Tuning SS CS

Shared Object Quality Assurance Shared Object Change Management

SS CS

SS CS

Shared Object Acceptance

SS CS

INFORMATICA CONFIDENTIAL

BEST PRACTICES

579 of 954

Shared Object Documentation Development Helpdesk

SS CS

Defining the standards for documentation of shared objects and maintaining a catalog of all shared objects and their functions. Provide a helpdesk of expert product personnel to support project teams. This will assist project teams new to developing data integration routines a place to turn for experienced guidance. Provide a workflow or decision tree to use when deciding which data integration technology to use for a given technology request. Develop the process to gather and document integration requirements. Depending on the level of service, activity may include assisting or even fully gathering the requirements for the project. Develop project estimation models and provide estimation assistance for data integration efforts. Provide full time management resources experienced in data integration to ensure successful projects. Provide project level architecture review as part of the design process for data integration projects. Help ensure standards are met and the project architecture fits with the enterprise architecture vision. Services to review design specifications in detail to ensure conformance to standards and identify any issues upfront before development work is begun. Product skilled resources for completion of the development efforts. Provide data profiling services to identify data quality issues. Develop plans for addressing issues found in data profiling. Define and meet data quality levels and thresholds for data integration efforts. Define and execute unit testing of data integration processes. Deliverables include documented test plans, test cases and verification against end user acceptance criteria. Define and perform system testing to ensure that data integration efforts work seamlessly across multiple projects and teams. Provide a single point for managing load schedules across the physical architecture to make best use of available resources and appropriately handle integration dependencies. Provide impact analysis on proposed and scheduled changes that may impact the integration environment. Changes include but are not limited to system enhancements, new systems, retirement of old systems, data volume changes, shared object changes, hardware migration and system outages.

SS

Software Method Selection

SS CS

Requirements Definition SS CS

Project Estimation

SS CS

Project Management

CS

Project Architecture Review

SS CS

Detailed Design Review SS CS

Development Resources SS CS Data Profiling CS

Data Quality Unit Testing

CS CS

System Testing

SS CS

Schedule Management Planning

SS CS

Impact Analysis

SS CS

ICC Production Support Activities

INFORMATICA CONFIDENTIAL

BEST PRACTICES

580 of 954

Activity Operations Helpdesk

ICC Models SS CS

Description First line of support for operations issues providing high level issue resolution. Helpdesk would field support cases and issues related to scheduled jobs, system availability and other production support tasks. Provide data validation on integration load tasks. Data may be held from end user access until some level of data validation has been performed. It might be manual review of load statistics - to automated review of record counts including grand total comparisons, expected size thresholds or any other metric an organization may define to catch potential data inconsistencies before reaching end users. Nightly/daily monitoring of the data integration load jobs. Ensuring jobs are properly initiated, are not being delayed, and ensuring successful completion. May provide first level support to the load schedule while escalating issues to the appropriate support teams. Responsible for providing metadata to system owners and end users regarding the production load process including load times, completion status, known issues and other pertinent information regarding the current state of the integration job stream. Coordinate movement of development objects and processes to production. May even physically control migration such that all migration is scheduled, managed, and performed by the ICC. Conduct formal and informal reviews of production changes before migration is approved. At this time, standards may be enforced, system tuning reviewed, production schedules updated, and formal sign off to production changes is issued. Develop and document the change management process such that development objects are efficiently and flawlessly migrated into the production environment. This may include notification rules, schedule migration plans, emergency fix procedures etc.

Data Validation

SS CS

Schedule Monitoring

SS CS

Operations Metadata Delivery

SS CS

Object Migration

SS CS

Charge Control Review

SS CS

Process Definition

BP TS SS CS

Choosing the ICC Model


Other factors in choosing an ICC model can depend on the nature of the organization. For example, if the data integration functions are already centrally managed for established corporate reasons it is unlikely that there will be a sudden move to shared services. In turn, shared services may be more realistic in a culture where independent departments take on and manage their own data integration projects. A central services model would only result if senior managers wanted to consolidate the management and development of projects for reasons of cost, knowledge or quality and make it policy. The higher the degree of centralization, the greater the potential cost savings. Some organizations have the flexibility to easily move toward central services, while others dont either due to organizational or regulatory constraints. There is no ideal model; just one that is appropriate to the environment in which it operates that will deliver increased efficiency and quicker and higher ROI for data integration projects. The adoption of the Central Services model does not necessarily mandate the inclusion of all applications within the orbit of the ICC. Some projects require very specific SLAs (Service Level Agreements) that are much more stringent than other projects, and as such they may require a less stringent ICC model. The tables below show how to compare services that will be provided by an ICC against the four models. Having considered the service categories, the appropriate ICC Organizational Model may be indicated. Working the exercise in reverse may reveal services that will need to be provided for a chosen ICC model that may not be possible initially or that would require extra resources and budget. The desired services can be marked down and compared against the standard models that are shown. Review Engagement Services Management to determine how to bundle these activities into client services for consumption.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

581 of 954

Other more general questions to consider when envisioning services the ICC can provide are based on the level of responsibility the ICC and its management group may take on. Will the ICC be responsible for:
INFORMATICA CONFIDENTIAL BEST PRACTICES 582 of 954

q q q q q q

A shared cross functional integration system? Enforcing technology standards? Maintaining a metadata repository? Developing shared common objects incorporating business logic? End to end business process monitoring? Will production support be provided?

Consider if the expertise and resources actually exist within the host organization to provide the services? Further, how would the intra-organizational politics affect the reception to of any of the models for an ICC? Would a new ICC group taking on those responsibilities be acceptable to other significant persons and departments who may be threatened by or benefit from the introduction of the ICC model? Conversely, which individuals and departments would support the creation of an ICC with one of the four main models described in this Best Practice? More information is available in the following publication: Integration Competency Center: an Implementation Methodology by John Schmidt and David Lyle, Copyright 2005 Informatica Corporation.

Last updated: 02-Jun-08 19:36

INFORMATICA CONFIDENTIAL

BEST PRACTICES

583 of 954

Creating Inventories of Reusable Objects & Mappings Challenge


Successfully identify the need and scope of reusability. Create inventories of reusable objects with in a folder or shortcuts across folders (Local shortcuts) or shortcuts across repositories (Global shortcuts). Successfully identify and create inventories of mappings based on business rules.

Description
Reusable Objects
Prior to creating an inventory of reusable objects or shortcut objects, be sure to review the business requirements and look for any common routines and/or modules that may appear in more than one data movement. These common routines are excellent candidates for reusable objects or shortcut objects. In PowerCenter, these objects can be created as:
q q q q

single transformations (i.e., lookups, filters, etc.) a reusable mapping component (i.e., a group of transformations - mapplets) single tasks in workflow manager (i.e., command, email, or session) a reusable workflow component (i.e., a group of tasks in workflow manager worklets).

Please note that shortcuts are not supported for workflow level objects (Tasks). Identify the need for reusable objects based on the following criteria:
q

Is there enough usage and complexity to warrant the development of a common object? Are the data types of the information passing through the reusable object the same from case to case or is it simply the same high-level steps with different fields and data.

Identify the Scope based on the following criteria:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

584 of 954

Do these objects need to be shared with in the same folder. If so, then create reusable objects with in the folder Do these objects need to be shared in several other PowerCenter repository folders? If so, then create local shortcuts Do these objects need to be shared across repositories? If so, then create a global repository and maintain these re-usable objects in the global repository. Create global shortcuts to these reusable objects from the local repositories.

Note: Shortcuts cannot be created for workflow objects.

PowerCenter Designer Objects


Creating and testing common objects does not always save development time or facilitate future maintenance. For example, if a simple calculation like subtracting a current rate from a budget rate that is going to be used for two different mappings, carefully consider whether the effort to create, test, and document the common object is worthwhile. Often, it is simpler to add the calculation to both mappings. However, if the calculation were to be performed in a number of mappings, if it was very difficult, and if all occurrences would be updated following any change or fix, then the calculation would be an ideal case for a reusable object. When you add instances of a reusable transformation to mappings, be careful that the changes do not invalidate the mapping or generate unexpected data. The Designer stores each reusable transformation as metadata, separate from any mapping that uses the transformation. The second criterion for a reusable object concerns the data that will pass through the reusable object. Developers often encounter situations where they may perform a certain type of high-level process (i.e., a filter, expression, or update strategy) in two or more mappings. For example, if you have several fact tables that require a series of dimension keys, you can create a mapplet containing a series of lookup transformations to find each dimension key. You can then use the mapplet in each fact table mapping, rather than recreating the same lookup logic in each mapping. This seems like a great candidate for a mapplet. However, after performing half of the mapplet work, the developers may realize that the actual data or ports passing through the high-level logic are totally different from case to case, thus making the use of a mapplet impractical. Consider whether there is a practical way to generalize the common logic so that it can be successfully applied to multiple cases. Remember, when creating a reusable object, the actual object will be replicated in one to many mappings. Thus, in each mapping using the mapplet or reusable transformation object, the same size and number of ports must pass into and out of the mapping/reusable object. Document the list of the reusable objects that pass this criteria test, providing a high-level description of what each object will accomplish. The detailed design will occur in a future subtask, but at this point the intent is to identify the number and functionality of reusable objects that will be built for the project. Keep in mind that it will be impossible to identify one hundred percent of the reusable objects at this point; the goal here is to create an
INFORMATICA CONFIDENTIAL BEST PRACTICES 585 of 954

inventory of as many as possible, and hopefully the most difficult ones. The remainder will be discovered while building the data integration processes.

PowerCenter Workflow Manager Objects


In some cases, we may have to read data from different sources and go through the same transformation logic and write the data to either one destination database or multiple destination databases. Also, sometimes, depending on the availability of the source, these loads have to be scheduled at different time. This case would be the ideal one to create a re-usable session and do Session overrides at the session instance level for the database connections/pre-session commands / post session commands. Logging load statistics, failure criteria and success criteria are usually common pieces of code that would be executed for multiple loads in most Projects. Some of these common tasks include:
q q q

Notification when number of rows loaded is less then expected Notification when there are any reject rows using email tasks and link conditions Successful completion notification based on success criteria like number of rows loaded using email tasks and link conditions Fail the load based on failure criteria like load statistics or status of some critical session using control task Stop/Abort a Workflow based on some failure criteria using control task Based on some previous session completion times, calculate the amount of time the down stream session has to wait before it can start using worklet variables, timer task and assignment task

q q

Re-usable worklets can be developed to encapsulate the above-mentioned tasks and can be used in multiple loads. By passing workflow variable values to the worklets and assign then to worklet variables, one can easily encapsulate common workflow logic.

Mappings
A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. In a simple world, a single source table would populate a single target table. However, in practice, this is usually not the case. Sometimes multiple sources of data need to be combined to create a target table, and sometimes a single source of data creates many target tables. The latter is especially true for mainframe data sources where COBOL OCCURS statements litter the landscape. In a typical warehouse or data mart model, each OCCURS statement decomposes to a separate

INFORMATICA CONFIDENTIAL

BEST PRACTICES

586 of 954

table. The goal here is to create an inventory of the mappings needed for the project. For this exercise, the challenge is to think in individual components of data movement. While the business may consider a fact table and its three related dimensions as a single object in the data mart or warehouse, five mappings may be needed to populate the corresponding star schema with data (i.e., one for each of the dimension tables and two for the fact table, each from a different source system). Typically, when creating an inventory of mappings, the focus is on the target tables, with an assumption that each target table has its own mapping, or sometimes multiple mappings. While often true, if a single source of data populates multiple tables, this approach yields multiple mappings. Efficiencies can sometimes be realized by loading multiple tables from a single source. By simply focusing on the target tables, however, these efficiencies can be overlooked. A more comprehensive approach to creating the inventory of mappings is to create a spreadsheet listing all of the target tables. Create a column with a number next to each target table. For each of the target tables, in another column, list the source file or table that will be used to populate the table. In the case of multiple source tables per target, create two rows for the target, each with the same number, and list the additional source (s) of data. The table would look similar to the following: Number 1 2 3 4 4 Target Table Customers Products Customer_Type Orders_Item Orders_Item Source Cust_File Items Cust_File Tickets Ticket_Items

When completed, the spreadsheet can be sorted either by target table or source table. Sorting by source table can help determine potential mappings that create multiple targets. When using a source to populate multiple tables at once for efficiency, be sure to keep restartabilty and reloadability in mind. The mapping will always load two or more target tables from the source, so there will be no easy way to rerun a single table. In this example, potentially the Customers table and the Customer_Type tables can be loaded in the same mapping. When merging targets into one mapping in this manner, give both targets the same

INFORMATICA CONFIDENTIAL

BEST PRACTICES

587 of 954

number. Then, re-sort the spreadsheet by number. For the mappings with multiple sources or targets, merge the data back into a single row to generate the inventory of mappings, with each number representing a separate mapping. The resulting inventory would look similar to the following: Number 1 2 4 Target Table Customers Products Orders_Item Source Cust_File Items Tickets Ticket_Items

Customer_Type

At this point, it is often helpful to record some additional information about each mapping to help with planning and maintenance. First, give each mapping a name. Apply the naming standards generated in 3.2 Design Development Architecture. These names can then be used to distinguish mappings from one other and also can be put on the project plan as individual tasks. Next, determine for the project a threshold for a high, medium, or low number of target rows. For example, in a warehouse where dimension tables are likely to number in the thousands and fact tables in the hundred thousands, the following thresholds might apply:
q q q

Low 1 to 10,000 rows Medium 10,000 to 100,000 rows High 100,000 rows +

Assign a likely row volume (high, medium or low) to each of the mappings based on the expected volume of data to pass through the mapping. These high level estimates will help to determine how many mappings are of high volume; these mappings will be the first candidates for performance tuning. Add any other columns of information that might be useful to capture about each mapping, such as a high-level description of the mapping functionality, resource (developer) assigned, initial estimate, actual completion time, or complexity rating.

Last updated: 05-Jun-08 13:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

588 of 954

Metadata Reporting and Sharing Challenge


Using Informatica's suite of metadata tools effectively in the design of the end-user analysis application.

Description
The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is entered depends on the metadata strategy. Detailed information or metadata comments can be entered for all repository objects (e.g. mapping, sources, targets, transformations, ports etc.). Also, all information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it will also require extra amount of time and efforts to do so. But once that information is fed to the Informatica repository ,the same information can be retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can also be created to view that information. There are several options available to export these reports (e.g. Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways to access the repository metadata:
q

Metadata Reporter, which is a web-based application that allows you to run reports against the repository metadata. This is a very comprehensive tool that is powered by the functionality of Informaticas BI reporting tool, Data Analyzer. It is included on the PowerCenter CD. Because Informatica does not support or recommend direct reporting access to the repository, even for Select Only queries, the second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX).

Metadata Reporter
The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter is based on the Data Analyzer and PowerCenter products. It provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations, reports to access to every Informatica object stored in the repository, and even reports to access objects in the Data Analyzer repository. The architecture of the Metadata Reporter is web-based, with an Internet browser front end. Because Metadata Reporter runs on Data Analyzer, you must have Data Analyzer installed and running before you proceed with Metadata Reporter setup. Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the same sequence as they are listed below:
q q q

Schemas.xml Schedule.xml GlobalVariables_Oracle.xml (This file is database specific, Informatica provides GlobalVariable files for DB2, SQLServer, Sybase and Teradata. You need to select the appropriate file based on your PowerCenter repository environment) Reports.xml Dashboards.xml

q q

Note : If you have setup a new instance of Data Analyzer exclusively for Metadata reporter, you should have no problem importing these files. However, if you are using an existing instance of Data Analyzer which you currently use for some other reporting purpose, be careful while importing these files. Some of the file (e.g., Global variables, schedules, etc.) may already exist with the same name. You can rename the conflicting objects. The following are the folders that are created in Data Analyzer when you import the above-listed files:
q

Data Analyzer Metadata Reporting - contains reports for Data Analyzer repository itself e.g. Todays Login ,Reports accessed by Users Today etc. PowerCenter Metadata Reports - contains reports for PowerCenter repository. To better organize reports based on their functionality these reports are further grouped into subfolders as following: Configuration Management - contains a set of reports that provide detailed information on configuration management, including deployment and label details. This folder contains following subfolders:
r r r

Deployment Label Object Version


BEST PRACTICES 589 of 954

INFORMATICA CONFIDENTIAL

Operations - contains a set of reports that enable users to analyze operational statistics including server load, connection usage, run times, load times, number of runtime errors, etc. for workflows, worklets and sessions. This folder contains following subfolders:
r r

Session Execution Workflow Execution

PowerCenter Objects - contains a set of reports that enable users to identify all types of PowerCenter objects, their properties, and interdependencies on other objects within the repository. This folder contains following subfolders:
r r r r r r r r r r

Mappings Mapplets Metadata Extension Server Grids Sessions Sources Target Transformations Workflows Worklets

Security - contains a set of reports that provide detailed information on the users, groups and their association within the repository.

Informatica recommends retaining this folder organization, adding new folders if necessary. The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters and wildcards. Metadata Reporter is accessible from any computer with a browser that has access to the web server where the Metadata Reporter is installed, even without the other Informatica client tools being installed on that computer. The Metadata Reporter connects to the PowerCenter repository using JDBC drivers. Be sure the proper JDBC drivers are installed for your database platform. (Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g., Syntax - jdbc:odbc:<data_source_name>)
q

Metadata Reporter is comprehensive. You can run reports on any repository. The reports provide information about all types of metadata objects. Metadata Reporter is easily accessible. Because the Metadata Reporter is web-based, you can generate reports from any machine that has access to the web server. The reports in the Metadata Reporter are customizable. The Metadata Reporter allows you to set parameters for the metadata objects to include in the report. The Metadata Reporter allows you to go easily from one report to another. The name of any metadata object that displays on a report links to an associated report. As you view a report, you can generate reports for objects on which you need more information.

The following table shows list of reports provided by the Metadata Reporter, along with their location and a brief description: Reports For PowerCenter Repository Sr No 1 Name Deployment Group Folder Description

Public Folders>PowerCenter Metadata Displays deployment groups by repository Reports>Configuration Management>Deployment>Deployment Group Public Folders>PowerCenter Metadata Reports>Configuration Management>Deployment>Deployment Group History Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates. This is a primary report in an analytic workflow.

Deployment Group History

INFORMATICA CONFIDENTIAL

BEST PRACTICES

590 of 954

Labels

Public Folders>PowerCenter Metadata Reports>Configuration Management>Labels>Labels Public Folders>PowerCenter Metadata Reports>Configuration Management>Object Version>All Object Version History

Displays labels created in the repository for any versioned object by repository. Displays all versions of an object by the date the object is saved in the repository. This is a standalone report. Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays Displays session run details for any start date by repository by folder. This is a primary report in an analytic workflow. Displays the load statistics for each table for last month by repository by folder. This is a primary report in an analytic workflow. Displays the run statistics of all workflows by repository by folder. This is a primary report in an analytic workflow. Displays the run statistics of all worklets by repository by folder. This is a primary report in an analytic workflow. Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets. This is a primary report in an analytic workflow. Displays Lookup transformations used in a mapping by repository and folder. This report is a standalone report and also the first node in the analytic workflow associated with the Mapping List primary report. Displays mappings defined as a shortcut by repository and folder.

All Object Version History

Server Load by Day of Public Folders>PowerCenter Metadata Week Reports>Operations>Session Execution>Server Load by Day of Week

Session Run Details

Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Session Run Details

Target Table Load Analysis (Last Month)

Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Target Table Load Analysis (Last Month) Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Workflow Run Details

Workflow Run Details

Worklet Run Details

Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Worklet Run Details

10

Mapping List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping List

11

Mapping Lookup Transformations

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Lookup Transformations

12

Mapping Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Source to Target Dependency

13

Source to Target Dependency

Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

591 of 954

14

Mapplet List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet List

Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. This is a primary report in an analytic workflow. Displays all Lookup transformations used in a mapplet by folder and repository. This report is a standalone report and also the first node in the analytic workflow associated with the Mapplet List primary report. Displays mapplets defined as a shortcut by repository and folder.

15

Mapplet Lookup Transformations

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Lookup Transformations

16

Mapplet Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Unused Mapplets in Mappings Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Metadata Extensions>Metadata Extensions Usage

17

Unused Mapplets in Mappings

Displays mapplets defined in a folder but not used in any mapping in that folder.

18

Metadata Extensions Usage

Displays, by repository by folder, reusable metadata extensions used by any object. Also displays the counts of all objects using that metadata extension. Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers. Displays all sessions and their properties by repository by folder. This is a primary report in an analytic workflow. Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in an analytic workflow. Displays sources that are defined as shortcuts by repository and folder

19

Server Grid List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Server Grid>Server Grid List

20

Session List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sessions>Session List

21

Source List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source List

22

Source Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target List

23

Target List

Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in an analytic workflow. Displays targets that are defined as shortcuts by repository and folder.

24

Target Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target Shortcuts

INFORMATICA CONFIDENTIAL

BEST PRACTICES

592 of 954

25

Transformation List

Public Folders>PowerCenter Metadata Displays transformations defined by Reports>PowerCenter repository and folder. This is a primary Objects>Transformations>Transformation List report in an analytic workflow. Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Transformations>Transformation Shortcuts Displays transformations that are defined as shortcuts by repository and folder.

26

Transformation Shortcuts

27

Scheduler (Reusable) Public Folders>PowerCenter Metadata Displays all the reusable schedulers List Reports>PowerCenter defined in the repository and their Objects>Workflows>Scheduler (Reusable) List description and properties by repository by folder. This is a primary report in an analytic workflow. Workflow List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Workflows>Workflow List Displays workflows and workflow properties by repository by folder. This report is a primary report in an analytic workflow. Displays worklets and worklet properties by repository by folder. This is a primary report in an analytic workflow. Displays users by repository and group.

28

29

Worklet List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Worklets>Worklet List

30

Users By Group

Public Folders>PowerCenter Metadata Reports>Security>Users By Group Reports For Data Analyzer Repository

Sr No 1

Name

Folder

Description

Bottom 10 Least Public Folders>Data Analyzer Metadata Displays the ten least accessed Accessed Reports this Reporting>Bottom 10 Least Accessed Reports reports for the current year. It has an Year this Year analytic workflow that provides access details such as user name and access time. Report Activity Details Public Folders>Data Analyzer Metadata Reporting>Report Activity Details Part of the analytic workflows "Top 10 Most Accessed Reports This Year", "Bottom 10 Least Accessed Reports this Year" and "Usage by Login (Month To Date)". Provides information about reports accessed in the current month until current date. Provides information about the next scheduled update for scheduled reports. It can be used to decide schedule timing for various reports for optimum system performance.

Report Activity Details Public Folders>Data Analyzer Metadata for Current Month Reporting>Report Activity Details for Current Month Report Refresh Schedule Public Folders>Data Analyzer Metadata Reporting>Report Refresh Schedule

Reports Accessed by Users Today

Public Folders>Data Analyzer Metadata Part of the analytic workflow for Reporting>Reports Accessed by Users Today "Today's Logins". It provides detailed information on the reports accessed by users today. This can be used independently to get comprehensive information about today's report activity details.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

593 of 954

Todays Logins

Public Folders>Data Analyzer Metadata Reporting>Todays Logins

Provides the login count and average login duration for users who logged in today. Provides information about the number of reports accessed today for each hour. The analytic workflow attached to it provides more details on the reports accessed and users who accessed them during the selected hour.

Todays Report Usage Public Folders>Data Analyzer Metadata by Hour Reporting>Todays Report Usage by Hour

Top 10 Most Accessed Public Folders>Data Analyzer Metadata Shows the ten most accessed reports Reports this Year Reporting>Top 10 Most Accessed Reports this for the current year. It has an analytic Year workflow that provides access details such as user name and access time. Top 5 Logins (Month To Date) Public Folders>Data Analyzer Metadata Reporting>Top 5 Logins (Month To Date) Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user. Shows the five longest running ondemand reports for the current month to date. It displays the average total response time, average DB response time, and the average Data Analyzer response time (all in seconds) for each report shown.

10

Top 5 Longest Running On-Demand Reports (Month To Date)

Public Folders>Data Analyzer Metadata Reporting>Top 5 Longest Running OnDemand Reports (Month To Date)

11

Top 5 Longest Running Scheduled Reports (Month To Date)

Public Folders>Data Analyzer Metadata Shows the five longest running Reporting>Top 5 Longest Running Scheduled scheduled reports for the current Reports (Month To Date) month to date. It displays the average response time (in seconds) for each report shown. Provides the number of errors encountered during execution of reports attached to schedules. The analytic workflow "Scheduled Report Error Details for Today" is attached to it. Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user.

12

Total Schedule Errors Public Folders>Data Analyzer Metadata for Today Reporting>Total Schedule Errors for Today

13

User Logins (Month To Public Folders>Data Analyzer Metadata Date) Reporting>User Logins (Month To Date)

14

Users Who Have Never Logged On

Public Folders>Data Analyzer Metadata Provides information about users who Reporting>Users Who Have Never Logged On exist in the repository but have never logged in. This information can be used to make administrative decisions about disabling accounts.

Customizing a Report or Creating New Reports


Once you select the report, you can customize it by setting the parameter values and/or creating new attributes or metrics. Data Analyzer includes simples steps to create new reports or modify existing ones. Adding filters or modifying filters offers tremendous reporting
INFORMATICA CONFIDENTIAL BEST PRACTICES 594 of 954

flexibility. Additionally, you can setup report templates and export them as Excel files, which can be refreshed as necessary. For more information on the attributes, metrics, and schemas included with the Metadata Reporter, consult the product documentation.

Wildcards
The Metadata Reporter supports two wildcard characters:
q q

Percent symbol (%) - represents any number of characters and spaces. Underscore (_) - represents one character or space.

You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is the same as using %. The following examples show how you can use the wildcards to set parameters. Suppose you have the following values available to select:
items, items_in_promotions, order_items, promotions

The following list shows the return values for some wildcard combinations you can use: Wildcard Combination % <blank> %items item_ item% ___m% %pr_mo% Return Values items, items_in_promotions, order_items, promotions items, items_in_promotions, order_items, promotions items, order_items Items items, items_in_promotions items, items_in_promotions, promotions items_in_promotions, promotions

A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the clipboard. Use Ctrl+V to paste the copy into a Word document. For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the PowerCenter documentation.

Security Awareness for Metadata Reporter


Metadata Reporter uses Data Analyzer for reporting out of the PowerCenter /Data Analyzer repository. Data Analyzer has a robust security mechanism that is inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for users based on their profiles. Since the information in PowerCenter repository does not change often after it goes to production, the Administrator can create some reports and export them to files that can be distributed to the user community. If the numbers of users for Metadata Reporter are limited, you can implement security using report filters or data restriction feature. For example, if a user in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it to the user's profile. For more information on the ways in which you can implement security in Data Analyzer, refer to the Data Analyzer documentation.

Metadata Exchange: the Second Generation (MX2)


The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data warehouse and display the warehouse metadata through their own products. The result was a set of relational views that encapsulated the underlying repository tables while exposing the metadata in several categories that were more suitable for external parties. Today, Informatica and several key vendors, including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to report and query the Informatica metadata.
INFORMATICA CONFIDENTIAL BEST PRACTICES 595 of 954

Informatica currently supports the second generation of Metadata Exchange called MX2. Although the overall motivation for creating the second generation of MX remains consistent with the original intent, the requirements and objectives of MX2 supersede those of MX. The primary requirements and features of MX2 are: Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism for accessing and manipulating records of data in a relational paradigm, it is not suitable for procedural programming tasks that can be achieved by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-oriented software tools require interfaces that can fully take advantage of the object technology. MX2 is implemented in C++ and offers an advanced object-based API for accessing and manipulating the PowerCenter Repository from various programming languages. Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the repository database and thus can be used independent of any of the Informatica software products. The same requirement also holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently of the client or server products. Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream data warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica partners by means of the new MX2 interfaces. Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with the appropriate verification and validation features to ensure the integrity of the metadata in the repository. Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution. Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools. Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository. Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with Microsoft's Component Object Model (COM) interoperability protocol. Therefore, any existing or future program that is COM-compliant can seamlessly interface with the PowerCenter Repository by means of MX2.

Last updated: 27-May-08 12:03

INFORMATICA CONFIDENTIAL

BEST PRACTICES

596 of 954

Repository Tables & Metadata Management Challenge


Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.

Description
Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information from the repository maintains the repository for better performance.

Managing Repository
The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating statistics.

Repository backup
Repository back up can be performed using the client tool Repository Server Admin Console or the command line program pmrep. Backup using pmrep can be automated and scheduled for regular backups.

This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to run daily.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

597 of 954

The following paragraphs describe some useful practices for maintaining backups: Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is recommended once a month or prior to major release. For development repositories, backup is recommended once a week or once a day, depending upon the team size. Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a utility such as winzip or gzip. Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that the repository itself. Move backups offline: Review the backups on a regular basis to determine how long they need to remain online. Any that are not required online should be moved offline, to tape, as soon as possible.

Restore repository
Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica recommends testing the backup files and recovery process at least once each quarter. The repository can be restored using the client tool, Repository Server Administrator Console, or the command line programs pmrepagent.

Restore folders
There is no easy way to restore only one particular folder from backup. First the backup repository has to be restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from the restored repository into the target repository.

Remove older versions


Use the purge command to remove older versions of objects from repository. To purge a specific version of an object, view the history of the object, select the version, and purge it.

Finding deleted objects and removing them from repository


If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option. Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted objects, use either the find checkouts command in the client tools or a query generated in the repository
INFORMATICA CONFIDENTIAL BEST PRACTICES 598 of 954

manager, or a specific query.

After an object has been deleted from the repository, you cannot create another object with the same name unless the deleted object has been completely removed from the repository. Use the purge command to completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions of a deleted object to completely remove it from repository.

Truncating Logs
You can truncate the log information (for sessions and workflows) stored in the repository either by using repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a particular folder. Options allow truncating all log entries or selected entries based on date and time.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

599 of 954

Repository Performance
Analyzing (or updating the statistics) of repository tables can help to improve the repository performance. Because this process should be carried out for all tables in the repository, a script offers the most efficient means. You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a command task to call the script.

Repository Agent and Repository Server performance


Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various causes should be analyzed and the repository server (or agent) configuration file modified to improve performance.

Managing Metadata
The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7. Minor changes in the queries may be required for PowerCenter repositories residing on other databases.

Failed Sessions
The following query lists the failed sessions in the last day. To make it work for the last n days, replace SYSDATE-1 with SYSDATE - n SELECT Subject_Area AS Folder, Session_Name, Last_Error AS Error_Message, DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status, Actual_Start AS Start_Time,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

600 of 954

Session_TimeStamp FROM rep_sess_log WHERE run_status_code != 1 AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

Long running Sessions


The following query lists long running sessions in the last day. To make it work for the last n days, replace SYSDATE-1 with SYSDATE - n SELECT Subject_Area AS Folder, Session_Name, Successful_Source_Rows AS Source_Rows, Successful_Rows AS Target_Rows, Actual_Start AS Start_Time, Session_TimeStamp FROM rep_sess_log WHERE run_status_code = 1 AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE) AND (Session_TimeStamp - Actual_Start) > (10/(24*60)) ORDER BY Session_timeStamp

Invalid Tasks
The following query lists folder names and task name, version number, and last saved for all invalid tasks. SELECT SUBJECT_AREA AS FOLDER_NAME, DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE, TASK_NAME AS OBJECT_NAME, VERSION_NUMBER, -- comment out for V6 LAST_SAVED

INFORMATICA CONFIDENTIAL

BEST PRACTICES

601 of 954

FROM REP_ALL_TASKS WHERE IS_VALID=0 AND IS_ENABLED=1

--AND CHECKOUT_USER_ID = 0 -- Comment out for V6 --AND is_visible=1 -- Comment out for V6 ORDER BY SUBJECT_AREA,TASK_NAME

Load Counts
The following query lists the load counts (number of rows loaded) for the successful sessions. SELECT subject_area, workflow_name, session_name, DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status, successful_rows, failed_rows, actual_start FROM REP_SESS_LOG WHERE TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE) ORDER BY subject_area workflow_name, session_name, Session_status
INFORMATICA CONFIDENTIAL BEST PRACTICES 602 of 954

Using Metadata Extensions Challenge


To provide for efficient documentation and achieve extended metadata reporting through the use of metadata extensions in repository objects.

Description
Metadata Extensions, as the name implies, help you to extend the metadata stored in the repository by associating information with individual objects in the repository. Informatica Client applications can contain two types of metadata extensions: vendordefined and user-defined.
q

Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them. User-defined. You create user-defined metadata extensions using PowerCenter clients. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.

You can create reusable or non-reusable metadata extensions. You associate reusable metadata extensions with all repository objects of a certain type. So, when you create a reusable extension for a mapping, it is available for all mappings. Vendor-defined metadata extensions are always reusable. Non-reusable extensions are associated with a single repository object. Therefore, if you edit a target and create a non-reusable extension for it, that extension is available only for the target you edit. It is not available for other targets. You can promote a nonreusable metadata extension to reusable, but you cannot change a reusable metadata extension to non-reusable. Metadata extensions can be created for the following repository objects:
q q

Source definitions Target definitions


BEST PRACTICES 603 of 954

INFORMATICA CONFIDENTIAL

q q q q q q q

Transformations (Expressions, Filters, etc.) Mappings Mapplets Sessions Tasks Workflows Worklets

Metadata Extensions offer a very easy and efficient method of documenting important information associated with repository objects. For example, when you create a mapping, you can store the mapping owners name and contact information with the mapping OR when you create a source definition, you can enter the name of the person who created/imported the source. The power of metadata extensions is most evident in the reusable type. When you create a reusable metadata extension for any type of repository object, that metadata extension becomes part of the properties of that type of object. For example, suppose you create a reusable metadata extension for source definitions called SourceCreator. When you create or edit any source definition in the Designer, the SourceCreator extension appears on the Metadata Extensions tab. Anyone who creates or edits a source can enter the name of the person that created the source into this field. You can create, edit, and delete non-reusable metadata extensions for sources, targets, transformations, mappings, and mapplets in the Designer. You can create, edit, and delete non-reusable metadata extensions for sessions, workflows, and worklets in the Workflow Manager. You can also promote non-reusable metadata extensions to reusable extensions using the Designer or the Workflow Manager. You can also create reusable metadata extensions in the Workflow Manager or Designer. You can create, edit, and delete reusable metadata extensions for all types of repository objects using the Repository Manager. If you want to create, edit, or delete metadata extensions for multiple objects at one time, use the Repository Manager. When you edit a reusable metadata extension, you can modify the properties Default Value, Permissions and Description. Note: You cannot create non-reusable metadata extensions in the Repository Manager. All metadata extensions created in the Repository Manager are reusable. Reusable metadata extensions are repository wide. You can also migrate Metadata Extensions from one environment to another. When

INFORMATICA CONFIDENTIAL

BEST PRACTICES

604 of 954

you do a copy folder operation, the Copy Folder Wizard copies the metadata extension values associated with those objects to the target repository. A non-reusable metadata extension will be copied as a non-reusable metadata extension in the target repository. A reusable metadata extension is copied as reusable in the target repository, and the object retains the individual values. You can edit and delete those extensions, as well as modify the values. Metadata Extensions provide for extended metadata reporting capabilities. Using Informatica MX2 API, you can create useful reports on metadata extensions. For example, you can create and view a report on all the mappings owned by a specific team member. You can use various programming environments such as Visual Basic, Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++ applications. Additionally, Metadata Extensions can also be populated via data modeling tools such as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange for Data Models. With the Informatica Metadata Exchange for Data Models, the Informatica Repository interface can retrieve and update the extended properties of source and target definitions in PowerCenter repositories. Extended Properties are the descriptive, user defined, and other properties derived from your Data Modeling tool and you can map any of these properties to the metadata extensions that are already defined in the source or target object in the Informatica repository.

Last updated: 27-May-08 12:04

INFORMATICA CONFIDENTIAL

BEST PRACTICES

605 of 954

Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance Challenge
The role that the PowerCenter repository can play in an automated QA strategy is often overlooked and under-appreciated. This repository is essentially a database about the transformation process and the software developed to implement it; the challenge is to devise a method to exploit this resource for QA purposes. To address the above challenge, Informatica PowerCenter provides several pre-packaged reports (PowerCenter Repository Reports) that can be installed on Data Analyzer or Metadata Manager Installation. These reports provide lots of useful information about PowerCenter object metadata and operational metadata that can be used for quality assurance.

Description
Before considering the mechanics of an automated QA strategy, it is worth emphasizing that quality should be built in from the outset. If the project involves multiple mappings repeating the same basic transformation pattern(s), it is probably worth constructing a virtual production line. This is essentially a template-driven approach to accelerate development and enforce consistency through the use of the following aids:
q q

Shared template for each type of mapping. Checklists to guide the developer through the process of adapting the template to the mapping requirements. Macros/scripts to generate productivity aids such as SQL overrides etc.

It is easier to ensure quality from a standardized base rather than relying on developers to repeat accurately the same basic keystrokes. Underpinning the exploitation of the repository for QA purposes is the adoption of naming standards which categorize components. By running the appropriate query on the repository, it is possible to identify those components whose attributes differ from those predicted for the category. Thus, it is quite possible to automate some aspects of QA. Clearly, the function of naming conventions is not just to standardize, but also to provide logical access paths into the information in the repository; names can be used to identify patterns and/or categories and thus allow assumptions to be made about object attributes. Along with the facilities provided to query the repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata Manager, this opens the door to an automated QA strategy For example, consider the following situation: it is possible that the EXTRACT mapping/session should always truncate the target table before loading; conversely, the TRANSFORM and LOAD phases should never truncate a target. Possible code errors in this respect can be identified as follows:
q

Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or LOAD.


BEST PRACTICES 606 of 954

INFORMATICA CONFIDENTIAL

Develop a query on the repository to search for sessions named EXTRACT, which do not have the truncate target option set. Develop a query on the repository to search for sessions named TRANSFORM or LOAD, which do have the truncate target option set. Provide a facility to allow developers to run both queries before releasing code to the test environment.

Alternatively, a standard may have been defined to prohibit unconnected output ports from transformations (such as expressions) in a mapping. These can be very easily identified from the MX View REP_MAPPING_UNCONN_PORTS: The following bullets represent a high-level overview of the steps involved in automating QA:
q

Review the transformations/mappings/sessions/workflows and allocate to broadly representative categories. Identify the key attributes of each category. Define naming standards to identify the category for transformations/mappings/sessions/ workflows. Analyze the MX Views to source the key attributes. Develop the query to compare actual and expected attributes for each category.

q q

q q

After you have completed these steps, it is possible to develop a utility that compares actual and expected attributes for developers to run before releasing code into any test environment. Such a utility may incorporate the following processing stages:
q q q q q

Execute a profile to assign environment variables (e.g., repository schema user, password, etc). Select the folder to be reviewed. Execute the query to find exceptions. Report the exceptions in an accessible format. Exit with failure if exceptions are found.

TIP Remember that any queries on the repository that bypass the MX views will require modification if subsequent upgrades to PowerCenter occur and as such is not recommended by Informatica.

The principal objective of any QA strategy is to ensure that developed components adhere to standards and to identify defects before incurring overhead during the migration from development to test/production environments. Qualitative, peer-based reviews of PowerCenter objects due for release obviously have their part to play in this process.

Using Metadata Manager and PowerCenter Repository Reports for Quality Assurance
The need for the Informatica Metadata Reporter was identified from the a number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations. In this
INFORMATICA CONFIDENTIAL BEST PRACTICES 607 of 954

section, we focus primarily on how these reports and custom reports can help ease the QA process. The following reports can help identify regressions in load performance:
q q q q

Session Run details Workflow Run details Worklet Run details Server Load by Day of the Week can help determine the load on the server before and after QA migrations and may help balance the loads through the week by modifying the schedules. The Target Table Load Analysis can help identify any data regressions with the number of records loaded in each target (if a baseline was established before the migration/upgrade). The Failed Session report lists failed sessions at a glance, which is very helpful after a major QA migration or QA of Informatica upgrade process

During huge deployments to QA, the Code review team can look at the following reports to determine if the standards (i.e., Naming standards, Comments for repository objects, metadata extensions usage, etc.) were followed. Accessing this information from PowerCenter Repository Reports typically reduces the time required for review because the reviewer doesnt need to open each mapping and check for these details. All of the following are out-of-the-box reports provided by Informatica:
q q q q q q q q q q q q q q

Label report Mappings list Mapping shortcuts Mapping lookup transformation Mapplet list Mapplet shortcuts Mapplet lookup transformation Metadata extensions usage Sessions list> Worklets list Workflows list Source list Target list Custom reports based on the review requirements

In addition, note that the following reports are also useful during migration and upgrade processes:
q

Invalid object reports and deployment group report in the QA repository help to determine which deployments caused the invalidations. Invalid object report against Development repository helps to identify the invalid objects that are part of deployment before QA migration. Invalid object report helps in QA of an Informatica upgrade process.

The following table summarizes some of the reports that Informatica ships with a PowerCenter Repository
INFORMATICA CONFIDENTIAL BEST PRACTICES 608 of 954

Reports installation:

Report Name 1 2

Description

Deployment Group Displays deployment groups by repository Deployment Group History Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates. Displays labels created in the repository for any versioned object by repository.

Labels

All Object Version Displays all versions of an object by the date the object is saved in History the repository. Server Load by Day of Week Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays Displays session run details for any start date by repository by folder.

Session Run Details

Target Table Load Displays the load statistics for each table for last month by Analysis (Last repository by folder Month) Workflow Run Details Worklet Run Details Mapping List Displays the run statistics of all workflows by repository by folder.

Displays the run statistics of all worklets by repository by folder.

10

Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets. Displays Lookup transformations used in a mapping by repository and folder.

11

Mapping Lookup Transformations

12

Mapping Shortcuts Displays mappings defined as a shortcut by repository and folder.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

609 of 954

13

Source to Target Dependency

Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived. Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. Displays all Lookup transformations used in a mapplet by folder and repository.

14

Mapplet List

15

Mapplet Lookup Transformations

16 17

Mapplet Shortcuts Displays mapplets defined as a shortcut by repository and folder. Unused Mapplets Displays mapplets defined in a folder but not used in any mapping in Mappings in that folder. Metadata Displays, by repository by folder, reusable metadata extensions Extensions Usage used by any object. Also displays the counts of all objects using that metadata extension. Server Grid List Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers. Displays all sessions and their properties by repository by folder. This is a primary report in a data integration workflow. Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in a data integration workflow.

18

19

20

Session List

21

Source List

22

Source Shortcuts Displays sources that are defined as shortcuts by repository and folder Target List Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in a data integration workflow. Displays targets that are defined as shortcuts by repository and folder. Displays transformations defined by repository and folder. This is a primary report in a data integration workflow.

23

24

Target Shortcuts

25

Transformation List

INFORMATICA CONFIDENTIAL

BEST PRACTICES

610 of 954

26

Transformation Shortcuts Scheduler (Reusable) List Workflow List Worklet List

Displays transformations that are defined as shortcuts by repository and folder. Displays all the reusable schedulers defined in the repository and their description and properties by repository by folder. Displays workflows and workflow properties by repository by folder. Displays worklets and worklet properties by repository by folder.

27

28 29

Last updated: 05-Jun-08 13:27

INFORMATICA CONFIDENTIAL

BEST PRACTICES

611 of 954

Configuring Standard Metadata Resources Challenge


Metadata that is derived from a variety of sources and tools is often disparate and fragmented. To be of value, metadata needs to be consolidated into a central repository. Informatica's Metadata Manager provides a central repository for the capture and analysis of critical metadata. Before you can browse and search metadata in the Metadata Manager warehouse, you must configure Metadata Manager, create resources, and then load the resource metadata.

Description
Informatica Metadata Manager is a web-based metadata management tool that you can use to browse and analyze metadata from disparate metadata repositories. Metadata Manager helps you understand and manage how information and processes are derived. It also helps you understand the fundamental relationships between information and processes, and how they are used. Metadata Manager extracts metadata from application, business intelligence, data integration, data modeling, and relational metadata sources. Metadata Manager uses PowerCenter workflows to extract metadata from metadata sources and load it into a centralized metadata warehouse called the Metadata Manager warehouse. Metadata Manager uses resources to represent the metadata in the Metadata Manager. Each resource represents metadata from a metadata source. Metadata Manager shows the metadata for each resource in the metadata catalog. The metadata catalog is a hierarchical representation of the metadata in the Metadata Manager warehouse. There are several steps to configure a standard resource in Metadata Manager. It is very important to identify, setup, and test your resource connections before configuring a resource into Metadata Manager. Informatica recommends creating naming standards usually prefixed by the metadata source type for the Metadata Manager Application (i.e. for a SQL Server relational database use SS_databasename_schemaname). The steps below describe how to load metadata from a metadata source into the Metadata Manager warehouse. Each detailed section shows information needed for individual standard resource types.

Loading Metadata Resource into Metadata Manager Warehouse


The Load page in the Metadata Manager Application is used to create and load resources into the Metadata Manager warehouse. Use the Load page to monitor and schedule resource loads, purge metadata from the Metadata Manager warehouse, and manage the search index. Complete the following steps to load metadata from a metadata source into the Metadata Manager warehouse: 1. Set up Metadata Manager and metadata sources. Create a Metadata Manager Service, install the Metadata Manager, and configure the metadata sources from which you want to extract metadata. 2. Create resources. Create resources that represent the metadata sources from which you want to extract metadata. 3. Configure resources. Configure the resources, including metadata source files and direct source connections, parameters, and connection assignments. You can also purge metadata for a previously loaded resource and update the index for resources. 4. Load and monitor resources. Load a resource to load the metadata for the resource into the Metadata Manager warehouse. When you load a resource, Metadata Manager extracts and loads the metadata for the resource. You can monitor the status of all resources and the status of individual resources. You can also schedule resource loads. 5. Manage resource and object permissions for Metadata Manager users. You can configure the resources and metadata objects in the warehouse for which Metadata Manager users have access. Use Metadata Manager command line programs to load resources, monitor the status of resource loads and PowerCenter workflows, and back up and restore the Metadata Manager repository.
INFORMATICA CONFIDENTIAL BEST PRACTICES 612 of 954

Configure Metadata Resources


Before you configure resources and load metadata into the Metadata Manager warehouse, you must configure the metadata sources. For metadata sources that use a source file, you select the source file when you configure the resource. If you do not correctly configure the metadata sources, the metadata load can fail or the metadata can be incorrectly loaded in the Metadata Manager warehouse. Table 2-1 describes the configuration tasks for the metadata sources: Table 2-1. Metadata Source Configuration Tasks Metadata Source Type Application Metadata Source Tasks

SAP

Install SAP transports and configure permissions. For more information, see SAP. Export documents, universes, and Crystal Reports to a repository. For more information, see Business Objects. Verify that you have access ReportNet URL. Metadata Manager uses the ReportNet URL to access the source repository metadata. Use the Cognos client tool to export the metadata to a .cat file. Export metadata to an XML file. For more information, see Hyperion Essbase. Export metadata to an XML file. For more information, see IBM DB2 Cube Views. Configure the database user account and projects. For more information, see Microstrategy. Metadata Manager extracts the latest version of objects that are checked into the PowerCenter repository. Check in all metadata objects that you want to extract from the PowerCenter repository. For more information, see PowerCenter.

Business Intelligence

Business Objects

Cognos ReportNet Content Manager

Cognos Impromptu

Hyperion Essbase

IBM DB2 CubeViews

Microstrategy

Data Integration

PowerCenter

INFORMATICA CONFIDENTIAL

BEST PRACTICES

613 of 954

Database Management IBM DB2 UDB Informix Microsoft SQL Server Oracle Sybase Teradata Data Modeling* Embarcadero ERStudio

Configure the permissions for the database user account. For more information, see Relational Database Sources.

Use the ERStudio client tool to export the metadata to a . dm1 file. Export metadata. For more information, see Erwin. Use the Oracle Designer client tool to export the metadata to a .dat file. Use the Rational Rose client tool to export the metadata to an .mdl file. Use the Sybase PowerDesigner client tool to save the model to a .pdm file in XML format. Use the Visio client tool to export the metadata to an .erx file. Export metadata to a .csv or .txt file. For more information, see Custom Metadata Sources.

ERwin Oracle Designer

Rational Rose ER

Sybase PowerDesigner

Visio Custom Custom

* You can load multiple models from the same data modeling tool source. For more information, se e Data Modeling Tool Sources

Standard Resource Types Business Objects


The Business Objects Resource requires you to install Business Object Designer on the machine hosting the Metadata Manager console and to provide user name and password to access Business Objects repository. Export the Business Objects universes, documents, and Crystal Reports to the Business Objects source repository. You can extract documents, universes, and Crystal Reports that have been exported to the source repository. You cannot extract metadata from documents or universes. Export from source repositories to make sure that the metadata in the Metadata Manager warehouse is consistent with the metadata that is distributed to Business Objects users. Use Business Objects Designer to export a universe to the Business Objects source repository. For example, to begin the export process in Business Objects Designer, click File > Export. You must secure a connection type to export a universe to a Business Objects source repository.
INFORMATICA CONFIDENTIAL BEST PRACTICES 614 of 954

Use Business Objects to export a document to the Business Objects repository. For example, to begin the export process in Business Objects, click File > Publish To > Corporate Documents. Use the Business Objects Central Manager Console to export Crystal Reports to the Business Objects repository. The screenshot below displays the information you will need to add the resource.

Custom Metadata Sources


If you create a custom resource and use a metadata source file, you must export the metadata to a metadata file with a .csv or .txt file extension. When you configure the custom resource, you specify the metadata file.

Data Modeling Tool Sources


You can load multiple models from a data modeling tool into the Metadata Manager warehouse. After you load the metadata, the Metadata Manager catalog shows models from the same modeling tool under the resource name. This requirement applies to all data modeling tool resource types.

Erwin / ER-Studio
Metadata Manager extracts ERwin metadata from a metadata file. When you configure the connection to the ERwin source repository in Metadata Manager, you specify the metadata file. The required format for the metadata file depends on the version of the ERwin source repository. The following table specifies the required file type format for each supported version: Version File Type

INFORMATICA CONFIDENTIAL

BEST PRACTICES

615 of 954

ERwin 3.0 to 3.5.2 ERwin 4.0 SP1 to 4.1 Erwin 7.x

.erx .er1 or .xml .erwin or .xml

The screenshot below displays the information you will need to add the resource.

Hyperion Essbase
Use the Hyperion Essbase client tool to export the metadata to an .xml file. Metadata Manager extracts Hyperion Essbase metadata from a metadata file with an .xml file extension. When you set up the resource for Hyperion Essbase in the Metadata Manager, you specify the metadata file. Use the Hyperion Essbase Integration Server to export the source metadata to an XML file. Export one model to each metadata file. To export the Hyperion model to an XML file: 1. 2. 3. 4. 5. 6. 7. Log in to Hyperion Essbase Integration Server. Create the Hyperion source or open an existing model. Click File > Save to save the model if you created or updated it. Click File > XML Import/Export. On the Export tab, select the model. Click Save As XML File. A pop-up window appears. Select the location where you want to store the XML file.

The screenshot below displays the information you will need to add the resource.
INFORMATICA CONFIDENTIAL BEST PRACTICES 616 of 954

IBM DB2 Cube Views


Use the IBM DB2 Cube Views OLAP Center GUI to export cube models to .xml files. When configure the resource for DB2 Cube Views in Metadata Manager, you specify the metadata files. TIP You can load multiple cube models into the Metadata Manager warehouse. Export each cube model into a separate .xml file and name the file with the same name as the cube model. If you export multiple cube models into an .xml file, export the same cube models into the same .xml file each time you export them. The screenshot below displays the information you will need to add the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

617 of 954

Microstrategy
To configure Microstrategy, complete the following tasks:
q q

Configure permissions. Configure multiple projects (optional).

The screenshot below displays the information you will need to add the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

618 of 954

Configure Permissions The Microstrategy project user account for which you provide the user name and password must have the Bypass All Object Security Access Checks administration privilege. You set this privilege in the Microstrategy Desktop client tool. Note: Although Microstrategy allows you to connect to a project source using database or network authentication, Metadata Manager uses project source authentication. Configure Multiple Projects in the Same Metadata File Microstrategy projects can be from different project sources. You can load multiple Microstrategy projects under the same Microstrategy resource. You must provide the user name and password for each project source. Project names must be unique. When you configure the Microstrategy resource, you specify the project source, project, user name, and password for each project.

PowerCenter
The screenshot below displays the information you will need to add the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

619 of 954

Relational Database Sources


Configure the permissions for the IBM DB2 UDB, Informix, Microsoft SQL Server, Oracle, Sybase ASE, and Teradata database user account. The database user account you use to connect to the metadata source must have SELECT permissions on the following objects in the specified schemas: Tables, Views, Indexes, Packages, Procedures, Functions, Sequences, Triggers, Synonyms. Note: For Oracle resources, the user account must also have the SELECT CATALOG ROLE permission.

DB2 Resource
jdbc:informatica:db2://host_name:port;DatabaseName=database_name

Informix Resource
jdbc:informatica:informix://host_name:port;InformixServer=server_name;DatabaseName=database_name SQL Server Resourcejdbc:informatica:sqlserver://host_name:port;SelectMethod=cursor; DatabaseName=database_name Connection String: For default instance: SQL Server Name@Database Name For named instance: Server Name\Instance Name@Database Name

Oracle Resource
INFORMATICA CONFIDENTIAL BEST PRACTICES 620 of 954

jdbc:informatica:oracle://host_name:port;SID=sid Connect String: Oracle instance name If the metadata in the Oracle source database contains unicode characters, set the NLS_LENGTH_SEMANTICS parameter to CHAR from BYTE. Specify a user name and password to access the Oracle database metadata. Be sure that the user has the Select Any Table privilege and Select Permissions on the following objects in the specified schemas: tables, views, indexes, packages, procedures, functions, sequences, triggers, and synonyms. Also ensure the user has Select Permissions on the SYS.v_$instance. One Resource is needed for each Oracle instance.

Teradata Resource:
jdbc:teradata://database_server_name/Database=database_name Connect String: Be sure that the user has access to all the system DBC tables.

SAP
To configure SAP, complete the following tasks:
q q

Install PowerCenter transports Configure user authorization profile

Installing Transports To extract metadata from SAP, you must install PowerCenter transports. The transports are located in the following folder in the location where you downloaded PowerCenter: <download location>\saptrans\mySAP Table 2-2 describes the transports you must install: Table 2-2. SAP Transports for Metadata Manager and SAP SAP Version Data and Cofile Names Transport Request R46K900084 Functionality

4.0B to 4.6B, XCONNECT_DESIGN_R900116.R46 4.6C, and nonUnicode XCONNECT_DESIGN_K900116.R46 Versions 4.7 and Higher

For mySAP R/3 ECC (R/3) and mySAP add-on components, including CRM, BW, and APO: Supports Table Metadata Extraction for SAP in Metadata Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

621 of 954

Unicode XCONNECT_DESIGN_R900109.U47 Versions 4.7 and Higher XCONNECT_DESIGN_K900109.U47

U47K900109

For mySAP R/3 ECC (R/3) and mySAP add-on components, including CRM, BW, and APO: Supports Table Metadata Extraction for SAP in Metadata Manager.

You must install the other mySAP transports before you install the transports for Metadata Manager. Configure User Authorization Profile The SAP administrator needs to create the product and development user authorization profile. Table 2-3 describes the user authorization profile: Table 2-3. SAP User Authorization Profile Authorization Object S_RFC Description Class Field Values

Authorization check for RFC access.

Cross Application Authorization Activity: 16 (Execute) objects Name of RFC to be protected: *. Type of RFC object to be protected: FUGR

Last updated: 02-Jun-08 22:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

622 of 954

Custom XConnect Implementation Challenge


Metadata Manager uses XConnects to extract source repository metadata and load it into the Metadata Manager Warehouse. The Metadata Manager Configuration Console is used to run each XConnect. A Custom XConnect is needed to load metadata from a source repository for which Metadata Manager does not prepackage an out-of-the box XConnect.

Description
This document organizes all steps into phases, where each phase and step must be performed in the order presented. To integrate custom metadata, complete tasks for the following phases:
q q q q

Design the Metamodel. Implement the Metamodel Design. Set-up and run the custom XConnect. Configure the reports and schema.

Prerequisites for Integrating Custom Metadata


To integrate custom metadata, install Metadata Manager and the other required applications. The custom metadata integration process assumes knowledge of the following topics:
q

Common Warehouse Metamodel (CWM) and Informatica-Defined Metamodels. The CWM metamodel includes industry-standard packages, classes, and class associations. The Informatica metamodel components supplement the CWM metamodel by providing repository-specific packages, classes, and class associations. For more information about the Informaticadefined metamodel components, run and review the metamodel reports. PowerCenter Functionality. During the metadata integration process, XConnects are configured and run. The XConnects run PowerCenter workflows that extract custom metadata and load it into the Metadata Manager Warehouse. Data Analyzer Functionality. Metadata Manager embeds Data Analyzer

INFORMATICA CONFIDENTIAL

BEST PRACTICES

623 of 954

functionality to create, run, and maintain a metadata reporting environment. Knowledge of creating, modifying, and deleting reports, dashboards, and analytic workflows in Data Analyzer is required. Knowledge of creating, modifying, and deleting table definitions, metrics, and attributes is required to update the schema with new or changed objects.

Design the Metamodel


During this planning phase, the metamodel is designed; the metamodel will be implemented in the next phase. A metamodel is the logical structure that classifies the metadata from a particular repository type. Metamodels consist of classes, class, associations, and packages, which group related classes and class associations. An XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations. This task consists of the following steps: 1. Identify Custom Classes. To identify custom classes, determine the various types of metadata in the source repository that need to be loaded into the Metadata Manager Warehouse. Each type of metadata corresponds to one class. 2. Identify Custom Class Properties. After identifying the custom classes, each custom class must be populated with properties (i.e., attributes) in order for Metadata Manager to track and report values belonging to classes instances. 3. Map Custom Classes to CWM Classes. Metadata Manager prepackages all CWM classes, class properties, and class associations. To quickly develop a custom metamodel and reduce redundancy, reuse the predefined class properties and associations instead of recreating them. To determine which custom classes can inherit properties from CWM classes, map custom classes to the packaged CWM classes. For all properties that cannot be inherited, define them in Metadata Manager. 4. Determine the Metadata Tree Structure. Configure the way the metadata tree displays objects. Determine the groups of metadata objects in the metadata tree, then determine the hierarchy of the objects in the tree. Assign the TreeElement class as a base class to each custom class. 5. Identify Custom Class Associations. The metadata browser uses class associations to display metadata. For each identified class association, determine if you can reuse a predefined association from a CWM base class or if you need to manually define an association in Metadata Manager. 6. Identify Custom Packages. A package contains related classes and class associations. Multiple packages can be assigned to a repository type to define the structure of the metadata contained in the source repositories of the given

INFORMATICA CONFIDENTIAL

BEST PRACTICES

624 of 954

repository type. Create packages to group related classes and class associations. To see an example of sample metamodel design specifications, see Appendix A in the Metadata Manager Custom Metadata Integration Guide.

Implement the Metamodel Design


Using the metamodel design specifications from the previous task, implement the metamodel in Metadata Manager. This task includes the following steps: 1. Create the originator (aka owner) of the metamodel. When creating a new metamodel, specify the originator of each metamodel. An originator is the organization that creates and owns the metamodel. When defining a new custom originator in Metadata Manager, select Customer as the originator type.
q q q q

Go to the Administration tab. Click Originators under Metamodel Management. Click Add to add a new originator. Fill out the requested information (Note: Domain Name, Name, and Type are mandatory fields). Click OK when you are finished.

2. Create the packages that contain the classes and associations of the subject metamodel. Define the packages to which custom classes and associations are assigned. Packages contain classes and their class associations. Packages have a hierarchical structure, where one package can be the parent of another package. Parent packages are generally used to group child packages together.
q q q q

Go to the Administration tab. Click Packages under Metamodel Management. Click Add to add a new package. Fill out the requested information (Note: Name and Originator are mandatory fields). Choose the originator created above. Click OK when you are finished.

3. Create Custom Classes. In this step, create custom classes identified in the metamodel design task.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

625 of 954

q q q

Go to the Administration tab. Click Classes under Metamodel Management. From the drop-down menu, select the package that you created in the step above Click Add to create a new class. Fill out the requested information (Note: Name, Package, and Class Label are mandatory fields). Base Classes: In order to see the metadata in the Metadata Manager metadata browser, you need to at least add the base class, TreeElement. To do this: a. Click Add under Base Classes. b. Select the package. c. Under Classes, select TreeElement. d. Click OK (You should now see the class properties in the properties section).

q q

To add custom properties to your class, click Add. Fill out the property information (Name, Data Type, and Display Label are mandatory fields). Click OK when you are done. Click OK at the top of the page to create the class.

Repeat the above steps for additional classes. 4. Create Custom Class Associations. In this step, implement the custom class associations identified in the metamodel design phase. In the previous step, CWM classes are added as base classes. Any of the class associations from the CWM base classes can be reused. Define those custom class associations that cannot be reused. If you only need the ElementOwnership association, skip this step
q q q q q

Go to the Administration tab. Click Associations under Metamodel Management. Click Add to add a new association. Fill out the requested information (all bold fields are required). Click OK when you are finished.

5. Create the Repository Type. Each type of repository contains unique metadata. For example, a PowerCenter data integration repository type
INFORMATICA CONFIDENTIAL BEST PRACTICES 626 of 954

contains workflows and mappings, but a Data Analyzer business intelligence repository type does not. Repository types maintain the uniqueness of each repository.
q q q q

Go to the Administration tab. Click Repository Types under Metamodel Management. Click Add to add a new repository type. Fill out the requested information (Note: Name and Product Type are mandatory fields). Click OK when you are finished.

6. Configure a Repository Type Root Class. Root classes display under the source repository in the metadata tree. All other objects appear under the root class. To configure a repository root class:
q q

Go to the Administration tab. Click Custom Repository Type Root Classes under Metamodel Management. Select the custom repository type. Optionally, select a package to limit the number of classes that display. Select the Root Class option for all applicable classes. Click Apply to apply the changes.

q q q q

Set Up and Run the XConnect


The objective of this task is to set up and run the custom XConnect. Custom XConnects involve a set of mappings that transform source metadata into the required format specified in the Informatica Metadata Extraction (IME) files. The custom XConnect extracts the metadata from the IME files and loads it into the Metadata Manager Warehouse. This task includes the following steps: 1. Determine which Metadata Manager Warehouse tables to load. Although you do not have to load all Metadata Manager Warehouse tables, you must load the following Metadata Manager Warehouse tables:
q

IMW_ELEMENT: The IME_ELEMENT interface file loads the element names from the source repository into the IMW_ELEMENT table. Note that element is used generically to mean packages, classes, or properties.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

627 of 954

IMW_ELMNT_ATTR: The IME_ELMNT_ATTR interface file loads the attributes belonging to elements from the source repository into the IMW_ELMNT_ATTR table. IMW_ELMNT_ASSOC: The IME_ELMNT_ASSOC interface file loads the associations between elements of a source repository into the IMW_ELMNT_ASSOC table.

To stop the metadata load into particular Metadata Manager Warehouse tables, disable the worklets that load those tables. 2. Reformat the source metadata. In this step, reformat the source metadata so that it conforms to the format specified in each required IME interface file. (The IME files are packaged with the Metadata Manager documentation.) Present the reformatted metadata in a valid source type format. To extract the reformatted metadata, the integration workflows require that the reformatted metadata be in one or more of the following source type formats: database table, database view, or flat file. Note that you can load metadata into a Metadata Manager Warehouse table using more than one of the accepted source type formats. 3. Register the Source Repository Instance in Metadata Manager. Before the Custom XConnect can extract metadata, the source repository must be registered in Metadata Manager. When registering the source repository, the Metadata Manager application assigns a unique repository ID that identifies the source repository. Once registered, Metadata Manager adds an XConnect in the Configuration Console for the source repository. To register the source repository, go to the Metadata Manager web interface. Register the repository under the custom repository type created above. All packages, classes, and class associations defined for the custom repository type apply to all repository instances registered to the repository type. When defining the repository, provide descriptive information about the repository instance. Once the repository is registered in Metadata Manager, Metadata Manager adds an XConnect in the Configuration Console for the repository. Create the Repository that will hold the metadata extracted from the source system:
q q q q

Go to the Administration tab. Click Repositories under Repository Management. Click Add to add a new repository. Fill out the requested information (Note: Name and Repository Type are mandatory fields). Choose the repository type created above.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

628 of 954

Click OK when finished.

4. Configure the Custom Parameter Files. Custom XConnects require that the parameter file be updated by specifying the following information:
q q

The source type (database table, database view, or flat file). The name of the database views or tables used to load the Metadata Manager Warehouse, if applicable. The list of all flat files used to load a particular Metadata Manager Warehouse table, if applicable. The worklets you want to enable and disable.

Understanding Metadata Manager Workflows for Custom Metadata


q

wf_Load_IME. Custom workflow to extract and transform metadata from the source repository into IME format. This is created by a developer. Metadata Manager prepackages the following integration workflows for custom metadata. These workflows read the IME files mentioned above and load them into the Metadata Manager Warehouse.
r

WF_STATUS: Extracts and transforms statuses from any source repository and loads them into the Metadata Manager Warehouse. To resolve status IDs correctly, the workflow is configured to run before the WF_CUSTOM workflow. WF_CUSTOM: Extracts and transforms custom metadata from IME files and loads that metadata into the Metadata Manager Warehouse.

5. Configure the Custom XConnect. The XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations specified in the custom metamodel. When the custom repository type is defined, Metadata Manager registers the corresponding XConnect in the Configuration Console. The following information in the Configuration Console configures the XConnect:
q

Under the Administration Tab, select Custom Workflow Configuration and choose the repository type to which the custom repository belongs.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

629 of 954

Workflows to load the metadata.


r r

CustomXConnect-wf_Load_IME workflow Metadata Manager-WF_CUSTOM workflow(prepackages all worklets and sessions required to populate all Metadata Manager Warehouse tables, except the IMW_STATUS table) Metadata Manager -WF_STATUS workflow (populates the IMW_STATUS)

Note: Metadata Manager Server does not load Metadata Manager Warehouse tables that have disabled worklets.
q

Under the Administration Tab, select Custom Workflow Configuration and choose the parameter file used by the workflows to load the metadata (the parameter file name is assigned at first data load). This parameter file name has the form nnnnn.par, where nnnnn is a five digit integer assigned at the time of the first load of this source repository. The script promoting Metadata Manager from the development environment to test and from the test environment to production preserves this file name.

6.

7.

Reset the $$SRC_INCR_DATE Parameter . After completing the first metadata load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter intervals, such as every f days. The value depends on how often the Metadata Manager Warehouse needs to be updated. If the source does not provide the date when the records were last updated, records are extracted regardless of the $$SRC_INCR_DATE parameter setting. Run the Custom XConnect . Using the Configuration Console, Metadata Manager Administrators can run the custom XConnect and ensure that the metadata loads correctly.

Note: When loading metadata with Effective From and Effective To Dates, Metadata Manager does not validate whether the Effective From Date is less than the Effective To Date. Ensure that each Effective To Date is greater than the Effective From Date. If you do not supply Effective From and Effective To Dates, Metadata Manager sets the Effective From Date to 1/1/1899 and the Effective To Date to 1/1/3714.

To Run a Custom XConnect


q

Log in to the Configuration Console.


BEST PRACTICES 630 of 954

INFORMATICA CONFIDENTIAL

q q

Click Source Repository Management Click Load next to the custom XConnect you want to run

Configure the Reports and Schema


The objective of this task is to set up the reporting environment, which needs to run reports on the metadata stored in the Metadata Manager Warehouse. The setup of the reporting environment depends on the reporting requirements. The following options are available for creating reports:
q

Use the existing schema and reports. Metadata Manager contains prepackaged reports that can be used to analyze business intelligence metadata, data integration metadata, data modeling tool metadata, and database catalog metadata. Metadata Manager also provides impact analysis and lineage reports that provide information on any type of metadata. Create new reports using the existing schema. Build new reports using the existing Metadata Manager metrics and attributes. Create new Metadata Manager Warehouse tables and views to support the schema and reports. If the prepackaged Metadata Manager schema does not meet the reporting requirements, create new Metadata Manager Warehouse tables and views. Prefix the name of custom-built tables with Z_IMW_. Prefix custom-built views with Z_IMA_. If you build new Metadata Manager Warehouse tables or views, register the tables in the Metadata Manager schema and create new metrics/attributes in the Metadata Manager schema. Note that the Metadata Manager schema is built on the Metadata Manager views.

After the environment setup is complete, test all schema objects, such as dashboards, analytic workflows, reports, metrics, attributes, and alerts.

Last updated: 05-Jun-08 14:15

INFORMATICA CONFIDENTIAL

BEST PRACTICES

631 of 954

Customizing the Metadata Manager Interface Challenge


Customizing the Metadata Manager Presentation layer to meet specific business needs.

Description
There are several areas in which the Metadata Manager Application interface can be customized to meet specific business needs. Customizations can be done by configuring security as well as the Metadata Manager Application interface. The first step to customization is configuring security according to business needs. By configuring security, only certain users will be able to access, search, and customize specific areas of Metadata Manager. Use the PowerCenter Administration Console to first create different roles, groups, and users. After users have been created, use the Security page in the Metadata Manager Application to manage permissions. The sections below cover some of the areas to configure when customizing Metadata Manager to meet specific business needs.

Metadata Manager Interface


The Metadata Manager Application interface consists of the following pages: Browse . Browse and search the metadata catalog, create and view shortcuts and shared folders, view information about metadata objects, run data lineage and whereused analysis, and add information about metadata objects. Model. Create and edit custom models, add custom attributes to packaged and custom models, and import and export custom models. Load. Create and load resources to load metadata into the Metadata Manager warehouse. Use the Load page to monitor and schedule resource loads, purge metadata from the Metadata Manager warehouse, and manage the search index. Security. Manage permissions on resources and metadata objects in the Metadata Manager warehouse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

632 of 954

The Metadata Manager Custom Metadata Integration Guide provides methodology and procedures for integrating custom metadata into the Metadata Manager warehouse. The Metadata Manager Custom Metadata Integration Guide is written for system administrators who want to load metadata from a repository type for which Metadata Manager does not package a model. This guide assumes that system administrators have knowledge of relational database concepts, models, and PowerCenter. Metadata Manager uses models to define the metadata it extracts from metadata sources. The following types of custom metadata can be added into the Metadata Manager warehouse:
q

Metadata for a custom metadata source. Load or add metadata from a source for which Metadata Manager does not package a resource type (e. g., from a Microsoft Access database). Metadata Manager does not package a resource type for Microsoft Access. A custom model can be created for the source metadata and then loaded into the Metadata Manager warehouse. This is also called, Creating a custom resource. Attributes. Add custom attributes to the existing metadata in the Metadata Manager warehouse. For example, to add an additional attribute to a report for Cognos ReportNet, the Cognos ReportNet model can be edited in Metadata Manager. Add the attribute and then add the metadata for the attribute on the Browse page. Relationships. Add relationships from custom metadata classes to model classes for which Metadata Manager packages a resource type. For example, a column in a custom metadata source is also used in an Oracle table. A classlevel relationship can be created between the custom source column and the Oracle table column. Then create the object-level relationship on the Browse page. The relationship can be created to run data lineage and where-used analysis on the custom metadata.

The Model page in Metadata Manager is where models for Metadata Manager are created or edited. After you create or edit the model for the custom metadata, you add the metadata to the Metadata Manager warehouse. You can add the metadata using the metadata catalog. You can also create a custom resource, create a template and generate PowerCenter workflows using the Custom Metadata Configurator, and load the metadata on the Load page in Metadata Manager. After you add the custom metadata into the Metadata Manager warehouse, use the Metadata Manager or the packaged Metadata Manager reports to analyze the metadata. You can create new reports to analyze additional information. You can also export and import the models, or export and import the metadata that you added to the metadata catalog.

Adding and Loading Metadata for a Custom Metadata Source


INFORMATICA CONFIDENTIAL BEST PRACTICES 633 of 954

When you add metadata for a custom metadata source, you define a model for the source metadata to define the type of metadata that Metadata Manager extracts. You create the model and add classes, attributes, and relationships. After you define the model, you can add the metadata to the metadata catalog using the Browse page, or you can use the Custom Metadata Configurator to create a template and then load the metadata into the Metadata Manager warehouse from metadata source files. When you create a template, you use the Custom Metadata Configurator to create the template and the PowerCenter objects, including mappings, sessions, and workflows that Metadata Manager uses to extract metadata from metadata source files. You can export the metadata from the metadata source to a metadata source file, create a custom resource in Metadata Manager, and load the metadata from the metadata source files.

Adding Custom Metadata


To add metadata for a custom metadata source, complete the following steps: Create the model. Create the model to represent the metadata in the metadata source using the Model page in Metadata Manager. Add classes, attributes, and relationships. Add custom classes, attributes, and relationships to the model using the Model page. Add the metadata to the Metadata Manager warehouse. Create a resource in the metadata catalog that represents the source metadata using the Browse page. Add custom metadata objects based on the classes you create.

Metadata Manager Reporting


You can access Metadata Manager Reporting from Metadata Manager to run reports. To access Data Analyzer from Metadata Manager, complete the following steps: Create a Reporting Service . Create a Reporting Service in the PowerCenter Administration Console and use the Metadata Manager repository as the data source. Launch Metadata Manager Reporting . On the Metadata Manager Browse page, click Reports in the toolbar. If you have the required privileges on the Reporting Service, Metadata Manager logs you into the Data Analyzer instance being used for Metadata Manager. You can then run the Metadata Manager reports.
INFORMATICA CONFIDENTIAL BEST PRACTICES 634 of 954

Metadata Manager includes the following types of reports: Primary reports . This is the top-level report in an analytic workflow. To access all lower-level reports in the analytic workflow, first run this report on the Analyze tab. Standalone reports . Unlike analytic workflow reports, you run these reports independently of other reports. Workflow reports. These are the lower-level reports in an analytic workflow. To access a workflow report, first run the associated primary report and all workflow reports that precede the given workflow report. You can use these reports to perform several types of analysis on metadata stored in the Metadata Manager warehouse. Metadata Manager prepackages reports for Business intelligence, Data modeling, Data integration, Database management, and Metamodel.

Customizing Metadata Manager Reporting


You can create new reporting elements and attributes under Schema Design. These elements can be used in new reports or existing report extensions. You can also extend or customize out-of-the-box reports, indicators, or dashboards. Informatica recommends using the Save As new report option for such changes in order to avoid any conflicts during upgrades. The Metadata Manager Reports Reference gives you a guideline of the reports and attributes being used. Further, you can use Data Analyzer's 1-2-3-4 report creation wizard to create new reports. Informatica recommends saving such reports in a new report folder to avoid conflict during upgrades.

Customizing Metadata Manager ODS Reports


Use the operational data store (ODS) report templates to analyze metadata stored in a particular repository. Although these reports can be used as is, they can also be customized to suit particular business requirements. Out-of-the-box reports can be used as a guideline for creating reports for other types of source repositories, such as a repository for which Metadata Manager does not prepackage a standard resource.

Last updated: 02-Jun-08 23:22

INFORMATICA CONFIDENTIAL

BEST PRACTICES

635 of 954

Estimating Metadata Manager Volume Requirements Challenge


Understanding the relationship between various inputs for the Metadata Manager solution in order to estimate data volumes for the Metadata Manager Warehouse.

Description
The size of the Metadata Manager warehouse is directly proportional to the size of metadata being loaded into it. The size is dependent on the number of element attributes being captured in source metadata and the associations defined in the metamodel. When estimating volume requirements for a Metadata Manager implementation, consider the following Metadata Manager components:
q

Metadata Manager Service - Manages the source repository metadata stored in the Metadata Manager Warehouse. You can use Metadata Manager, which uses the Metadata Manager Service, to search, view, and configure source repository metadata and run reports. Metadata Manager Integration Repository - This PowerCenter repository stores the workflows, which are resource components that extract source metadata and load it into the Metadata Manager Warehouse. Metadata Manager Warehouse- The Metadata Manager Warehouse stores the Metadata Manager metadata. It also stores source repository metadata and metamodels

Considerations
Volume estimation for Metadata Manager is an iterative process. Use the Metadata Manager in the development environment to get accurate size estimates for the Metadata Manager in the production environment. The required steps are as follows: 1. Identify the source metadata that needs to be loaded in the Metadata Manager production warehouse. 2. Size the Metadata Manager Development warehouse based on the initial sizing estimates (as explained under the Sizing Estimate Example section of this document). 3. Run the resource loads and monitor the disk usage. The development metadata loaded during the initial run of the resources should be used as a baseline for further sizing estimates.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

636 of 954

Sizing Estimate Example


The following table is an initial estimation matrix that should be helpful in deriving a reasonable initial estimation. For increased input sizes, consider the expected Metadata Manager Warehouse target size to increase in direct proportion. Expected Metadata Manager Warehouse Target Size 50MB 10MB 4MB 5MB 4.5MB

Resource Metamodel and other tables PowerCenter Data Analyzer Database Other Resource
Last updated: 02-Jun-08 23:31

Input Size 1MB 1MB 1MB 1MB

INFORMATICA CONFIDENTIAL

BEST PRACTICES

637 of 954

Metadata Manager Business Glossary Challenge


A group of people working towards a common goal need shared definitions for the information they are dealing with. Implementing a Business Glossary with Metadata Manager provides a vocabulary to facilitate better understanding between business and IT.

Description
Data values and data object names such as names of entities and attributes can be interpreted differently by various groups. In many organizations, Business Analysts create spreadsheets or word documents to manage business terms. Without a common repository to store these business terms, it is often a challenge to communicate them to other groups. By creating a Business Glossary, Metadata Manager can be leveraged to associate business terms to IT assets. This is achieved by creating and configuring a custom model in Metadata Manager. The custom metadata model is used for building the Business Glossary as a searchable online catalog. The Business Glossary can also be printed and published as a report. In order to capture the association between the business terms and actual IT implementations, it is necessary to utilize the Classes from the Business Nomenclature package of the Common Warehouse Model (CWM). Below are the major elements of the Business Nomenclature package (Poole et al.,2002b):

For an example implementation of a Business Glossary, it is assumed that business terms are currently maintained in spreadsheets. By building a metadata adapter, the business terms can be loaded from the spreadsheets to Metadata Manager. Predefined PowerCenter workflows that come with Metadata Manager can use the spreadsheets as source and load into Metadata Manager warehouse. The main steps involved to implement a Business Glossary with Metadata Manager are: 1. Create the Model 2. Configure Classes 3. Configure the Class Attributes
INFORMATICA CONFIDENTIAL BEST PRACTICES 638 of 954

4. Configure the Class Relationships 5. Load Custom Resource using Custom Metadata Configurator 6. Add Custom Resource to the Browse page Before creating the Model, it is necessary to identify model name/description, parent classes and subclasses, attributes and relationships. Subsequently, a new model for the Business Glossary should be created and classes for this model should be configured. The following table is a summary of the repository and class model definition: CLASS NAME DESCRIPTION BASE CLASSES ATTRIBUTES

Category

Represents a category in the glossary

CWM:Business Nomenclature:Glossary IMM Tree Element

Name

Term

Represents a term in the business glossary

CWM:Business Nomenclature:Term

Name, subtype, version, supplier

An association called ConceptToimplementation will be added to link the Term class to physical implementations such as database tables. Once the class is opened on the Model page from the Model navigator, attributes and relationships can be configured. The next step is to load the spreadsheets to Metadata Manager. Element metadata and association metadata file should be created in order to map attributes and associations to classes. The ime_element interface should be used to load every entity identified as an element in the source repository. Any named entity in the source system can be classified as an element. IME_ELEMENT has some predefined attributes like Description which can be mapped to this interface. The value of the element class identifier, CLASS_ID, must exist in the IMW_CLASS table. If not, the ETL process rejects the record. The combination of REPOSITORY_ID, CLASS_ID, and ELEMENT_ID must be unique for each element. The following is an example element metadata file: REPOSITORY_ID CLASS_ID ELEMENT_NAME DESCRIPTION

BUS_GLOSSARY

Businessglossary.category

CRM

CRM related terms

BUS_GLOSSARY

Businessglossary.term

Acquisition Cost

Total dollar amount spent to acquire a new customer

BUS_GLOSSARY

Businessglossary.term

Advertisement

A paid, public, promotional message for an identified sponsor promoting a companys products and services

BUS_GLOSSARY

Businessglossary.term

Campaign Management

A marketing process to effectively promote the products and services of a given organization to prospects

The ime_elmnt_assoc interface should be used to load associations between two elements in a source repository. An association is the way in which two elements are related. Examples of associations in a database environment include associations between an
INFORMATICA CONFIDENTIAL BEST PRACTICES 639 of 954

index and the table on which it is defined, a table and its corresponding columns, and a column and its constraints. There are two elements in each association. The From element is the element from which the association is defined. The TO_ELEMENT is the element to which the association is defined. While mapping associations to this interface, ensure that one of the following is true:
q

The values in the FROM_CLASS_ID and TO_CLASS_ID are the same as the From and To classes that are defined in the association in the IMW_ASSOCIATION table The classes specified in the FROM_CLASS_ID and TO_CLASS_ID of the interface have base classes that are the From and To classes defined in the association in the IMW_ASSOCIATION table.

In other words, you cannot load an association for an object if the association is not defined for its class or the base class of its class. This interface loads the IMW_ELMNT_ASSOC table. The Metadata Manager server uses this table to show metadata in the browser and the ETL process uses this table to load other Metadata Manager Warehouse tables. The following is an outline of the necessary fields for association metadata file:

LABEL

COLUMN NAME

DESCRIPTION

Repository

REPOSITORY_ID

Uniquely identifies the repository which stores the association between two elements.

From Element Class

FROM_CLASS_ID

Uniquely identifies the class of the From element in the association being loaded.

From Element

FROM_ELEMENT_ID

Uniquely identifies the From element in the association being loaded.

From Element Repository

FROM_REPO_ID

Uniquely identifies the repository to which the From element belongs.

To Element Class

TO_CLASS_ID

Uniquely identifies the class of the To element in the association being loaded.

To Element

TO_ELEMENT_ID

Uniquely identifies the To element of the association being loaded.

To Element Repository

TO_REPO_ID

Uniquely identifies the repository to which the To element belongs.

Association

ASSOCIATION_ID

Uniquely identifies the association between the From element and To element in the source repository.

The last step before generating the PowerCenter mappings and workflows to load the spreadsheets into Metadata Manager warehouse is configuring a template to store information about how to map the metadata object attributes to the class attributes. In order to load the source spreadsheets, it is necessary to generate a comma delimited file. This can be done by saving the spreadsheets in a .csv format. The name of the .csv files must be entered in the indirect files that PowerCenter will use as source.
INFORMATICA CONFIDENTIAL BEST PRACTICES 640 of 954

Once the workflows complete and the spreadsheets are loaded into the Metadata Manager Warehouse, the Business Glossary information can be searched and browsed like any other Metadata resource. Last but not least, Business Glossary implementation with Metadata Manager intends to not only capture the business terms, but also provide a way to relate technical and business concepts. This will allow Business Users, Data Stewards, Business Analysts, and Data Analysts to describe key concepts of their business and communicate to other groups.

Last updated: 27-May-08 12:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

641 of 954

Metadata Manager Load Validation Challenge


Just as it is essential to know that all data for the current load cycle has loaded correctly, it is important to ensure that all metadata extractions (Metadata Resources) loaded correctly into the Metadata Manager warehouse. If metadata extractions do not execute successfully, the Metadata Manager warehouse will not be current with the most up-to-date metadata.

Description
The process for validating Metadata Manager metadata loads is very simple using the Metadata Manager Application interface. In the Metadata Manager Application interface, you can view the run history for each of the resources. For load validation, use the Load Page in the Metadata Manager Application interface, PowerCenter Workflow Monitor, and PowerCenter Administration Console logs. The Workflow Monitor in PowerCenter will also have a workflow and session log for the resource load. Resources can fail for a variety of reasons common in IT such as unavailability of the database, network failure, improper configuration, etc. More detailed error messages can be found in the activity log or in the workflow log files. The following installation directories will also have additional log files that are used for the resource load process: \server\tomcat\mm_files\MM_PC851\mm_load \server\tomcat\mm_files\MM_PC851\mm_index \server\tomcat\logs

Loading and Monitoring Resources Overview


After you configure the metadata source and create a resource, you can load the resource. When you load a resource, Metadata Manager uses the connection information for the resource to extract the metadata from the metadata source. Metadata Manager converts the extracted metadata into IME files and runs PowerCenter workflows to load the metadata from the IME files into the Metadata Manager warehouse. You can use the Load page to perform the following resource tasks: Load a resource. Load the source metadata for a resource into the Metadata Manager warehouse. Metadata Manager extracts metadata and profiling information, and indexes the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

642 of 954

Monitor a resource. Use the Metadata Manager activity log, resource output log, and PowerCenter Workflow Monitor to monitor and troubleshoot resource loads. Schedule a resource. Create a schedule to select the time and frequency that Metadata Manager loads a resource. You can attach the schedule to a resource. The figure below shows the Load page for Metadata Manager:

Loading Resources
You can load a resource for Metadata Manager immediately in the Load page. Metadata Manager loads the resource and displays the results of the resource load in the Resource List. When Metadata Manager loads a resource, it completes the following tasks: Loads metadata. Loads the metadata for the resource into the Metadata Manager warehouse. Extracts profiling information. Extracts profiling information from the source database. If you load a relational database resource, you can extract profiling information from tables and columns in the database. Indexes the resource. Creates or updates the index files for the resource. You can start the load process from the Resource List section of the Load page.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

643 of 954

To load a resource: 1. On the Load page, select the resource you want to load in the Resource List. 2. Click Load. Metadata Manager adds the resource to the load queue and starts the load process. If Metadata Manager finds an unassigned connection to another metadata source, Metadata Manager pauses the load. You must configure the connectionassignment to proceed. Configure the connection assignments for the resource in the Resource Properties section and click Resume. For more information about configuring connection assignments, see Configuring Connection Assignments. 3. To cancel the load, click Cancel. When the resource load completes, Metadata Manager updates the Last Status Date and Last Status for the resource. You can use the activity log and the output log to view more information about the resource load.

Resuming a Failed Resource Load


If a resource load fails when PowerCenter runs the workflows that load the metadata into the warehouse, you can resume the resource load. You can use the output log in Metadata Manager and the workflow and session logs in the PowerCenter Workflow Manager to troubleshoot the error and resume the resource load. To resume a failed load: 1. On the Load page, select the resource in the Resource List for which you want to resume the resource load. 2. Click Resume. Metadata Manager continues loading the resource from the previous point of failure and completes any profiling or indexing operations.

Load Queue
When you load a resource, Metadata Manager places the resource in a load queue. The load queue controls the order in which Metadata Manager loads resources. Metadata Manager places resources in the load queue when you start the resource load from the Load page or when a scheduled resource load begins. If a resource load fails, Metadata Manager keeps the resource in the load queue until the timeout interval for the resource load is exceeded. When the timeout interval is exceeded, Metadata Manager removes the
INFORMATICA CONFIDENTIAL BEST PRACTICES 644 of 954

resource from the load queue and begins loading the next resource in the queue. You can configure the number of resources that Metadata Manager loads simultaneously and the timeout interval for resource loads when you configure the Metadata Manager Service in the PowerCenter Administration Console.

Loading Metadata Sources in Order


To view data lineage or where-used analysis across source repositories and databases, you configure the connection assignments for the resource, load the metadata for the database or other source repository, and then load the resource that contains the connections. For example, you want run data lineage analysis between a PowerCenter repository and an Oracle database. You must load the Oracle database, configure the connection assignments for the PowerCenter resource, and then load the PowerCenter resource.

Monitoring Resources
You can monitor resource loads runs to determine if they are successful. If a resource load fails, troubleshoot the failure and load the resource again. You can use the following logs in Metadata Manager to view information about resource loads and troubleshoot errors: Activity log. Contains the status of resource load operations for all resources. Output log. Contains detailed information about each resource load operation. You can also use the PowerCenter Workflow Monitor to view the workflows as they load the metadata. Use session and workflow logs to troubleshoot errors. If you load multiple resources of the same resource type concurrently, the Integration Service runs multiple instances of the workflow that corresponds to the resource type. Each workflow instance includes separate workflow and session logs. You can also use mmcmd and mmwfdrundetails to get more information about the status of a resource load and get more information about the PowerCenter workflows and sessions that load metadata. Note: Profiling may show as successful although some of the PowerCenter sessions that load profiling information fail. Sessions can fail because of run-time resource constraints. If one or more sessions fail but the other profiling sessions complete successfully, profiling
INFORMATICA CONFIDENTIAL BEST PRACTICES 645 of 954

displays as successful on the Load page.

Activity Log
The activity log contains details on each resource load. Use the activity log to get more details on a specific resource load. The following shows a sample activity log:

The following table describes the contents of the activity log: Property Description

Resource Name of the resource. Task Type Type of task performed by Metadata Manager. Metadata Manager performs the following tasks: - Metadata Load. Loads metadata into the Metadata Manager warehouse. - Profiling. Extracts profiling information from the source database. - Indexing. Creates or updates index files for the resource. User Metadata Manager user that started the resource load.

Start Date The date and time the resource load started.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

646 of 954

End Date The date and time the resource load completed. Status The status of the metadata load, profiling, and indexing operations.

To view the contents of the activity log: 1. On the Load page, click Activity Log. The Activity Log window appears. 2. To filter the contents of the Activity Log window, select the time frame in the Show logs list. 3. To sort by column, click the column name. 4. To refresh the log to see recent changes, click Refresh.

Output Log
The output log displays the results of the most recent resource load for a resource. Use the output log for detailed information about the operations performed by Metadata Manager when it loads the resource. The following example shows an excerpt from an output log: MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Starting metadata load... MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Resource Name: PowerCenter_85 MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Resource Type: PowerCenter MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Resource Group: Data Integration MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Metadata load started... MetadataLoad [Sun Sep 16 09:20:55 PDT 2007] : Task started: ETLTaskHandler. MetadataLoad [Sun Sep 16 09:20:55 PDT 2007] : Opened connection to PowerCenter repository MetadataLoad [Sun Sep 16 09:20:55 PDT 2007] : Connected to the Repository Service MetadataLoad [Sun Sep 16 09:21:20 PDT 2007] : Started workflow WF_PC8X_STAGE MetadataLoad [Sun Sep 16 09:21:20 PDT 2007] : Waiting for workflow to complete... MetadataLoad [Sun Sep 16 09:24:06 PDT 2007] : Completed workflow WF_PC8X_STAGE MetadataLoad [Sun Sep 16 09:24:09 PDT 2007] : Started workflow WF_PC MetadataLoad [Sun Sep 16 09:24:09 PDT 2007] : Waiting for workflow to complete...
Last updated: 02-Jun-08 23:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

647 of 954

Metadata Manager Migration Procedures Challenge


This Best Practice describes the processes to follow when Metadata Manager is deployed in multiple environments and out of the box Metadata Manager components are customized or configured; or new components are added to Metadata Manager.
q

Reports: This includes changes to the reporting schema and the out of the box reports. This also includes any newly created reports or schema elements generated to cater to custom reporting needs and located at a specific implementation instance of the product. Metamodel: This includes additionally created metamodel components needed to help associate any custom metadata against repository types and domains that are not covered by the out of the box Metadata Manager repository types. Metadata: This includes additionally created metadata objects, their properties or associations against repository instances configured within Metadata Manager. These repository instances could either belong to the repository types supported out of the box by Metadata Manager or any new repository types configured through custom additions to the metamodels.

Integration Repository: This includes changes to the out of the box PowerCenter workflows or mappings. This would also include any new PowerCenter objects (mappings, transformations etc.) or associated workflows.

Description
Report Changes
The following chart depicts the various scenarios related to the reporting area and the actions that need to be taken as relates to the migration of the changed components. It is always advisable to create new schema elements (metrics, attributes etc.) or new reports in the new (migration target) Data Analyzer folder to facilitate exporting or importing the Data Analyzer objects across development, test and production.

Nature of Report Change: Modify schema component (metric, attribute etc.)


Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the changed components. Test Backup up the Test environment as a failsafe. Import the XML exported from the development environment. Answer Yes to overriding the definitions that already exist for the changed schema components. Test and verify the changes within the TEST environment. Production Backup the Production environment as a failsafe. Import the XML exported from the development environment. Answer Yes to overriding the definitions that already exist for the changed schema components. Test and verify the changes within the Production environment.

Nature of Report Change: Modify an existing report (add or delete metrics, attributes, filters, change formatting etc.)
Development Test Production

INFORMATICA CONFIDENTIAL

BEST PRACTICES

648 of 954

Perform the change in development, test the same and certify it for deployment. Do an XML export of the changed report.

Backup up the Test environment as a failsafe. Import the XML exported from the development environment. Answer Yes to overriding the definitions that already exist for the changed report. Test and verify the changes within the TEST environment.

Backup the Production environment as a failsafe. Import the XML exported from the development environment. Answer Yes to overriding the definitions that already exist for the changed report. Test and verify the changes within the Production environment.

Nature of Report Change: Add new schema component (metric, attribute etc.)
Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the new schema components. Test Backup up the Test environment as a failsafe. Import the XML exported from the development environment. Test and verify the changes within the TEST environment. Production Backup the Production environment as a failsafe. Import the XML exported from the development environment. Test and verify the changes within the Production environment.

Nature of Report Change: Add new report


Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the new report. Test Backup up the Test environment as a failsafe. Import the XML exported from the development environment. Test and verify the changes within the TEST environment. Production Backup the Production environment as a failsafe. Import the XML exported from the development environment. Test and verify the changes within the Production environment.

Metamodel Changes
The following chart depicts the various scenarios related to the metamodel area and the actions that need to be taken related to the migration of the changed components.

Nature of the Change: Add new metamodel component

Development
INFORMATICA CONFIDENTIAL

Test
BEST PRACTICES

Production
649 of 954

Perform the change in development, test the same and certify it for deployment. Do an XML export of the new metamodel components (export can be done at 3 levels: Originators, Repository Types and Entry Points) using the Export Metamodel option.

Backup up the Test environment as Backup the Production environment as a failsafe. a failsafe. Import the XML exported from the Import the XML exported from the development environment using the development environment using the Import metamodel option. Import metamodel option. Test and verify the changes within the TEST environment. Test and verify the changes within the Production environment.

Integration Repository Changes


The following chart depicts the various scenarios related to the integration repository area and the actions that need to be taken as relates to the deployment of the changed components. It is always advisable to create new mappings, transformations, workflows etc in a new PowerCenter folder so that it becomes easy to export the ETL objects across development, test and production.

Nature of the Change: Modify an existing mapping, transformation and/or the associated workflows etc.
Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the changed objects. Test Backup up the Test environment as a failsafe. Import the XML exported from the development environment. Answer Yes to overriding the definitions that already exist for the changed object. Test and verify the changes within the TEST environment. Production Backup the Production environment as a failsafe. Import the XML exported from the development environment. Answer Yes to overriding the definitions that already exist for the changed object. Test and verify the changes within the Production environment.

Nature of the Change: Add new ETL object (mapping, transformation etc.)
Development Test Production

INFORMATICA CONFIDENTIAL

BEST PRACTICES

650 of 954

Perform the change in development, test the same and certify it for deployment.

Backup up the Test environment as a failsafe.

Backup the Production environment as a failsafe. Import the XML exported from the development environment. Test and verify the changes within the Production environment.

Import the XML exported from the Do an XML export of the new objects. development environment. Test and verify the changes within the TEST environment.

Last updated: 03-Jun-08 00:05

INFORMATICA CONFIDENTIAL

BEST PRACTICES

651 of 954

Metadata Manager Repository Administration Challenge


The task of administering the Metadata Manager Repository involves taking care of both the integration repository and the Metadata Manager warehouse. This requires knowledge of both PowerCenter administrative features (i.e., the integration repository used in Metadata Manager) and Metadata Manager administration features.

Description
A Metadata Manager administrator needs to be involved in the following areas to ensure that the Metadata Manager warehouse is fulfilling the end-user needs:
q

Migration of Metadata Manager objects created in the Development environment to QA or the Production environment Creation and maintenance of access and privileges of Metadata Manager objects Repository backups Job monitoring Metamodel creation.

q q q

Migration from Development to QA or Production


In cases where a client has modified out-of-the-box objects provided in Metadata Manager or created a custom metamodel for custom metadata, the objects must be tested in the Development environment prior to being migrated to the QA or Production environments. The Metadata Manager Administrator needs to do the following to ensure that the objects are in sync between the two environments:
q

Install a new Metadata Manager instance for the QA/Production environment. This involves creating a new integration repository and Metadata Manager warehouse Export the metamodel from the Development environment and import it to QA or production via XML Import/Export functionality (in the Metadata Manager Administration tab) or via the Metadata Manager command line utility.
BEST PRACTICES 652 of 954

INFORMATICA CONFIDENTIAL

Export the custom or modified reports created or configured in the Development environment and import them to QA or Production via XML Import/Export functionality. This functionality is identical to the function in Data Analyzer.

Providing Access and Privileges


Users can perform a variety of Metadata Manager tasks based on their privileges. The Metadata Manager Administrator can assign privileges to users by assigning them roles. Each role has a set of privileges that allow the associated users to perform specific tasks. The Administrator can also create groups of users so that all users in a particular group have the same functions. When an Administrator assigns a role to a group, all users of that group receive the privileges assigned to the role. The Metadata Manager Administrator can assign privileges to users to enable users to perform the any of the following tasks in Metadata Manager:
q

Configure reports. Users can view particular reports, create reports, and/or modify the reporting schema. Configure the Metadata Manager Warehouse. Users can add, edit, and delete repository objects using Metadata Manager. Configure metamodels. Users can add, edit, and delete metamodels.

Metadata Manager also allows the Administrator to create access permissions on specific source repository objects for specific users. Users can be restricted to reading, writing, or deleting source repository objects that appear in Metadata Manager. Similarly, the Administrator can establish access permissions for source repository objects in the Metadata Manager warehouse. Access permissions determine the tasks that users can perform on specific objects. When the Administrator sets access permissions, he or she determines which users have access to the source repository objects that appear in Metadata Manager. The Administrator can assign the following types of access permissions to objects:
q

Read - Grants permission to view the details of an object and the names of any objects it contains. Write - Grants permission to edit an object and create new repository objects in the Metadata Manager warehouse. Delete - Grants permission to delete an object from a repository. Change permission - Grants permission to change the access permissions for an object.

q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

653 of 954

When a repository is first loaded into the Metadata Manager warehouse, Metadata Manager provides all permissions to users with the System Administrator role. All other users receive read permissions. The Administrator can then set inclusive and exclusive access permissions.

Metamodel Creation
In cases where a client needs to create custom metamodels for sourcing custom metadata, the Metadata Manager Administrator needs to create new packages, originators, repository types and class associations.

Job Monitoring
When Metadata Manager Resources are running in the Production environment, Informatica recommends monitoring loads through the Metadata Manager console. The Load Page in the Metadata Manager Application interface has an Activity Log that can identify the total time it takes for a Resource to complete. The console maintains a history of all runs of an Resource, enabling a Metadata Manager Administrator to ensure that load times are meeting the SLA agreed upon with end users and that the load times are not increasing inordinately as data increases in the Metadata Manager warehouse. The Activity Log provides the following details about each repository load:
q

Repository Name- name of the source repository defined in Metadata Manager Run Start Date- day of week and date the Resource run began Start Time- time the Resource run started End Time- time the Resource run completed Duration- number of seconds the Resource run took to complete Ran From- machine hosting the source repository Last Refresh Status- status of the Resource run, and whether it completed successfully or failed

q q q q q q

Repository Backups
When Metadata Manager is running in either the Production or QA environment, Informatica recommends taking periodic backups of the following areas:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

654 of 954

q q

Database backups of the Metadata Manager warehouse Integration repository; Informatica recommends either of two methods for this backup:
r

The PowerCenter Repository Server Administration Console or pmrep command line utility The traditional, native database backup method.

The native PowerCenter backup is required but Informatica recommends using both methods because, if database corruption occurs, the native PowerCenter backup provides a clean backup that can be restored to a new database.

Last updated: 03-Jun-08 00:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

655 of 954

Upgrading Metadata Manager Challenge


This best practices document summarizes the instructions for a Metadata Manager upgrade.

Description
Install and configure PowerCenter Services before you start the Metadata Manager upgrade process. To upgrade a Metadata Manager repository, create a Metadata Manager Service and associate it with the Metadata Manager repository you wish to upgrade. Then use the Metadata Manager options to upgrade the Metadata Manager repository. Use the Administration Console to perform the following Metadata Manager upgrade tasks. Refer to the PowerCenter Configuration Guide for detailed instructions on upgrading Metadata Manager. Before you start the upgrade process be sure to check through the Informatica support information for the Metadata Manager upgrade path. For instance, Superglue 2.1 (as Metadata Manager was previously called) should first be upgraded to Metadata Manager 8.1 and then to the Metadata Manager 8.5. Superglue 2.2 or Metadata Manager 8.1 can be upgraded to Metadata Manager 8.5 in one step. Also verify the requirements for the following Metadata Manager 8.5 components:
q q q q q q

Metadata Manager and Metadata Manager Client Web browser Databases Third-party software Code pages Application server

For more information about requirements for the PowerCenter components, see Chapter 3 PowerCenter Prerequisites in the PowerCenter Installation Guide. For information about requirements for the Metadata Manager components, see Chapter 2

INFORMATICA CONFIDENTIAL

BEST PRACTICES

656 of 954

- Verify Prerequisites After you install in the PowerCenter Configuration Guide.


q

Disk Space for Metadata Manager Repository The Disk Space of 1GB is considered the starting size. Disk Space can grow beyond this considerably when many or big metadata resources are loaded

Flash 9 plug-in from Adobe The Flash 9 plug-in from Adobe is required to properly display data lineage. To run data lineage analysis in Metadata Manager or from the Designer, download and install the Flash 9 plug-in on the web browser. You can obtain the Flash plug-in from the Adobe web site. This can be downloaded after the upgrade. When starting a data lineage display, the Metadata Manager will prompt for the Adobe installation and point to the correct web site. To check whether Adobe Flashplayer 9 is installed on a Windows Client, you can check Start > Control Panel > Add or Remove Programs > (usually the first entry) Adobe Flashplayer 9

As we already know from the existing installation, Metadata Manager is made up of various components. Except for the Metadata Manager Repository all other Metadata Manager components (i.e., Metadata Manager Server, PowerCenter Repository, PowerCenter Clients and Metadata Manager Clients) should be uninstalled and then reinstalled with the latest version of the Metadata Manager Keep in mind that all modifications and/or customizations to the standard version of Metadata Manager will be lost and will need to be re-created and re-tested after the upgrade process.

Upgrade Steps
1. Set up new repository database and user account.
q

Set up new database/schema for the PowerCenter Metadata Manager repository. For Oracle, set the appropriate storage parameters. For IBM DB2, use a single node tablespace to optimize PowerCenter performance. For IBM DB2, configure the system temporary table spaces and update the heap sizes. Create a database user account for the PowerCenter Metadata Manager repository. The database user must have permissions to create and drop

INFORMATICA CONFIDENTIAL

BEST PRACTICES

657 of 954

tables and indexes, and to select, insert, update, and delete data from tables. 2. Make a copy of the existing Metadata Manager repository.
q

You can use any backup or copy utility provided with the database to make a copy of the working Metadata Manager repository prior to upgrading the Metadata Manager. Use the copy of the Metadata Manager repository for the new Metadata Manager installation.

3. Back up the existing parameter files.


q

Make a copy of the existing parameter files. If you have custom Resources and the parameter, attribute and data files of these custom Resources is in a different place, do not forget to take a backup of them too. You may need to refer to these files when you later configure the parameters for the custom Resources as part of the Metadata Manager client upgrade. For PowerCenter 8.5, you can find the parameter files in the following directory: PowerCenter_Home\server\infa_shared\SrcFiles For Metadata Manager 8.5, you can find the parameter files in the following directory: PowerCenter_Home\Server\SrcFiles

4. Export the Metadata Manager mappings that you customized or created for your environment.
q

If you made any changes on the standard Metadata Manager mappings, or created some new mappings within the Metadata Manager Integration repository, make an export of these mappings, workflows and/or sessions. If you created some additional reports, make an export of these reports too.

5. Install Metadata Manager.


q

Select the Custom installation set and install Metadata Manager. The installer creates a Repository Service and Integration Service in the PowerCenter domain and creates a PowerCenter repository for Metadata Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

658 of 954

6. Stop the Metadata Manager server.


q

You must stop the Metadata Manager server before you upgrade the Metadata Manager repository contents.

7. Upgrade the Metadata Manager repository.


q

Use the Metadata Manager upgrade utility shipped with the latest version of Metadata Manager to upgrade the Metadata Manager repository.

8. Complete the Metadata Manager post-upgrade tasks.


q

After you upgrade the Metadata Manager repository, perform the following tasks:
r

Update metamodels for Business Objects and Cognos ReportNet Content Manager. Delete obsolete Metadata Manager objects. Refresh Metadata Manager views. For a DB2 Metadata Manager repository, import metamodels.

r r r

9. Upgrade the Metadata Manager Client.


q

For instructions on upgrading the Metadata Manager Client, refer to the PowerCenter Configuration Guide. After you complete the upgrade steps, verify that all dashboards and reports are working correctly in Metadata Manager. When you are sure that the new version is working properly, you can delete the old instance of Metadata Manager and switch to the new version.

10. Compare and redeploy the exported Metadata Manager mappings that were customized or created for your environment.
q

If you had any modified Metadata Manager mappings in the previous release of Metadata Manager, check whether the modifications are still necessary. If the modifications still needed override or rebuild the changes into the new PowerCenter mappings.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

659 of 954

Import the customized reports into the new environment and check that the reports are still working with the new Metadata Manager environment. If not then make the necessary modifications to make them compatible with the new structure.

11. Upgrade the Custom Resources


q

If you have any custom Resources in your environment, you need to regenerate the Resource mappings that were generated by the previous version of the custom Resource configuration wizard. Before starting the regeneration process, ensure that the absolute paths to the .csv files are the same as the previous version. If all the paths are the same, no further actions are required after the regeneration of the workflows and mappings.

12. Uninstall the previous version of Metadata Manager.


q

Verify that the browser and all reports are working correctly in Metadata Manager 8.1. If the upgrade is successful, you can uninstall the previous version of Metadata Manager.

Last updated: 05-Jun-08 14:26

INFORMATICA CONFIDENTIAL

BEST PRACTICES

660 of 954

Daily Operations Challenge


Once the data warehouse has been moved to production, the most important task is keeping the system running and available for the end users.

Description
In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team becomes, in effect, a customer to the Production Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an Operations Manual, to help in the support of the production data warehouse.

Monitoring the System


Monitoring the system is useful for identifying any problems or outages before the users notice. The Production Support team must know what failed, where it failed, when it failed, and who needs to be working on the solution. Identifying outages and/or bottlenecks can help to identify trends associated with various technologies. The goal of monitoring is to reduce downtime for the business user. Comparing the monitoring data against threshold violations, service level agreements, and other organizational requirements helps to determine the effectiveness of the data warehouse and any need for changes.

Service Level Agreement


The Service Level Agreement (SLA) outlines how the overall data warehouse system is to be maintained. This is a high-level document that discusses system maintenance and the components of the system, and identifies the groups responsible for monitoring the various components. The SLA should be able to be measured against key performance indicators. At a minimum, it should contain the following information:
q q q q q q q

Times when the system should be available to users. Scheduled maintenance window. Who is expected to monitor the operating system. Who is expected to monitor the database. Who is expected to monitor the PowerCenter sessions. How quickly the support team is expected to respond to notifications of system failures. Escalation procedures that include data warehouse team contacts in the event that the support team cannot resolve the system failure.

Operations Manual
The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the data warehouse system maintenance. This manual should be self-contained, providing all of the information necessary for a production support operator to maintain the system and resolve most problems that can arise. This manual should contain information on how to maintain all data warehouse system components. At a minimum, the Operations Manual should contain:
q q

Information on how to stop and re-start the various components of the system. Ids and passwords (or how to obtain passwords) for the system components.
BEST PRACTICES 661 of 954

INFORMATICA CONFIDENTIAL

q q q q

Information on how to re-start failed PowerCenter sessions and recovery procedures. A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and the average run times. Error handling strategies. Who to call in the event of a component failure that cannot be resolved by the Production Support team.

PowerExchange Operations Manual


The need to maintain archive logs and listener logs, use started tasks, perform recovery, and other operation functions on MVS are challenges that need to be addressed in the Operations Manual. If listener logs are not cleaned up on a regular basis, operations is likely to face space issues. Setting up archive logs on MVS requires datasets to be allocated and sized. Recovery after failure requires operations intervention to restart workflows and set the restart tokens. For Change Data Capture, operations are required to start the started tasks in a scheduler and/or after an IPL. There are certain commands that need to be executed by operations. The PowerExchange Reference Guide (8.1.1) and the related Adapter Guide provides detailed information on the operation of PowerExchange Change Data Capture.

Archive/Listener Log Maintenace


The archive log should be controlled by using the Retention Period specified in the EDMUPARM ARCHIVE_OPTIONS in parameter ARCHIVE_RTPD=. The default supplied in the Install (in RUNLIB member SETUPCC2) is 9999. This is generally longer than most organizations need. To change it, just rerun the first step (and only the first step) in SETUPCC2 after making the appropriate changes. Any new archive log datasets will be created with the new retention period. This does not, however, fix the old archive datasets; to do that, use SMS to override the specification, removing the need to change the EDMUPARM. The listener default log are part of the joblog of the running listener. If the listener job runs continuously, there is a potential risk of the spool file reaching the maximum and causing issues with the listener. For example, if the listener started task is scheduled to restart every weekend, the log will be refreshed and a new spool file will be created. If necessary, change the started task listener jobs from //DTLLOG DD SYSOUT=* //DTLLOG DD DSN=&HLQ..LOG, this will log the file to the member LOG in the HLQ..RUNLIB.

Recovery After Failure


The last resort recovery procedure is to re-execute your initial extraction and load, and restart the CDC process from the new initial load start point. Fortunately there are other solutions. In any case, if you do need every change, reinitializing may not be an option.

Application ID
PowerExchange documentation talks about consuming applications the processes that extract changes, whether they are realtime or change (periodic batch extraction). Each consuming application must identify itself to PowerExchange. Realistically, this means that each session must have an application id parameter containing a unique label.

Restart Tokens
Power Exchange remembers each time that a consuming application successfully extracts changes. The end-point of the extraction (Address in the database Log RBA or SCN) is stored in a file on the server hosting the Listener that reads the changed data. Each of these memorized end-points (i.e., Restart Tokens) is a potential restart point. It is possible, using the Navigator interface directly, or by updating the restart file, to force the next extraction to restart from any of these points. If youre using the ODBC interface for PowerExchange, this is the best solution to implement.
INFORMATICA CONFIDENTIAL BEST PRACTICES 662 of 954

If you are running periodic extractions of changes and everything finishes cleanly, the restart token history is a good approach to recovery back to a previous extraction. You simply chose the recovery point from the list and re-use it. There are more likely scenarios though. If you are running realtime extractions, potentially never-ending or until theres a failure, there are no end-points to memorize for restarts. If your batch extraction fails, you may already have processed and committed many changes. You cant afford to miss any changes and you dont want to reapply the same changes youve just processed, but the previous restart token does not correspond to the reality of what youve processed. If you are using the Power Exchange Client for PowerCenter (PWXPC), the best answer to the recovery problem lies with PowerCenter, which has historically been able to deal with restarting this type of process Guaranteed Message Delivery. This functionality is applicable to both realtime and change CDC options. The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run for each Application Id in files on the PowerCenter Server. The directory and file name are required parameters when configuring the PWXPC connection in the Workflow Manager. This functionality greatly simplifies recovery procedures compared to using the ODBC interface to PowerExchange. To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration tab in the session properties. During normal session execution, PowerCenter Server stores recovery information in cache files in the directory specified for $PMCacheDir.

Normal CDC Execution


If the session ends "cleanly" (i.e., zero return code), PowerCenter writes tokens to the restart file, and the GMD cache is purged. If the session fails, you are left with unprocessed changes in the GMD cache and a Restart Token corresponding to the point in time of the last of the unprocessed changes. This information is useful for recovery.

Recovery
If a CDC session fails, and it was executed with recovery enabled, you can restart it in recovery mode either from the PowerCenter Client interfaces or using the pmcmd command line instruction. Obviously, this assumes that you are able to identify that the session failed previously. 1. 2. 3. 4. Start from the point in time specified by the Restart Token in the GMD cache. PowerCenter reads the change records from the GMD cache. PowerCenter processes and commits the records to the target system(s). Once the records in the GMD cache have been processed and committed, PowerCenter purges the records from the GMD cache and writes a restart token to the restart file. 5. The PowerCenter session ends cleanly. The CDC session is now ready for you to execute in normal mode again.

Recovery Using PWX ODBC Interface


You can, of course, successfully recover if you are using the ODBC connectivity to PowerCenter, but you have to build in some things yourself coping with processing all the changes from the last restart token, even if youve already processed some of them. When you re-execute a failed CDC session, you receive all the changed data since the last Power Exchange restart token. Your session has to cope with processing some of the same changes you already processed at the start of the failed execution either using lookups/joins to the target to see if youve already applied the change you are
INFORMATICA CONFIDENTIAL BEST PRACTICES 663 of 954

processing, or simply ignoring database error messages such as trying to delete a record you already deleted. If you run DTLUAPPL to generate a restart token periodically during the execution of your CDC extraction and save the results, you can use the generated restart token to force a recovery at a more recent point in time than the last session-end restart token. This is especially useful if you are running realtime extractions using ODBC, otherwise you may find yourself re-processing several days of changes youve already processed. Finally, you can always re-initialize the target and the CDC processing:
q q q

Take an image copy of the tablespace containing the table to be captured, with QUIESCE option. Monitor the EDMMSG output from the PowerExchange Logger job. Look for message DTLEDM172774I which identifies the PowerExchange Logger sequence number corresponding to the QUIESCE event. The output logger show detail with the following format: DB2 QUIESCE of TABLESPACE TSNAME.TBNAME at DB2 RBA/LRSN 000849C56185 EDP Logger RBA . . . . . . . . . : D5D3D3D34040000000084E0000000000 Sequence number . . . . . . . . . : 000000084E0000000000 Edition number . . . . . . . . . : B93C4F9C2A79B000 Source EDMNAME(s) . . . . . . . . : DB2DSN1CAPTNAME1

q q q

Take note of the log sequence number Repeat for all tables that form part of the same PowerExchange Application. Run the DTLUAPPL utility specifying the application name and the registration name for each table in the application. Alter the SYSIN as follows: MOD APPL REGDEMO DSN1 (where REGDEMO is Registration name on Navigator) add RSTTKN CAPDEMO (where CAPDEMO is Capture name from Navigator) SEQUENCE 000000084E0000000000000000084E0000000000 RESTART D5D3D3D34040000000084E0000000000 END APPL REGDEMO (where REGDEMO is Registration name from Navigator)

Note how the sequence number is a repeated string from the sequence number found in the Logger messages after the Copy/Quiesce.

Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the same message sequence. This sets the extraction start point on the PowerExchange Logger to the point at which the QUIESCE was done above. The image copy obtained above can be used for the initial materialization of the target tables.

PowerExchange Tasks: MVS Start and Stop Command Summary


Task Start Command* Stop Command Notes Description of Task

INFORMATICA CONFIDENTIAL

BEST PRACTICES

664 of 954

/F DTLLST, CLOSE

Preferred method /F DTLLST, CLOSE

Listener

/S DTLLST

/F DTLLST, CLOSE, FORCE If CLOSE doesnt work /P DTLLST /C DTLLST


If FORCE doesnt work If STOP doesnt work

The PowerExchange listener is used for bulk data movement and registering sources for Change Data Capture

Agent

/S DTLA

Logger

/S DTLL

The PowerExchange Agent, used to manage connections to the /DTLA DRAIN and PowerExchange SHUTDOWN COMPLETELY can be Logger and handle /DTLA shutdown used only at the request of Informatica repository and other Support tasks. This must be started before the Logger. The PowerExchange /P DTLL Logger used to ****(if you are installing, you need to manage the Linear run setup2 here prior to starting the datasets and /F DTLL, STOP Logger) /f DTLL, display hiperspace that hold change capture data.

/F DTLDB2EC, STOP or /F DTLDB2EC, ECCR (DB2) /S DTLDB2EC QUIESCE or /P DTLDB2EC

STOP command just cancel ECCR, QUIESCE wait for open UOWs to There must be complete. /F DTLDB2EC, display will publish stats into the ECCR sysout

registrations present prior to bringing up most adaptor ECCRs. The PowerExchange Condenser used to run condense jobs against the PowerExchange Logger. This is used with PowerExchange CHANGE to organize the data by table, allow for interval-based extraction, and optionally fully condense multiple changes to a single row.

Condense

/S DTLC

/F DTLC, SHUTDOWN

Apply

a certain listener issue the following: (2) Then to stop the Apply issue the Submit JCL or / (1) F <Listener following where: name = DBN2 (apply job>, D A S DTLAPP name) (2) F DTLLST, If the CAPX access and apply is STOPTASK name running locally not through a listener then issue the following command: <Listener job>, CLOSE
BEST PRACTICES

(1) To identify all tasks running through Apply process used in


situations where straight replication is required and the data is not moved through PowerCenter before landing in the target.

The PowerExchange

INFORMATICA CONFIDENTIAL

665 of 954

Notes:
1. /p is an MVS STOP command , /f is an MVS MODIFY command. 2. REMOVE the / if the command is done from the console not SDSF.

If you attempt to shut down the Logger before the ECCR(s), a message indicates that there are still active ECCRs and that the logger will come down AFTER the ECCRS go away. What you should do is: You can shut the Listener and the ECCR(s) down at the same time. The Listener: 1. F <Listener_job>,CLOSE 2. If this isnt coming down fast enough for you, issue F <Listener_job>,CLOSE FORCE 3. If it still isnt coming down fast enough, issue C <Listener_job> Note that these commands are listed in the order of most to least desirable method for bringing the listener down. The DB2 ECCR: 1. F <DB2 ECCR>,QUIESCE - this waits for all OPEN UOWs to finish, which can be awhile if a longrunning batch job is running. 2. F <DB2 ECCR>,STOP - this terminates immediately 3. P <DB2 ECCR> - this also terminates immediately Once the ECCR(s) are down, you can then bring the Logger down. The Logger: P <Logger job_name> The Agent: CMDPREFIX SHUTDOWN If you know that you are headed for an IPL, you can issue all these commands at the same time. The Listener and ECCR(s) should start down, if you are looking for speed, issue F <Listener_job>,CLOSE FORCE to shut down the Listener, then issue F <DB2 ECCR>,STOP to terminate DB2 ECCR, then shut down the Logger and the Agent. Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new file/DB2 table/IMS database is being updated during this shutdown process and the Agent is not available, the call to see if the source is registered returns a Not being captured answer. The update, therefore, occurs without you capturing it, leaving your target in a broken state (which you won't know about until too late!)

Sizing the Logger


When you install PWX-CHANGE, up to two active log data sets are allocated with minimum size requirements. The information in this section can help to determine if you need to increase the size of the data sets, and if you should allocate additional log data sets. When you define your active log data sets, consider your systems capacity and your changed data requirements, including archiving and performance issues. After the PWX Logger is active, you can change the log data set configuration as necessary. In general, remember that you must balance the following variables:
q

Data set size


BEST PRACTICES 666 of 954

INFORMATICA CONFIDENTIAL

q q

Number of data sets Amount of archiving

The choices you make depend on the following factors:


q q q q

Resource availability requirements Performance requirements Whether you are running near-realtime or batch replication Data recovery requirements

An inverse relationship exists between the size of the log data sets and the frequency of archiving required. Larger data sets need to be archived less often than smaller data sets. Note: Although smaller data sets require more frequent archiving, the archiving process requires less time. Use the following formulas to estimate the total space you need for each active log data set. For an example of the calculated data set size, refer to the PowerExchange Reference Guide.
q

active log data set size in bytes = (average size of captured change record * number of changes captured per hour * desired number of hours between archives) * (1 + overhead rate) active log data set size in cylinders = active log data set size in tracks/number of tracks per cylinder active log data set size in tracks = active log data set size in bytes/number of usable bytes per track

q q

When determining the average size of your captured change records, note the following information:
q

PWX Change Capture captures the full object that is changed. For example, if one field in an IMS segment has changed, the product captures the entire segment. The PWX header adds overhead to the size of the change record. Per record, the overhead is approximately 300 bytes plus the key length. The type of change transaction affects whether PWX Change Capture includes a before-image, afterimage, or both:
r r r

DELETE includes a before-image. INSERT includes an after-image. UPDATE includes both.

Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors:
q q

Overhead for control information Overhead for writing recovery-related information, such as system checkpoints.

You have some control over the frequency of system checkpoints when you define your PWX Logger parameters. See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information about this parameter. DASD Capacity Conversion Table Space Information Model 3390 Model 3380

INFORMATICA CONFIDENTIAL

BEST PRACTICES

667 of 954

usable bytes per track tracks per cylinder

49,152 15

40,960 15

This example is based on the following assumptions:


q q q q q

estimated average size of a changed record = 600 bytes estimated rate of captured changes = 40,000 changes per hour desired number of hours between archives = 12 overhead rate = 5 percent DASD model = 3390

The estimated size of each active log data set in bytes is calculated as follows: 600 * 40,000 * 12 * 1.05 = 302,400,000 The number of cylinders to allocate is calculated as follows: 302,400,000 / 49,152 = approximately 6152 tracks 6152 / 15 = approximately 410 cylinders The following example shows an IDCAMS DEFINE statement that uses the above calculations:

DEFINE CLUSTER (NAME (HLQ.EDML.PRILOG.DS01) LINEAR VOLUMES(volser) SHAREOPTIONS(2,3) CYL(410) ) DATA (NAME(HLQ.EDML.PRILOG.DS01.DATA) )
The variable HLQ represents the high-level qualifier that you defined for the log data sets during installation.

Additional Logger Tips


The Logger format utility (EDMLUTL0) formats only the primary space allocation. This means that the Logger does not use secondary allocation. This includes Candidate Volumes and Space, such as that allocated by SMS when using a STORCLAS with the Guaranteed Space attribute. Logger active logs should be defined through IDCAMS with:
q q q

No secondary allocation. A single VOLSER in the VOLUME parameter. An SMS STORCLAS, if used, without GUARANTEED SPACE=YES.

PowerExchange Agent Commands


INFORMATICA CONFIDENTIAL BEST PRACTICES 668 of 954

You can use commands from the MVS system to control certain aspects of PowerExchange Agent processing. To issue a PowerExchange Agent command, enter the PowerExchange Agent command prefix (as specified by CmdPrefix in your configuration parameters), followed by the command. For example, if CmdPrefix=AG01, issue the following command to close the Agent's message log: AG01 LOGCLOSE The PowerExchange Agent intercepts agent commands issued on the MVS console and processes them in the agent address space. If the PowerExchange Agent address space is inactive, MVS rejects any PowerExchange Agent commands that you issue. If the PowerExchange Agent has not been started during the current IPL, or if you issue the command with the wrong prefix, MVS generates the following message: IEE305I command COMMAND INVALID See PowerExchange Reference Guide (8.1.1) for detailed information on Agent commands.

PowerExchange Logger Commands


The PowerExchange Logger uses two types of commands: interactive and batch You run interactive commands from the MVS console when the PowerExchange logger is running. You can use PowerExchange Logger interactive commands to:
q q q q

Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer connections. Resolve in-doubt UOWs. Stop a PowerExchange Logger. Print the contents of the PowerExchange active log file (in hexadecimal format).

You use batch commands primarily in batch change utility jobs to make changes to parameters and configurations when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands to:
q

Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange Logger names, archive log options, buffer options, and mode (single or dual). Add log definitions to the restart data set. Delete data set records from the restart data set. Display log data sets, UOWs, and reader/writer connections.

q q q

See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4, Page 59)

Last updated: 05-Jun-08 14:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

669 of 954

Data Integration Load Traceability Challenge


Load management is one of the major difficulties facing a data integration or data warehouse operations team. This Best Practice tries to answer the following questions:
q q q q q q

How can the team keep track of what has been loaded? What order should the data be loaded in? What happens when there is a load failure? How can bad data be removed and replaced? How can the source of data be identified? When it was loaded?

Description
Load management provides an architecture to allow all of the above questions to be answered with minimal operational effort.

Benefits of a Load Management Architecture Data Lineage


The term Data Lineage is used to describe the ability to track data from its final resting place in the target back to its original source. This requires the tagging of every row of data in the target with an ID from the load management metadata model. This serves as a direct link between the actual data in the target and the original source data. To give an example of the usefulness of this ID, a data warehouse or integration competency center operations team, or possibly end users, can, on inspection of any row of data in the target schema, link back to see when it was loaded, where it came from, any other metadata about the set it was loaded with, validation check results, number of other rows loaded at the same time, and so forth. It is also possible to use this ID to link one row of data with all of the other rows loaded at the same time. This can be useful when a data issue is detected in one row and the operations team needs to see if the same error exists in all of the other rows. More than this, it is the ability to easily identify the source data for a specific row in the target, enabling the operations team to quickly identify where a data issue may lie. It is often assumed that data issues are produced by the transformation processes executed as

INFORMATICA CONFIDENTIAL

BEST PRACTICES

670 of 954

part of the target schema load. Using the source ID to link back the source data makes it easy to identify whether the issues were in the source data when it was first encountered by the target schema load processes or if those load processes caused the issue. This ability can save a huge amount of time, expense, and frustration -- particularly in the initial launch of any new subject areas.

Process Lineage
Tracking the order that data was actually processed in is often the key to resolving processing and data issues. Because choices are often made during the processing of data based on business rules and logic, the order and path of processing differs from one run to the next. Only by actually tracking these processes as they act upon the data can issue resolution be simplified.

Process Dependency Management


Having a metadata structure in place provides an environment to facilitate the application and maintenance of business dependency rules. Once a structure is in place that identifies every process, it becomes very simple to add the necessary metadata and validation processes required to ensure enforcement of the dependencies among processes. Such enforcement resolves many of the scheduling issues that operations teams typically faces. Process dependency metadata needs to exist because it is often not possible to rely on the source systems to deliver the correct data at the correct time. Moreover, in some cases, transactions are split across multiple systems and must be loaded into the target schema in a specific order. This is usually difficult to manage because the various source systems have no way of coordinating the release of data to the target schema.

Robustness
Using load management metadata to control the loading process also offers two other big advantages, both of which fall under the heading of robustness because they allow for a degree of resilience to load failure.

Load Ordering
Load ordering is a set of processes that use the load management metadata to identify the order in which the source data should be loaded. This can be as simple as making sure the data is loaded in the sequence it arrives, or as complex as having a pre-defined load sequence planned in the metadata. There are a number of techniques used to manage these processes. The most common is an automated process that generates a PowerCenter load list from flat files in a directory, then archives the files in that list after the load is complete. This process can use embedded data in file names or can read header records to identify the correct ordering of the data. Alternatively the correct order can be pre-defined in the load management metadata using load calendars. Either way, load ordering should be employed in any data integration or data warehousing
INFORMATICA CONFIDENTIAL BEST PRACTICES 671 of 954

implementation because it allows the load process to be automatically paused when there is a load failure, and ensures that the data that has been put on hold is loaded in the correct order as soon as possible after a failure. The essential part of the load management process is that it operates without human intervention, helping to make the system self healing!

Rollback
If there is a loading failure or a data issue in normal daily load operations, it is usually preferable to remove all of the data loaded as one set. Load management metadata allows the operations team to selectively roll back a specific set of source data, the data processed by a specific process, or a combination of both. This can be done using manual intervention or by a developed automated feature.

Simple Load Management Metadata Model

As you can see from the simple load management metadata model above, there are two sets of data linked to every transaction in the target tables. These represent the two major types of load management metadata:
q q

Source tracking Process tracking

INFORMATICA CONFIDENTIAL

BEST PRACTICES

672 of 954

Source Tracking
Source tracking looks at how the target schema validates and controls the loading of source data. The aim is to automate as much of the load processing as possible and track every load from the source through to the target schema.

Source Definitions
Most data integration projects use batch load operations for the majority of data loading. The sources for these come in a variety of forms, including flat file formats (ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems. The first control point for the target schema is to maintain a definition of how each source is structured, as well as other validation parameters. These definitions should be held in a Source Master table like the one shown in the data model above. These definitions can and should be used to validate that the structure of the source data has not changed. A great example of this practice is the use of DTD files in the validation of XML feeds. In the case of flat files, it is usual to hold details like:
q q q q

Header information (if any) How many columns Data types for each column Expected number of rows

For RDBMS sources, the Source Master record might hold the definition of the source tables or store the structure of the SQL statement used to extract the data (i.e., the SELECT, FROM and ORDER BY clauses). These definitions can be used to manage and understand the initial validation of the source data structures. Quite simply, if the system is validating the source against a definition, there is an inherent control point at which problem notifications and recovery processes can be implemented. Its better to catch a bad data structure than to start loading bad data.

Source Instances
A Source Instance table (as shown in the load management metadata model) is designed to hold one record for each separate set of data of a specific source type being loaded. It should have a direct key link back to the Source Master table which defines its type.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

673 of 954

The various source types may need slightly different source instance metadata to enable optimal control over each individual load. Unlike the source definitions, this metadata will change every time a new extract and load is performed. In the case of flat files, this would be a new file name and possibly date / time information from its header record. In the case of relational data, it would be the selection criteria (i. e., the SQL WHERE clause) used for each specific extract, and the date and time it was executed. This metadata needs to be stored in the source tracking tables so that the operations team can identify a specific set of source data if the need arises. This need may arise if the data needs to be removed and reloaded after an error has been spotted in the target schema.

Process Tracking
Process tracking describes the use of load management metadata to track and control the loading processes rather than the specific data sets themselves. There can often be many load processes acting upon a single source instance set of data. While it is not always necessary to be able to identify when each individual process completes, it is very beneficial to know when a set of sessions that move data from one stage to the next has completed. Not all sessions are tracked this way because, in most cases, the individual processes are simply storing data into temporary tables that will be flushed at a later date. Since load management process IDs are intended to track back from a record in the target schema to the process used to load it, it only makes sense to generate a new process ID if the data is being stored permanently in one of the major staging areas.

Process Definition
Process definition metadata is held in the Process Master table (as shown in the load management metadata model ). This, in its basic form, holds a description of the process and its overall status. It can also be extended, with the introduction of other tables, to reflect any dependencies among processes, as well as processing holidays.

Process Instances
A process instance is represented by an individual row in the load management metadata Process Instance table. This represents each instance of a load process that is actually run. This holds metadata about when the process started and stopped, as well as its current status. Most importantly, this table allocates a unique ID to each instance. The unique ID allocated in the process instance table is used to tag every row of source data. This ID is then stored with each row of data in the target table.

Integrating Source and Process Tracking

INFORMATICA CONFIDENTIAL

BEST PRACTICES

674 of 954

Integrating source and process tracking can produce an extremely powerful investigative and control tool for the administrators of data warehouses and integrated schemas. This is achieved by simply linking every process ID with the source instance ID of the source it is processing. This requires that a write-back facility be built into every process to update its process instance record with the ID of the source instance being processed. The effect is that there is a one to one/many relationship between the source instance table and the process instance table containing several rows for each set of source data loaded into a target schema. For example, in a data warehousing project, a row for loading the extract into a staging area, a row for the move from the staging area to an ODS, and a final row for the move from the ODS to the warehouse.

Integrated Load Management Flow Diagram

Tracking Transactions
This is the simplest data to track since it is loaded incrementally and not updated. This means that the process and source tracking discussed earlier in this document can be applied as is.

Tracking Reference Data

INFORMATICA CONFIDENTIAL

BEST PRACTICES

675 of 954

This task is complicated by the fact that reference data, by its nature, is not static. This means that if you simply update the data in a row any time there is a change, there is no way that the change can be backed out using the load management practice described earlier. Instead, Informatica recommends always using slowly changing dimension processing on every reference data and dimension table to accomplish source and process tracking. Updating the reference data as a slowly changing table retains the previous versions of updated records, thus allowing any changes to be backed out.

Tracking Aggregations
Aggregation also causes additional complexity for load management because the resulting aggregate row very often contains the aggregation across many source data sets. As with reference data, this means that the aggregated row cannot be backed out in the same way as transactions. This problem is managed by treating the source of the aggregate as if it was an original source. This means that rather than trying to track the original source, the load management metadata only tracks back to the transactions in the target that have been aggregated. So, the mechanism is the same as used for transactions but the resulting load management metadata only tracks back from the aggregate to the fact table in the target schema.

Last updated: 20-Dec-07 15:44

INFORMATICA CONFIDENTIAL

BEST PRACTICES

676 of 954

Disaster Recovery Planning with PowerCenter HA Option Challenge


Develop a Disaster Recovery (DR) Plan for PowerCenter running on Unix/Linux platforms. Design a PowerCenter data integration platform for high availability (HA) and disaster recovery that can support a variety of mission-critical and time-sensitive operational applications across multiple business and functional areas.

Description
To enable maximum resilience, the data integration platform design should provide redundancy and remoteness. The target architecture proposed in this document is based upon the following assumptions:
q

A PowerCenter HA option license is present.


q

A Cluster File System will be used to provide concurrent file access from multiple servers in order to provide a flexible, high-performance, and highly available platform for shared data in a SAN environment.
q

Four servers will be available for installing PowerCenter components.


q

PowerCenter binaries, repository/domain database, and shared file system for PowerCenter working files are considered in a failover scenario. The DR plan does not take into consideration source and target databases, ftp servers or scheduling tools. A standby database server (which requires replicated logs for recovery) will be used as the disaster recovery solution for the database tier. It will provide disaster tolerance for both the PowerCenter repository and the domain database. As this server will be used to achieve high availability it should have performance characteristics in parity with the primary repository database server. Recovery time for storage can be reduced using near real-time replication of data-over-distance from the primary SAN to a mirror SAN. Storage vendors should be consulted for optimal SAN and mirror SAN configuration.

Primary Data Center During Normal Operation

INFORMATICA CONFIDENTIAL

BEST PRACTICES

677 of 954

PowerCenter Domain During Normal Operation


Informatica Server Manager on Node 1 and Node 2 are running. Informatica Server Manager on Node 3 and Node 4 is shutdown.

A node is a logical representation of a physical machine. Each node runs a Service Manager (SM) process to control the services running on that node. A node is considered unavailable if the SM process is not up and running. For example, the SM process may not be running if the administrator has shut down the machine or the SM process. SM processes periodically exchange a heartbeat signal amongst themselves to detect any node/network
INFORMATICA CONFIDENTIAL BEST PRACTICES 678 of 954

failure. Upon detecting a primary (or backup) node failure, the remaining nodes determine the new primary (or backup) node via a distributed voting algorithm. Typically, the administrator will configure the OS to automatically start the SM whenever the OS boots up or in the event the SM fails unexpectedly. For unexpected failures of the SM, monitoring scripts should be used because the SM is the primary point of control for PowerCenter services on a node. When PowerCenter is installed on a Unix/Linux platform, the same user id (uid) and group id (gid) should be created for all Unix/Linux users on Node1, Node2, Node3 and Node4. When the infa_shared directory is placed on a shared file system like CFS, all Unix/Linux users should be granted read/write access to the same files. For example, if a workflow running on Node1 creates a log file in the log directory, Node2, Node3 and Node4 should be able to read and update this file. To install and configure PowerCenter services on four nodes: 1. 2. 3. 4. For the Node1 installation, choose the option to create domain. For the Node2, Node3 and Node4 installations choose the option to join the domain. Node1 will be the master gateway. For Node2, Node3 and Node4 choose Serves as Gateway: Yes. For Node 1, use the following URL to confirm that it is the Master Gateway: http://node1_hostname:6001/coreservices/DomainService The result should look like this: /coreservices/AlertService : enabled /coreservices/AuthenticationService : initialized /coreservices/ AuthorizationService : enabled /coreservices/DomainConfigurationService : enabled /coreservices/ DomainService : [DOM_10004] Domain service is currently master gateway node and enabled. / coreservices/DomainService/InitTime : Fri Aug 03 09:59:03 EDT 2007 /coreservices/ LicensingService : enabled /coreservices/LogService : enabled /coreservices/LogServiceAgent : initialized /coreservices/NodeConfigurationService : enabled 5. For Node2, Node 3 and Node 4 respectively, use the following URL to confirm that they are not Master Gateways: http://node2_hostname:6001/coreservices/DomainService The result should look like this: /coreservices/AlertService : uninitialized /coreservices/AuthenticationService : initialized /coreservices/ AuthorizationService : initialized /coreservices/DomainConfigurationService : initialized /coreservices/ DomainService : [DOM_10005] Domain service is currently non-master gateway node and listening. / coreservices/DomainService/InitTime : Fri Aug 03 09:59:03 EDT 2007 /coreservices/ LicensingService : initialized /coreservices/LogService : initialized /coreservices/LogServiceAgent : initialized /coreservices/NodeConfigurationService : enabled 6. Confirm the following settings: a. b. c. d. e. For Node1 Repository Service should be created as primary. For Node1 Acts as backup Integration Service should be checked. For Node2 Integration Service should be created as primary. For Node2 Acts as backup Repository Service should be checked. Node3 and Node4 should be assigned as backup nodes for Repository Service and Integration Service.

Note: During the failover in order for Node3 and Node4 to act as primary repository services, they will need to have access to the standby repository database.
INFORMATICA CONFIDENTIAL BEST PRACTICES 679 of 954

After the installation, persistent cache files, parameter files, logs, and other run-time files should be configured to use the directory created on the shared file system by pointing the $PMRootDir variable to this directory. Otherwise a symbolic link can be created from the default infa_shared location to the infa_shared directory created on the shared file system. After the initial set up, Node3 and Node4 should be shutdown from the Administration Console. During normal operations Node3 and Node4 will be unavailable. In the event of a failover to the secondary data center, it is assumed that the servers for Node1 and Node2 will become unavailable. By rebooting the hosts for Node3 and Node4 the following script-placed init.d will start the Service Manager process: TOMCAT_HOME=/u01/app/informatica/pc8.0.0/server/tomcat/bin case "$1" in 'start') # Start the PowerCenter daemons: su - pmuser -c "$TOMCAT_HOME/infaservice.sh startup" exit ;; 'stop') Esac Every node in the domain sends a heartbeat to the primary gateway at a periodic interval. The default value for this interval is 15 seconds (this may change in a future release). The heartbeat is a message sent over an HTTP connection. As part of the heartbeat, each node also updates the gateway with the service processes currently running on the node. If a node fails to send a heartbeat during the default timeout value which is a multiple of the heartbeat interval (the default value is 90 seconds) then the primary gateway node marks the node unavailable and will failover any of the services running on that node. Six chances are given for the node to update the master before it is marked as down. This avoids any false alarms for a single packet loss or in cases of heavy network load where the packet delivery could take longer. When Node3 and Node4 are started in the backup data center, they will try to establish a connection to the Master Gateway Node1. After failing to reach Node1, one of them will establish itself as the new Master Gateway. When normal operations resume, Node1 and Node2 will be rebooted and the Informatica Service Manager process will start on these nodes. Since the Informatica Service Manager process on Node3 and Node4 will be shutdown, Node1 will try to become the Master Gateway. The change in configuration required for the DR servers (there will be two servers as in production) can be set up as a script to automate the switchover to DR. For example, the database connectivity should be configured such that failover to the standby database is transparent to the PowerCenter repository and the Domain database. All database connectivity information should be identical in both data centers to make sure that the same source and target databases are used. For scheduling tools, FTP servers and message queues additional steps are required to switch to the ETL platform in the backup data center. As a result of using the PowerCenter HA option, redundancy in the primary data center is achieved. By using SAN mirroring, a standby repository database, and PowerCenter installations at the backup data center, remoteness is achieved. A further scale-out approach is recommended using the PowerCenter grid option to leverage resources on all of the servers. A single cluster file system across all nodes is essential to coordinate read/write access to the storage pool, ensure data integrity, and attain performance.

Backup Data Center After Failover From Primary Data Center

INFORMATICA CONFIDENTIAL

BEST PRACTICES

680 of 954

PowerCenter Domain During DR Operation


Informatica Server Manager on Node 2 and Node 3 are running. Informatica Server Manager on Node 1 and Node 2 is shutdown.

Last updated: 04-Dec-07 18:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

681 of 954

High Availability Challenge


Increasingly, a number of customers find that their Data Integration implementation must be available 24x7 without interruption or failure. This Best Practice describes the High Availability (HA) capabilities incorporated in PowerCenter and explains why it is critical to address both architectural (e.g., systems, hardware, firmware) and procedural (e.g., application design, code implementation, session/workflow features) recovery with HA.

Description
One of the common requirements of high volume data environments with non-stop operations is to minimize the risk exposure from system failures. PowerCenters High Availability Option provides failover, recovery and resilience for business critical, always-on data integration processes. When considering HA recovery, be sure to explore the following two components of HA that exist on all enterprise systems:

External Resilience
External resilience has to do with the integration and specification of domain name servers, database servers, FTP servers, network access servers in a defined, tested 24x7 configuration. The nature of Informaticas data integration setup places it at many interface points in system integration. Before placing and configuring PowerCenter within an infrastructure that has an HA expectation, the following questions should be answered:
q

Is the pre-existing set of servers already in a sustained HA configuration? Is there a schematic with applicable settings to use for reference? If so, is there a unit test or system test to exercise before installing PowerCenter products? It is important to remember, as a prerequisite for the PowerCenter architecture that the external systems must be HA. What are the bottlenecks or perceived failure points of the existing system? Are these bottlenecks likely to be exposed or heightened by placing PowerCenter in the infrastructure? (e.g., five times the amount of Oracle traffic, ten times the amount of DB2 traffic, a UNIX server that always shows 10% idle may now have twice as many processes running). Finally, if a proprietary solution (such as IBM HACMP or Veritas Storage Foundation for Windows) has been implemented with success at a customer site, this sets a different expectation. The customer may merely want the grid capability of multiple PowerCenter nodes to recover Informatica tasks, and expect their O/S level HA capabilities to provide file system or server bootstrap recovery upon a fundamental failure of those back-end systems. If these back-end systems have a script/command capability to, for example, restart a repository service, PowerCenter can be installed in this fashion. However, PowerCenter's HA capability extends as far as the PowerCenter components.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

682 of 954

Internal Resilience
In an HA PowerCenter environment key elements to keep in mind are:
q q

Rapid and constant connectivity to the repository metadata. Rapid and constant network connectivity between all gateway and worker nodes in the PowerCenter domain. A common highly-available storage system accessible to all PowerCenter domain nodes with one service name and one file protocol. Only domain nodes on the same operating system can share gateway and log files (see Admin Console->Domain->Properties->Log and Gateway Configuration).

Internal resilience occurs within the PowerCenter environment among PowerCenter services, the PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal resilience can be configured at the following levels:
q

Domain. Configure service connection resilience at the domain level in the general properties for the domain. The domain resilience timeout determines how long services attempt to connect as clients to application services or the Service Manager. The domain resilience properties are the default values for all services in the domain. Service. It is possible to configure service connection resilience in the advanced properties for an application service. When configuring connection resilience for an application service, this overrides the resilience values from the domain settings. Gateway. The master gateway node maintains a connection to the domain configuration database. If the domain configuration database becomes unavailable, the master gateway node tries to reconnect. The resilience timeout period depends on user activity and whether the domain has one or multiple gateway nodes:
r

Single gateway node. If the domain has one gateway node, the gateway node tries to reconnect until a user or service tries to perform a domain operation. When a user tries to perform a domain operation, the master gateway node shuts down. Multiple gateway nodes. If the domain has multiple gateway nodes and the master gateway node cannot reconnect, then the master gateway node shuts down. If a user tries to perform a domain operation while the master gateway node is trying to connect, the master gateway node shuts down. If another gateway node is available, the domain elects a new master gateway node. The domain tries to connect to the domain configuration database with each gateway node. If none of the gateway nodes can connect, the domain shuts down and all domain operations fail.

Common Elements of Concern in an HA Configuration Restart and Failover


Restart and Failover has to do with the Domain Services (Integration and Repository). If these services are not highly available, the scheduling, dependencies(e.g., touch files, ftp, etc.) and
INFORMATICA CONFIDENTIAL BEST PRACTICES 683 of 954

artifacts of the ETL process cannot be highly available. If a service process becomes unavailable, the Service Manager can restart the process or fail it over to a backup node based on the availability of the node. When a service process restarts or fails over, the service restores the state of operation and begins recovery from the point of interruption. Backup nodes can be configured for services with the the high availability option. If an application service is configured to run on primary and backup nodes, one service process can run at a time. The following situations describe restart and failover for an application service:
q

If the primary node running the service process becomes unavailable, the service fails over to a backup node. The primary node may be unavailable if it shuts down or if the connection to the node becomes unavailable. If the primary node running the service process is available, the domain tries to restart the process based on the restart options configured in the domain properties. If the process does not restart, the Service Manager can mark the process as failed. The service then fails over to a backup node and starts another process. If the Service Manager marks the process as failed, the administrator must enable the process after addressing any configuration problem.

If a service process fails over to a backup node, it does not fail back to the primary node when the node becomes available. The service process can be disabled on the backup node to cause it to fail back to the primary node.

Recovery
Recovery is the completion of operations after an interrupted service is restored. When a service recovers, it restores the state of operation and continues processing the job from the point of interruption. The state of operation for a service contains information about the service process. The PowerCenter services include the following states of operation:
q

Service Manager. The Service Manager for each node in the domain maintains the state of service processes running on that node. If the master gateway shuts down, the newly elected master gateway collects the state information from each node to restore the state of the domain. Repository Service. The Repository Service maintains the state of operation in the repository. This includes information about repository locks, requests in progress and connected clients. Integration Service. The Integration Service maintains the state of operation in the shared storage configured for the service. This includes information about scheduled, running, and completed tasks for the service. The Integration Service maintains the session and workflow state of operations based on the recovery strategy configured for the session and workflow.
BEST PRACTICES 684 of 954

INFORMATICA CONFIDENTIAL

When designing a system that has HA recovery as a core component, be sure to include architectural and procedural recovery. Architectural recovery for a PowerCenter domain involves the Service Manager, Repository Service and Integration Service restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository Service recover, but the Integration Service cannot recover the restart is not successful and has little value to a production environment. Field experience with PowerCenter has yielded these key items in planning a proper recovery upon a systemic failure:
q

A PowerCenter domain cannot be established without at least one gateway node running. Even if a domain consists of ten worker nodes and one gateway node, none of the worker nodes can run ETL jobs without a gateway node managing the domain. An Integration Service cannot run without its associated Repository Service being started and connected to its metadata repository. A Repository Service cannot run without its metadata repository DBMS being started and accepting database connections. Often database connections are established on periodic windows that expire which puts the repository offline. If the installed domain configuration is running from Authentication Module Configuration and the LDAP Principal User account becomes corrupt or inactive, all PowerCenter repository access is lost. If the installation uses any additional authentication outside PowerCenter (such as LDAP), an additional recovery and restart plan is required.

Procedural recovery is supported with many features of PowerCenter. Consider the following very simple mapping that might run in production for many ETL applications:

Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many times the file is not there, but the processes depending on this must always run. The process is always insert only. You do not want the succession of ETL that follows this small process to fail they can run to customer_stg with current records only. This setting in the Workflow Manager, Session, Properties would fit your need:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

685 of 954

Since it is not critical the ff_customer records run each time, record the failure but continue the process. Now say the situation has changed. Sessions are failing on a PowerCenter server due to target database timeouts. A requirement is given that the session must recover from this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

686 of 954

Resuming from last checkpoint restarts the process from its prior commit, allowing no loss of ETL work. To finish this second case, consider three basic items on the workflow side when the HA option is implemented:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

687 of 954

An Integration Service in an HA environment can only recover those workflows marked with Enable HA recovery. For all critical workflows, this should be considered. For a mature set of ETL code running in QA or Production, consider the following workflow property:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

688 of 954

This would automatically recover tasks from where they failed in a workflow upon an application or system wide failure. Consider carefully the use of this feature, however. Remember, automated restart of critical ETL processes without interaction can have vast unintended side effects. For instance, if a database alias or synonym was dropped, all ETL targets may now refer to different objects than the original intent. Only PowerCenter environments with HA, mature production support practices, and a complete operations manual per Velocity, should expect complete recovery with this feature. In an HA environment, certain components of the Domain can go offline while the Domain stays up to execute ETL jobs. This is a time to use the Suspend On Error feature from the General tab of Workflow settings. The backup Integration Server would then pickup this workflow and resume processing based on the resume settings of this workflow:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

689 of 954

Features
A variety of HA features exist in PowerCenter. Specifically, they include:
q q

Integration Service HA option PowerCenter Enterprise Grid Option Repository Service HA option

First, proceed from an assumption that nodes have been provided such that a basic HA configuration of PowerCenter can take place. A lab-tested version completed by Informatica is configured as below with an HP solution. Your solution can be completed with any reliable clustered file system. Your first step would always be implementing and thoroughly exercising a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

690 of 954

clustered file system:

Now, lets address the options in order:

Integration Service HA Option


You must have the HA option on the license key for this to be available on install. Note that once the base PowerCenter install is configured, all nodes are available from the Admin Console>Domain->Integration Services->Grid/Node Assignments. From the above example, you would see Node 1, Node 2, Node 3 as dropdown options on that browse page. With the HA (Primary/Backup) install complete, Integration Services are then displayed with both P and B in a configuration, with the current operating node highlighted:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

691 of 954

If a failure were to occur on this HA configuration, the Integration Service INT_SVCS_DEV would poll the Domain: Domain_Corp_RD for another Gateway Node, then assign INT_SVCS_DEV over to that Node, in this case Node_Corp_RD02. Then the B button would highlight showing this Node as providing INT_SVCS_DEV. A vital component of configuring the Integration Service for HA is making sure the Integration Service files are stored in a shared persistent environment. The paths for Integration Service files must be specified for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If an Integration Service is configured to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes. When an Integration Service is enabled, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location. By default, the installation program creates a set of Integration Service directories in the server
INFORMATICA CONFIDENTIAL BEST PRACTICES 692 of 954

\infa_shared directory. The shared location for these directories can be set by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. The key HA concern of this is $PMRootDir should be on the highly-available clustered file system mentioned above.

Integration Service Grid Option


The Grid option provides implicit HA since the Integration Service can be configured as active/ active to provide redundancy. The Server Grid option should be included on the license key for this to be available upon install. In configuring the $PMRootDir files for the Integration Service, retain the methodology described above. Also, in Admin Console->Domain->Properties->Log and Gateway Configuration, the log and directory paths should be on the clustered file system mentioned above. A grid must be created before it can be used in a Power Center domain. Be sure to remember these key points:
q

PowerCenter supports nodes from heterogeneous operating systems, bit modes, and others to be used within same domain. However, if there are heterogeneous nodes for a grid, then you can run Workflow on Grid. For the Session on Grid option, an homogeneous grid is required. An homogeneous grid is necessary for Session on Grid because a session may have a sharing cache file and other objects that may not be compatible with all of the operating systems.

If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids centered on two different operating systems. In either case, the performance of your clustered file system is going to affect the performance of your server grid, and should be considered as part of your performance/maintenance strategy.

Repository Service HA Option


You must have the HA option on the license key for this to be available on install. There are two ways to include the Repository Service HA capability when configuring PowerCenter:
q

The first is during install. When the Install Program prompts for your nodes to do a Repository install (after answering Yes to Create Repository), you can enter a second node where the Install Program can create and invoke the PowerCenter service and Repository Service for a backup repository node. Keep in mind that all of the database, OS, and server preparation steps referred to in the PowerCenter Installation and Configuration Guide still hold true for this backup node. When the install is complete, the Repository Service displays a P/B link similar to that illustrated above for the INT_SVCS_DEV example Integration Service. A second method for configuring Repository Service HA allows for measured, incremental implementation of HA from a tested base configuration. After ensuring that your initial Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and the DBMS repository containing the metadata are running and stable, you can add a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

693 of 954

second node and make it the Repository Backup. Install the PowerCenter Service on this second server following the PowerCenter Installation and Configuration Guide. In particular, skip creating Repository Content or an Integration Service on the node. Following this, Go to Admin Console->Domain and select: Create->Node. The server to contain this node should be of the exact same configuration/ clustered file system/OS as the Primary Repository Service. The following dialog should appear:

Assign a logical name to the node to describe its place, and select Create. The node should now be running as part of your domain, but if it isn't, refer to the PowerCenter Command Line Reference with the infaservice and infacmd commands to ensure the node is running on the domain. When it is running, go to Domain->Repository->Properties->Node Assignments->Edit and the browser window displays:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

694 of 954

Click OK and the Repository Service is now configured in a Primary/Backup setup for the domain. To ensure the P/B setting, test the following elements of the configuration: 1. Be certain the same version of the DBMS client is installed on the server and can access the metadata. 2. Both nodes must be on the same clustered file system. 3. Log onto the OS for the Backup Repository Service and ping the Domain Master Gateway Node. Be sure a reasonable response time is being given at an OS level (i.e., less than 5 seconds). 4. Take the Primary Repository Service Node offline and validate that the polling, failover, restart process takes place in a methodical, traceable manner for the Repository Service on the Domain. This should be clearly visible from the node logs on the Primary and Secondary Repository Service boxes [$INFA_HOME/server/tomcat/logs] or from Admin Console->Repository->Logs. Note: Remember that when a node is taken offline, you cannot access Admin Console from that node.

Using a Script to Monitor Informatica Services


A script should be used with the High Availability Option that will check all the Informatica Services in the domain as well as the domain itself. If any of the services are down the script can bring them back up. To implement the HA Option using a script, the Domain, Repository and Integration details need to be provided as input to the script; and the script needs to be scheduled to run at regular intervals. The script can be developed with eight functions (and one main function to check and bring up the services). A script can be implemented in any environment by providing input in the <Input Environment Variables> section only. Comments have been provided for each function to make them easy to understand. Below is a brief description of the eight functions:
INFORMATICA CONFIDENTIAL BEST PRACTICES 695 of 954

q q q

print_msg: Called to print output to the I/O and also writes to the log file. domain_service_lst: Accepts the list of services to be checked for in the domain. check_service: Calls the service manager, repository, and the integration functions internally to check if they are up and running. check_repo_service: Checks if the repository is up or down. If it is down it calls another function to bring it up. enable_repo_service: Called to enable the repository service. check_int_service: Checks if the integration is up or down. If it is down it calls another function to bring it up. enable_int_service: Called to enable the integration service. disable_int_service: Called to disable the integration service.

q q

q q

Last updated: 25-May-08 19:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

696 of 954

Load Validation Challenge


Knowing that all data for the current load cycle has loaded correctly is essential for effective data warehouse management. However, the need for load validation varies depending on the extent of error checking, data validation, and data cleansing functionalities inherent in your mappings. For large data integration projects with thousands of mappings, the task of reporting load statuses becomes overwhelming without a well-planned load validation process.

Description
Methods for validating the load process range from simple to complex. Use the following steps to plan a load validation process: 1. Determine what information you need for load validation (e.g., work flow names, session names, session start times, session completion times, successful rows and failed rows). 2. Determine the source of the information. All of this information is stored as metadata in the PowerCenter repository, but you must have a means of extracting it. 3. Determine how you want the information presented to you. Should the information be delivered in a report? Do you want it emailed to you? Do you want it available in a relational table so that history is easily preserved? Do you want it stored as a flat file? Weigh all of these factors to find the correct solution for your project. Below are descriptions of five possible load validation solutions, ranging from fairly simple to increasingly complex:

1. Post-session Emails on Success or Failure


One practical application of the post-session email functionality is the situation in which a key business user waits for completion of a session to run a report. Email is configured to notify the user when the session was successful so that the report can be run. Another practical application is the situation in which a production support analyst needs to be notified immediately of any failures. Configure the session to send an email to the analyst upon failure. For round-the-clock support, a pager number that has the ability to receive email can be used in place of an email address. Post-session email is configured in the session, under the General tab and Session Commands. A number of variables are available to simplify the text of the email:
q

%s Session name

INFORMATICA CONFIDENTIAL

BEST PRACTICES

697 of 954

q q q q q q q q q q q q

%e Session status %b Session start time %c Session completion time %i Session elapsed time %l Total records loaded %r Total records rejected %t Target table details %m Name of the mapping used in the session %n Name of the folder containing the session %d Name of the repository containing the session %g Attach the session log to the message %a <file path> Attache a file to the message

2. Other Workflow Manager Features


In addition to post session email messages, there are other features available in the Workflow Manager to help validate loads. Control, Decision, Event, and Timer tasks are some of the features that can be used to place multiple controls on the behavior of loads. Another solution is to place conditions within links. Links are used to connect tasks within a workflow or worklet. Use the pre-defined or user-defined variables in the link conditions. In the example below, upon the Successful completion of both sessions A and B, the PowerCenter Server executex session C.

3. PowerCenter Reports (PCR)


The PowerCenter Reports (PCR) is a web-based business intelligence (BI) tool that is included with every PowerCenter license to provide visibility into metadata stored in the PowerCenter repository in a manner that is easy to comprehend and distribute. The PCR includes more than 130 pre-packaged metadata reports and dashboards delivered through Data Analyzer, Informaticas BI offering. These pre-packaged reports enable PowerCenter customers to extract extensive business and technical metadata through easy-to-read reports including:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

698 of 954

q q q q

Load statistics and operational metadata that enable load validation. Table dependencies and impact analysis that enable change management. PowerCenter object statistics to aid in development assistance. Historical load statistics that enable planning for growth.

In addition to the 130 pre-packaged reports and dashboards that come standard with PCR, you can develop additional custom reports and dashboards that are based upon the PCR limited-use license that allows you to source reports from the PowerCenter repository. Examples of custom components that can be created include:
q q q

Repository-wide reports and/or dashboards with indicators of daily load success/failure. Customized project-based dashboard with visual indicators of daily load success/failure. Detailed daily load statistics report for each project that can be exported to Microsoft Excel or PDF. Error handling reports that deliver error messages and source data for row level errors that may have occurred during a load.

Below is an example of a custom dashboard that gives instant insight into the load validation across an entire repository through four custom indicators.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

699 of 954

4. Query Informatica Metadata Exchange (MX) Views


Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the PowerCenter repository. The Repository Manager generates these views when you create or upgrade a repository. Almost any query can be put together to retrieve metadata related to the load execution from the repository. The MX view, REP_SESS_LOG, is a great place to start. This view is likely to contain all the information you need. The following sample query shows how to extract folder name, session name, session end time, successful rows, and session duration: select subject_area, session_name, session_timestamp, successful_rows, (session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where session_timestamp = (select max(session_timestamp) from rep_sess_log where session_name =a.session_name) order by subject_area, session_name The sample output would look like this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

700 of 954

TIP Informatica strongly advises against querying directly from the repository tables. Because future versions of PowerCenter are likely to alter the underlying repository tables, PowerCenter supports queries from the unaltered MX views, not the repository tables.

5. Mapping Approach
A more complex approach, and the most customizable, is to create a PowerCenter mapping to populate a table or a flat file with desired information. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information. The following graphic illustrates a sample mapping:

This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute

INFORMATICA CONFIDENTIAL

BEST PRACTICES

701 of 954

minimum and maximum run times for that particular session. This enables you to compare the current execution time with the minimum and maximum durations. Note: Unless you have acquired additional licensing, a customized metadata data mart cannot be a source for a PCR report. However, you can use a business intelligence tool of your choice instead.

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

702 of 954

Repository Administration Challenge


Defining the role of the PowerCenter Administrator to understand the tasks required to properly manage the domain and repository.

Description
The PowerCenter Administrator has many responsibilities. In addition to regularly backing up the domain and repository, truncating logs, and updating the database statistics, he or she also typically performs the following functions:
q q q q q q q

Determines metadata strategy Installs/configures client/server software Migrates development to test and production Maintains PowerCenter servers Upgrades software Administers security and folder organization Monitors and tunes environment

Note: The Administrator is also typically responsible for maintaining domain and repository passwords; changing them on a regular basis and keeping a record of them in a secure place.

Determine Metadata Strategy


The PowerCenter Administrator is responsible for developing the structure and standard for metadata in the PowerCenter Repository. This includes developing naming conventions for all objects in the repository, creating a folder organization, and maintaining the repository. The Administrator is also responsible for modifying the metadata strategies to suit changing business needs or to fit the needs of a particular project. Such changes may include new folder names and/or a different security setup.

Install/Configure Client/Server Software

INFORMATICA CONFIDENTIAL

BEST PRACTICES

703 of 954

This responsibility includes installing and configuring the application servers in all applicable environments (e.g., development, QA, production, etc.). The Administrator must have a thorough understanding of the working environment, along with access to resources such as a Windows 2000/2003 or UNIX Admin and a DBA. The Administrator is also responsible for installing and configuring the client tools. Although end users can generally install the client software, the configuration of the client tool connections benefits from being consistent throughout the repository environment. The Administrator, therefore, needs to enforce this consistency in order to maintain an organized repository.

Migrate Development to Production


When the time comes for content in the development environment to be moved to the test and production environments, it is the responsibility of the Administrator to schedule, track, and copy folder changes. Also, it is crucial to keep track of the changes that have taken place. It is the role of the Administrator to track these changes through a change control process. The Administrator should be the only individual able to physically move folders from one environment to another. If a versioned repository is used, the Administrator should set up labels and instruct the developers on the labels that they must apply to their repository objects (i.e., reusable transformations, mappings, workflows and sessions). This task also requires close communication with project staff to review the status of items of work to ensure, for example, that only tested or approved work is migrated.

Maintain PowerCenter Servers


The Administrator must also be able to understand and troubleshoot the server environment. He or she should have a good understanding of PowerCenters Service Oriented Architecture and how the domain and application services interact with each other. The Administrator should also understand what the Integration Service does when a session is running and be able to identify those processes. Additionally, certain mappings may produce files in addition to the standard session and workflow logs. The Administrator should be familiar with these files and know how and where to maintain them.

Upgrade Software
If and when the time comes to upgrade software, the Administrator is responsible for overseeing the installation and upgrade process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

704 of 954

Security and Folder Administration


Security administration consists of both the PowerCenter domain and repository. For the domain, it involves creating, maintaining, and updating all domain users and their associated rights and privileges to services and alerts. For the repository, it involves creating, maintaining, and updating all users within the repository, including creating and assigning groups based on new and changing projects and defining which folders are to be shared, and at what level. Folder administration involves creating and maintaining the security of all folders. The Administrator should be the only user with privileges to edit folder properties.

Monitor and Tune Environment


Proactively monitoring the domain and user activity helps ensure a healthy functioning PowerCenter environment. The Administrator should review user activity for the domain to verify that the appropriate rights and privileges have been applied. The domain activity will ensure correct CPU and license usage. The Administrator should have sole responsibility for implementing performance changes to the server environment. He or she should observe server performance throughout development so as to identify any bottlenecks in the system. In the production environment, the Repository Administrator should monitor the jobs and any growth (e.g., increases in data or throughput time) and communicate such change to appropriate staff address bottlenecks, accommodate growth, and ensure that the required data is loaded within the prescribed load window.

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

705 of 954

Third Party Scheduler Challenge


Successfully integrate a third-party scheduler with PowerCenter. This Best Practice describes various levels to integrate a third-party scheduler.

Description
Tasks such as getting server and session properties, session status, or starting or stopping a workflow or a task can be performed either through the Workflow Monitor or by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be integrated with PowerCenter at any of several levels. The level of integration depends on the complexity of the workflow/schedule and the skill sets of production support personnel. Many companies want to automate the scheduling process by using scripts or thirdparty schedulers. In some cases, they are using a standard scheduler and want to continue using it to drive the scheduling process. A third-party scheduler can start or stop a workflow or task, obtain session statistics, and get server details using the pmcmd commands. Pmcmd is a program used to communicate with the PowerCenter server.

Third Party Scheduler Integration Levels


In general, there are three levels of integration between a third-party scheduler and PowerCenter: Low, Medium, and High.

Low Level
Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter workflow. This process subsequently kicks off the rest of the tasks or sessions. The PowerCenter scheduler handles all processes and dependencies after the third-party scheduler has kicked off the initial workflow. In this level of integration, nearly all control lies with the PowerCenter scheduler. This type of integration is very simple to implement because the third-party scheduler
INFORMATICA CONFIDENTIAL BEST PRACTICES 706 of 954

kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on a standard scheduler. This type of integration also takes advantage of the robust functionality offered by the Workflow Monitor. Low-level integration requires production support personnel to have a thorough knowledge of PowerCenter. Because Production Support personnel in many companies are only knowledgeable about the companys standard scheduler, one of the main disadvantages of this level of integration is that if a batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. Thus, the majority of the production support burden falls back on the Project Development team.

Medium Level
With Medium-level integration, a third-party scheduler kicks off some, but not all, workflows or tasks. Within the tasks, many sessions may be defined with dependencies. PowerCenter controls the dependencies within the tasks. With this level of integration, control is shared between PowerCenter and the third-party scheduler, which requires more integration between the third-party scheduler and PowerCenter. Medium-level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not have in-depth knowledge about the tool, they may be unable to fix problems that arise, so the production support burden is shared between the Project Development team and the Production Support team.

High Level
With High-level integration, the third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible for controlling all dependencies among the sessions. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter. Production Support personnel may have limited knowledge of PowerCenter but must have thorough knowledge of the scheduling tool. Because Production Support personnel in many companies are knowledgeable only about the companys standard scheduler, one of the main advantages of this level of integration is that if the batch fails at some point, the Production Support personnel are usually able to determine the exact breakpoint. Thus, the production support burden lies with the Production Support team.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

707 of 954

Sample Scheduler Script


There are many independent scheduling tools on the market. The following is an example of a AutoSys script that can be used to start tasks; it is included here simply as an illustration of how a scheduler can be implemented in the PowerCenter environment. This script can also capture the return codes, and abort on error, returning a success or failure (with associated return codes to the command line or the Autosys GUI monitor). # Name: jobname.job # Author: Author Name # Date: 01/03/2005 # Description: # Schedule: Daily # # Modification History # When Who Why # #-----------------------------------------------------------------. jobstart $0 $* # set variables ERR_DIR=/tmp # Temporary file will be created to store all the Error Information # The file format is TDDHHMISS<PROCESS-ID>.lst curDayTime=`date +%d%H%M%S` FName=T$CurDayTime$$.lst if [ $STEP -le 1 ] then echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..." cd /dbvol03/vendor/informatica/pmserver/ #pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01 wf_stg_tmp_product_xref_table #pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait s_M_S # The above lines need to be edited to include the name of the workflow or the
INFORMATICA CONFIDENTIAL BEST PRACTICES 708 of 954

task that you are attempting to start. TG_TMP_PRODUCT_XREF_TABLE # Checking whether to abort the Current Process or not RetVal=$? echo "Status = $RetVal" if [ $RetVal -ge 1 ] then jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n" exit 1 fi echo "Step 1: Successful"

fi

jobend normal exit 0

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

709 of 954

Updating Repository Statistics Challenge


The PowerCenter repository has more than 170 tables, and most have one or more indexes to speed up queries. Most databases use column distribution statistics to determine which index to use to optimize performance. It can be important, especially in large or high-use repositories, to update these statistics regularly to avoid performance degradation.

Description
For PowerCenter, statistics are updated during copy, backup or restore operations. In addition, the RMREP command has an option to update statistics that can be scheduled as part of a regularly-run script. For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL Server, DB2 and Informix discussed below. Each example shows how to extract the information out of the PowerCenter repository and incorporate it into a custom stored procedure.

Features in PowerCenter version 7 and later Copy, Backup and Restore Repositories
PowerCenter automatically identifies and updates all statistics of all repository tables and indexes when a repository is copied, backed-up, or restored. If you follow a strategy of regular repository back-ups, the statistics will also be updated.

PMREP Command
PowerCenter also has a command line option to update statistics in the database. This allows this command to be put in a Windows batch file or Unix Shell script to run. The format of the command is: pmrep updatestatistics {-s filelistfile} The s option allows for you to skip different tables you may not want to update statistics.

Example of Automating the Process


One approach to automating this would be to use a UNIX shell that includes the pmrep command updatestatistics which is incorporated into a special workflow in PowerCenter and run on a scheduled basis. Note: Workflow Manager supports command line as well as scheduling. Below listed is an example of the command line object.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

710 of 954

In addition, this workflow can be scheduled to run continuously on a daily, weekly or monthly schedule. This allows the statistics to be updated regularly so performance is not degraded.

Tuning Strategies for PowerCenter version 6 and earlier


The following are strategies for generating scripts to update distribution statistics. Note that all PowerCenter repository tables and index names begin with "OPB_" or "REP_".

Oracle
Run the following queries: select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%' select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%' This will produce output like:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

711 of 954

'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;' analyze table OPB_ANALYZE_DEP compute statistics; analyze table OPB_ATTR compute statistics; analyze table OPB_BATCH_OBJECT compute statistics;

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' analyze index OPB_DBD_IDX compute statistics; analyze index OPB_DIM_LEVEL compute statistics; analyze index OPB_EXPR_IDX compute statistics; Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like: 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' Run this as a SQL script. This updates statistics for the repository tables.

MS SQL Server
Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like : name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

712 of 954

Sybase
Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.

Informix
Run the following query: select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or tabname like 'OPB_%'; This will produce output like : (constant) tabname (constant) update statistics low for table OPB_ANALYZE_DEP ; update statistics low for table OPB_ATTR ; update statistics low for table OPB_BATCH_OBJECT ; Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks like: (constant) tabname (constant) Run this as a SQL script. This updates statistics for the repository tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

713 of 954

DB2
Run the following query : select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;' from sysstat.tables where tabname like 'OPB_%' This will produce output like: runstats on table PARTH.OPB_ANALYZE_DEP and indexes all; runstats on table PARTH.OPB_ATTR and indexes all; runstats on table PARTH.OPB_BATCH_OBJECT and indexes all; Save the output to a file. Run this as a SQL script to update statistics for the repository tables.

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

714 of 954

Determining Bottlenecks Challenge


Because there are many variables involved in identifying and rectifying performance bottlenecks, an efficient method for determining where bottlenecks exist is crucial to good data warehouse management.

Description
The first step in performance tuning is to identify performance bottlenecks. Carefully consider the following five areas to determine where bottlenecks exist; using a process of elimination, investigating each area in the order indicated: 1. 2. 3. 4. 5. Target Source Mapping Session System

Best Practice Considerations Use Thread Statistics to Identify Target, Source, and Mapping Bottlenecks
Use thread statistics to identify source, target or mapping (transformation) bottlenecks. By default, an Integration Service uses one reader, one transformation, and one target thread to process a session. Within each session log, the following thread statistics are available:
q q

Run time Amount of time the thread was running Idle time Amount of time the thread was idle due to other threads within application or Integration Service. This value does not include time the thread is blocked due to the operating system. Busy Percentage of the overall run time the thread is not idle. This percentage is calculated using the following formula:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

715 of 954

(run time idle time) / run time x 100 By analyzing the thread statistics found in an Integration Service session log, it is possible to determine which thread is being used the most. If a transformation thread is 100 percent busy and there are additional resources (e.g., CPU cycles and memory) available on the Integration Service server, add a partition point in the segment. If reader or writer thread is 100 percent busy, consider using string data types in source or target ports since non-string ports require more processing.

Use the Swap Method to Test Changes in Isolation


Attempt to isolate performance problems by running test sessions. You should be able to compare the sessions original performance with that of tuned sessions performance. The swap method is very useful for determining the most common bottlenecks. It involves the following five steps: 1. Make a temporary copy of the mapping, session and/or workflow that is to be tuned, then tune the copy before making changes to the original. 2. Implement only one change at a time and test for any performance improvements to gauge which tuning methods work most effectively in the environment. 3. Document the change made to the mapping, session and/or workflow and the performance metrics achieved as a result of the change. The actual execution time may be used as a performance metric. 4. Delete the temporary mapping, session and/or workflow upon completion of performance tuning. 5. Make appropriate tuning changes to mappings, sessions and/or workflows.

Evaluating the Five Areas of Consideration Target Bottlenecks Relational Targets


The most common performance bottleneck occurs when the Integration Service writes to a target database. This type of bottleneck can easily be identified with the following
INFORMATICA CONFIDENTIAL BEST PRACTICES 716 of 954

procedure: 1. Make a copy of the original workflow 2. Configure the session in the test workflow to write to a flat file and run the session. 3. Read the thread statistics in session log If session performance increases significantly when writing to a flat file, you have a write bottleneck. Consider performing the following tasks to improve performance:
q q q q q q q

Drop indexes and key constraints Increase checkpoint intervals Use bulk loading Use external loading Minimize deadlocks Increase database network packet size Optimize target databases

Flat file targets


If the session targets a flat file, you probably do not have a write bottleneck. If the session is writing to a SAN or a non-local file system, performance may be slower than writing to a local file system. If possible, a session can be optimized by writing to a flat file target local to the Integration Service. If the local flat file is very large, you can optimize the write process by dividing it among several physical drives. If the SAN or non-local file system is significantly slower than the local file system, work with the appropriate network/storage group to determine if there are configuration issues within the SAN.

Source Bottlenecks Relational sources


If the session reads from a relational source, you can use a filter transformation, a read test mapping, or a database query to identify source bottlenecks. Using a Filter Transformation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

717 of 954

Add a filter transformation in the mapping after each source qualifier. Set the filter condition to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then you have a source bottleneck. Using a Read Test Session. You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the read query by removing any transformation logic from the mapping. Use the following steps to create a read test mapping: 1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to a file target. Use the read test mapping in a test session. If the test session performance is similar to the original session, you have a source bottleneck. Using a Database Query You can also identify source bottlenecks by executing a read query directly against the source database. To do so, perform the following steps:
q q

Copy the read query directly from the session log. Run the query against the source database with a query tool such as SQL Plus. Measure the query execution time and the time it takes for the query to return the first row.

If there is a long delay between the two time measurements, you have a source bottleneck. If your session reads from a relational source and is constrained by a source bottleneck, review the following suggestions for improving performance:
q q

Optimize the query. Create tempdb as in-memory database.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

718 of 954

q q q

Use conditional filters. Increase database network packet size. Connect to Oracle databases using IPC protocol.

Flat file sources


If your session reads from a flat file source, you probably do not have a read bottleneck. Tuning the line sequential buffer length to a size large enough to hold approximately four to eight rows of data at a time (for flat files) may improve performance when reading flat file sources. Also, ensure the flat file source is local to the Integration Service.

Mapping Bottlenecks
If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping bottleneck. Use the swap method to determine if the bottleneck is in the mapping. Begin by adding a Filter transformation in the mapping immediately before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, you have a mapping bottleneck. You can also use the performance details to identify mapping bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks. Follow these steps to identify mapping bottlenecks: Create a test mapping without transformations 1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to the target. Check for High Rowsinlookupcache counters Multiple lookups can slow the session. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

719 of 954

Check for High Errorrows counters Transformation errors affect session performance. If a session has large numbers in any of the Transformation_errorrows counters, you may improve performance by eliminating the errors. For further details on eliminating mapping bottlenecks, refer to the Best Practice: Tuning Mappings for Better Performance

Session Bottlenecks
Session performance details can be used to flag other problem areas. Create performance details by selecting Collect Performance Data in the session properties before running the session. View the performance details through the Workflow Monitor as the session runs, or view the resulting file. The performance details provide counters about each source qualifier, target definition, and individual transformation within the mapping to help you understand session and mapping efficiency. To view the performance details during the session run:
q q q

Right-click the session in the Workflow Monitor. Choose Properties. Click the Properties tab in the details dialog box.

To view the resulting performance daa file, look for the file session_name.perf in the same directory as the session log and open the file in any text editor. All transformations have basic counters that indicate the number of input row, output rows, and error rows. Source qualifiers, normalizers, and targets have additional counters indicating the efficiency of data moving into and out of buffers. Some transformations have counters specific to their functionality. When reading performance details, the first column displays the transformation name as it appears in the mapping, the second column contains the counter name, and the third column holds the resulting number or efficiency percentage. Low buffer input and buffer output counters

INFORMATICA CONFIDENTIAL

BEST PRACTICES

720 of 954

If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets, increasing the session DTM buffer pool size may improve performance. Aggregator, Rank, and Joiner readfromdisk and writetodisk counters If a session contains Aggregator, Rank, or Joiner transformations, examine each Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes. If the session performs incremental aggregation, the Aggregator_readtodisk and writetodisk counters display a number other than zero because the Integration Service reads historical aggregate data from the local disk during the session and writes to disk when saving historical data. Evaluate the incremental Aggregator_readtodisk and writetodisk counters during the session. If the counters show any numbers other than zero during the session run, you can increase performance by tuning the index and data cache sizes. Note: PowerCenter versions 6.x and above include the ability to assign memory allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a global/session level. For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning Sessions for Better Performance and Tuning SQL Overrides and Environment for Better Performance.

System Bottlenecks
After tuning the source, target, mapping, and session, you may also consider tuning the system hosting the Integration Service. The Integration Service uses system resources to process transformations, session execution, and the reading and writing of data. The Integration Service also uses system memory for other data tasks such as creating aggregator, joiner, rank, and lookup table caches. You can use system performance monitoring tools to monitor the amount of system resources the Server uses and identify system bottlenecks.
q

Windows NT/2000. Use system tools such as the Performance and


BEST PRACTICES 721 of 954

INFORMATICA CONFIDENTIAL

Processes tab in the Task Manager to view CPU usage and total memory usage. You can also view more detailed performance information by using the Performance Monitor in the Administrative Tools on Windows.
q

UNIX. Use the following system tools to monitor system performance and identify system bottlenecks:
r r

lsattr -E -I sys0 - To view current system settings iostat - To monitor loading operation for every disk attached to the database server vmstat or sar w - To monitor disk swapping actions sar u - To monitor CPU loading.

r r

For further information regarding system tuning, refer to the Best Practices: Performance Tuning UNIX Systems and Performance Tuning Windows 2000/2003 Systems.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

722 of 954

Performance Tuning Databases (Oracle) Challenge


Database tuning can result in a tremendous improvement in loading performance. This Best Practice covers tips on tuning Oracle.

Description
Performance Tuning Tools
Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar with these tools, so weve included only a short description of some of the major ones here.

V$ Views
V$ views are dynamic performance views that provide real-time information on database activity, enabling the DBA to draw conclusions about database performance. Because SYS is the owner of these views, only SYS can query them. Keep in mind that querying these views impacts database performance; with each query having an immediate hit. With this in mind, carefully consider which users should be granted the privilege to query these views. You can grant viewing privileges with either the SELECT privilege, which allows a user to view for individual V$ views or the SELECT ANY TABLE privilege, which allows the user to view all V$ views. Using the SELECT ANY TABLE option requires the O7_DICTIONARY_ACCESSIBILITY parameter be set to TRUE, which allows the ANY keyword to apply to SYS owned objects.

Explain Plan
Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them. Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. The SQL in a source qualifier or in a lookup that is running for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time it takes to build a lookup cache to determine if the SQL for these transformations should

INFORMATICA CONFIDENTIAL

BEST PRACTICES

723 of 954

be tested.

SQL Trace
SQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. This utility is run for a session with the ALTER SESSION SET SQL_TRACE = TRUE statement.

TKPROF
The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF formats this dump file into a more understandable report.

UTLBSTAT & UTLESTAT


Executing UTLBSTAT creates tables to store dynamic performance statistics and begins the statistics collection process. Run this utility after the database has been up and running (for hours or days). Accumulating statistics may take time, so you need to run this utility for a long while and through several operations (i.e., both loading and querying). UTLESTAT ends the statistics collection process and generates an output file called report.txt. This report should give the DBA a fairly complete idea about the level of usage the database experiences and reveal areas that should be addressed.

Disk I/O
Disk I/O at the database level provides the highest level of performance gain in most systems. Database files should be separated and identified. Rollback files should be separated onto their own disks because they have significant disk I/O. Co-locate tables that are heavily used with tables that are rarely used to help minimize disk contention. Separate indexes so that when queries run indexes and tables, they are not fighting for the same resource. Also be sure to implement disk striping; this, or RAID technology can help immensely in reducing disk contention. While this type of planning is time consuming, the payoff is well worth the effort in terms of performance gains.

Dynamic Sampling
Dynamic sampling enables the server to improve performance by:
q

Estimating single-table predicate statistics where available statistics are missing or may lead to bad estimations. Estimating statistics for tables and indexes with missing statistics. Estimating statistics for tables and indexes with out of date statistics.

q q

Dynamic sampling is controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter, which accepts values from "0" (off) to "10" (aggressive sampling) with a default value of "2". At compile-time, Oracle determines if dynamic sampling can improve query
INFORMATICA CONFIDENTIAL BEST PRACTICES 724 of 954

performance. If so, it issues recursive statements to estimate the necessary statistics. Dynamic sampling can be beneficial when:
q q

The sample time is small compared to the overall query execution time. Dynamic sampling results in a better performing query.

The query can be executed multiple times.

Automatic SQL Tuning in Oracle Database 10g


In its normal mode, the query optimizer needs to make decisions about execution plans in a very short time. As a result, it may not always be able to obtain enough information to make the best decision. Oracle 10g allows the optimizer to run in tuning mode, where it can gather additional information and make recommendations about how specific statements can be tuned further. This process may take several minutes for a single statement so it is intended to be used on high-load, resource-intensive statements. In tuning mode, the optimizer performs the following analysis:
q

Statistics Analysis. The optimizer recommends the gathering of statistics on objects with missing or stale statistics. Additional statistics for these objects are stored in an SQL profile. SQL Profiling. The optimizer may be able to improve performance by gathering additional statistics and altering session specific parameters such as the OPTIMIZER_MODE. If such improvements are possible, the information is stored in an SQL profile. If accepted, this information can then used by the optimizer when running in normal mode. Unlike a stored outline, which fixes the execution plan, an SQL profile may still be of benefit when the contents of the table alter drastically. Even so, it's sensible to update profiles periodically. The SQL profiling is not performed when the tuining optimizer is run in limited mode. Access Path Analysis. The optimizer investigates the effect of new or modified indexes on the access path. Because its index recommendations relate to a specific statement, where practical, it is also suggest the use of the SQL Access Advisor to check the impact of these indexes on a representative SQL workload. SQL Structure Analysis. The optimizer suggests alternatives for SQL statements that contain structures that may affect performance. Be aware that implementing these suggestions requires human intervention to check their
BEST PRACTICES 725 of 954

INFORMATICA CONFIDENTIAL

validity.

TIP The automatic SQL tuning features are accessible from Enterprise Manager on the "Advisor Central" page

Useful Views
Useful views related to automatic SQL tuning include:
q q q q q q q q q q q q q q q

DBA_ADVISOR_TASKS DBA_ADVISOR_FINDINGS DBA_ADVISOR_RECOMMENDATIONS DBA_ADVISOR_RATIONALE DBA_SQLTUNE_STATISTICS DBA_SQLTUNE_BINDS DBA_SQLTUNE_PLANS DBA_SQLSET DBA_SQLSET_BINDS DBA_SQLSET_STATEMENTS DBA_SQLSET_REFERENCES DBA_SQL_PROFILES V$SQL V$SQLAREA V$ACTIVE_SESSION_HISTORY

Memory and Processing


Memory and processing configuration is performed in the init.ora file. Because each database is different and requires an experienced DBA to analyze and tune it for optimal performance, a standard set of parameters to optimize PowerCenter is not practical and is not likely to ever exist.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

726 of 954

TIP Changes made in the init.ora file take effect after a restart of the instance. Use svrmgr to issue the commands shutdown and startup (eventually shutdown immediate) to the instance. Note that svrmgr is no longer available as of Oracle 9i because Oracle is moving to a web-based Server Manager in Oracle 10g. If you are using Oracle 9i, install Oracle client tools and log onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the initialization parameters.
The settings presented here are those used in a four-CPU AIX server running Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel processing queries and indexes. Weve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle) systems determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion.

HASH_AREA_SIZE = 16777216
q q q

Default value: 2 times the value of SORT_AREA_SIZE Range of values: any integer This parameter specifies the maximum amount of memory, in bytes, to be used for the hash join. If this parameter is not set, its value defaults to twice the value of the SORT_AREA_SIZE parameter. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt). HASH_JOIN_ENABLED
r

In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to true. In Oracle 8i and above hash_join_enabled=true is the default value

HASH_MULTIBLOCK_IO_COUNT
r r

Allows multiblock reads against the TEMP tablespace It is advisable to set the NEXT extentsize to greater than the value for hash_multiblock_io_count to reduce disk I/O This is the same behavior seen when setting the db_file_multiblock_read_count parameter for data tablespaces except this one applies only to multiblock access of segments of TEMP Tablespace
BEST PRACTICES 727 of 954

INFORMATICA CONFIDENTIAL

STAR_TRANSFORMATION_ENABLED
r

Determines whether a cost-based query transformation will be applied to star queries When set to TRUE, the optimizer will consider performing a cost-based query transformation on the n-way join table

OPTIMIZER_INDEX_COST_ADJ
r r

Numeric parameter set between 0 and 1000 (default 1000) This parameter lets you tune the optimizer behavior for access path selection to be more or less index friendly

Optimizer_percent_parallel=33
This parameter defines the amount of parallelism that the optimizer uses in its cost functions. The default of 0 means that the optimizer chooses the best serial plan. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full-table scan operation. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Low values favor indexes, while high values favor table scans. Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL. parallel_max_servers=40
q q q

Used to enable parallel query. Initially not set on Install. Maximum number of query servers or parallel recovery processes for an instance.

Parallel_min_servers=8
q q q

Used to enable parallel query. Initially not set on Install. Minimum number of query server processes for an instance. Also the number of query-server processes Oracle creates when the instance is started.
BEST PRACTICES 728 of 954

INFORMATICA CONFIDENTIAL

SORT_AREA_SIZE=8388608
q q q

Default value: operating system-dependent Minimum value: the value equivalent to two database blocks This parameter specifies the maximum amount, in bytes, of program global area (PGA) memory to use for a sort. After the sort is complete, and all that remains to do is to fetch the rows out, the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is freed. The memory is released back to the PGA, not to the operating system. Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple allocations never exist; there is only one memory area of SORT_AREA_SIZE for each user process at any time. The default is usually adequate for most database operations. However, if very large indexes are created, this parameter may need to be adjusted. For example, if one process is doing all database access, as in a full database import, then an increased value for this parameter may speed the import, particularly the CREATE INDEX statements.

Automatic Shared Memory Management in Oracle 10g


Automatic Shared Memory Management puts Oracle in control of allocating memory within the SGA. The SGA_TARGET parameter sets the amount of memory available to the SGA. This parameter can be altered dynamically up to a maximum of the SGA_MAX_SIZE parameter value. Provided the STATISTICS_LEVEL is set to TYPICAL or ALL, and the SGA_TARGET is set to a value other than "0", Oracle will control the memory pools that would otherwise be controlled by the following parameters:
q q q q

DB_CACHE_SIZE (default block size) SHARED_POOL_SIZE LARGE_POOL_SIZE JAVA_POOL_SIZE

If these parameters are set to a non-zero value, they represent the minimum size for the pool. These minimum values may be necessary if you experience application errors when certain pool sizes drop below a specific threshold. The following parameters must be set manually and take memory from the quota
INFORMATICA CONFIDENTIAL BEST PRACTICES 729 of 954

allocated by the SGA_TARGET parameter:


q q q q q

DB_KEEP_CACHE_SIZE DB_RECYCLE_CACHE_SIZE DB_nK_CACHE_SIZE (non-default block size) STREAMS_POOL_SIZE LOG_BUFFER

IPC as an Alternative to TCP/IP on UNIX


On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same box), using an IPC connection can significantly reduce the time it takes to build a lookup cache. In one case, a fact mapping that was using a lookup to get five columns (including a foreign key) and about 500,000 rows from a table was taking 19 minutes. Changing the connection type to IPC reduced this to 45 seconds. In another mapping, the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000 row write (array inserts), and primary key with unique index in place. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec). A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this: DW.armafix = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL =TCP) (HOST = armafix) (PORT = 1526) ) ) (CONNECT_DATA=(SID=DW) ) ) Make a new entry in the tnsnames like this, and use it for connection to the local Oracle instance: DWIPC.armafix = (DESCRIPTION = (ADDRESS = (PROTOCOL=ipc)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

730 of 954

(KEY=DW) ) (CONNECT_DATA=(SID=DW)) )

Improving Data Load Performance Alternative to Dropping and Reloading Indexes


Experts often recommend dropping and reloading indexes during very large loads to a data warehouse but there is no easy way to do this. For example, writing a SQL statement to drop each index, then writing another SQL statement to rebuild it, can be a very tedious process. Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by allowing you to disable and re-enable existing indexes. Oracle stores the name of each index in a table that can be queried. With this in mind, it is an easy matter to write a SQL statement that queries this table. then generate SQL statements as output to disable and enable these indexes. Run the following to generate output to disable the foreign keys in the data warehouse: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' || CONSTRAINT_NAME || ' ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'R' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077 ; ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075 ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060 ;

INFORMATICA CONFIDENTIAL

BEST PRACTICES

731 of 954

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011059 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011134 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011131 ; Dropping or disabling primary keys also speeds loads. Run the results of this SQL statement after disabling the foreign key constraints: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'P' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ; Finally, disable any unique constraints with the following: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS

INFORMATICA CONFIDENTIAL

BEST PRACTICES

732 of 954

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'U' ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011071 ; Save the results in a single file and name it something like DISABLE.SQL To re-enable the indexes, rerun these queries after replacing DISABLE with ENABLE. Save the results in another file with a name such as ENABLE.SQL and run it as a post-session command. Re-enable constraints in the reverse order that you disabled them. Re-enable the unique constraints first, and re-enable primary keys before foreign keys.

TIP Dropping or disabling foreign keys often boosts loading, but also slows queries (such as lookups) and updates. If you do not use lookups or updates on your target tables, you should get a boost by using this SQL statement to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that will be used for the lookup from your script. You may want to experiment to determine which method is faster.

Optimizing Query Performance Oracle Bitmap Indexing


With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree index. A b-tree index can greatly improve query performance on data that has high cardinality or contains mostly unique values, but is not much help for low cardinality/ highly-duplicated data and may even increase query time. A typical example of a low cardinality field is gender it is either male or female (or possibly unknown). This kind of data is an excellent candidate for a bitmap index, and can significantly improve query performance.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

733 of 954

Keep in mind however, that b-tree indexing is still the Oracle default. If you dont specify an index type when creating an index, Oracle defaults to b-tree. Also note that for certain columns, bitmaps are likely to be smaller and faster to create than a b-tree index on the same column. Bitmap indexes are suited to data warehousing because of their performance, size, and ability to create and drop very quickly. Since most dimension tables in a warehouse have nearly every column indexed, the space savings is dramatic. But it is important to note that when a bitmap-indexed column is updated, every row associated with that bitmap entry is locked, making bit-map indexing a poor choice for OLTP database tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML statement (e.g., inserts and updates), which can make loads very slow. For this reason, it is a good idea to drop or disable bitmap indexes prior to the load and recreate or re-enable them after the load. The relationship between Fact and Dimension keys is another example of low cardinality. With a b-tree index on the Fact table, a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause, then joins back to the Fact table. With a bitmapped index on the Fact table, a star query may be created that accesses the Fact table first followed by the Dimension table joins, avoiding a Cartesian product of all possible Dimension attributes. This star query access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora file and if there are single column bitmapped indexes on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a bitmap index, add the word bitmap between create and index. All other syntax is identical.

Bitmap Indexes
drop index emp_active_bit; drop index emp_gender_bit; create bitmap index emp_active_bit on emp (active_flag); create bitmap index emp_gender_bit on emp (gender);

B-tree Indexes
drop index emp_active;

INFORMATICA CONFIDENTIAL

BEST PRACTICES

734 of 954

drop index emp_gender; create index emp_active on emp (active_flag); create index emp_gender on emp (gender); Information for bitmap indexes is stored in the data dictionary in dba_indexes, all_indexes, and user_indexes with the word BITMAP in the Uniqueness column rather than the word UNIQUE. Bitmap indexes cannot be unique. To enable bitmap indexes, you must set the following items in the instance initialization file:
q q q q

compatible = 7.3.2.0.0 # or higher event = "10111 trace name context forever" event = "10112 trace name context forever" event = "10114 trace name context forever"

Also note that the parallel query option must be installed in order to create bitmap indexes. If you try to create bitmap indexes without the parallel query option, a syntax error appears in the SQL statement; the keyword bitmap won't be recognized.

TIP To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is installed, the word parallel appears in the banner text.

Index Statistics Table method


Index statistics are used by Oracle to determine the best method to access tables and should be updated periodically as normal DBA procedures. The following should improve query results on Fact and Dimension tables (including appending and updating records) by updating the table and index statistics for the data warehouse: The following SQL statement can be used to analyze the tables in the database:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

735 of 954

SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;' FROM USER_TABLES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following result: ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS; ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS; ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS; The following SQL statement can be used to analyze the indexes in the database: SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;' FROM USER_INDEXES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS; Save these results as a SQL script to be executed before or after a load.

Schema method
Another way to update index statistics is to compute indexes by schema rather than by table. If data warehouse indexes are the only indexes located in a single schema, you can use the following command to update the statistics: EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute');
INFORMATICA CONFIDENTIAL BEST PRACTICES 736 of 954

In this example, BDB is the schema for which the statistics should be updated. Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command. TIP

These SQL statements can be very resource intensive, especially for very large tables. For this reason, Informatica recommends running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use estimate instead of compute in the above examples.

Parallelism
Parallel execution can be implemented at the SQL statement, database object, or instance level for many SQL operations. The degree of parallelism should be identified based on the number of processors and disk drives on the server, with the number of processors being the minimum degree.

SQL Level Parallelism


Hints are used to define parallelism at the SQL statement level. The following examples demonstrate how to utilize four processors: SELECT /*+ PARALLEL(order_fact,4) */ ; SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ ;

TIP When using a table alias in the SQL Statement, be sure to use this alias in the hint. Otherwise, the hint will not be used, and you will not receive an error message.
Example of improper use of alias: SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME FROM EMP A

INFORMATICA CONFIDENTIAL

BEST PRACTICES

737 of 954

Here, the parallel hint will not be used because of the used alias A for table EMP. The correct way is: SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME FROM EMP A

Table Level Parallelism


Parallelism can also be defined at the table and index level. The following example demonstrates how to set a tables degree of parallelism to four for all eligible SQL statements on this table: ALTER TABLE order_fact PARALLEL 4; Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention.

Additional Tips Executing Oracle SQL Scripts as Pre- and Post-Session Commands on UNIX
You can execute queries as both pre- and post-session commands. For a UNIX environment, the format of the command is: sqlplus s user_id/password@database @ script_name.sql For example, to execute the ENABLE.SQL file created earlier (assuming the data warehouse is on a database named infadb), you would execute the following as a postsession command: sqlplus s user_id/password@infadb @ enable.sql In some environments, this may be a security issue since both username and password are hard-coded and unencrypted. To avoid this, use the operating systems authentication to log onto the database instance. In the following example, the Informatica id pmuser is used to log onto the Oracle database. Create the Oracle user pmuser with the following SQL statement:
INFORMATICA CONFIDENTIAL BEST PRACTICES 738 of 954

CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . . TEMPORARY TABLESPACE . . . In the following pre-session command, pmuser (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script: sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL You may want to use the init.ora parameter os_authent_prefix to distinguish between normal oracle-users and external-identified ones. DRIVING_SITE Hint If the source and target are on separate instances, the Source Qualifier transformation should be executed on the target instance. For example, you want to join two source tables (A and B) together, which may reduce the number of selected rows. However, Oracle fetches all of the data from both tables, moves the data across the network to the target instance, then processes everything on the target instance. If either data source is large, this causes a great deal of network traffic. To force the Oracle optimizer to process the join on the source instance, use the Generate SQL option in the source qualifier and include the driving_site hint in the SQL statement as: SELECT /*+ DRIVING_SITE */ ;

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

739 of 954

Performance Tuning Databases (SQL Server) Challenge


Database tuning can result in tremendous improvement in loading performance. This Best Practice offers tips on tuning SQL Server.

Description
Proper tuning of the source and target database is a very important consideration in the scalability and usability of a business data integration environment. Managing performance on an SQL Server involves the following points.
q q q q q q

Manage system memory usage (RAM caching). Create and maintain good indexes. Partition large data sets and indexes. Monitor disk I/O subsystem performance. Tune applications and queries. Optimize active data.

Taking advantage of grid computing is another option for improving the overall SQL Server performance. To set up a SQL Server cluster environment, you need to set up a cluster where the databases are split among the nodes. This provides the ability to distribute the load across multiple nodes. To achieve high performance, Informatica recommends using a fibre-attached SAN device for shared storage.

Manage RAM Caching


Managing RAM buffer cache is a major consideration in any database server environment. Accessing data in RAM cache is much faster than accessing the same information from disk. If database I/O can be reduced to the minimal required set of data and index pages, the pages stay in RAM longer. Too much unnecessary data and index information flowing into buffer cache quickly pushes out valuable pages. The primary goal of performance tuning is to reduce I/O so that buffer cache is used effectively.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

740 of 954

Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage:
q

Max async I/O is used to specify the number of simultaneous disk I/O operations that SQL Server can submit to the operating system. Note that this setting is automated in SQL Server 2000 SQL Server allows several selectable models for database recovery, these include:
r r r

Full Recovery Bulk-Logged Recovery Simple Recovery

Create and Maintain Good Indexes


Creating and maintaining good indexes is key to maintaining minimal I/O for all database queries.

Partition Large Data Sets and Indexes


To reduce overall I/O contention and improve parallel operations, consider partitioning table data and indexes. Multiple techniques for achieving and managing partitions using SQL Server 2000 are addressed in this document.

Tune Applications and Queries


Tuning applications and queries is especially important when a database server is likely to be servicing requests from hundreds or thousands of connections through a given application. Because applications typically determine the SQL queries that are executed on a database server, it is very important for application developers to understand SQL Server architectural basics and know how to take full advantage of SQL Server indexes to minimize I/O.

Partitioning for Performance


The simplest technique for creating disk I/O parallelism is to use hardware partitioning and create a single "pool of drives" that serves all SQL Server database files except transaction log files, which should always be stored on physically-separate disk drives dedicated to log files. (See Microsoft documentation for installation procedures.)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

741 of 954

Objects For Partitioning Consideration


The following areas of SQL Server activity can be separated across different hard drives, RAID controllers, and PCI channels (or combinations of the three):
q q q q q

Transaction logs Tempdb Database Tables Nonclustered Indexes

Note: In SQL Server 2000, Microsoft introduced enhancements to distributed partitioned views that enable the creation of federated databases (commonly referred to as scale-out), which spread resource load and I/O activity across multiple servers. Federated databases are appropriate for some high-end online analytical processing (OLTP) applications, but this approach is not recommended for addressing the needs of a data warehouse.

Segregating the Transaction Log


Transaction log files should be maintained on a storage device that is physically separate from devices that contain data files. Depending on your database recovery model setting, most update activity generates both data device activity and log activity. If both are set up to share the same device, the operations to be performed compete for the same limited resources. Most installations benefit from separating these competing I/O activities.

Segregating tempdb
SQL Server creates a database, tempdb, on every server instance to be used by the server as a shared working area for various activities, including temporary tables, sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY clauses, queries using DISTINCT (temporary worktables have to be created to remove duplicate rows), cursors, and hash joins. To move the tempdb database, use the ALTER DATABASE command to change the physical file location of the SQL Server logical file name associated with tempdb. For example, to move tempdb and its associated log to the new file locations E:\mssql7 and C:\temp, use the following commands:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

742 of 954

alterdatabasetempdbmodifyfile(name='tempdev',filename= 'e:\mssql7\tempnew_location.mDF') alterdatabasetempdbmodifyfile(name='templog',filename= 'c:\temp\tempnew_loglocation.mDF')

The master database, msdb, and model databases are not used much during production (as compared to user databases), so it is generally y not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.

Database Partitioning
Databases can be partitioned using files and/or filegroups. A filegroup is simply a named collection of individual files grouped together for administration purposes. A file cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and image data can all be associated with a specific filegroup. This means that all their pages are allocated from the files in that filegroup. The three types of filegroups are:
q

Primary filegroup. Contains the primary data file and any other files not placed into another filegroup. All pages for the system tables are allocated from the primary filegroup. User-defined filegroup. Any filegroup specified using the FILEGROUP keyword in a CREATE DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL Server Enterprise Manager. Default filegroup. Contains the pages for all tables and indexes that do not have a filegroup specified when they are created. In each database, only one filegroup at a time can be the default filegroup. If no default filegroup is specified, the default is the primary filegroup.

Files and filegroups are useful for controlling the placement of data and indexes and eliminating device contention. Quite a few installations also leverage files and filegroups as a mechanism that is more granular than a database in order to exercise more control over their database backup/recovery strategy.

Horizontal Partitioning (Table)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

743 of 954

Horizontal partitioning segments a table into multiple tables, each containing the same number of columns but fewer rows. Determining how to partition tables horizontally depends on how data is analyzed. A general rule of thumb is to partition tables so queries reference as few tables as possible. Otherwise, excessive UNION queries, used to merge the tables logically at query time, can impair performance. When you partition data across multiple tables or multiple servers, queries accessing only a fraction of the data can run faster because there is less data to scan. If the tables are located on different servers, or on a computer with multiple processors, each table involved in the query can also be scanned in parallel, thereby improving query performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up a table, can execute more quickly. By using a partitioned view, the data still appears as a single table and can be queried as such without having to reference the correct underlying table manually

Cost Threshold for Parallelism Option


Use this option to specify the threshold where SQL Server creates and executes parallel plans. SQL Server creates and executes a parallel plan for a query only when the estimated cost to execute a serial plan for the same query is higher than the value set in cost threshold for parallelism. The cost refers to an estimated elapsed time in seconds required to execute the serial plan on a specific hardware configuration. Only set cost threshold for parallelism on symmetric multiprocessors (SMP).

Max Degree of Parallelism Option


Use this option to limit the number of processors (from a maximum of 32) to use in parallel plan execution. The default value is zero, which uses the actual number of available CPUs. Set this option to one to suppress parallel plan generation. Set the value to a number greater than one to restrict the maximum number of processors used by a single query execution.

Priority Boost Option


Use this option to specify whether SQL Server should run at a higher scheduling priority than other processors on the same computer. If you set this option to one, SQL Server runs at a priority base of 13. The default is zero, which is a priority base of seven.

Set Working Set Size Option


INFORMATICA CONFIDENTIAL BEST PRACTICES 744 of 954

Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The server memory setting is configured automatically by SQL Server based on workload and available resources. It can vary dynamically among minimum server memory and maximum server memory. Setting set working set size means the operating system does not attempt to swap out SQL Server pages, even if they can be used more readily by another process when SQL Server is idle.

Optimizing Disk I/O Performance


When configuring a SQL Server that contains only a few gigabytes of data and does not sustain heavy read or write activity, you need not be particularly concerned with the subject of disk I/O and balancing of SQL Server I/O activity across hard drives for optimal performance. To build larger SQL Server databases however, which can contain hundreds of gigabytes or even terabytes of data and/or that sustain heavy read/ write activity (as in a DSS application), it is necessary to drive configuration around maximizing SQL Server disk I/O performance by load-balancing across multiple hard drives.

Partitioning for Performance


For SQL Server databases that are stored on multiple disk drives, performance can be improved by partitioning the data to increase the amount of disk I/O parallelism. Partitioning can be performed using a variety of techniques. Methods for creating and managing partitions include configuring the storage subsystem (i.e., disk, RAID partitioning) and applying various data configuration mechanisms in SQL Server such as files, file groups, tables and views. Some possible candidates for partitioning include:
q q q q q

Transaction log Tempdb Database Tables Non-clustered indexes

Using bcp and BULK INSERT


Two mechanisms exist inside SQL Server to address the need for bulk movement of data: the bcp utility and the BULK INSERT statement.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

745 of 954

q q

Bcp is a command prompt utility that copies data into or out of SQL Server. BULK INSERT is a Transact-SQL statement that can be executed from within the database environment. Unlike bcp, BULK INSERT can only pull data into SQL Server. An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement, rather than having to shell out to the command prompt.

TIP Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none is specified, SQL Server commits all rows to be loaded as a single batch. For example, you attempt to load 1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.

General Guidelines for Initial Data Loads


While loading data:
q q q q q q q q q

Remove indexes. Use Bulk INSERT or bcp. Parallel load using partitioned data files into partitioned tables. Run one load stream for each available CPU. Set Bulk-Logged or Simple Recovery model. Use the TABLOCK option. Create indexes. Switch to the appropriate recovery model. Perform backups

General Guidelines for Incremental Data Loads


q

Load data with indexes in place.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

746 of 954

Use performance and concurrency requirements to determine locking granularity (sp_indexoption). Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to preserve a pointin-time recovery, such as online users modifying the database during bulk loads. Read operations should not affect bulk loads.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

747 of 954

Performance Tuning Databases (Teradata) Challenge


Database tuning can result in tremendous improvement in loading performance. This Best Practice provides tips on tuning Teradata.

Description
Teradata offers several bulk load utilities including:
q

MultiLoad which supports inserts, updates, deletes, and upserts to any table. FastExport which is a high-performance bulk export utility. BTEQ which allows you to export data to a flat file but is suitable for smaller volumes than FastExport. FastLoad which is used for loading inserts into an empty table. TPump which is a light-weight utility that does not lock the table that is being loaded.

q q

q q

Tuning MultiLoad
There are many aspects to tuning a Teradata database. Several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance.

MultiLoad parameters
Below are the MultiLoad-specific parameters that are available in PowerCenter:
q q

TDPID. A client based operand that is part of the logon string. Date Format. Ensure that the date format used in your target flat file is equivalent to the date format parameter in your MultiLoad script. Also validate that your date format is compatible with the date format specified in the Teradata database. Checkpoint. A checkpoint interval is similar to a commit interval for other

INFORMATICA CONFIDENTIAL

BEST PRACTICES

748 of 954

databases. When you set the checkpoint value to less than 60, it represents the interval in minutes between checkpoint operations. If the checkpoint is set to a value greater than 60, it represents the number of records to write before performing a checkpoint operation. To maximize write speed to the database, try to limit the number of checkpoint operations that are performed.
q

Tenacity. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running. Load Mode. Available load methods include Insert, Update, Delete, and Upsert. Consider creating separate external loader connections for each method, selecting the one that will be most efficient for each target table. Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0 to retain error tables. Max Sessions. This parameter specifies the maximum number of sessions that are allowed to log on to the database. This value should not exceed one per working amp (Access Module Process). Sleep. This parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation.

Estimating Space Requirements for MultiLoad Jobs


Always estimate the final size of your MultiLoad target tables and make sure the destination has enough space to complete your MultiLoad job. In addition to the space that may be required by target tables, each MultiLoad job needs permanent space for:
q q q

Work tables Error tables Restart Log table

Note: Spool space cannot be used for MultiLoad work tables, error tables, or the restart log table. Spool space is freed at each restart. By using permanent space for the MultiLoad tables, data is preserved for restart operations after a system failure. Work tables, in particular, require a lot of extra permanent space. Also remember to account for the size of error tables since error tables are generated for each target table. Use the following formula to prepare the preliminary space estimate for one target table, assuming no fallback protection, no journals, and no non-unique secondary indexes:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

749 of 954

PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML) Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job.

Monitoring MultiLoad Performance


Below are tips for analyzing MultiLoad performance: 1. Determine which phase of the MultiLoad job is causing poor performance.
q

If the performance bottleneck is during the acquisition phase, as data is acquired from the client system, then the issue may be with the client system. If it is during the application phase, as data is applied to the target tables, then the issue is not likely to be with the client system. The MultiLoad job output lists the job phases and other useful information. Save these listings for evaluation.

2. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job. 3. Check for locks on the MultiLoad target tables and error tables. 4. Check the DBC.Resusage table for problem areas, such as data bus or CPU capacities at or near 100 percent for one or more processors. 5. Determine whether the target tables have non-unique secondary indexes (NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table. 6. Check the size of the error tables. Write operations to the fallback error tables are performed at normal SQL speed, which is much slower than normal MultiLoad tasks. 7. Verify that the primary index is unique. Non-unique primary indexes can cause severe MultiLoad performance problems 8. Poor performance can happen when the input data is skewed with respect to the Primary Index of the database. Teradata depends upon random and well distributed data for data input and retrieval. For example, a file containing a million rows with a single value 'AAAAAA' for the Primary Index will take an infinite time to load. 9. One common tool used for determining load issues/skewed data/locks is Performance Monitor (PMON). PMON requires MONITOR access on the Teradata system. If you do not have Monitor access, then the DBA can help
INFORMATICA CONFIDENTIAL BEST PRACTICES 750 of 954

you to look at the system. 10. SQL against the system catalog can also be used to determine any performance bottle necks. The following query is used to see if the load is inserting data into the system. Spool space (a type of work space) is inside the build as data is transferred to the database. So if the load is going well, the spool will be built rapidly in the database. Use the following query to check: SELECT sum(currentspool) from dbc.diskspace where databasename = userid loading the database. After the spool rises has a reached its peak, spool will fall rapidly as data is inserted from spool into the table. If the spool grows slowly, then the input data is probably skewed.

FastExport
FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/ Sources is by using ODBC since there is not native connectivity to Teradata. However, ODBC is slow. For higher performance, use FastExport if the number of rows to be pulled is in the order of a million rows. FastExport writes to a file. The lookup or source qualifier then reads this file. FastExport integrated within PowerCenter.

BTEQ
BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you to export data to a flat file, but is suitable for smaller volumes of data. This provides faster performance than ODBC but doesn't tax Teradata system resources the way FastExport can. A possible use for BTEQ with PowerCenter is to export smaller volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a presession script.

TPump
TPump was a load utility primarily intended for streaming data (think of loading bundles of messages arriving from MQ using Power Center Real Time). TPump can also load from a file or a named pipe. While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility. Another important difference between MultiLoad and TPump is that TPump locks at the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

751 of 954

row-hash level instead of the table level thus providing users read access to fresher data. Although Teradata says that it has improved the speed of TPump for loading files to compare with that of MultiLoad. So, try a test load using TPump first. Also, be cautious with the use of TPump to load streaming data if the data throughput is large.

Push Down Optimization


PowerCenter embeds a powerful engine that actually has a memory management system built within and all the smart algorithms built into the engine to perform various transformation operations such as aggregation, sorting, joining, lookup etc. This is a typically referred to as an ETL architecture where Extracts, Transformations and Loads are performed. So, data is extracted from the data source to the PowerCenter Engine (can be on the same machine as the source or a separate machine) where all the transformations are applied and then pushed to the target. Some of the performance considerations for this type of architecture are:
q

Is the network fast enough and tuned effectively to support the necessary data transfer?
q

Is the hardware on which PowerCenter is running sufficiently robust with high processing capability and high memory capacity. ELT (Extract, Load, Transform) is a relatively new design or runtime paradigm that became popular with the advent of high performance RDBM systems such asDSS and OLTP. Because Teradata typically runs on well tuned operating systems and well tuned hardware, the ELT paradigm tries to push as much of the transformation logic as possible onto the Teradata system. The ELT design paradigm can be achieved through the Pushdown Optimization option offered with PowerCenter.

ETL or ELT
Because many database vendors and consultants advocate using ELT (Extract, Load and Transform) over ETL (Extract, Transform and Load), the use of Pushdown Optimization can be somewhat controversial. Informatica advocates using Pushdown Optimization as an option to solve specific performance situations rather than as the default design of a mapping.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

752 of 954

The following scenarios can help in deciding on when to use ETL with PowerCenter and when to use ELT (i.e., Pushdown Optimization): 1. When the load needs to look up only dimension tables then there may be no need to use Pushdown Optimization. In this context, PowerCenter's ability to build dynamic, persistent caching is significant. If a daily load involves 10s or 100s of fact files to be loaded throughout the day, then dimension surrogate keys can be easily obtained from PowerCenter's cache in memory. Compare this with the cost of running the same dimension lookup queries on the database. 2. In many cases large Teradata systems contain only a small amount of data. In such cases there may be no need to push down. 3. When only simple filters or expressions need to be applied on the data then there may be no need to push down. The special case is that of applying filters or expression logic to non-unique columns in incoming data in PowerCenter. Compare this to loading the same data into the database and then applying a WHERE clause on a non-unique column, which is highly inefficient for a large table. The principle here is: Filter and resolve the data AS it gets loaded instead of loading it into a database, querying the RDBMS to filter/resolve and re-loading it into the database. In other words, ETL instead of ELT. 4. Push Down optimization needs to be considered only if a large set of data needs to be merged or queried for getting to your final load set.

Maximizing Performance using Pushdown Optimization


You can push transformation logic to either the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration. When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to

INFORMATICA CONFIDENTIAL

BEST PRACTICES

753 of 954

Pushdown Optimization.

Known Issues with Teradata


You may encounter the following problems using ODBC drivers with a Teradata database:
q

Teradata sessions fail if the session requires a conversion to a numeric data type and the precision is greater than 18. Teradata sessions fail when you use full pushdown optimization for a session containing a Sorter transformation. A sort on a distinct key may give inconsistent results if the sort is not case sensitive and one port is a character port. A session containing an Aggregator transformation may produce different results from PowerCenter if the group by port is a string data type and it is not case-sensitive. A session containing a Lookup transformation fails if it is configured for targetside pushdown optimization. A session that requires type casting fails if the casting is from x to date/time. A session that contains a date to string conversion fails

q q

Working with SQL Overrides


You can configure the Integration Service to perform an SQL override with Pushdown Optimization. To perform an SQL override, you configure the session to create a view. When you use a SQL override for a Source Qualifier transformation in a session configured for source or full Pushdown Optimization with a view, the Integration Service creates a view in the source database based on the override. After it creates the view in the database, the Integration Service generates a SQL query that it can push to the database. The Integration Service runs the SQL query against the view to perform Pushdown Optimization. Note: To use an SQL override with pushdown optimization, you must configure the session for pushdown optimization with a view.

Running a Query
If the Integration Service did not successfully drop the view, you can run a query against the source database to search for the views generated by the Integration Service. When the Integration Service creates a view, it uses a prefix of PM_V. You
INFORMATICA CONFIDENTIAL BEST PRACTICES 754 of 954

can search for views with this prefix to locate the views created during pushdown optimization. Teradata specific SQL: SELECT TableName FROM DBC.Tables WHERE CreatorName = USER AND TableKind ='V' AND TableName LIKE 'PM\_V%' ESCAPE '\'

Rules and Guidelines for SQL OVERIDE


Use the following rules and guidelines when you configure pushdown optimization for a session containing an SQL override:

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

755 of 954

Performance Tuning in a Real-Time Environment Challenge


As Data Integration becomes a broader and more service-oriented Information Technology initiative, real-time and right-time solutions will become critical to the success of the overall architecture. Tuning real-time processes is often different then tuning batch processes.

Description
To remain agile and flexible in increasingly competitive environments, todays companies are dealing with sophisticated operational scenarios such as consolidation of customer data in real time to support a call center or the delivery of precise forecasts for supply chain operation optimization. To support such highly demanding operational environments, data integration platforms must do more than serve analytical data needs. They must also support real-time, 24x7, mission-critical operations that involve live or current information available across the enterprise and beyond. They must access, cleanse, integrate and deliver data in real time to ensure up-to-thesecond information availability. Also, data integration platforms must intelligently scale to meet both increasing data volumes and also increasing numbers of concurrent requests that are typical of shared services Integration Competency Center (ICC) environments. The data integration platforms must also be extremely reliable, providing high availability to minimize outages and ensure seamless failover and recovery as every minute of downtime can lead to huge impacts on business operations. PowerCenter can be used to process data in real time. Real-time processing is on-demand processing of data from real-time sources. A real-time session reads, processes and writes data to targets continuously. By default, a session reads and writes bulk data at scheduled intervals unless it is configured for real-time processing. To process data in real time, the data must originate from a real-time source. Real-time sources include JMS, WebSphere MQ, TIBCO, webMethods, MSMQ, SAP, and web services. Real-time processing can also be used for processes that require immediate access to dynamic data (i.e., financial data).

Latency Impact on performance


Use the Real-time Flush Latency session condition to control the target commit latency when running in real-time mode. PWXPC commits source data to the target at the end of the specified maximum latency period. This parameter requires a valid value and has a valid default value. When the session runs, PWXPC begins to read data from the source. After data is provided to the source qualifier, the Real-Time Flush Latency interval begins. At the end of each Real-Time Flush Latency interval and an end-UOW boundary is reached, PWXPC issues a commit to the target. The following message appears in the session log to indicate that this has occurred: [PWXPC_10082] [INFO] [CDCDispatcher] raising real-time flush with restart tokens [restart1_token], [restart2_token] because Real-time Flush Latency [RTF_millisecs] occurred Only complete UOWs are committed during real-time flush processing. The commit to the target when reading CDC data is not strictly controlled by the Real-Time Flush Latency specification. The UOW Count and the Commit Threshold values also determine the commit frequency. The value specified for Real-Time Flush Latency also controls the PowerExchange Consumer API (CAPI) interface timeout value (PowerExchange latency) on the source platform. The CAPI interface timeout value is displayed in the
INFORMATICA CONFIDENTIAL BEST PRACTICES 756 of 954

following PowerExchange message on the source platform (and in the session log if Retrieve PWX Log Entries is specified in the Connection Attributes): PWX-09957 CAPI i/f: Read times out after <n> seconds The CAPI interface timeout also affects latency as it will affect how quickly changes are returned to the PWXPC reader by PowerExchange. PowerExchange will ensure that it returns control back to PWXPC at least once every CAPI interface timeout period. This allows the PWXPC to regain control and, if necessary, perform the real-time flush of data returned. A high RTF Latency specification will also impact the speed with which stop requests from PowerCenter are handled as the PWXPC CDC Reader must wait for PowerExchange to return control before it can handle the stop request. TIP Use the PowerExchange STOPTASK command to shutdown more quickly when using a high RTF Latency value. For example, if the value for Real-Time Flush Latency is 10 seconds, PWXPC will issue a commit for all data read after 10 seconds have elapsed and the next end-UOW boundary is received. The lower the value is set, the faster the data commits data to the target. As the lowest possible latency is required for the application of changes to the target, specify a low Real-Time Flush Latency value. Warning: When you specify a low Real-Time Flush Latency interval, the session might consume more system resources on the source and target platforms. This is because:
q q

The session will commit to the target more frequently therefore consuming more target resources. PowerExchange will return more frequently to the PWXPC reader thereby passing fewer rows on each iteration and consuming more resources on the source PowerExchange platform

Balance performance and resource consumption with latency requirements when choosing the UOW Count and Real-Time Flush Latency values.

Commit Interval Impact on performance


Commit Threshold is only applicable to Real-Time CDC sessions. Use the Commit Threshold session condition to cause commits before reaching the end of the UOW when processing large UOWs. This parameter requires a valid value and has a valid default value Commit Threshold can be used to cause a commit before the end of a UOW is received, a process also referred to as sub-packet commit. The value specified in the Commit Threshold is the number of records within a source UOW to process before inserting a commit into the change stream. This attribute is different from the UOW Count attribute in that it is a count records within a UOW rather than complete UOWs. The Commit Threshold counter is reset when either the number of records specified or the end of the UOW is reached. This attribute is useful when there are extremely large UOWs in the change stream that might cause locking issues on the target database or resource issues on the PowerCenter Integration Server. The Commit Threshold count is cumulative across all sources in the group. This means that sub-packet commits are inserted into the change stream when the count specified is reached regardless of the number of sources to which the changes actually apply. For example, a UOW contains 900 changes for one source followed by 100 changes for a second source and then 500 changes for the first source. If the Commit Threshold is set to 1000, the commit record is inserted after the 1000th change record which is after the 100 changes for the second source. Warning: A UOW may contain changes for multiple source tables. Using Commit Threshold can cause commits to be generated at points in the change stream where the relationship between these tables is inconsistent. This may
INFORMATICA CONFIDENTIAL BEST PRACTICES 757 of 954

then result in target commit failures. If 0 or no value is specified, commits will occur on UOW boundaries only. Otherwise, the value specified is used to insert commit records into the change stream between UOW boundaries, where applicable. The value of this attribute overrides the value specified in the PowerExchange DBMOVER configuration file parameter SUBCOMMIT_THRESHOLD. For more information on this PowerExchange parameter, refer to the PowerExchange Reference Manual. The commit to the target when reading CDC data is not strictly controlled by the Commit Threshold specification. The commit records inserted into the change stream as a result of the Commit Threshold value affect the UOW Count counter. The UOW Count and the Real-Time Flush Latency values determine the target commit frequency. For example, a UOW contains 1,000 change records (any combination of inserts, updates, and deletes). If 100 is specified for the Commit Threshold and 5 for the UOW Count, then a commit record will be inserted after each 100 records and a target commit will be issued after every 500 records.

Last updated: 29-May-08 18:40

INFORMATICA CONFIDENTIAL

BEST PRACTICES

758 of 954

Performance Tuning UNIX Systems Challenge


Identify opportunities for performance improvement within the complexities of the UNIX operating environment.

Description
This section provides an overview of the subject area, followed by discussion of the use of specific tools.

Overview
All system performance issues are fundamentally resource contention issues. In any computer system, there are three essential resources: CPU, memory, and I/O - namely disk and network I/O. From this standpoint, performance tuning for PowerCenter means ensuring that the PowerCenter and its sub-processes have adequate resources to execute in a timely and efficient manner. Each resource has its own particular set of problems. Resource problems are complicated because all resources interact with each other. Performance tuning is about identifying bottlenecks and making trade-off to improve the situation. Your best approach is to initially take a baseline measurement and to obtain a good understanding of how it behaves, then evaluate any bottleneck revealed on each system resource during your load window and determine the removal of whichever resource contention offers the greatest opportunity for performance enhancement. Here is a summary of each system resource area and the problems it can have.

CPU
q

On any multiprocessing and multi-user system, many processes want to use the CPUs at the same time. The UNIX kernel is responsible for allocation of a finite number of CPU cycles across all running processes. If the total demand on the CPU exceeds its finite capacity, then all processing is likely to reflect a negative impact on performance; the system scheduler puts each process in a queue to wait for CPU availability.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

759 of 954

An average of the count of active processes in the system for the last 1, 5, and 15 minutes is reported as load average when you execute the command uptime. The load average provides you a basic indicator of the number of contenders for CPU time. Likewise vmstat command provides an average usage of all the CPUs along with the number of processes contending for CPU (the value under the r column). On SMP (symmetric multiprocessing) architecture servers, watch the even utilization of all the CPUs. How well all the CPUs are utilized depends on how well an application can be parallelized, If a process is incurring a high degree of involuntary context switch by the kernel; binding the process to a specific CPU may improve performance.

Memory
q

Memory contention arises when the memory requirements of the active processes exceed the physical memory available on the system; at this point, the system is out of memory. To handle this lack of memory, the system starts paging, or moving portions of active processes to disk in order to reclaim physical memory. When this happens, performance decreases dramatically. Paging is distinguished from swapping, which means moving entire processes to disk and reclaiming their space. Paging and excessive swapping indicate that the system can't provide enough memory for the processes that are currently running. Commands such as vmstat and pstat show whether the system is paging; ps, prstat and sar can report the memory requirements of each process.

Disk I/O
q

The I/O subsystem is a common source of resource contention problems. A finite amount of I/O bandwidth must be shared by all the programs (including the UNIX kernel) that currently run. The system's I/O buses can transfer only so many megabytes per second; individual devices are even more limited. Each type of device has its own peculiarities and, therefore, its own problems. Tools are available to evaluate specific parts of the I/O subsystem
r

iostat can give you information about the transfer rates for each disk drive. ps and vmstat can give some information about how many processes are blocked waiting for I/O. sar can provide voluminous information about I/O efficiency. sadp can give detailed information about disk access patterns.

r r

INFORMATICA CONFIDENTIAL

BEST PRACTICES

760 of 954

Network I/O
q

The source data, the target data, or both the source and target data are likely to be connected through an Ethernet channel to the system where PowerCenter resides. Be sure to consider the number of Ethernet channels and bandwidth available to avoid congestion.
r

netstat shows packet activity on a network, watch for high collision rate of output packets on each interface. nfstat monitors NFS traffic; execute nfstat c from a client machine (not from the nfs server); watch for high time rate of total call and not responding message.

Given that these issues all boil down to access to some computing resource, mitigation of each issue con sists of making some adjustment to the environment to provide more (or preferential) access to the resource; for instance:
q

Adjusting execution schedules to allow leverage of low usage times may improve availability of memory, disk, network bandwidth, CPU cycles, etc. Migrating other applications to other hardware is likely tol reduce demand on the hardware hosting PowerCenter. For CPU intensive sessions, raising CPU priority (or lowering priority for competing processes) provides more CPU time to the PowerCenter sessions. Adding hardware resources, such as adding memory, can make more resource available to all processes. Re-configuring existing resources may provide for more efficient usage, such as assigning different disk devices for input and output, striping disk devices, or adjusting network packet sizes.

Detailed Usage
The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips are likely to be more helpful than others in a particular environment, all are worthy of consideration. Availability, syntax and format of each varies across UNIX versions.

Running ps -axu

INFORMATICA CONFIDENTIAL

BEST PRACTICES

761 of 954

Run ps -axu to check for the following items:


q

Are there any processes waiting for disk access or for paging? If so check the I/ O and memory subsystems. What processes are using most of the CPU? This may help to distribute the workload better. What processes are using most of the memory? This may help to distribute the workload better. Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral.

Identifying and Resolving Memory Issues


Use vmstat or sar to check for paging/swapping actions. Check the system to ensure that excessive paging/swapping does not occur at any time during the session processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/ swapping. If paging or excessive swapping does occur at any time, increase memory to prevent it. Paging/swapping, on any database system, causes a major performance decrease and increased I/O. On a memory-starved and I/O-bound server, this can effectively shut down the PowerCenter process and any databases running on the server. Some swapping may occur normally regardless of the tuning settings. This occurs because some processes use the swap space by their design. To check swap space availability, use pstat and swap. If the swap space is too small for the intended applications, it should be increased. Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory problems and check for the following:
q q

Are pages-outs occurring consistently? If so, you are short of memory. Are there a high number of address translation faults? (System V only). This suggests a memory shortage. Are swap-outs occurring consistently? If so, you are extremely short of memory. Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. If you dont have vmstat S, look at the w and de fields of vmstat. These should always be zero.

If memory seems to be the bottleneck, try following remedial steps:


INFORMATICA CONFIDENTIAL BEST PRACTICES 762 of 954

Reduce the size of the buffer cache (if your system has one) by decreasing BUFPAGES. If you have statically allocated STREAMS buffers, reduce the number of large (e.g., 2048- and 4096-byte) buffers. This may reduce network performance, but netstat-m should give you an idea of how many buffers you really need. Reduce the size of your kernels tables. This may limit the systems capacity (i. e., number of files, number of processes, etc.). Try running jobs requiring a lot of memory at night. This may not help the memory problems, but you may not care about them as much. Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive job is running at a time, your system may perform satisfactorily. Try to limit the time spent running sendmail, which is a memory hog. If you dont see any significant improvement, add more memory.

q q

Identifying and Resolving Disk I/O Issues


Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring the load on specific disks. Take notice of how evenly disk activity is distributed among the system disks. If it is not, are the most active disks also the fastest disks? Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area of the disk (good), spread evenly across the disk (tolerable), or in two well-defined peaks at opposite ends (bad)?
q

Reorganize your file systems and disks to distribute I/O activity as evenly as possible. Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention. Use your fastest disk drive and controller for your root file system; this almost certainly has the heaviest activity. Alternatively, if single-file throughput is important, put performance-critical files into one file system and use the fastest drive for that file system. Put performance-critical files on a file system with a large block size: 16KB or 32KB (BSD). Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your systems memory performance.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

763 of 954

Rebuild your file systems periodically to eliminate fragmentation (i.e., backup, build a new file system, and restore). If you are using NFS and using remote files, look at your network situation. You dont have local disk I/O problems. Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or swapping consistently, you have memory problems, fix memory problem first. Swapping makes performance worse.

If your system has disk capacity problem and is constantly running out of disk space try the following actions:
q

Write a find script that detects old core dumps, editor backup and auto-save files, and other trash and deletes it automatically. Run the script through cron. Use the disk quota system (if your system has one) to prevent individual users from gathering too much storage. Use a smaller block size on file systems that are mostly small files (e.g., source code files, object modules, and small data files).

Identifying and Resolving CPU Overload Issues


Use uptime or sar -u to check for CPU loading. Sar provides more detail, including % usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is indicative of memory and contention of swapping/paging problems. In this case, it is necessary to make memory changes to reduce the load on the system server. When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the time, work must be piling up somewhere. This points to CPU overload.
q

Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be performance problems, but any savings will help. Get users to run jobs at night with at or any queuing system thats available. You may not care if the CPU (or the memory or I/O system) is overloaded at night, provided the work is done in the morning.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

764 of 954

Using nice to lower the priority of CPU-bound jobs improves interactive performance. Also, using nice to raise the priority of CPU-bound jobs expedites them but may hurt interactive performance. In general though, using nice is really only a temporary solution. If your workload grows, it will soon become insufficient. Consider upgrading your system, replacing it, or buying another system to share the load.

Identifying and Resolving Network I/O Issues


Suspect problems with network capacity or with data integrity if users experience slow performance when they are using rlogin or when they are accessing files via NFS. Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If the number of input or output errors is large, suspect hardware problems. A large number of input errors indicate problems somewhere on the network. A large number of output errors suggests problems with your system and its interface to the network. If collisions and network hardware are not a problem, figure out which system appears to be slow. Use spray to send a large burst of packets to the slow system. If the number of dropped packets is large, the remote system most likely cannot respond to incoming data fast enough. Look to see if there are CPU, memory or disk I/O problems on the remote system. If not, the system may just not be able to tolerate heavy network workloads. Try to reorganize the network so that this system isnt a file server. A large number of dropped packets may also indicate data corruption. Run netstat-s on the remote system, then spray the remote system from the local system and run netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of drop packets that spray reports, the remote system is slow network server If the increase of socket full drops is less than the number of dropped packets, look for network errors. Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS server is overloaded, the network may be faulty, or one or more servers may have crashed. If badmix is roughly equal to timeout, at least one NFS server is overloaded. If timeout and retrans are high, but badxid is low, some part of the network between the NFS client and server is overloaded and dropping packets. Try to prevent users from running I/O- intensive programs across the network. The greputility is a good example of an I/O intensive program. Instead, have users log into the remote system to do their work.
INFORMATICA CONFIDENTIAL BEST PRACTICES 765 of 954

Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system. Use systems with good network performance as file servers. lsattr E l sys0 is used to determine some current settings on some UNIX environments. (In Solaris, you execute prtenv.) Of particular attention is maxuproc, the setting to determine the maximum level of user background processes. On most UNIX environments, this is defaulted to 40, but should be increased to 250 on most systems. Choose a file system. Be sure to check the database vendor documentation to determine the best file system for the specific machine. Typical choices include: s5, the UNIX System V file system; ufs, the UNIX file system derived from Berkeley (BSD); vxfs, the Veritas file system; and lastly raw devices that, in reality are not a file system at all. Additionally, for the PowerCenter Enterprise Grid Option cluster file system (CFS), products such as GFS for RedHat Linux, Veritas CFS, and GPFS for IBM AIX are some of the available choices.

Cluster File System Tuning


In order to take full advantage of the PowerCenter Enterprise Grid Option , cluster file system (CFS) is recommended. PowerCenter Grid option requires that the directories for each Integration Service to be shared with other servers. This allows Integration Services to share files such as cache files between different session runs. CFS performance is a result of tuning parameters and tuning the infrastructure. Therefore, using the parameters recommended by each CFS vendor is the best approach for CFS tuning.

PowerCenter Options
The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about the running tasks, including CPU%, memory, and swap usage. The PowerCenter 64-bit option can allocate more memory to sessions and achieve higher throughputs compared to 32-bit version of PowerCenter.

Last updated: 06-Dec-07 15:16


INFORMATICA CONFIDENTIAL BEST PRACTICES 766 of 954

Performance Tuning Windows 2000/2003 Systems Challenge


Windows Server is designed as a self-tuning operating system. Standard installation of Windows Server provides good performance out-of-the-box, but optimal performance can be achieved by tuning. Note: Tuning is essentially the same for both Windows 2000 and 2003-based systems.

Description
The following tips have proven useful in performance-tuning Windows Servers. While some are likely to be more helpful than others in any particular environment, all are worthy of consideration. The two places to begin tuning an NT server are:
q q

Performance Monitor. Performance tab (hit ctrl+alt+del, choose task manager, and click on the Performance tab).

Although the Performance Monitor can be tracked in real-time, creating a result-set representative of a full day is more likely to render an accurate view of system performance.

Resolving Typical Windows Server Problems


The following paragraphs describe some common performance problems in a Windows Server environment and suggest tuning solutions. Server Load: Assume that some software will not be well coded, and some background processes (e.g., a mail server or web server) running on a single machine, can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs may be the only recourse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

767 of 954

Device Drivers: The device drivers for some types of hardware are notorious for inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware vendor to minimize this problem. Memory and services: Although adding memory to Windows Server is always a good solution, it is also expensive and usually must be planned in advance. Before adding memory, check the Services in Control Panel because many background applications do not uninstall the old service when installing a new version. Thus, both the unused old service and the new service may be using valuable CPU memory resources. I/O Optimization: This is, by far, the best tuning option for database applications in the Windows Server environment. If necessary, level the load across the disk devices by moving files. In situations where there are multiple controllers, be sure to level the load across the controllers too. Using electrostatic devices and fast-wide SCSI can also help to increase performance. Further, fragmentation can usually be eliminated by using a Windows Server disk defragmentation product. Finally, on Windows Servers, be sure to implement disk striping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Also increase the priority of the disk devices on the Windows Server. Windows Server, by default, sets the disk device priority low.

Monitoring System Performance in Windows Server


In Windows Server, PowerCenter uses system resources to process transformation, session execution, and reading and writing of data. The PowerCenter Integration Service also uses system memory for other data such as aggregate, joiner, rank, and cached lookup tables. With Windows Server, you can use the system monitor in the Performance Console of the administrative tools, or system tools in the task manager, to monitor the amount of system resources used by the PowerCenter and to identify system bottlenecks. Windows Server provides the following tools (accessible under the Control Panel/ Administration Tools/Performance) for monitoring resource usage on your computer:
q q

System Monitor Performance Logs and Alerts

These Windows Server monitoring tools enable you to analyze usage and detect
INFORMATICA CONFIDENTIAL BEST PRACTICES 768 of 954

bottlenecks at the disk, memory, processor, and network level.

System Monitor
The System Monitor displays a graph which is flexible and configurable. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste counter paths from Web pages or other sources into the System Monitor display. Because the System Monitor is portable, it is useful in monitoring other systems that require administration.

Performance Monitor
The Performance Logs and Alerts tool provides two types of performance-related logs counter logs and trace logsand an alerting function. Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel. Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O, page faults, or thread activity. The alerting function allows you to define a counter value that will trigger actions such as sending a network message, running a program, or starting a log. Alerts are useful if you are not actively monitoring a particular counter threshold value but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. You may want to set alerts based on established performance baseline values for your system. Note: You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM \CurrentControlSet\Services\SysmonLog\Log_Queries). The predefined log settings under Counter Logs (i.e., System Overview) are configured to create a binary log that, after manual start-up, updates every 15 seconds and logs continuously until it achieves a maximum size. If you start logging with the default settings, data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and Processor(_Total)\ % Processor Time. If you want to create your own log setting, press the right mouse on one of the log

INFORMATICA CONFIDENTIAL

BEST PRACTICES

769 of 954

types.

PowerCenter Options
The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about running task including CPU%, Memory and Swap usage. PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64bit Windows Server 2003 can allocate more memory to sessions and achieve higher throughputs than the 32-bit version of PowerCenter on Windows Server. Using PowerCenter Grid option on Windows Server enables distribution of a session or sessions in a workflow to multiple servers and reduces the processing load window. The PowerCenter Grid option requires that the directories for each integration service to be shared with other servers. This allows integration services to share files such as cache files among various session runs. With a Cluster File System (CFS), integration services running on various servers can perform concurrent reads and write to the same block of data.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

770 of 954

Recommended Performance Tuning Procedures Challenge


To optimize PowerCenter load times by employing a series of performance tuning procedures.

Description
When a PowerCenter session or workflow is not performing at the expected or desired speed, there is a methodology that can help to diagnose problems that may be adversely affecting various components of the data integration architecture. While PowerCenter has its own performance settings that can be tuned, you must consider the entire data integration architecture, including the UNIX/Windows servers, network, disk array, and the source and target databases to achieve optimal performance. More often than not, an issue external to PowerCenter is the cause of the performance problem. In order to correctly and scientifically determine the most logical cause of the performance problem, you need to execute the performance tuning steps in a specific order. This enables you to methodically rule out individual pieces and narrow down the specific areas on which to focus your tuning efforts.

1. Perform Benchmarking
You should always have a baseline of current load times for a given workflow or session with a similar row count. Maybe you are not achieving your required load window or simply think your processes could run more efficiently based on comparison with other similar tasks running faster. Use the benchmark to estimate what your desired performance goal should be and tune to that goal. Begin with the problem mapping that you created, along with a session and workflow that use all default settings. This helps to identify which changes have a positive impact on performance.

2. Identify the Performance Bottleneck Area


This step helps to narrow down the areas on which to focus further. Follow the areas and sequence below when attempting to identify the bottleneck:
q

Target
BEST PRACTICES 771 of 954

INFORMATICA CONFIDENTIAL

q q q q

Source Mapping Session/Workflow System.

The methodology steps you through a series of tests using PowerCenter to identify trends that point where next to focus. Remember to go through these tests in a scientific manner; running them multiple times before reaching any conclusion and always keep in mind that fixing one bottleneck area may create a different bottleneck. For more information, see Determining Bottlenecks.

3. "Inside" or "Outside" PowerCenter


Depending on the results of the bottleneck tests, optimize inside or outside PowerCenter. Be sure to perform the bottleneck test in the order prescribed in Determining Bottlenecks, since this is also the order in which you should make any performance changes. Problems outside PowerCenter refers to anything that indicates the source of the performance problem is external to PowerCenter. The most common performance problems outside PowerCenter are source/target database problem, network bottleneck, server, or operating system problem.
q

For source database related bottlenecks, refer to Tuning SQL Overrides and Environment for Better Performance For target database related problems, refer to Performance Tuning Databases - Oracle, SQL Server, or Teradata For operating system problems, refer to Performance Tuning UNIX Systems or Performance Tuning Windows 2000/2003 Systems for more information.

Problems inside PowerCenter refers to anything that PowerCenter controls, such as actual transformation logic, and PowerCenter Workflow/Session settings. The session settings contain quite a few memory settings and partitioning options that can greatly improve performance. Refer to the Tuning Sessions for Better Performance for more information. Although there are certain procedures to follow to optimize mappings, keep in mind that, in most cases, the mapping design is dictated by business logic; there may be a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

772 of 954

more efficient way to perform the business logic within the mapping, but you cannot ignore the necessary business logic to improve performance. Refer to Tuning Mappings for Better Performance for more information.

4. Re-Execute the Problem Workflow or Session


After you have completed the recommended steps for each relevant performance bottleneck, re-run the problem workflow or session and compare the results to the benchmark and compare load performance against the baseline. This step is iterative, and should be performed after any performance-based setting is changed. You are trying to answer the question, Did the performance change have a positive impact? If so, move on to the next bottleneck. Be sure to prepare detailed documentation at every step along the way so you have a clear record of what was and wasn't tried. While it may seem like there are an enormous number of areas where a performance problem can arise, if you follow the steps for finding the bottleneck(s), and apply the tuning techniques specific to it, you are likely to improve performance and achieve your desired goals.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

773 of 954

Tuning and Configuring Data Analyzer and Data Analyzer Reports Challenge
A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can be a crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning Data Analyzer and Data Analyzer reports.

Description
Performance tuning reports occurs both at the environment level and the reporting level. Often report performance can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The following guidelines should help with tuning the environment and the report itself. 1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform benchmarks at various points throughout the day and evening hours to account for inconsistencies in network traffic, database server load, and application server load. This provides a baseline to measure changes against. 2. Review Report. Confirm that all data elements are required in the report. Eliminate any unnecessary data elements, filters, and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the report can be broken into multiple reports or presented at a higher level. These are often ways to create more visually appealing reports and allow for linked detail reports or drill down to detail level. 3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the report to run during hours when the system use is minimized. Consider scheduling large numbers of reports to run overnight. If mid-day updates are required, test the performance at lunch hours and consider scheduling for that time period. Reports that require filters by users can often be copied and filters precreated to allow for scheduling of the report. 4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA monitors the database environment. This provides the DBA the opportunity to tune the database for querying. Finally, look into changes in database settings. Increasing the database memory in the initialization file often improves Data Analyzer performance significantly. 5. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL" button on the report. Run the query from the report, against the database using a client tool on the server that the database resides on. One caveat to this is that even the database tool on the server may contact the outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath / IPC Oracles local database communication protocol) and monitor the database throughout this process. This test may pinpoint if the bottleneck is occurring on the network or in the database. If, for instance, the query performs well regardless of where it is executed, but the report continues to be slow, this indicates an application server bottleneck. Common locations for network bottlenecks include router tables, web server demand, and server input/output. Informatica does recommend installing Data Analyzer on a dedicated application server. 6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of tuning involves changes to the database tables. Review the under performing reports. Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an aggregate table. By studying the existing reports and future requirements, you can determine what key aggregates can be created in the ETL tool and stored in the database. Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data
INFORMATICA CONFIDENTIAL BEST PRACTICES 774 of 954

Analyzer. Each time a calculation must be done in Data Analyzer, it is being performed as part of the query process. To determine if a query can be improved by building these elements in the database, try removing them from the report and comparing report performance. Consider if these elements are appearing in a multitude of reports or simply a few. 7.

Database Queries. As a last resort for under-performing reports, you may want to edit the actual report query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the SQL into a query utility and execute. (DBA assistance may be beneficial here.) If the query appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once you have confirmed that the report is as required, work to edit the query while continuing to re-test it in a query utility. Additional options include utilizing database views to cache data prior to report generation. Reports are then built based on the view.

Note: Editing the report query requires query editing for each report change and may require editing during migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning. The Data Analyzer repository database should be tuned for an OLTP workload.

Tuning Java Virtual Machine (JVM)

JVM Layout
The Java Virtual Machine (JVM) is the repository for all live objects, dead objects, and free memory. It has the following primary jobs:
q q q

Execute code Manage memory Remove garbage objects

The size of the JVM determines how often and how long garbage collection runs. The JVM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic application server.

Parameters of the JVM


1. 2. 3. -Xms and -Xmx parameters define the minimum and maximum heap size; for large applications like Data Analyzer, the values should be set equal to each other. Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to reduce garbage collection.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

775 of 954

4. 5.

Permanent generation, which holds the JVM's class and method objects -XX:MaxPermSize command line parameter controls the permanent generation's size. "NewSize" and "MaxNewSize" parameters control the new generation's minimum and maximum size. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation occupies 5/6 of the heap while the new generation occupies 1/6 of the heap).
r

When the new generation fills up, it triggers a minor collection, in which surviving objects are moved to the old generation.
r

When the old generation fills up, it triggers a major collection, which involves the entire object heap. This is more expensive in terms of resources than a minor collection. 6. 7. 8. 9.

If you increase the new generation size, the old generation size decreases. Minor collections occur less often, but the frequency of major collection increases. If you decrease the new generation size, the old generation size increases. Minor collections occur more, but the frequency of major collection decreases. As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size). Enable additional JVM if you expect large numbers of users. Informatica typically recommends two to three CPUs per JVM.

Other Areas to Tune Execute Threads


q

Threads available to process simultaneous operations in Weblogic.


q

Too few threads means CPUs are under-utilized and jobs are waiting for threads to become available.
q

Too many threads means system is wasting resource in managing threads. The OS performs unnecessary context switching.
q

The default is 15 threads. Informatica recommends using the default value, but you may need to experiment to determine the optimal value for your environment.

Connection Pooling
The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it.
q q q

Initial capacity = 15 Maximum capacity = 15 Sum of connections of all pools should be equal to the number of execution threads.

Connection pooling avoids the overhead of growing and shrinking the pool size dynamically by setting the initial and
INFORMATICA CONFIDENTIAL BEST PRACTICES 776 of 954

maximum pool size at the same level. Performance packs use platform-optimized (i.e., native) sockets to improve server performance. They are available on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux.
q q

Check Enable Native I/O on the server attribute tab. Adds <NativeIOEnabled> to config.xml as true.

For Websphere, use the Performance Tuner to modify the configurable parameters. For optimal configuration, separate the application server , the data warehouse database, and the repository database onto separate dedicated machines.

Application Server-Specific Tuning Details JBoss Application Server


Web Container. Tune the web container by modifying the following configuration file so that it accepts a reasonable number of HTTP requests as required by the Data Analyzer installation. Ensure that the web container is made available to optimal number of threads so that it can accept and process more HTTP requests. <JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-service.xml The following is a typical configuration: <!-- A HTTP/1.1 Connector on port 8080 --> <Connector className="org.apache.coyote.tomcat4.CoyoteConnector" port= "8080" minProcessors="10" maxProcessors="100" enableLookups="true" acceptCount="20" debug="0" tcpNoDelay="true" bufferSize="2048" connectionLinger="-1" connectionTimeout="20000" />< The following parameters may need tuning:
q

minProcessors. Number of threads created initially in the pool.


q

maxProcessors. Maximum number of threads that can ever be created in the pool.
q

acceptCount. Controls the length of the queue of waiting requests when no more threads are available from the pool to process the request.
q

connectionTimeout. Amount of time to wait before a URI is received from the stream. Default is 20 seconds. This avoid problems where a client opens a connection and does not send any data
q

tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer to be full. This reduces latency at the cost of more packets being sent over the network. The default is true.
q

enableLookups. Determines whether a reverse DNS lookup is performed. This can be enabled to prevent IP spoofing. Enabling this parameter can cause problems when a DNS is misbehaving. The enableLookups parameter can be turned off when you implicitly trust all
INFORMATICA CONFIDENTIAL BEST PRACTICES 777 of 954

clients.
q

connectionLinger. How long connections should linger after they are closed. Informatica recommends using the default value: -1 (no linger). In the Data Analyzer application, each web page can potentially have more than one request to the application server. Hence, the maxProcessors should always be more than the actual number of concurrent users. For an installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value. If the number of threads is too low, the following message may appear in the log files: ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first time, Informatica ships Data Analyzer with pre-compiled JSPs. The following is a typical configuration: <JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml <servlet> <servlet-name>jsp</servlet-name> <servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class> <init-param> <param-name>logVerbosityLevel</param-name> <param-value>WARNING</param-value> <param-name>development</param-name> <param-value>false</param-value> </init-param> <load-on-startup>3</load-on-startup> </servlet> The following parameter may need tuning:
q

Set the development parameter to false in a production installation. Database Connection Pool. Data Analyzer accesses the repository database to retrieve metadata information. When it runs reports, it accesses the data sources to get the report information. Data Analyzer keeps a pool of database connections for the repository. It also keeps a separate database connection pool for each data source. To optimize Data Analyzer database connections, you can tune the database connection pools. Repository Database Connection Pool. To optimize the repository database connection pool, modify the JBoss configuration file: <JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or other databases. For example, for an Oracle repository, the configuration file name is oracle_ds.xml. With some versions of Data Analyzer, the configuration file may simply be named DataAnalyzer-ds.xml. The following is a typical configuration:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

778 of 954

<datasources> <local-tx-datasource> <jndi-name>jdbc/IASDataSource</jndi-name> <connection-url> jdbc:informatica:oracle://aries:1521;SID=prfbase8</connection-url> <driver-class>com.informatica.jdbc.oracle.OracleDriver</driver-class> <user-name>powera</user-name> <password>powera</password> <exception-sorter-class-name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter </exception-sorter-class-name> <min-pool-size>5</min-pool-size> <max-pool-size>50</max-pool-size> <blocking-timeout-millis>5000</blocking-timeout-millis> <idle-timeout-minutes>1500</idle-timeout-minutes> </local-tx-datasource> </datasources> The following parameters may need tuning:
q

min-pool-size. The minimum number of connections in the pool. (The pool is lazily constructed, that is, it will be empty until it is first accessed. Once used, it will always have at least the min-pool-size connections.)
q

max-pool-size. The strict maximum size of the connection pool.


q

blocking-timeout-millis. The maximum time in milliseconds that a caller waits to get a connection when no more free connections are available in the pool.
q

idle-timeout-minutes. The length of time an idle connection remains in the pool before it is used. The max-pool-size value is recommended to be at least five more than maximum number of concurrent users because there may be several scheduled reports running in the background and each of them needs a database connection. A higher value is recommended for idle-timeout-minutes. Because Data Analyzer accesses the repository very frequently, it is inefficient to spend resources on checking for idle connections and cleaning them out. Checking for idle connections may block other threads that require new connections. Data Source Database Connection Pool. Similar to the repository database connection pools, the data source also has a pool of connections that Data Analyzer dynamically creates as soon as the first client requests a connection. The tuning parameters for these dynamic pools are present in following file: <JBOSS_HOME>/bin/IAS.properties.file The following is a typical configuration: # # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50
INFORMATICA CONFIDENTIAL BEST PRACTICES 779 of 954

dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20< /FONT> The following JBoss-specific parameters may need tuning:
q

dynapool.initialCapacity. The minimum number of initial connections in the data source pool.
q

dynapool.maxCapacity. The maximum number of connections that the data source pool may grow to.
q

dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB pool name for identification purposes.
q

dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a connection from the pool if none is readily available.
q

dynapool.refreshTestMinutes. This parameter determines the frequency at which a health check is performed on the idle connections in the pool. This should not be performed too frequently because it locks up the connection pool and may prevent other clients from grabbing connections from the pool.
q

dynapool.shrinkPeriodMins. This parameter determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool shrinks back to the value of its initialCapacity parameter. This is done only if the allowShrinking parameter is set to true.

EJB Container
Data Analyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity beans (EB). In addition, there are six message-driven beans (MDBs) that are used for the scheduling and real-time functionalities. Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is the EJB pool. You can tune the EJB pool parameters in the following file: <JBOSS_HOME>/server/Informatica/conf/standardjboss.xml. The following is a typical configuration: <container-configuration> <container-name> Standard Stateless SessionBean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name> stateless-rmi-invoker</invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor> org.jboss.ejb.plugins.LogInterceptor</interceptor>
INFORMATICA CONFIDENTIAL BEST PRACTICES 780 of 954

<interceptor> org.jboss.ejb.plugins.SecurityInterceptor</interceptor> <!-- CMT --> <interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor transaction="Container" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor transaction="Container"> org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor </interceptor> <!-- BMT --> <interceptor transaction="Bean"> org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.TxInterceptorBMT</interceptor> <interceptor transaction="Bean" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> </container-interceptors> <instance-pool> org.jboss.ejb.plugins.StatelessSessionInstancePool</instance-pool> <instance-cache></instance-cache> <persistence-manager></persistence-manager> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> </container-configuration> The following parameter may need tuning:
q

MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. If <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are requests for more objects. However, only the <MaximumSize> number of objects can be returned to the pool.
q

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data Analyzer to increase the throughput for highconcurrency installations.
q

strictMaximumSize. When the value is set to true, the <strictMaximumSize> enforces a rule that only <MaximumSize> number of objects can be active. Any subsequent requests must wait for an object to be returned to the pool.
q

strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests wait for an object to be made available in the pool. Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning parameters. The main difference is that MDBs are not invoked by clients. Instead, the messaging system delivers messages to the MDB when they are available.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

781 of 954

To tune the MDB parameters, modify the following configuration file: <JBOSS_HOME>/server/informatica/conf/standardjboss.xml The following is a typical configuration: <container-configuration> <container-name>Standard Message Driven Bean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name>message-driven-bean </invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor </interceptor> <!-- CMT --> <interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor transaction="Container" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor </interceptor> <interceptor transaction="Container"> org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor </interceptor> <!-- BMT --> <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT </interceptor> <interceptor transaction="Bean" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool </instance-pool> <instance-cache></instance-cache> <persistence-manager></persistence-manager> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> </container-configuration> The following parameter may need tuning: MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are request for more objects. However, only the <MaximumSize> number of objects can be returned to the pool. Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data
INFORMATICA CONFIDENTIAL BEST PRACTICES 782 of 954

Analyzer to increase the throughput for high-concurrency installations.


q

strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter enforces a rule that only <MaximumSize> number of objects will be active. Any subsequent requests must wait for an object to be returned to the pool.
q

strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests wait for an object to be made available in the pool. Enterprise Java Beans (EJB). Data Analyzer EJBs use BMP (bean-managed persistence) as opposed to CMP (container-managed persistence). The EJB tuning parameters are very similar to the stateless bean tuning parameters. The EJB tuning parameters are in the following configuration file: <JBOSS_HOME>/server/informatica/conf/standardjboss.xml. The following is a typical configuration: <container-configuration> <container-name>Standard BMP EntityBean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name>entity-rmi-invoker </invoker-proxy-binding-name> <sync-on-commit-only>false</sync-on-commit-only> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.SecurityInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.TxInterceptorCMT </interceptor> <interceptor metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityLockInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor </interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> <interceptor> org.jboss.ejb.plugins.EntitySynchronizationInterceptor </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.EntityInstancePool </instance-pool> <instance-cache>org.jboss.ejb.plugins.EntityInstanceCache </instance-cache> <persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager
INFORMATICA CONFIDENTIAL BEST PRACTICES 783 of 954

</persistence-manager> <locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock </locking-policy> <container-cache-conf> <cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy </cache-policy> <cache-policy-conf> <min-capacity>50</min-capacity> <max-capacity>1000000</max-capacity> <overager-period>300</overager-period> <max-bean-age>600</max-bean-age> <resizer-period>400</resizer-period> <max-cache-miss-period>60</max-cache-miss-period> <min-cache-miss-period>1</min-cache-miss-period> <cache-load-factor>0.75</cache-load-factor> </cache-policy-conf> </container-cache-conf> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> <commit-option>A</commit-option> </container-configuration> The following parameter may need tuning: MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are request for more objects. However, only the <MaximumSize> number of objects are returned to the pool. Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data Analyzer to increase the throughput for high-concurrency installations.
q

strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter enforces a rule that only <MaximumSize> number of objects can be active. Any subsequent requests must wait for an object to be returned to the pool.
q

strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests will wait for an object to be made available in the pool.

RMI Pool
The JBoss Application Server can be configured to have a pool of threads to accept connections from clients for remote method invocation (RMI). If you use the Java RMI protocol to access the Data Analyzer API from other custom applications, you can optimize the RMI thread pool parameters. To optimize the RMI pool, modify the following configuration file: <JBOSS_HOME>/server/informatica/conf/jboss-service.xml The following is a typical configuration:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

784 of 954

<mbeancode="org.jboss.invocation.pooled.server.PooledInvoker"name="jboss:service=invoker, type=pooled"> <attribute name="NumAcceptThreads">1</attribute> <attribute name="MaxPoolSize">300</attribute> <attribute name="ClientMaxPoolSize">300</attribute> <attribute name="SocketTimeout">60000</attribute> <attribute name="ServerBindAddress"></attribute> <attribute name="ServerBindPort">0</attribute> <attribute name="ClientConnectAddress"></attribute> <attribute name="ClientConnectPort">0</attribute> <attribute name="EnableTcpNoDelay">false</attribute> <depends optional-attribute-name="TransactionManagerService"> jboss:service=TransactionManager </depends> </mbean> The following parameters may need tuning:
q

NumAcceptThreads. The controlling threads used to accept connections from the client.
q

MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server.
q

ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the client.
q

Backlog. The number of requests in the queue when all the processing threads are in use. EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it to true may increase the network traffic because more packets will be sent across the network.

WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior of some of the parameters and arrive at a good settings.

Web Container
Navigate to Application Servers > [your_server_instance] > Web Container > Thread Pool to tune the following parameters.
q

Minimum Size: Specifies the minimum number of threads to allow in the pool. The default value of 10 is appropriate. q Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has been determined to be optimal.
q

Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a thread is reclaimed. The default of 3500ms is considered optimal. Is Growable: Specifies whether the number of threads can increase beyond the maximum size configured for the thread pool. Be sure to leave this option unchecked. Also, the maximum threads should be hard-limited to the value given in the Maximum Size.

Note: In a load-balanced environment, there is likely to be more than one server instance that may be spread across multiple machines. In such a scenario, be sure that the changes have been properly propagated to all of the server
INFORMATICA CONFIDENTIAL BEST PRACTICES 785 of 954

instances.

Transaction Services
Total transaction lifetime timeout: In certain circumstances (e.g., import of large XML files), the default value of 120 seconds may not be sufficient and should be increased. This parameter can be modified during runtime also.

Diagnostic Trace Services


q

Disable the trace in a production environment .


q

Navigate to Application Servers > [your_server_instance] > Administration Services > Diagnostic Trace Service and make sure Enable Tracing is not checked.

Debugging Services
Ensure that the tracing is disabled in a production environment. Navigate to Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service > Debugging Service and make sure Startup is not checked.

Performance Monitoring Services


This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to ping the application server after a certain interval; if the server is found to be dead, it then tries to restart the server. Navigate to Application Servers > [your_server_instance] > Process Definition > MonitoringPolicy and tune the parameters according to a policy determined for each Data Analyzer installation. Note: The parameter Ping Timeout determines the time after which a no-response from the server implies that it is faulty. The monitoring service then attempts to kill the server and restart it if Automatic restart is checked. Take care that Ping Timeout is not set to too small a value.

Process Definitions (JVM Parameters)


For a Data Analyzer installation with a high number of concurrent users, Informatica recommends that the minimum and the maximum heap size be set to the same values. This avoids the heap allocation-reallocation expense during a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica recommends setting the values of minimum heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is recommended after carefully studying the garbage collection behavior by turning on the verbosegc option. The following is a list of java parameters (for IBM JVM 1.4.1) that should not be modified from the default values for Data Analyzer installation:
q

-Xnocompactgc. This parameter switches off heap compaction altogether. Switching off heap compaction results in heap fragmentation. Since Data Analyzer frequently allocates large objects, heap fragmentation can result in OutOfMemory exceptions.
q

-Xcompactgc. Using this parameter leads to each garbage collection cycle carrying out
INFORMATICA CONFIDENTIAL BEST PRACTICES 786 of 954

compaction, regardless of whether it's useful.


q

-Xgcthreads. This controls the number of garbage collection helper threads created by the JVM during startup. The default is N-1 threads for an N-processor machine. These threads provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause time during garbage collection.
q

-Xclassnogc. This disables collection of class objects. -Xinitsh. This sets the initial size of the application-class system heap. The system heap is expanded as needed and is never garbage collected.

You may want to alter the following parameters after carefully examining the application server processes:
q q

Navigate to Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine" Verbose garbage collection. Check this option to turn on verbose garbage collection. This can help in understanding the behavior of the garbage collection for the application. It has a very low overhead on performance and can be turned on even in the production environment.

Initial heap size. This is the ms value. Only the numeric value (without MB) needs to be specified. For concurrent usage, the initial heap-size should be started with a 1000 and, depending on the garbage collection behavior, can be potentially increased up to 2000. A value beyond 2000 may actually reduce throughput because the garbage collection cycles will take more time to go through the large heap, even though the cycles may be occurring less frequently.
q

Maximum heap size. This is the mx value. It should be equal to the Initial heap size value.
q

RunHProf:. This should remain unchecked in production mode, because it slows down the VM considerably.
q

Debug Mode. This should remain unchecked in production mode, because it slows down the VM considerably.
q

Disable JIT.: This should remain unchecked (i.e., JIT should never be disabled).

Performance Monitoring Services


Be sure that performance monitoring services are not enabled in a production environment. Navigate to Application Servers > [your_server_instance] > Performance Monitoring Services and be sure Startup is not checked.

Database Connection Pool


The repository database connection pool can be configured by navigating to JDBC Providers > User-defined JDBC Provider > Data Sources > IASDataSource > Connection Pools The various parameters that may need tuning are:
q

Connection Timeout. The default value of 180 seconds should be good. This implies that after 180 seconds, the request to grab a connection from the pool will timeout. After it times out, DataAnalyzer will throw an exception. In that case, the pool size may need to be increased.
BEST PRACTICES 787 of 954

INFORMATICA CONFIDENTIAL

Max Connections. The maximum number of connections in the pool. Informatica recommends a value of 50 for this. Min Connections. The minimum number of connections in the pool. Informatica recommends a value of 10 for this. Reap Time. This specifies the frequency of pool maintenance thread. This should not be set very high because when pool maintenance thread is running, it blocks the whole pool and no process can grab a new connection form the pool. If the database and the network are reliable, this should have a very high value (e. g., 1000). Unused Timeout. This specifies the time in seconds after which an unused connection will be discarded until the pool size reaches the minimum size. In a highly concurrent usage, this should be a high value. The default of 1800 seconds should be fine. Aged Timeout. Specifies the interval in seconds before a physical connection is discarded. If the database and the network are stable, there should not be a reason for age timeout. The default is 0 (i.e., connections do not age). If the database or the network connection to the repository database frequently comes down (compared to the life of the AppServer), this can be used to age-out the stale connections.

Much like the repository database connection pools, the data source or data warehouse databases also have a pool of connections that are created dynamically by Data Analyzer as soon as the first client makes a request. The tuning parameters for these dynamic pools are present in <WebSphere_Home>/AppServer/IAS.properties file. The following is a typical configuration:.

# # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_ dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

788 of 954

The various parameters that may need tuning are:


q q q q

dynapool.initialCapacity - the minimum number of initial connections in the data-source pool. dynapool.maxCapacity - the maximum number of connections that the data-source pool may grow up to. dynapool.poolNamePrefix - a prefix added to the dynamic JDB pool name for identification purposes. dynapool.waitSec - the maximum amount of time (in seconds) that a client will wait to grab a connection from the pool if none is readily available. dynapool.refreshTestMinutes - determines the frequency at which a health check on the idle connections in the pool is performed. Such checks should not be performed too frequently because they lock up the connection pool and may prevent other clients from grabbing connections from the pool. dynapool.shrinkPeriodMins - determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool decreases (to its initialCapacity). This is done only if allowShrinking is set to true.

Message Listeners Services


To process scheduled reports, Data Analyzer uses Message-Driven-Beans. It is possible to run multiple reports within one schedule in parallel by increasing the number of instances of the MDB catering to the Scheduler (InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value since each report consumes considerable resources (e.g., database connections, and CPU processing at both the application-server and database server levels) and setting this to a very high value may actually be detrimental to the whole system. Navigate to Application Servers > [your_server_instance] > Message Listener Service > Listener Ports > IAS_ScheduleMDB_ListenerPort . The parameters that can be tuned are:
q

Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica does not recommend going beyond five. Maximum messages. This should remain as one. This implies that each report in a schedule will be executed in a separate transaction instead of a batch. Setting it to more than one may have unwanted effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.

Plug-in Retry Intervals and Connect Timeouts


When Data Analyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform the loadbalancing between each server in the cluster. The proxy http-server sends the request to the plug-in and the plug-in then routes the request to the proper application-server. The plug-in file can be generated automatically by navigating to Environment > Update web server plugin configuration. The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of the server. It is possible to have different timeout settings for different servers in the cluster. The timeout settings implies that after the given number of seconds if the server doesnt respond, then it is marked as down and the request is sent over to the next available member of the cluster. The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as down. The default value is 10 seconds. This means if a cluster member is marked as down, the server does not try to send a request to the same member for 10 seconds.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

789 of 954

Tuning Mappings for Better Performance Challenge


In general, mapping-level optimization takes time to implement, but can significantly boost performance. Sometimes the mapping is the biggest bottleneck in the load process because business rules determine the number and complexity of transformations in a mapping. Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally, bringing about a performance increase in all scenarios. The second group of tuning processes may yield only small performance increase, or can be of significant value, depending on the situation. Some factors to consider when choosing tuning processes at the mapping level include the specific environment, software/ hardware limitations, and the number of rows going through a mapping. This Best Practice offers some guidelines for tuning mappings.

Description
Analyze mappings for tuning only after you have tuned the target and source for peak performance. To optimize mappings, you generally reduce the number of transformations in the mapping and delete unnecessary links between transformations. For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations), limit connected input/output or output ports. Doing so can reduce the amount of data the transformations store in the data cache. Having too many Lookups and Aggregators can encumber performance because each requires index cache and data cache. Since both are fighting for memory space, decreasing the number of these transformations in a mapping can help improve speed. Splitting them up into different mappings is another option. Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on the cache directory. Unless the seek/access time is fast on the directory itself, having too many Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.

Consider Single-Pass Reading


If several mappings use the same data source, consider a single-pass reading. If you have several sessions that use the same sources, consolidate the separate mappings with either a single Source Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the separate data flows. Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that function is called in the session. For example, if you need to subtract percentage from the PRICE ports for both the Aggregator and Rank transformations, you can minimize work by subtracting the percentage before splitting the pipeline.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

790 of 954

Optimize SQL Overrides


When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned depends on the underlying source or target database system. See Tuning SQL Overrides and Environment for Better Performance for more information .

Scrutinize Datatype Conversions


PowerCenter Server automatically makes conversions between compatible datatypes. When these conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary. In some instances however, datatype conversions can help improve performance. This is especially true when integer values are used in place of other datatypes for performing comparisons using Lookup and Filter transformations.

Eliminate Transformation Errors


Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During transformation errors, the PowerCenter Server engine pauses to determine the cause of the error, removes the row causing the error from the data flow, and logs the error in the session log. Transformation errors can be caused by many things including: conversion errors, conflicting mapping logic, any condition that is specifically set up as an error, and so on. The session log can help point out the cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for these transformations. If you need to run a session that generates a large number of transformation errors, you might improve performance by setting a lower tracing level. However, this is not a long-term response to transformation errors. Any source of errors should be traced and eliminated.

Optimize Lookup Transformations


There are a several ways to optimize lookup transformations that are set up in a mapping.

When to Cache Lookups


Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup table and queries the lookup cache during the session. When this option is not enabled, the PowerCenter Server queries the lookup table on a row-by-row basis. Note: All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for lookups are sufficient to ensure that caches will not page to disks. Information regarding memory and cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for Better Performance. A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard to the number of rows expected to be processed. For example, consider the following example.
INFORMATICA CONFIDENTIAL BEST PRACTICES 791 of 954

In Mapping X, the source and lookup contain the following number of records:

ITEMS (source): MANUFACTURER: DIM_ITEMS:

5000 records 200 records 100000 records

Number of Disk Reads

Cached Lookup LKP_Manufacturer Build Cache Read Source Records Execute Lookup Total # of Disk Reads LKP_DIM_ITEMS Build Cache Read Source Records Execute Lookup Total # of Disk Reads 100000 5000 0 105000 200 5000 0 5200

Un-cached Lookup

0 5000 5000 100000

0 5000 5000 10000

Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the lookup table is small in comparison with the number of times the lookup is executed. So this lookup should be cached. This is the more likely scenario. Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk reads would total 10,000. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. Thus, the lookup should not be cached.
INFORMATICA CONFIDENTIAL BEST PRACTICES 792 of 954

Use the following eight step method to determine if a lookup should be cached: 1. Code the lookup into the mapping. 2. Select a standard set of data from the source. For example, add a "where" clause on a relational source to load a sample 10,000 rows. 3. Run the mapping with caching turned off and save the log. 4. Run the mapping with caching turned on and save the log to a different name than the log created in step 3. 5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS. 6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS. 7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS. 8. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your expected source records is less than X, it is better to not cache the lookup. If your expected source records is more than X, it is better to cache the lookup. For example: Assume the lookup takes 166 seconds to cache (LS=166). Assume with a cached lookup the load is 232 rows per second (CRS=232). Assume with a non-cached lookup the load is 147 rows per second (NRS = 147). The formula would result in: (166*147*232)/(232-147) = 66,603. Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more than 66,603 records, then the lookup should be cached.

Sharing Lookup Caches


There are a number of methods for sharing lookup caches:
q

Within a specific session run for a mapping, if the same lookup is used multiple times in a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. If multiple cached lookups are from the same table but are expected to return different columns of data, it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. Bringing back a common set of columns may reduce the number of disk reads. Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a persistent cache is set in the lookup properties, the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. This can improve performance because the Server builds the memory cache from cache files instead of the database. This feature should only be used when the lookup table is not expected to change between session runs. Across different mappings and sessions, the use of a named persistent cache allows
BEST PRACTICES 793 of 954

INFORMATICA CONFIDENTIAL

sharing an existing cache file.

Reducing the Number of Cached Rows


There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache. Note: If you use a SQL override in a lookup, the lookup must be cached.

Optimizing the Lookup Condition


In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first in order to optimize lookup performance.

Indexing the Lookup Table


The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a result, indexes on the database table should include every column used in a lookup condition. This can improve performance for both cached and un-cached lookups.
q

In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement used to create the cache. Columns used in the ORDER BY condition should be indexed. The session log will contain the ORDER BY statement.
q

In the case of an un-cached lookup, since a SQL statement is created for each row passing into the lookup transformation, performance can be helped by indexing columns in the lookup condition.

Use a Persistent Lookup Cache for Static Lookups


If the lookup source does not change between sessions, configure the Lookup transformation to use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to session, eliminating the time required to read the lookup source.

Optimize Filter and Router Transformations


Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve performance. Avoid complex expressions when creating the filter condition. Filter transformations are most effective when a simple integer or TRUE/FALSE expression is used in the filter condition. Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if rejected rows do not need to be saved.
INFORMATICA CONFIDENTIAL BEST PRACTICES 794 of 954

Replace multiple filter transformations with a router transformation. This reduces the number of transformations in the mapping and makes the mapping easier to follow.

Optimize Aggregator Transformations


Aggregator Transformations often slow performance because they must group data before processing it. Use simple columns in the group by condition to make the Aggregator Transformation more efficient. When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex expressions in the Aggregator expressions, especially in GROUP BY ports. Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is sorted by group and, as a group is passed through an Aggregator, calculations can be performed and information passed on to the next transformation. Without sorted input, the Server must wait for all rows of data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of Sorted Ports option. Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can only be used if the source data can be sorted. Further, using this option assumes that a mapping is using an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is required to hold data from the previous row of data processed. The premise is to use the previous row of data to determine whether the current row is a part of the current group or is the beginning of a new group. Thus, if the row is a part of the current group, then its data would be used to continue calculating the current group function. An Update Strategy Transformation would follow the Expression Transformation and set the first row of a new group to insert, and the following rows to update. Use incremental aggregation if you can capture changes from the source that changes less than half the target. When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a session. The PowerCenter Server updates your target incrementally, rather than processing the entire source and recalculating the same calculations every time you run the session.

Joiner Transformation Joining Data from the Same Source


You can join data from the same source in the following ways:
q q

Join two branches of the same pipeline. Create two instances of the same source and join pipelines from these source instances.

You may want to join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data. When you join the data using this method, you can maintain the original data and transform parts of that data within one mapping. When you join data from the same source, you can create two branches of the pipeline. When you branch
INFORMATICA CONFIDENTIAL BEST PRACTICES 795 of 954

a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input. If you want to join unsorted data, you must create two instances of the same source and join the pipelines. For example, you may have a source with the following ports:
q q q

Employee Department Total Sales

In the target table, you want to view the employees who generated sales that were greater than the average sales for their respective departments. To accomplish this, you create a mapping with the following transformations:
q q

Sorter transformation. Sort the data. Sorted Aggregator transformation. Average the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data. Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated data with the original data. Filter transformation. Compare the average sales data against sales data for each employee and filter out employees with less than above average sales.

Note: You can also join data from output groups of the same transformation, such as the Custom transformation or XML Source Qualifier transformations. Place a Sorter transformation between each output group and the Joiner transformation and configure the Joiner transformation to receive sorted input. Joining two branches can affect performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing. You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances. Note: When you join data using this method, the PowerCenter Server reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline. Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source:
q

Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once. Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you
BEST PRACTICES 796 of 954

INFORMATICA CONFIDENTIAL

use a Sorter transformation to sort the data, branch the pipeline after you sort the data.
q

Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation. Join two instances of a source if one pipeline may process much more slowly than the other pipeline.

Performance Tips
Use the database to do the join when sourcing data from the same database schema. Database systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a join condition should be used when joining multiple tables from the same database schema. Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of data is also smaller. Join sorted data when possible. You can improve session performance by configuring the Joiner transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the PowerCenter Server improves performance by minimizing disk input and output. You see the greatest performance improvement when you work with large data sets. For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows. For optimal performance and disk storage, designate the master source as the source with the fewer rows. During a session, the Joiner transformation compares each row of the master source against the detail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process. For a sorted Joiner transformation, designate as the master source the source with fewer duplicate key values. For optimal performance and disk storage, designate the master source as the source with fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it caches rows for one hundred keys at a time. If the master source contains many rows with the same key value, the PowerCenter Server must cache more rows, and performance can be slowed. Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner transformation, you may optimize performance by grouping data and using n:n partitions.

Add a hash auto-keys partition upstream of the sort origin


To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you must group and sort data. To group data, ensure that rows with the same key value are routed to the same partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort the data ensures that you maintain grouping and sort the data within each group.

Use n:n partitions


You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not need to cache all of the master data. This reduces memory usage and speeds processing. When you use 1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache
INFORMATICA CONFIDENTIAL BEST PRACTICES 797 of 954

to disk if the memory cache fills. When the Joiner transformation receives the data from the detail pipeline, it must then read the data from disk to compare the master and detail pipelines.

Optimize Sequence Generator Transformations


Sequence Generator transformations need to determine the next available sequence number; thus, increasing the Number of Cached Values property can increase performance. This property determines the number of values the PowerCenter Server caches at one time. If it is set to cache no values, then the PowerCenter Server must query the repository each time to determine the next number to be used. You may consider configuring the Number of Cached Values to a value greater than 1000. Note that any cached values not used in the course of a session are lost since the sequence generator value in the repository is set when it is called next time, to give the next set of cache values.

Avoid External Procedure Transformations


For the most part, making calls to external procedures slows a session. If possible, avoid the use of these Transformations, which include Stored Procedures, External Procedures, and Advanced External Procedures.

Field-Level Transformation Optimization


As a final step in the tuning process, you can tune expressions used in transformations. When examining expressions, focus on complex expressions and try to simplify them when possible. To help isolate slow expressions, do the following: 1. 2. 3. 4. Time the session with the original expression. Copy the mapping and replace half the complex expressions with a constant. Run and time the edited session. Make another copy of the mapping and replace the other half of the complex expressions with a constant. 5. Run and time the edited session.

Processing field level transformations takes time. If the transformation expressions are complex, then processing is even slower. Its often possible to get a 10 to 20 percent performance improvement by optimizing complex field level transformations. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization are the fields with the most complex expressions. Keep in mind that there may be more than one field causing performance problems.

Factoring Out Common Logic


Factoring out common logic can reduce the number of times a mapping performs the same logic. If a mapping performs the same logic multiple times, moving the task upstream in the mapping may allow the logic to be performed just once. For example, a mapping has five target tables. Each target requires a Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup to a position before the data flow splits.

Minimize Function Calls


INFORMATICA CONFIDENTIAL BEST PRACTICES 798 of 954

Anytime a function is called it takes resources to process. There are several common examples where function calls can be reduced or eliminated. Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the PowerCenter Server must search and group the data. Thus, the following expression: SUM(Column A) + SUM(Column B) Can be optimized to: SUM(Column A + Column B) In general, operators are faster than functions, so operators should be used whenever possible. For example if you have an expression which involves a CONCAT function such as: CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME) It can be optimized to: FIRST_NAME || LAST_NAME Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical statements to be written in a more compact fashion. For example: IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C,< /FONT> IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B,< /FONT> IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C,< /FONT> IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A,< /FONT> IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C,< /FONT> IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B,< /FONT> IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C,< /FONT> IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0))))))))< /FONT> Can be optimized to: IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT> The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in three IIFs, three comparisons, and two additions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

799 of 954

Be creative in making expressions more efficient. The following is an example of rework of an expression that eliminates three comparisons down to one: IIF(X=1 OR X=5 OR X=9, 'yes', 'no')< /FONT> Can be optimized to: IIF(MOD(X, 4) = 1, 'yes', 'no')< /FONT >

Calculate Once, Use Many Times


Avoid calculating or testing the same value multiple times. If the same sub-expression is used several times in a transformation, consider making the sub-expression a local variable. The local variable can be used only within the transformation in which it was created. Calculating the variable only once and then referencing the variable in following sub-expressions improves performance.

Choose Numeric vs. String Operations


The PowerCenter Server processes numeric operations faster than string operations. For example, if a lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.

Optimizing Char-Char and Char-Varchar Comparisons


When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of CHAR source fields.

Use DECODE Instead of LOOKUP


When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a DECODE function is used, the lookup values are incorporated into the expression itself so the server does not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using DECODE may improve performance.

Reduce the Number of Transformations in a Mapping


Because there is always overhead involved in moving data among transformations, try, whenever possible, to reduce the number of transformations. Also, resolve unnecessary links between transformations to minimize the amount of data moved. This is especially important with data being pulled from the Source Qualifier Transformation.

Use Pre- and Post-Session SQL Commands


You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier transformation and in the Properties tab of the target instance in a mapping. To increase the load speed, use these commands to drop indexes on the target before the session runs, then recreate them when the
INFORMATICA CONFIDENTIAL BEST PRACTICES 800 of 954

session completes. Apply the following guidelines when using SQL statements:
q

You can use any command that is valid for the database type. However, the PowerCenter Server does not allow nested comments, even though the database may. You can use mapping parameters and variables in SQL executed against the source, but not against the target. Use a semi-colon (;) to separate multiple statements. The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/. If you need to use a semi-colon outside of quotes or comments, you can escape it with a back slash (\). The Workflow Manager does not validate the SQL.

q q q

Use Environmental SQL


For relational databases, you can execute SQL commands in the database environment when connecting to the database. You can use this for source, target, lookup, and stored procedure connections. For instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the guidelines listed above for using the SQL statements.

Use Local Variables


You can use local variables in Aggregator, Expression, and Rank transformations.

Temporarily Store Data and Simplify Complex Expressions


Rather than parsing and validating the same expression each time, you can define these components as variables. This also allows you to simplyfy complex expressions. For example, the following expressions: AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT > SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT > can use variables to simplify complex expressions and temporarily store data:

Port V_CONDITION1 V_CONDITION2 AVG_SALARY SUM_SALARY

Value JOB_STATUS = 'Full-time' OFFICE_ID = 1000 AVG( SALARY, V_CONDITION1 AND V_CONDITION2 ) SUM( SALARY, V_CONDITION1 AND V_CONDITION2 )

INFORMATICA CONFIDENTIAL

BEST PRACTICES

801 of 954

Store Values Across Rows


You can use variables to store data from prior rows. This can help you perform procedural calculations. To compare the previous state to the state just read: IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )< /FONT >

Capture Values from Stored Procedures


Variables also provide a way to capture multiple columns of return values from stored procedures.

Last updated: 13-Feb-07 17:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

802 of 954

Tuning Sessions for Better Performance Challenge


Running sessions is where the pedal hits the metal. A common misconception is that this is the area where most tuning should occur. While it is true that various specific session options can be modified to improve performance, PowerCenter 8 comes with PowerCenter Enterprise Grid Option and Pushdown optimizations that also improve performance tremendously.

Description
Once you optimize the source and target database, and mapping, you can focus on optimizing the session. The greatest area for improvement at the session level usually involves tweaking memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches. The PowerCenter Server uses index and data caches for each of these transformations. If the allocated data or index cache is not large enough to store the data, the PowerCenter Server stores the data in a temporary disk file as it processes the session data. Each time the PowerCenter Server pages to the temporary file, performance slows. You can see when the PowerCenter Server pages to the temporary file by examining the performance details. The transformation_readfromdisk or transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or Joiner transformation indicate the number of times the PowerCenter Server must page to disk to process the transformation. Index and data caches should both be sized according to the requirements of the individual lookup. The sizing can be done using the estimation tools provided in the Transformation Guide, or through observation of actual cache sizes on in the session caching directory. The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server for these files is PM [type of transformation] [generated session instance id number] _ [transformation instance id number] _ [partition index]. dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19. dat. The cache directory may be changed however, if disk space is a constraint. Informatica recommends that the cache directory be local to the PowerCenter Server. A RAID 0 arrangement that gives maximum performance with no redundancy is
INFORMATICA CONFIDENTIAL BEST PRACTICES 803 of 954

recommended for volatile cache file directories (i.e., no persistent caches). If the PowerCenter Server requires more memory than the configured cache size, it stores the overflow values in these cache files. Since paging to disk can slow session performance, the RAM allocated needs to be available on the server. If the server doesnt have available RAM and uses paged memory, your session is again accessing the hard disk. In this case, it is more efficient to allow PowerCenter to page the data rather than the operating system. Adding additional memory to the server is, of course, the best solution. Refer to Session Caches in the Workflow Administration Guide for detailed information on determining cache sizes. The PowerCenter Server writes to the index and data cache files during a session in the following cases:
q

The mapping contains one or more Aggregator transformations, and the session is configured for incremental aggregation. The mapping contains a Lookup transformation that is configured to use a persistent lookup cache, and the PowerCenter Server runs the session for the first time. The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache. The Data Transformation Manager (DTM) process in a session runs out of cache memory and pages to the local cache files. The DTM may create multiple files when processing large amounts of data. The session fails if the local directory runs out of disk space.

When a session is running, the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. When a session completes, the DTM generally deletes the overflow index and data cache files. However, index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. Cache files may also remain if the session does not complete successfully.

Configuring Automatic Memory Settings


PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you can configure the Integration Service to automatically calculate cache memory settings at run time. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session
INFORMATICA CONFIDENTIAL BEST PRACTICES 804 of 954

caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. The values stored in the data and index caches depend upon the requirements of the transformation. For example, the Aggregator index cache stores group values as configured in the group by ports, and the data cache stores calculations based on the group by ports. When the Integration Service processes a Sorter transformation or writes data to an XML target, it also creates a cache.

Configuring Session Cache Memory


The Integration Service can determine cache memory requirements for the Lookup, Aggregator, Rank, Joiner, Sorter and XML. You can configure auto for the index and data cache size in the transformation properties or on the mappings tab of the session properties

Max Memory Limits


Configuring maximum memory limits allows you to ensure that you reserve a designated amount or percentage of memory for other processes. You can configure the memory limit as a numeric value and as a percent of total memory. Because available memory varies, the Integration Service bases the percentage value on the total memory on the Integration Service process machine. For example, you configure automatic caching for three Lookup transformations in a session. Then, you configure a maximum memory limit of 500MB for the session. When you run the session, the Integration Service divides the 500MB of allocated memory among the index and data caches for the Lookup transformations. When you configure a maximum memory value, the Integration Service divides memory among transformation caches based on the transformation type. When you configure a numeric value and a percent both, the Integration Service compares the values and uses the lower value as the maximum memory limit. When you configure automatic memory settings, the Integration Service specifies a minimum memory allocation for the index and data caches. The Integration Service allocates 1,000,000 bytes to the index cache and 2,000,000 bytes to the data cache for each transformation instance. If you configure a maximum memory limit that is less than the minimum value for an index or data cache, the Integration Service overrides this value. For example, if you configure a maximum memory value of 500 bytes for

INFORMATICA CONFIDENTIAL

BEST PRACTICES

805 of 954

session containing a Lookup transformation, the Integration Service overrides or disable the automatic memory settings and uses the default values. When you run a session on a grid and you configure Maximum Memory Allowed for Auto Memory Attributes, the Integration Service divides the allocated memory among all the nodes in the grid. When you configure Maximum Percentage of Total Memory Allowed for Auto Memory Attributes, the Integration Service allocates the specified percentage of memory on each node in the grid.

Aggregator Caches
Keep the following items in mind when configuring the aggregate memory cache sizes:
q

Allocate at least enough space to hold at least one row in each aggregate group. Remember that you only need to configure cache memory for an Aggregator transformation that does not use sorted ports. The PowerCenter Server uses Session Process memory to process an Aggregator transformation with sorted ports, not cache memory. Incremental aggregation can improve session performance. When it is used, the PowerCenter Server saves index and data cache information to disk at the end of the session. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. The PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that have sessions which use incremental aggregation should be set up so that only new detail records are read with each subsequent run. When configuring Aggregate data cache size, remember that the data cache holds row data for variable ports and connected output ports only. As a result, the data cache is generally larger than the index cache. To reduce the data cache size, connect only the necessary output ports to subsequent transformations.

Joiner Caches
When a session is run with a Joiner transformation, the PowerCenter Server reads from master and detail sources concurrently and builds index and data caches based on the master rows. The PowerCenter Server then performs the join based on the detail source data and the cache data. The number of rows the PowerCenter Server stores in the cache depends on the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

806 of 954

partitioning scheme, the data in the master source, and whether or not you use sorted input. After the memory caches are built, the PowerCenter Server reads the rows from the detail source and performs the joins. The PowerCenter Server uses the index cache to test the join condition. When it finds source data and cache data that match, it retrieves row values from the data cache.

Lookup Caches
Several options can be explored when dealing with Lookup transformation caches.
q

Persistent caches should be used when lookup data is not expected to change often. Lookup cache files are saved after a session with a persistent cache lookup is run for the first time. These files are reused for subsequent runs, bypassing the querying of the database for the lookup. If the lookup table changes, you must be sure to set the Recache from Database option to ensure that the lookup cache files are rebuilt. You can also delete the cache files before the session run to force the session to rebuild the caches. Lookup caching should be enabled for relatively small tables. Refer to the Best Practice Tuning Mappings for Better Performance to determine when lookups should be cached. When the Lookup transformation is not configured for caching, the PowerCenter Server queries the lookup table for each input row. The result of the lookup query and processing is the same, regardless of whether the lookup table is cached or not. However, when the transformation is configured to not cache, the PowerCenter Server queries the lookup table instead of the lookup cache. Using a lookup cache can usually increase session performance. Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary, which helps increase the performance of the lookup

Allocating Buffer Memory


The Integration Service can determine the memory requirements for the buffer memory:
q q

DTM Buffer Size Default Buffer Block Size

You can also configure DTM buffer size and the default buffer block size in the session properties. When the PowerCenter Server initializes a session, it allocates blocks of
INFORMATICA CONFIDENTIAL BEST PRACTICES 807 of 954

memory to hold source and target data. Sessions that use a large number of sources and targets may require additional memory blocks. To configure these settings, first determine the number of memory blocks the PowerCenter Server requires to initialize the session. Then you can calculate the buffer size and/or the buffer block size based on the default settings, to create the required number of session blocks. If there are XML sources or targets in the mappings, use the number of groups in the XML source or target in the total calculation for the total number of sources and targets.

Increasing the DTM Buffer Pool Size


The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory to create the internal data structures and buffer blocks used to bring data into and out of the server. When the DTM buffer memory is increased, the PowerCenter Server creates more buffer blocks, which can improve performance during momentary slowdowns. If a session's performance details show low numbers for your source and target BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer pool size may improve performance. Using DTM buffer memory allocation generally causes performance to improve initially and then level off. (Conversely, it may have no impact on source or target-bottlenecked sessions at all and may not have an impact on DTM bottlenecked sessions). When the DTM buffer memory allocation is increased, you need to evaluate the total memory available on the PowerCenter Server. If a session is part of a concurrent batch, the combined DTM buffer memory allocated for the sessions or batches must not exceed the total memory for the PowerCenter Server system. You can increase the DTM buffer size in the Performance settings of the Properties tab.

Running Workflows and Sessions Concurrently


The PowerCenter Server can process multiple sessions in parallel and can also process multiple partitions of a pipeline within a session. If you have a symmetric multiprocessing (SMP) platform, you can use multiple CPUs to concurrently process session data or partitions of data. This provides improved performance since true parallelism is achieved. On a single processor platform, these tasks share the CPU, so there is no parallelism.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

808 of 954

To achieve better performance, you can create a workflow that runs several sessions in parallel on one PowerCenter Server. This technique should only be employed on servers with multiple CPUs available.

Partitioning Sessions
Performance can be improved by processing data in parallel in a single session by creating multiple partitions of the pipeline. If you have PowerCenter partitioning available, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the PowerCenter Server to create multiple connections to sources and process partitions of source data concurrently. When you create or edit a session, you can change the partitioning information for each pipeline in a mapping. If the mapping contains multiple pipelines, you can specify multiple partitions in some pipelines and single partitions in others. Keep the following attributes in mind when specifying partitioning information for a pipeline:
q

Location of partition points. The PowerCenter Server sets partition points at several transformations in a pipeline by default. If you have PowerCenter partitioning available, you can define other partition points. Select those transformations where you think redistributing the rows in a different way is likely to increase the performance considerably. Number of partitions. By default, the PowerCenter Server sets the number of partitions to one. You can generally define up to 64 partitions at any partition point. When you increase the number of partitions, you increase the number of processing threads, which can improve session performance. Increasing the number of partitions or partition points also increases the load on the server. If the server contains ample CPU bandwidth, processing rows of data in a session concurrently can increase session performance. However, if you create a large number of partitions or partition points in a session that processes large amounts of data, you can overload the system. You can also overload source and target systems, so that is another consideration. Partition types. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types: 1. Round-robin partitioning. PowerCenter distributes rows of data evenly to all partitions. Each partition processes approximately the same number of rows. In a pipeline that reads data from file sources of different sizes, you can use round-robin partitioning to ensure that each

INFORMATICA CONFIDENTIAL

BEST PRACTICES

809 of 954

partition receives approximately the same number of rows. 2. Hash keys. The PowerCenter Server uses a hash function to group rows of data among partitions. The Server groups the data based on a partition key. There are two types of hash partitioning:
r

Hash auto-keys. The PowerCenter Server uses all grouped or sorted ports as a compound partition key. You can use hash auto-keys partitioning at or before Rank, Sorter, and unsorted Aggregator transformations to ensure that rows are grouped properly before they enter these transformations. Hash user keys. The PowerCenter Server uses a hash function to group rows of data among partitions based on a user-defined partition key. You choose the ports that define the partition key.

3. Key range. The PowerCenter Server distributes rows of data based on a port or set of ports that you specify as the partition key. For each port, you define a range of values. The PowerCenter Server uses the key and ranges to send rows to the appropriate partition. Choose key range partitioning where the sources or targets in the pipeline are partitioned by key range. 4. -Pass-through partitioning. The PowerCenter Server processes data without redistributing rows among partitions. Therefore, all rows in a single partition stay in that partition after crossing a pass-through partition point. 5. Database partitioning partition. You can optimize session performance by using the database partitioning partition type instead of the pass-through partition type for IBM DB2 targets. If you find that your system is under-utilized after you have tuned the application, databases, and system for maximum single-partition performance, you can reconfigure your session to have two or more partitions to make your session utilize more of the hardware. Use the following tips when you add partitions to a session:
q

Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before you add each partition. Set DTM buffer memory. For a session with n partitions, this value should be at least n times the value for the session with one partition. Set cached values for Sequence Generator. For a session with n partitions, there should be no need to use the number of cached values property of the Sequence Generator transformation. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the
BEST PRACTICES 810 of 954

INFORMATICA CONFIDENTIAL

session with one partition.


q

Partition the source data evenly. Configure each partition to extract the same number of rows. Or redistribute the data among partitions early using a partition point with round-robin. This is actually a good way to prevent hammering of the source system. You could have a session with multiple partitions where one partition returns all the data and the override SQL in the other partitions is set to return zero rows (where 1 = 2 in the where clause prevents any rows being returned). Some source systems react better to multiple concurrent SQL queries; others prefer smaller numbers of queries. Monitor the system while running the session. If there are CPU cycles available (twenty percent or more idle time), then performance may improve for this session by adding a partition. Monitor the system after adding a partition. If the CPU utilization does not go up, the wait for I/O time goes up, or the total data transformation rate goes down, then there is probably a hardware or software bottleneck. If the wait for I/ O time goes up a significant amount, then check the system for hardware bottlenecks. Otherwise, check the database configuration. Tune databases and system. Make sure that your databases are tuned properly for parallel ETL and that your system has no bottlenecks.

Increasing the Target Commit Interval


One method of resolving target database bottlenecks is to increase the commit interval. Each time the target database commits, performance slows. If you increase the commit interval, the number of times the PowerCenter Server commits decreases and performance may improve. When increasing the commit interval at the session level, you must remember to increase the size of the database rollback segments to accommodate the larger number of rows. One of the major reasons that Informatica set the default commit interval to 10,000 is to accommodate the default rollback segment / extent size of most databases. If you increase both the commit interval and the database rollback segments, you should see an increase in performance. In some cases though, just increasing the commit interval without making the appropriate database changes may cause the session to fail part way through (i.e., you may get a database error like "unable to extend rollback segments" in Oracle).

Disabling High Precision


If a session runs with high precision enabled, disabling high precision may improve session performance.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

811 of 954

The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-precision Decimal datatype in a session, you must configure it so that the PowerCenter Server recognizes this datatype by selecting Enable High Precision in the session property sheet. However, since reading and manipulating a high-precision datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter Server down, session performance may be improved by disabling decimal arithmetic. When you disable high precision, the PowerCenter Server reverts to using a dataype of Double.

Reducing Error Tracking


If a session contains a large number of transformation errors, you may be able to improve performance by reducing the amount of data the PowerCenter Server writes to the session log. To reduce the amount of time spent writing to the session log file, set the tracing level to Terse. At this tracing level, the PowerCenter Server does not write error messages or row-level information for reject data. However, if terse is not an acceptable level of detail, you may want to consider leaving the tracing level at Normal and focus your efforts on reducing the number of transformation errors. Note that the tracing level must be set to Normal in order to use the reject loading utility. As an additional debug option (beyond the PowerCenter Debugger), you may set the tracing level to verbose initialization or verbose data.
q

Verbose initialization logs initialization details in addition to normal, names of index and data files used, and detailed transformation statistics. Verbose data logs each row that passes into the mapping. It also notes where the PowerCenter Server truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the PowerCenter Server writes row data for all rows in a block when it processes a transformation.

However, the verbose initialization and verbose data logging options significantly affect the session performance. Do not use Verbose tracing options except when testing sessions. Always remember to switch tracing back to Normal after the testing is complete. The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Because there are only a handful of
INFORMATICA CONFIDENTIAL BEST PRACTICES 812 of 954

reasons why transformation errors occur, it makes sense to fix and prevent any recurring transformation errors. PowerCenter uses the mapping tracing level when the session tracing level is set to none.

Pushdown Optimization
You can push transformation logic to the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration. When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to Pushdown Optimization.

Source-Side Pushdown Optimization Sessions


In source-side pushdown optimization, the Integration Service analyzes the mapping from the source to the target until it reaches a downstream transformation that cannot be pushed to the database. The Integration Service generates a SELECT statement based on the transformation logic up to the transformation it can push to the database. Integration Service pushes all transformation logic that is valid to push to the database by executing the generated SQL statement at run time. Then, it reads the results of this SQL statement and continues to run the session. Similarly it create the view for SQL override and then generate SELECT statement and runs the SELECT statement against this view. When the session completes, the Integration Service drops the view from the database.

Target-Side Pushdown Optimization Sessions


When you run a session configured for target-side pushdown optimization, the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

813 of 954

Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. Then, it executes the generated SQL.

Full Pushdown Optimization Sessions


To use full pushdown optimization, the source and target must be on the same database. When you run a session configured for full pushdown optimization, the Integration Service analyzes the mapping from source to target and analyze each transformation in the pipeline until it analyzes the target. It generates and executes the SQL on sources and targets, When you run a session for full pushdown optimization, the database must run a long transaction if the session contains a large quantity of data. Consider the following database performance issues when you generate a long transaction:
q q

A long transaction uses more database resources. A long transaction locks the database for longer periods of time, and thereby reduces the database concurrency and increases the likelihood of deadlock. A long transaction can increase the likelihood that an unexpected event may occur.

The Rank transformation cannot be pushed to the database. If you configure the session for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source. It pushes the Expression transformation and target to the target database, and it processes the Rank transformation. The Integration Service does not fail the session if it can push only part of the transformation logic to the database and the session is configured for full optimization.

Using a Grid
You can use a grid to increase session and workflow performance. A grid is an alias assigned to a group of nodes that allows you to automate the distribution of workflows and sessions across nodes.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

814 of 954

When you use a grid, the Integration Service distributes workflow tasks and session threads across multiple nodes. Running workflows and sessions on the nodes of a grid provides the following performance gains:
q q q

Balances the Integration Service workload. Processes concurrent sessions faster. Processes partitions faster.

When you run a session on a grid, you improve scalability and performance by distributing session threads to multiple DTM processes running on nodes in the grid. To run a workflow or session on a grid, you assign resources to nodes, create and configure the grid, and configure the Integration Service to run on a grid.

Running a Session on Grid


When you run a session on a grid, the master service process runs the workflow and workflow tasks, including the Scheduler. Because it runs on the master service process node, the Scheduler uses the date and time for the master service process node to start scheduled workflows. The Load Balancer distributes Command tasks as it does when you run a workflow on a grid. In addition, when the Load Balancer dispatches a Session task, it distributes the session threads to separate DTM processes. The master service process starts a temporary preparer DTM process that fetches the session and prepares it to run. After the preparer DTM process prepares the session, it acts as the master DTM process, which monitors the DTM processes running on other nodes. The worker service processes start the worker DTM processes on other nodes. The worker DTM runs the session. Multiple worker DTM processes running on a node might be running multiple sessions or multiple partition groups from a single session depending on the session configuration. For example, you run a workflow on a grid that contains one Session task and one Command task. You also configure the session to run on the grid. When the Integration Service process runs the session on a grid, it performs the following tasks:
q

On Node 1, the master service process runs workflow tasks. It also starts a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

815 of 954

temporary preparer DTM process, which becomes the master DTM process. The Load Balancer dispatches the Command task and session threads to nodes in the grid.
q

On Node 2, the worker service process runs the Command task and starts the worker DTM processes that run the session threads. On Node 3, the worker service process starts the worker DTM processes that run the session threads.

For information about configuring and managing a grid, refer to the PowerCenter Administrator Guide and to the best practice PowerCenter Enterprise Grid Option. For information about how the DTM distributes session threads into partition groups, see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.

Last updated: 06-Dec-07 15:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

816 of 954

Tuning SQL Overrides and Environment for Better Performance Challenge


Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from source database tables, which positively impacts the overall session performance. This Best Practice explores ways to optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.

Description
SQL Queries Performing Data Extractions
Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.

DB2 Coalesce and Oracle NVL


When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In Oracle, the NVL function is used, while in DB2, the COALESCE function is used. Here is an example of the Oracle NLV function: SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM exp.exp_bio_result bio, sar.sar_data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', X) AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log < /FONT > WHERE load_status = 'P')< Here is the same query in DB2: SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM bio_result bio, data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', X) AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log < /FONT > WHERE load_status = 'P')< /FONT >

INFORMATICA CONFIDENTIAL

BEST PRACTICES

817 of 954

Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views
In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around this limitation. You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain. The logic is now in two places: in an Informatica mapping and in a database view You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in the FROM clause: SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT, N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT, N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER, N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM DOSE_REGIMEN N, (SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM EXPERIMENT_PARAMETER R, NEW_GROUP_TMP TMP WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID< /FONT > AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID < /FONT > )X WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID < /FONT > ORDER BY N.DOSE_REGIMEN_ID

Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression temp tables and the WITH Clause
The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by specifying the query name. For example: WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') < /FONT > SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM bio_result bio, data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time
INFORMATICA CONFIDENTIAL BEST PRACTICES 818 of 954

AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', X) AND log.seq_no = maxseq. seq_no< /FONT > Here is another example using a WITH clause that uses recursive SQL: WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS (SELECT PERSON_ID, NAME, PARENT_ID FROM PARENT_CHILD WHERE NAME IN (FRED, SALLY, JIM) UNION ALL SELECT C.PERSON_ID, C.NAME, C.PARENT_ID FROM PARENT_CHILD C, PERSON_TEMP RECURS WHERE C.PERSON_ID = RECURS.PERSON_ID < /FONT > AND LEVEL < 5) SELECT * FROM PERSON_TEMP The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents, but you get the idea. The LEVEL clause prevents infinite recursion.

CASE (DB2) vs. DECODE (Oracle)


The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case since it was the only legal way to test a condition in earlier versions. DECODE is not allowed in DB2. In Oracle: SELECT EMPLOYEE, FNAME, LNAME, DECODE (SALARY) < 10000, NEED RAISE, > 1000000, OVERPAID, THE REST OF US) AS COMMENT FROM EMPLOYEE In DB2:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

819 of 954

SELECT EMPLOYEE, FNAME, LNAME, CASE WHEN SALARY < 10000 THEN NEED RAISE WHEN SALARY > 1000000 THEN OVERPAID ELSE THE REST OF US END AS COMMENT FROM EMPLOYEE

Debugging Tip: Obtaining a Sample Subset


It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can be commented out or removed after it is put in general use. DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows: SELECT EMPLOYEE, FNAME, LNAME FROM EMPLOYEE WHERE JOB_TITLE = WORKERBEE < /FONT > FETCH FIRST 12 ROWS ONLY Oracle does it this way using the ROWNUM variable: SELECT EMPLOYEE, FNAME, LNAME FROM EMPLOYEE WHERE JOB_TITLE = WORKERBEE < /FONT > AND ROWNUM <= 12< /FONT>

INTERSECT, INTERSECT ALL, UNION, UNION ALL


Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL return all rows.

System Dates in Oracle and DB2


Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or the date however you want with date functions. Here is an example that returns yesterdays date in Oracle (default format as mm/dd/yyyy): SELECT TRUNC(SYSDATE) 1 FROM DUAL
INFORMATICA CONFIDENTIAL BEST PRACTICES 820 of 954

DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT TIMESTAMP Here is an example for DB2: SELECT FNAME, LNAME, CURRENT DATE AS TODAY FROM EMPLOYEE

Oracle: Using Hints


Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer control over the execution. Hints are always honored unless execution is not possible. Because the database engine does not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and access method hints are the most common. In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2, however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using older versions of Oracle however, the use of the proper INDEX hints should help performance. The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below provides a partial list of optimizer hints and descriptions.

Optimizer hints: Choosing the best join method


Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts while the nested loop involves no sorts. The hash join also requires memory to build the hash table. Hash joins are most effective when the amount of data is large and one table is much larger than the other. Here is an example of a select that performs best as a hash join: SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M WHERE C.CUST_ID = M.MANAGER_ID< /FONT > Considerations Better throughput Better response time Large subsets of data Index available to support join Limited memory and CPU available for sorting Parallel execution Join Type Sort/Merge Nested loop Sort/Merge Nested loop Nested loop Sort/Merge or Hash

INFORMATICA CONFIDENTIAL

BEST PRACTICES

821 of 954

Joining all or most of the rows of large tables Joining small sub-sets of data and index available

Sort/Merge or Hash Nested loop

Hint ALL_ROWS

Description The database engine creates an execution plan that optimizes for throughput. Favors full table scans. Optimizer favors Sort/Merge The database engine creates an execution plan that optimizes for response time. It returns the first row of data as quickly as possible. Favors index lookups. Optimizer favors Nested-loops The database engine creates an execution plan that uses cost-based execution if statistics have been run on the tables. If statistics have not been run, the engine uses rule-based execution. If statistics have been run on empty tables, the engine still uses cost-based execution, but performance is extremely poor. The database engine creates an execution plan based on a fixed set of rules. Use nested loops Use sort merge joins The database engine performs a hash scan of the table. This hint is ignored if the table is not clustered.

FIRST_ROWS

CHOOSE

RULE USE NL USE MERGE HASH

Access method hints


Access method hints control how data is accessed. These hints are used to force the database engine to use indexes, hash scans, or row id scans. The following table provides a partial list of access method hints. Hint ROWID INDEX Description The database engine performs a scan of the table based on ROWIDS. DO NOT USE in Oracle 9.2 and above. The database engine performs an index scan of a specific table, but in 9.2 and above, the optimizer does not use any indexes other than those mentioned. The database engine converts a query with an OR condition into two or more queries joined by a UNION ALL statement.

USE_CONCAT

The syntax for using a hint in a SQL statement is as follows: Select /*+ FIRST_ROWS */ empno, ename From emp;

INFORMATICA CONFIDENTIAL

BEST PRACTICES

822 of 954

Select /*+ USE_CONCAT */ empno, ename From emp;

SQL Execution and Explain Plan


The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually provide better execution time. The developer can determine which type of execution is being used by running an explain plan on the SQL query in question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results of that statement are then used as input by the next level statement. Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full table scans cause degradation in performance. Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following additional information including:
q q q

The number of executions The elapsed time of the statement execution The CPU time used to execute the statement

The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can immediately show the change in resource consumption after the statement has been tuned and a new explain plan has been run.

Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being used, it is possible to force the query to use it by using an access method hint, as described earlier.

Reviewing SQL Logic


The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme cases, the entire SQL statement may need to be re-written to become more efficient.

Reviewing SQL Syntax


SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example:
q

EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example: SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN

INFORMATICA CONFIDENTIAL

BEST PRACTICES

823 of 954

(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster SELECT * FROM DEPARTMENTS D WHERE EXISTS (SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)< /FONT > Situation Index supports subquery No Index to support subquery Exists Yes No Table scans per parent row Probably not Yes No In Yes Yes Table scan once

Sub-query returns many rows Sub-query returns one or a few rows Most of the sub-query rows are eliminated by the parent query Index in parent that match sub-query columns

Yes Yes Yes

Possibly not since the Yes IN uses the EXISTS cannot use the index index

Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way can improve performance by more than100 percent. Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup objects within the mapping to fill in the optional information.

Choosing the Best Join Order


Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the incremental ETL load. Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause. Outer joins limit the join order that the optimizer can use. Dont use them needlessly.

Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN
q

Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this may not be a problem on small tables, it can become a performance drain on large tables. SELECT NAME_ID FROM CUSTOMERS WHERE NAME_ID NOT IN (SELECT NAME_ID FROM EMPLOYEES)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

824 of 954

Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan. SELECT C.NAME_ID FROM CUSTOMERS C WHERE NOT EXISTS (SELECT * FROM EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID)< /FONT >

In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator. SELECT C.NAME_ID FROM CUSTOMERS C MINUS SELECT E.NAME_ID* FROM EMPLOYEES E

Also consider using outer joins with IS NULL conditions for anti-joins. SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID (+)< /FONT > AND C.NAME_ID IS NULL

Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change based on the database engine.
q

In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders entered into the system since the previous load of the database, then, in the product information lookup, only select the products that match the distinct product IDs in the incremental sales orders. Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved from a table as limits in the BETWEEN. Here is an example: SELECT R.BATCH_TRACKING_NO, R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.GCW_LOAD_DATE FROM CDS_SUPPLIER R, (SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

825 of 954

L.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG L WHERE L.LOAD_DATE_PREV IN (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV FROM ETL_AUDIT_LOG Y) )Z WHERE R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput time from hours to seconds. Here is the improved SQL: SELECT R.BATCH_TRACKING_NO, R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.LOAD_DATE FROM /* In-line view for lower limit */ (SELECT R1.BATCH_TRACKING_NO, R1.SUPPLIER_DESC, R1.SUPPLIER_REG_NO, R1.SUPPLIER_REF_CODE, R1.LOAD_DATE FROM CDS_SUPPLIER R1, (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
INFORMATICA CONFIDENTIAL BEST PRACTICES 826 of 954

FROM ETL_AUDIT_LOG Y) Z WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV< /FONT> ORDER BY R1.LOAD_DATE) R, /* end in-line view for lower limit */ (SELECT MAX(D.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG D) A /* upper limit /* WHERE R. LOAD_DATE <= A.LOAD_DATE< /FONT>

Tuning System Architecture


Use the following steps to improve the performance of any system: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Establish performance boundaries (baseline). Define performance objectives. Develop a performance monitoring plan. Execute the plan. Analyze measurements to determine whether the results meet the objectives. If objectives are met, consider reducing the number of measurements because performance monitoring itself uses system resources. Otherwise continue with Step 6. Determine the major constraints in the system. Decide where the team can afford to make trade-offs and which resources can bear additional load. Adjust the configuration of the system. If it is feasible to change more than one tuning option, implement one at a time. If there are no options left at any level, this indicates that the system has reached its limits and hardware upgrades may be advisable. Return to Step 4 and continue to monitor the system. Return to Step 1. Re-examine outlined objectives and indicators. Refine monitoring and tuning strategy.

System Resources
The PowerCenter Server uses the following system resources:
q q q q

CPU Load Manager shared memory DTM buffer memory Cache memory

When tuning the system, evaluate the following considerations during the implementation process.
q

Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of network hops between the PowerCenter Server and the databases. Use multiple PowerCenter Servers on separate systems to potentially improve session performance. When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can potentially slow session performance Check hard disks on related machines. Slow disk access on source and target databases, source and target file
BEST PRACTICES 827 of 954

q q

INFORMATICA CONFIDENTIAL

systems, as well as the PowerCenter Server and repository machines can slow session performance.
q

When an operating system runs out of physical memory, it starts paging to disk to free physical memory. Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system memory when sessions use large cached lookups or sessions have many partitions. In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources. Use processor binding to control processor usage by the PowerCenter Server. In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris documentation. In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For details, see project system administrator and HP-UX documentation. In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands. The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details, see project system administrator and AIX documentation.

Database Performance Features


Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the many available alternatives is the best implementation choice for the particular database. The project team must have a thorough understanding of the data, database, and desired use of the database by the end-user community prior to beginning the physical implementation process. Evaluate the following considerations during the implementation process.
q

Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and primary key to foreign key relationships, and also eliminating join tables. Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are recommended to drop indexes before the load and rebuilding them after the load using post-session scripts. Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating that additional logic in the mappings. Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions, particularly on initial loads. OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to determine after the fact. DBAs must work with the System Administrator to ensure all the database processes have the same priority. Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput. Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk controllers.

Last updated: 13-Feb-07 17:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

828 of 954

Using Metadata Manager Console to Tune the XConnects Challenge


Improving the efficiency and reducing the run-time of your XConnects through the parameter settings of the Metadata Manager console.

Description
Remember that the minimum system requirements for a machine hosting the Metadata Manager console are:
q q q q

Windows operating system (2000, NT 4.0 SP 6a) 400MB disk space 128MB RAM (256MB recommended) 133 MHz processor.

If the system meets or exceeds the minimal requirements, but an XConnect is still taking an inordinately long time to run, use the following steps to try to improve its performance. To improve performance of your XConnect loads from database catalogs:
q

Modify the inclusion/exclusion schema list (if schema to be loaded is more than exclusion, then use exclusion) Carefully examine how many old objects the project needs by default. Modify the sysdate -5000 to a smaller value to reduce the result set.

To improve performance of your XConnect loads from the PowerCenter repository:


q q

Load only the production folders that are needed for a particular project. Run the XConnects with just one folder at a time, or select the list of folders for a particular run.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

829 of 954

Advanced Client Configuration Options Challenge


Setting the Registry to ensure consistent client installations, resolve potential missing or invalid license key issues, and change the Server Manager Session Log Editor to your preferred editor.

Description
Ensuring Consistent Data Source Names
To ensure the use of consistent data source names for the same data sources across the domain, the Administrator can create a single "official" set of data sources, then use the Repository Manager to export that connection information to a file. You can then distribute this file and import the connection information for each client machine. Solution:
q q

From Repository Manager, choose Export Registry from the Tools drop-down menu. For all subsequent client installs, simply choose Import Registry from the Tools drop-down menu.

Resolving Missing or Invalid License Keys


The missing or invalid license key error occurs when attempting to install PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than Administrator. This problem also occurs when the client software tools are installed under the Administrator account, and a user with a non-administrator ID subsequently attempts to run the tools. The user who attempts to log in using the normal non-administrator userid will be unable to start the PowerCenter Client tools. Instead, the software displays the message indicating that the license key is missing or invalid. Solution:
q

While logged in as the installation user with administrator authority, use regedt32 to edit the registry. Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. From the menu bar, select Security/Permissions, and grant read access to the users that should be permitted to use the PowerMart Client. (Note that the registry entries for both PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and PowerMart Client tools.)

q q

Changing the Session Log Editor

INFORMATICA CONFIDENTIAL

BEST PRACTICES

830 of 954

In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the workflow monitor. Then browse for the editor that you want on the General tab. For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the wordpad.exe can be found in the path statement. Instead, a window appears the first time a session log is viewed from the PowerCenter Server Manager prompting the user to enter the full path name of the editor to be used to view the logs. Users often set this parameter incorrectly and must access the registry to change it. Solution:
q

While logged in as the installation user with administrator authority, use regedt32 to go into the registry. Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar, select View Tree and Data. Select the Log File Editor entry by double clicking on it. Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write. exe). Select Registry --> Exit from the menu bar to save the entry.

q q

For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow Monitor. The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for workflow and session logs.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

831 of 954

Adding a New Command Under Tools Menu


Other tools, in addition to the PowerCenter client tools, are often needed during development and testing. For example, you may need a tool such as Enterprise manager (SQL Server) or Toad (Oracle) to query the database. You can add shortcuts to executable programs from any client tools Tools drop-down menu to provide quick access to these programs. Solution: Choose Customize under the Tools menu and add a new item. Once it is added, browse to find the executable it is going to call (as shown below).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

832 of 954

When this is done once, you can easily call another program from your PowerCenter client tools. In the following example, TOAD can be called quickly from the Repository Manager tool.

Changing Target Load Type


In PowerCenter versions 6.0 and earlier, each time a session was created, it defaulted to be of type bulk, although this was not necessarily what was desired and could cause the session to fail under certain conditions if not changed. In versions 7.0 and above, you can set a property in Workflow Manager to choose the default load type to be either 'bulk' or 'normal'.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

833 of 954

Solution:
q q q

In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab. Click the button for either 'normal' or 'bulk', as desired. Click OK, then close and open the Workflow Manager tool.

After this, every time a session is created, the target load type for all relational targets will default to your choice.

Resolving Undocked Explorer Windows


The Repository Navigator window sometimes becomes undocked. Docking it again can be frustrating because double clicking on the window header does not put it back in place.

Solution:
q

To get the Window correctly docked, right-click in the white space of the Navigator window. Make sure that Allow Docking option is checked. If it is checked, double-click on the title bar of the Navigator Window.
BEST PRACTICES 834 of 954

INFORMATICA CONFIDENTIAL

Resolving Client Tool Window Display Issues


If one of the windows (e.g., Navigator or Output) in a PowerCenter 7.x or later client tool (e. g., Designer) disappears, try the following solutions to recover it:
q q q

Clicking View > Navigator Toggling the menu bar Uninstalling and reinstalling Client tools

Note: If none of the above solutions resolve the problem, you may want to try the following solution using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause serious problems that may require reinstalling the operating system. Informatica does not guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the Registry Editor at your own risk. Solution: Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can often be resolved as follows:
q q q

Close the client tool. Go to Start > Run and type "regedit". Go to the key HKEY_CURRENT_USER\Software\Informatica\PowerMart Client Tools\x.y.z Where x.y.z is the version and maintenance release level of the PowerCenter client as follows:

PowerCenter Folder Version Name 7.1 7.1.1 7.1.2 7.1.3 7.1.4 8.1
q

7.1 7.1.1 7.1.1 7.1.1 7.1.1 8.1

Open the key of the affected tool (for the Repository Manager open Repository Manager Options). Export all of the Toolbars sub-folders and rename them. Re-open the client tool.

q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

835 of 954

Enhancing the Look of the Client Tools


The PowerCenter client tools allow you to customize the look and feel of the display. Here are a few examples of what you can do.

Designer

q q q

From the Menu bar, select Tools > Options. In the dialog box, choose the Format tab. Select the feature that you want to modify (i.e., workspace colors, caption colors, or fonts).

Changing the background workspace colors can help identify which workspace is currently open. For example, changing the Source Analyzer workspace color to green or the Target Designer workspace to purple to match their respective metadata definitions helps to identify the workspace. Alternatively, click the Select Theme button to choose a color theme, which displays background colors based on predefined themes.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

836 of 954

Workflow Manager
You can modify the Workflow Manager using the same approach as the Designer tool. From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or customize each element individually.

Workflow Monitor
You can modify the colors in the Gantt Chart view to represent the various states of a task. You can also select two colors for one task to give it a dimensional appearance; this can be helpful in
INFORMATICA CONFIDENTIAL BEST PRACTICES 837 of 954

distinguishing between running tasks, succeeded tasks, etc. To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt Chart.

Using Macros in Data Stencil


Data Stencil contains unsigned macros. Set the security level in Visio to Medium so you can enable macros when you start Data Stencil. If the security level for Visio is set to High or Very High, you
INFORMATICA CONFIDENTIAL BEST PRACTICES 838 of 954

cannot run the Data Stencil macros. To use the security level for the Visio, select Tools > Macros > Security from the menu. On the Security Level tab, select Medium. When you start Data Stencil, Visio displays a security warning about viruses in macros. Click Enable Macros to enable the macros for Data Stencil.

Last updated: 19-Mar-08 19:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

839 of 954

Advanced Server Configuration Options Challenge


Correctly configuring Advanced Integration Service properties, Integration Service process variables, and automatic memory settings; using custom properties to write service logs to files; and adjusting semaphore and shared memory settings in the UNIX environment.

Description
Configuring Advanced Integration Service Properties
Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties > Edit. The following Advanced properties are included:

Limit on Resilience Optional Timeouts

Maximum amount of time (in seconds) that the service holds on to resources for resilience purposes. This property places a restriction on clients that connect to the service. Any resilience timeouts that exceed the limit are cut off at the limit. If the value of this property is blank, the value is derived from the domain-level settings. Valid values are between 0 and 2592000, inclusive. Default is blank.

Resilience Timeout Optional

Period of time (in seconds) that the service tries to establish or reestablish a connection to another service. If blank, the value is derived from the domainlevel settings. Valid values are between 0 and 2592000, inclusive. Default is blank.

Configuring Integration Service Process Variables


One configuration best practice is to properly configure and leverage the Integration service (IS) process variables. The benefits include:
q q q

Ease of deployment across environments (DEV > TEST > PRD) Ease of switching sessions from one IS to another without manually editing all the sessions to change directory paths. All the variables are related to directory paths used by a given Integration Service.

You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the runtime files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

840 of 954

By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. You must specify the directory path for each type of file. You specify the following directories using service process variables: Each registered server has its own set of variables. The list is fixed, not user-extensible.

Service Process Variable $PMRootDir $PMSessionLogDir $PMBadFileDir $PMCacheDir $PMTargetFileDir $PMSourceFileDir $PMExtProcDir $PMTempDir $PMSuccessEmailUser $PMFailureEmailUser $PMSessionLogCount

Value (no default user must insert a path) $PMRootDir/SessLogs $PMRootDir/BadFiles $PMRootDir/Cache $PMRootDir/TargetFiles $PMRootDir/SourceFiles $PMRootDir/ExtProc $PMRootDir/Temp (no default user must insert a path) (no default user must insert a path) 0

$PMSessionErrorThreshold 0 $PMWorkflowLogCount $PMWorkflowLogDir $PMLookupFileDir $PMStorageDir 0 $PMRootDir/WorkflowLogs $PMRootDir/LkpFiles $PMRootDir/Storage

Writing PowerCenter 8 Service Logs to Files


Starting with PowerCenter 8, all the logging for the services and sessions created use the log service and can only be viewed through the PowerCenter Administration Console. However, it is still possible to get this information logged into a file similar to the previous versions. To write all Integration Service logs (session, workflow, server, etc.) to files: 1. <!--[endif]-->Log in to the Admin Console. 2. Select the Integration Service
INFORMATICA CONFIDENTIAL BEST PRACTICES 841 of 954

3. Add a Custom property called UseFileLog and set its value to "Yes". 4. Add a Custom property called LogFileName and set its value to the desired file name. 5. Restart the service. Integration Service Custom Properties (undocumented server parameters) can be entered here as well: 1. At the bottom of the list enter the Name and Value of the custom property 2. Click OK.

Adjusting Semaphore Settings on UNIX Platforms


When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server. Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as database servers. The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and system. The method used to change the parameter depends on the operating system:
q q q

HP/UX: Use sam (1M) to change the parameters. Solaris: Use admintool or edit /etc/system to change the parameters. AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters on UNIX Platforms


Informatica recommends setting the following parameters as high as possible for the UNIX operating system. However, if you set these parameters too high, the machine may not boot. Always refer to the operating system documentation for parameter limits. Note that different UNIX operating systems set these variables in different ways or may be self tuning. Always reboot the system after configuring the UNIX kernel.

HP-UX
For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to 500. NCALL is referred to as NCALLOUT. Use the HP System V IPC Shared-Memory Subsystem to update parameters. To change a value, perform the following steps: 1. 2. 3. 4. 5. 6. 7. Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program. Double click the Kernel Configuration icon. Double click the Configurable Parameters icon. Double click the parameter you want to change and enter the new value in the Formula/Value field. Click OK. Repeat these steps for all kernel configuration parameters that you want to change. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.

The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.

IBM AIX
None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.

SUN Solaris
Keep the following points in mind when configuring and tuning the SUN Solaris platform:
INFORMATICA CONFIDENTIAL BEST PRACTICES 842 of 954

1. Edit the /etc/system file and add the following variables to increase shared memory segments: set shmsys:shminfo_shmmax=value set shmsys:shminfo_shmmin=value set shmsys:shminfo_shmmni=value set shmsys:shminfo_shmseg=value set semsys:seminfo_semmap=value set semsys:seminfo_semmni=value set semsys:seminfo_semmns=value set semsys:seminfo_semmsl=value set semsys:seminfo_semmnu=value set semsys:seminfo_semume=value 2. Verify the shared memory value changes: # grep shmsys /etc/system 3. Restart the system: # init 6

Red Hat Linux


The default shared memory limit (shmmax) on Linux platforms is 32MB. This value can be changed in the proc file system without a restart. For example, to allow 128MB, type the following command: $ echo 134217728 >/proc/sys/kernel/shmmax You can put this command into a script run at startup. Alternatively, you can use sysctl(8), if available, to control this parameter. Look for a file called /etc/sysctl.conf and add a line similar to the following: kernel.shmmax = 134217728 This file is usually processed at startup, but sysctl can also be called explicitly later. To view the values of other parameters, look in the files /usr/src/linux/include/asm-xxx/shmparam.h and /usr/src/linux/include/linux/ sem.h.

SuSE Linux
The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a restart. For example, to allow 512MB, type the following commands: #sets shmall and shmmax shared memory echo 536870912 >/proc/sys/kernel/shmall echo 536870912 >/proc/sys/kernel/shmmax #Sets shmall to 512 MB #Sets shmmax to 512 MB

You can also put these commands into a script run at startup. Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following: #sets user limits (ulimit) for system memory resources ulimit -v 512000 #set virtual (swap) memory to 512 MB
BEST PRACTICES 843 of 954

INFORMATICA CONFIDENTIAL

ulimit -m 512000

#set physical memory to 512 MB

Configuring Automatic Memory Settings


With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer memory size and session cache size at runtime. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer memory and cache memory settings, consider the overall memory usage for best performance. Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the Integration Service disables automatic memory settings and uses default values.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

844 of 954

Causes and Analysis of UNIX Core Files Challenge


This Best Practice explains what UNIX core files are and why they are created, and offers some tips on analyzing them.

Description
Fatal run-time errors in UNIX programs usually result in the termination of the UNIX process by the operating system. Usually, when the operating system terminates a process, a "core dump" file is also created, which can be used to analyze the reason for the abnormal termination.

What is a Core File and What Causes it to be Created?


UNIX operating systems may terminate a process before its normal, expected exit for several reasons. These reasons are typically for bad behavior by the program, and include attempts to execute illegal or incorrect machine instructions, attempts to allocate memory outside the memory space allocated to the program, attempts to write to memory marked read-only by the operating system, and other similar incorrect lowlevel operations. Most of these bad behaviors are caused by errors in programming logic in the program. UNIX may also terminate a process for some reasons that are not caused by programming errors. The main examples of this type of termination are when a process exceeds its CPU time limit, and when a process exceeds its memory limit. When UNIX terminates a process in this way, it normally writes an image of the processes memory to disk in a single file. These files are called "core files", and are intended to be used by a programmer to help determine the cause of the failure. Depending on the UNIX version, the name of the file may be "core", or in more recent UNIX versions, "core.nnnn" where nnnn is the UNIX process ID of the process that was terminated. Core files are not created for "normal" runtime errors such as incorrect file permissions, lack of disk space, inability to open a file or network connection, and other errors that a program is expected to detect and handle. However, under certain error conditions a program may not handle the error conditions correctly and may follow a path of
INFORMATICA CONFIDENTIAL BEST PRACTICES 845 of 954

execution that causes the OS to terminate it and cause a core dump. Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger behavior that causes unexpected core dumps. For example, using an odbc driver library from one vendor and an odbc driver manager from another vendor may result in a core dump if the libraries are not compatible. A similar situation can occur if a process is using libraries from different versions of a database client, such as a mixed installation of Oracle 8i and 9i. An installation like this should not exist, but if it does, core dumps are often the result.

Core File Locations and Size Limits


A core file is written to the current working directory of the process that was terminated. For PowerCenter, this is always the directory the services were started from. For other applications, this may not be true. UNIX also implements a per user resource limit on the maximum size of core files. This is controlled by the ulimit command. If the limit is 0, then core files will not be created. If the limit is less than the total memory size of the process, a partial core file will be written. Refer to the Best Practice Understanding and Setting UNIX Resources for PowerCenter Installations .

Analyzing Core Files


Core files provide valuable insight into the state and condition the process was in just before it was terminated. It also contains the history or log of routines that the process went through before that fateful function call; this log is known as the stack trace. There is little information in a core file that is relevant to an end user; most of the contents of a core file are only relevant to a developer, or someone who understands the internals of the program that generated the core file. However, there are a few things that an end user can do with a core file in the way of initial analysis. The most important aspect of analyzing a core file is the task of extracting this stack trace out of the core dump. Debuggers are the tools that help retrieve this stack trace and other vital information out of the core. Informatica recommends using the pmstack utility. The first step is to save the core file under a new name so that it is not overwritten by a later crash of the same application. One option is to append a timestamp to the core, but it can be renamed to anything: mv core core.ddmmyyhhmi

INFORMATICA CONFIDENTIAL

BEST PRACTICES

846 of 954

The second step is to log in with the same UNIX user id that started up the process that crashed. This sets the debugger's environment to be same as that of the process at startup time. The third step is to go to the directory where the program is installed. Run the "file" command on the core file. This returns the name of the process that created the core file. file <fullpathtocorefile>/core.ddmmyyhhmi Core files can be generated by the PowerCenter executables (i.e., pmserver, infaservices, and pmdtm) as well as from other UNIX commands executed by the Integration Service, typically from command tasks and per- or post-session commands. If a PowerCenter process is terminated by the OS and a core is generated, the session or server log typically indicates Process terminating on Signal/Exception as its last entry.

Using the pmstack Utility


Informatica provides a pmstack utility that can automatically analyze a core file. If the core file is from PowerCenter, it will generate a complete stack trace from the core file, which can be sent to Informatica Customer Support for further analysis. The track contains everything necessary to further diagnose the problem. Core files themselves are normally not useful on a system other than the one where they were generated. The pmstack utility can be downloaded from the Informatica Support knowledge base as article 13652, and from the support ftp server at tsftp.informatica.com. Once downloaded, run pmstack with the c option, followed by the name of the core file: $ pmstack -c core.21896 ================================= SSG pmstack ver 2.0 073004 ================================= Core info : -rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896 core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4style, from ''''''''pmdtm'''''''' Process name used for analyzing the core : pmdtm Generating stack trace, please wait.. Pmstack completed successfully Please send file core.21896.trace to Informatica Technical Support You can then look at the generated trace file or send it to support.
INFORMATICA CONFIDENTIAL BEST PRACTICES 847 of 954

Pmstack also supports a p option, which can be used to extract a stack trace from a running process. This is sometimes useful if the process appears to be hung to determine what the process is doing.

Last updated: 19-Mar-08 19:01

INFORMATICA CONFIDENTIAL

BEST PRACTICES

848 of 954

Domain Configuration Challenge


The domain architecture in PowerCenter simplifies the administration of disparate PowerCenter services across the enterprise as well as the maintenance of security throughout PowerCenter. It allows for the grouping of previously separately administered application services and nodes into logically-grouped folders within the domain, based on administrative ownership. It is vital when installing or upgrading PowerCenter, that the Application Administrator understand the terminology and architecture surrounding the Domain Configuration in order to effectively administer, upgrade, deploy, and maintain PowerCenter Services throughout the enterprise.

Description
The domain architecture allows PowerCenter to provide a service-oriented architecture where you can specify which services are running on which node or physical machine from one central location. The components in the domain are aware of each others presence and continually monitor one another via heartbeats. The various services within the domain can move from one physical machine to another without any interruption to the PowerCenter environment. As long as clients can connect to the domain, the domain can route their needs to the appropriate physical machine. From a monitoring perspective, the domain provides the ability to monitor all services in the domain as well as control security from a central location. You no longer have to log into and ping multiple machines in a robust PowerCenter environment instead a single screen displays the current availability state of all services. For more details on the individual components and detailed configuration of a domain, refer to the PowerCenter Administrator Guide.

Key Domain Components


There are several key domain components to consider during installation and setup:
q

Master Gateway The node designated as the master gateway or domain controller is the main entry point to the domain. This server or set of servers should be your most reliable and available machine in the architecture. It is the first point of entry for all clients wishing to connect to one of the PowerCenter services. If the master gateway is unavailable, the entire domain is unavailable. You may designate more than one node to run the gateway service. One gateway is always the master or primary, but by having the gateway services running on more than one node in a multimode configuration, your domain can continue to function if the master gateway is no longer available. In a highavailability environment, it is critical to have one or more nodes running the gateway service as a backup to the master gateway.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

849 of 954

Shared File System The PowerCenter domain architecture provides centralized logging capability and; when high-availability is enabled, a highly available environment with automatic fail-over of workflows and sessions. In order to achieve this, the base PowerCenter server file directories must reside on a file system that is accessible by all nodes in the domain. When PowerCenter is initially installed, this directory is called infa_shared and is located under the server directory of the PowerCenter installation. It includes logs and checkpoint information that is shared among nodes of the domain. Ideally, this file system is both high-performance and highly available. Domain Metadata As of PowerCenter 8, a store of metadata exists to hold all of the configuration settings for the domain. This domain repository is separate from the one or more PowerCenter repositories in a domain. Instead, it is a handful of tables that replace the older version 7.xpmserver.cfg, pmrep.cfg and other PowerCenter configuration information. As of PowerCenter 8.5, all PowerCenter security is also maintained here. Upon installation you will be prompted for the RDBMS location for the domain repository. This information should be treated like a PowerCenter repository, with regularlyscheduled backups and a disaster recovery plan. Without this metadata, a domain is unable to function. The RDBMS user provided to PowerCenter requires permissions to create and drop tables, as well as insert, update, and delete records. Ideally, if you are going to be grouping multiple independent nodes within this domain, the domain configuration database should reside on a separate and independent server so as to eliminate the single point of failure if the node hosting the domain configuration database fails.

Domain Architecture
Just as in other PowerCenter architectures, the premise of the architecture is to maintain flexibility and scalability across the environment. There is no single best way to deploy the architecture. Rather, each environment should be assessed for external factors and then PowerCenter should be configured appropriately to function best in that particular environment. The advantage of the service-oriented architecture is that components in the architecture (i.e., repository services, integration services, and others) can be moved among nodes without needing to make changes to the mappings or workflows. Starting in PowerCenter 8.5, all reporting components of PowerCenter (Data Analyzer and Metadata Manager) are now all configured and administered from the Domain. Because of this architecture, it is very simple to alter architecture components if you find a suboptimal configuration and want to alter it in your environment. The key here is that you are not tied to any choices you make at installation time and have the flexibility to make changes to your architecture as your business needs change.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

850 of 954

Tip While the architecture is very flexible and provides easy movement of services throughout the environment, an item to carefully consider at installation time is the name of the domain and its subsequent nodes. These are somewhat troublesome to change later because of their criticality to the domain. It is not recommended that you imbed server IP addresses and names in the domain name or the node names. You never know when you may need to move to new hardware or move nodes to new locations. For example, instead of naming your domain PowerCenter_11.5.8.20, consider naming it Enterprise_Dev_Test. This makes it more intuitive to understand what domain you are attaching to and if you ever decide to move the main gateway to another server, you dont need to change the domain or node name. While these names can be changed, the change is not easy and requires using command line programs to alter the domain metadata.

In the next sections, we look at a couple of sample domain configurations.

Single Node Domain


Even in a single server/single node installation, you must still create a domain. In this case, all domain components reside on a single physical machine (i.e., node). You can have any number of PowerCenter services running in this domain. It is important to note that with PowerCenter 8 and beyond, you can run multiple integration services at the same time on the same machine even in a NT/Windows environment. Naturally this configuration exposes a single point of failure for every component in the domain and high availability is not available in this situation.

Multiple Node Domains


INFORMATICA CONFIDENTIAL BEST PRACTICES 851 of 954

Domains can continue to expand to meet the demands of true enterprise-wide data integration.

Domain Architecture for Production/Development/Quality Assurance Environments


The architecture picture becomes more complex when you consider a typical development environment, which usually includes some level of a Development, Quality Assurance, and Production environment. In most implementations, these are separate PowerCenter repositories and associated servers. It is possible to define a single domain to include one or more of these development environments. However, there are a few points to consider:
q

If the domain gateway is unavailable for any reason, the entire domain is inaccessible. Keep in mind that if you place your development, quality assurance and production services in a single domain, you have the possibility of affecting your production environment with development and quality assurance work. If you decide to restart the domain in Development for some reason, you are effectively restarting development, quality assurance and production at the same time. Also, if you experience some sort of failure that affects the domain in production, you have also brought down your development environment and have no place to test a fix the problem since your entire environment is compromised.
q

For the domain you should have a common, shared, high-performance file system to share the centralized logging and checkpoint files. If you have all three environments together on one domain, you are mixing production logs
INFORMATICA CONFIDENTIAL BEST PRACTICES 852 of 954

with development logs and other files on the same physical disk. Your production backups and disaster recovery files will have more than just production information in them. q For a future upgrade, it is very likely that you will need to upgrade all components of the domain at once to the new version of PowerCenter. If you have placed development, quality assurance, and production in the same domain, you may need to upgrade all of it at once. This is an undesirable situation in most data integration environments. For these reasons, Informatica generally recommends having at least two separate domains in any environment:
q q

Production Domain Development/Quality Assurance Domain

Some architects choose to deploy a separate domain for each environment to further isolate them and to ensure no disruptions occur in the Quality Assurance environment due to any changes in the development environment. The tradeoff is an additional administration console to log into and maintain. One thing to keep in mind is that while you may have separate domains with separate domain metadata repositories, there is no need to migrate any of the metadata from the separate domain repositories between development, Quality Assurance and production. The domain metadata repositories collect information based on the physical location and connectivity of the components and thus, it makes no sense to migrate between environments. You do need to provide separate database locations for each, but there is no migration needs for the data within; each one is specific to the environment it services.

Administration
The domain administrator has the access to start/shutdown all services within the domain, as well as the ability to create other users and delegate roles and responsibilities to them. Keep in mind that if the domain is shutdown, it has to be restarted via the command line or the host operating system GUI.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

853 of 954

PowerCenter's High Availability option provides the ability to create multiple gateway nodes to a domain, such that if the Master Gateway Node fails, another can assume its responsibilities; including authentication, logging, and service management.

Security and Folders


Much like traditional repository security, security in the domain interface is set up on a per-folder basis, with owners being designated per logical groupings of objects/services in the domain. One of the major differences is that Domain security allows the creation of subfolders to segment nodes and services as desired. There are many considerations when deciding on a folder structure, keeping in mind that this logical administrative interface should be accessible to Informatica Administrators only and not to users and groups associated with a developer role (which are designated at the Repository level). New legislation in the United States and Europe, such as Basel II and the Public Company Accounting Reform and Investor Protection Act of 2002 (also known as SOX, SarbOx and Sarbanes-Oxley) have been widely interpreted to place many restrictions on the ability of persons in development roles to have direct write access to production systems, and consequently, administration roles should be planned accordingly. An organization may simply need to use different folders to group objects into Development, Quality Assurance and Production roles; each with separate administrators. In some instances, systems may need to be entirely separate, with different domains for the Development, Quality Assurance, and Production systems. Sharing of metadata remains simple between separate domains, with PowerCenters ability to link domains, and copy data between linked domains. For Data Migration projects, it is recommended to establish a standardized architecture that includes a set of folders, connections and developer access in accordance with the needs of the project. Typically this includes folders for:
q q q q

Acquiring data Converting data to match the target system The final load to the target application Establishing reference data structures

When configuring security in PowerCenter 8.5, there are two interrelated security aspects that should be addressed when planning a PowerCenter security policy:
q

Role Differentiation Groups should be created separately to define roles and privileges typically needed for an Informatica Administrator and for an Informatica Developer. Using this separation at the group level allows for a more efficient administration of PowerCenter user privileges and provides for a more secure PowerCenter environment. Maintenance of Privileges As privileges typically are the same for several users within a PowerCenter environment, care should be taken to define these distinct separations ahead of time, so that privileges can be defined at a group level, rather than at an individual user level. As a best practice, users should not be granted user specific privileges, unless it is temporary.
BEST PRACTICES 854 of 954

INFORMATICA CONFIDENTIAL

Maintenance
As part of a regular backup of metadata, a recurring backup should be scheduled for the PowerCenter domain configuration database metadata. This can be accomplished through PowerCenter by using the infasetup command, further explained in the Command Line Reference. The schema should also be added to the normal RDBMS backup schedule, thus providing two reliable backup methods for disaster recovery purposes.

Licensing
As part of PowerCenter 8.5s new Service-Oriented Architecture (SOA), licensing for PowerCenter services is centralized within the domain. License key file(s) are received from Informatica at the same time the download location for the software is provided. Adding license object(s) and assigning individual PowerCenter Services to the license(s) is the method used to enable a PowerCenter Service. This can be done during install, or add initial/incremental license keys can be used after install via the Administration Console web-based utility (or the infacmd command line utility).

Last updated: 26-May-08 17:36

INFORMATICA CONFIDENTIAL

BEST PRACTICES

855 of 954

Managing Repository Size Challenge


The PowerCenter repository is expected to grow over time as new development and production runs occur. Over time, the repository can be expected to grow to a size that may start slowing performance of the repository or make backups increasingly difficult. This Best Practice discusses methods to manage the size of the repository. The release of PowerCenter version 8.x added several features that aid in managing the repository size. Although the repository is slightly larger with version 8.x than it was with the previous versions, the client tools have increased functionality to limit the dependency on the size of the repository. PowerCenter versions earlier than 8.x require more administration to keep the repository sizes manageable.

Description
Why should we manage the size of the repository? Repository size affects the following:
q

DB backups and restores. If database backups are being performed, the size required for the backup can be reduced. If PowerCenter backups are being used, you can limit what gets backed up. Overall query time of the repository, which slows performance of the repository over time. Analyzing tables on a regular basis can aid in repository table performance. Migrations (i.e., copying from one repository to the next). Limit data transfer between repositories to avoid locking up the repository for a long period of time. Some options are available to avoid transferring all run statistics when migrating. A typical repository starts off small (i.e., 50MB to 60MB for an empty repository) and grows to upwards of 1GB for a large repository. The type of information stored in the repository includes:
r r r r

Versions Objects Run statistics Scheduling information

INFORMATICA CONFIDENTIAL

BEST PRACTICES

856 of 954

Variables

Tips for Managing Repository Size Versions and Objects


Delete old versions or purged objects from the repository. Use your repository queries in the client tools to generate reusable queries that can determine out-of-date versions and objects for removal. Use Query Browser to run object queries on both versioned and non-versioned repositories.. Old versions and objects not only increase the size of the repository, but also make it more difficult to manage further into the development cycle. Cleaning up the folders makes it easier to determine what is valid and what is not. One way to keep repository size small is to use shortcuts by creating shared folders if you are using the same source/target definition, reusable transformations in multiple folders.

Folders
Remove folders and objects that are no longer used or referenced. Unnecessary folders increase the size of the repository backups. These folders should not be a part of production but they may exist in development or test repositories.

Run Statistics
Remove old run statistics from the repository if you no longer need them. History is important to determine trending, scaling, and performance tuning needs but you can always generate reports based on the PowerCenter Metadata Reporter and save reports of the data you need. To remove the run statistics, go to Repository Manager and truncate the logs based on the dates.

Recommendations
Informatica strongly recommends upgrading to the latest version of PowerCenter since the most recent release includes such features as skip workflow and session log, skip deployment group history, skip MX data and so forth. The repository size in version 8.x and above is larger than the previous versions of PowerCenter, but the added size does not significantly affect the performance of the repository. It is still advisable to

INFORMATICA CONFIDENTIAL

BEST PRACTICES

857 of 954

analyze the tables or run statistics to optimize the tables. Informatica does not recommend directly querying the repository tables or performing deletes on them. Use the client tools unless otherwise advised by Informatica technical support personnel.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

858 of 954

Organizing and Maintaining Parameter Files & Variables Challenge


Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.

Description
Parameter files are a means of providing run time values for parameters and variables defined in a workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell script, or an Informatica mapping. Variable values are stored in the repository and can be changed within mappings. However, variable values specified in parameter files supersede values stored in the repository. The values stored in the repository can be cleared or reset using workflow manager.

Parameter File Contents


A Parameter File contains the values for variables and parameters. Although a parameter file can contain values for more than one workflow (or session), it is advisable to build a parameter file to contain values for a single or logical group of workflows for ease of administration. When using the command line mode to execute workflows, multiple parameter files can also be configured and used for a single workflow if the same workflow needs to be run with different parameters.

Types of Parameters and Variables


A parameter file contains the following types of parameters and variables:
q q

Service Variable. Defines a service variable for an Integration Service. Service Process Variable. Defines a service process variable for an Integration Service that runs on a specific node. Workflow Variable. References values and records information in a workflow. For example, use a workflow variable in a decision task to determine whether the previous task ran properly. Worklet Variable. References values and records information in a worklet. You can use predefined worklet variables in a parent workflow, but cannot use workflow variables from the parent workflow in a worklet. Session Parameter. Defines a value that can change from session to session, such a database connection or file name. Mapping Parameter. Defines a value that remains constant throughout a session, such as a state sales tax rate. Mapping Variable. Defines a value that can change during the session. The Integration Service saves the value of a mapping variable to the repository at the end of each successful
BEST PRACTICES 859 of 954

INFORMATICA CONFIDENTIAL

session run and uses that value the next time the session runs.

Configuring Resources with Parameter File


If a session uses a parameter file, it must run on a node that has access to the file. You create a resource for the parameter file and make it available to one or more nodes. When you configure the session, you assign the parameter file resource as a required resource. The Load Balancer dispatches the Session task to a node that has the parameter file resource. If no node has the parameter file resource available, the session fails.

Configuring Pushdown Optimization with Parameter File


Depending on the database workload, you may want to use source-side, target-side, or full pushdown optimization at different times. For example, you may want to use partial pushdown optimization during the database's peak hours and full pushdown optimization when activity is low. Use the $ $PushDownConfig mapping parameter to use different pushdown optimization configurations at different times. The parameter lets you run the same session using the different types of pushdown optimization. When you configure the session, choose $$PushdownConfig for the Pushdown Optimization attribute. Define the parameter in the parameter file. Enter one of the following values for $$PushdownConfig in the parameter file:
q q q

None. The Integration Service processes all transformation logic for the session. Source. The Integration Service pushes part of the transformation logic to the source database. Source with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database. Target. The Integration Service pushes part of the transformation logic to the target database. Full. The Integration Service pushes all transformation logic to the database. Full with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database. The Integration Service pushes any remaining transformation logic to the target database.

q q q

Parameter File Name


Informatica recommends giving the Parameter File the same name as the workflow with a suffix of . par. This helps in identifying and linking the parameter file to a workflow.

Parameter File: Order of Precedence


While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a file specified at the workflow level always supersedes files specified at session levels.

Parameter File Location


INFORMATICA CONFIDENTIAL BEST PRACTICES 860 of 954

Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. Place the Parameter Files in directory that can be accessed using the server variable. This helps to move the sessions and workflows to a different server without modifying workflow or session properties. You can override the location and name of parameter file specified in the session or workflow while executing workflows via the pmcmd command. The following points apply to both Parameter and Variable files, however these are more relevant to Parameters and Parameter files, and are therefore detailed accordingly.

Multiple Parameter Files for a Workflow


To run a workflow with different sets of parameter values during every run: 1. Create multiple parameter files with unique names. 2. Change the parameter file name (to match the parameter file name defined in Session or Workflow properties). You can do this manually or by using a pre-session shell (or batch script). 3. Run the workflow. Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps 2 and 3.

Generating Parameter Files


Based on requirements, you can obtain the values for certain parameters from relational tables or generate them programmatically. In such cases, the parameter files can be generated dynamically using shell (or batch scripts) or using Informatica mappings and sessions. Consider a case where a session has to be executed only on specific dates (e.g., the last working day of every month), which are listed in a table. You can create the parameter file containing the next run date (extracted from the table) in more than one way.

Method 1:
1. The workflow is configured to use a parameter file. 2. The workflow has a decision task before running the session: comparing the Current System date against the date in the parameter file. 3. Use a shell (or batch) script to create a parameter file. Use an SQL query to extract a single date, which is greater than the System Date (today) from the table and write it to a file with required format. 4. The shell script uses pmcmd to run the workflow. 5. The shell script is scheduled using cron or an external scheduler to run daily. The following figure shows the use of a shell script to generate a parameter file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

861 of 954

The following figure shows a generated parameter file.

Method 2:
1. The Workflow is configured to use a parameter file. 2. The initial value for the data parameter is the first date on which the workflow is to run. 3. The workflow has a decision task before running the session: comparing the Current System date against the date in the parameter file 4. The last task in the workflow generates the parameter file for the next run of the workflow (using
INFORMATICA CONFIDENTIAL BEST PRACTICES 862 of 954

a command task calling a shell script) or a session task, which uses a mapping. This task extracts a date that is greater than the system date (today) from the table and writes into parameter file in the required format. 5. Schedule the workflow using Scheduler, to run daily (as shown in the following figure).

Parameter File Templates


In some other cases, the parameter values change between runs, but the change can be incorporated into the parameter files programmatically. There is no need to maintain separate parameter files for each run. Consider, for example, a service provider who gets the source data for each client from flat files located in client-specific directories and writes processed data into global database. The source data structure, target data structure, and processing logic are all same. The log file for each client run has to be preserved in a client-specific directory. The directory names have the client id as part of directory structure (e.g., /app/data/Client_ID/) You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one parameter file per client. However, the number of parameter files may become cumbersome to manage when the number of clients increases.
INFORMATICA CONFIDENTIAL BEST PRACTICES 863 of 954

In such cases, a parameter file template (i.e., a parameter file containing values for some parameters and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual parameter file (for a specific client), replacing the placeholders with actual values, and then execute the workflow using pmcmd. [PROJ_DP.WF:Client_Data] $InputFile_1=/app/data/Client_ID/input/client_info.dat $LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log Using a script, replace Client_ID and curdate to actual values before executing the workflow. The following text is an excerpt from a parameter file that contains service variables for one Integration Service and parameters for four workflows: [Service:IntSvs_01] $PMSuccessEmailUser=pcadmin@mail.com $PMFailureEmailUser=pcadmin@mail.com [HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS] $$platform=unix [HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR] $$platform=unix $DBConnection_ora=qasrvrk2_hp817 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1] $$DT_WL_lvl_1=02/01/2005 01:05:11 $$Double_WL_lvl_1=2.2 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2] $$DT_WL_lvl_2=03/01/2005 01:01:01 $$Int_WL_lvl_2=3 $$String_WL_lvl_2=ccccc
INFORMATICA CONFIDENTIAL BEST PRACTICES 864 of 954

Use Case 1: Fiscal Calendar-Based Processing


Some Financial and Retail industries use Fiscal calendar for accounting purposes. Use the mapping parameters to process the correct fiscal period. For example, create a calendar table in the database with the mapping between the Gregorian calendar and fiscal calendar. Create mapping parameters in the mappings for the starting and ending dates. Create another mapping with the logic to create a parameter file. Run the parameter file creation session before running the main session. The calendar table can be directly joined with the main table, but the performance may not be good in some databases depending upon how the indexes are defined. Using a parameter file can resolve the index and result in better performance.

Use Case 2: Incremental Data Extraction


Mapping parameters and variables can be used to extract inserted/updated data since previous extract. Use the mapping parameters or variables in the source qualifier to determine the beginning timestamp and the end timestamp for extraction. For example, create a user-defined mapping variable $$PREVIOUS_RUN_DATE_TIME that saves the timestamp of the last row the Integration Service read in the previous session. Use this variable for the beginning timestamp and the built-in variable $$$SessStartTime for the end timestamp in the source filter. Use the following filter to incrementally extract data from the database: LOAN.record_update_timestamp > TO_DATE($$PREVIOUS_DATE_TIME) and LOAN.record_update_timestamp <= TO_DATE($$$SessStartTime)

Use Case 3: Multi-Purpose Mapping


Mapping parameters can be used to extract data from different tables using a single mapping. In some cases the table name is the only difference between extracts. For example, there are two similar extracts from tables FUTURE_ISSUER and EQUITY_ISSUER; the column names and data types within the tables are same. Use mapping parameter $$TABLE_NAME in the source qualifier SQL override, create two parameter files for each table name. Run the workflow using the pmcmd command with the corresponding parameter file, or create two sessions with corresponding parameter file.

Use Case 4: Using Workflow Variables


You can create variables within a workflow. When you create a variable in a workflow, it is valid only in
INFORMATICA CONFIDENTIAL BEST PRACTICES 865 of 954

that workflow. Use the variable in tasks within that workflow. You can edit and delete user-defined workflow variables. Use user-defined variables when you need to make a workflow decision based on criteria you specify. For example, you create a workflow to load data to an orders database nightly. You also need to load a subset of this data to headquarters periodically, every tenth time you update the local orders database. Create separate sessions to update the local database and the one at headquarters. Use a user-defined variable to determine when to run the session that updates the orders database at headquarters. To configure user-defined workflow variables, set up the workflow as follows: Create a persistent workflow variable, $$WorkflowCount, to represent the number of times the workflow has run. Add a Start task and both sessions to the workflow. Place a Decision task after the session that updates the local orders database. Set up the decision condition to check to see if the number of workflow runs is evenly divisible by 10. Use the modulus (MOD) function to do this. Create an Assignment task to increment the $$WorkflowCount variable by one. Link the Decision task to the session that updates the database at headquarters when the decision condition evaluates to true. Link it to the Assignment task when the decision condition evaluates to false. When you configure workflow variables using conditions, the session that updates the local database runs every time the workflow runs. The session that updates the database at headquarters runs every 10th time the workflow runs.

Last updated: 09-Feb-07 16:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

866 of 954

Platform Sizing Challenge


Determining the appropriate platform size to support the PowerCenter environment based on customer infrastructure and requirements.

Description
The main factors that affect the sizing estimate are the input parameters that are based on the requirements and the constraints imposed by the existing infrastructure and budget. Other important factors include choice of Grid/High Availability Option, future growth estimates and real time versus batch load requirements. The required platform size to support PowerCenter depends upon each customers unique infrastructure and processing requirements: The Integration Service allocates resources for individual extraction, transformation and load (ETL) jobs or sessions. Each session has its own resource requirement.The resources required for the Integration Service depend upon the number of sessions, the complexity of each session (i.e., what it does while moving data) and how many sessions run concurrently. This Best Practice discusses the relevant questions pertinent to estimating the platform requirements.

TIP An important concept regarding platform sizing is not to size your environment too soon in the project lifecycle. A common mistake is to size the servers before any ETL is designed or developed, and in many cases these platforms are too small for the resulting system. Thus, it is better to analyze sizing requirements after the data transformation processes have been well defined during the design and development phases.

Environment Questions
To determine platform size, consider the following questions regarding your environment:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

867 of 954

q q q

What sources do you plan to access? How do you currently access those sources? Have you decided on the target environment (e.g., database, hardware, operating system)? If so, what is it? Have you decided on the PowerCenter environment (e.g., hardware, operating system, 32/64-bit processing)? Is it possible for the PowerCenter services to be on the same server as the target? How do you plan to access your information (e.g., cube, ad-hoc query tool) and what tools will you use to do this? What other applications or services, if any, run on the PowerCenter server? What are the latency requirements for the PowerCenter loads?

q q

PowerCenter Sizing Questions


To determine server size, consider the following questions:
q

Is the overall ETL task currently being performed? If so, how is it being done, and how long does it take? What is the total volume of data to move? What is the largest table (i.e., bytes and rows)? Is there any key on this table that can be used to partition load sessions, if needed? How often does the refresh occur? Will refresh be scheduled at a certain time, or driven by external events? Is there a "modified" timestamp on the source table rows? What is the batch window available for the load? Are you doing a load of detail data, aggregations, or both? If you are doing aggregations, what is the ratio of source/target rows for the largest result set? How large is the result set (bytes and rows)?

q q

q q q q q q

The answers to these questions provide an approximation guide to the factors that affect PowerCenter's resource requirements. To simplify the analysis, focus on large jobs that drive the resource requirement.

PowerCenter Resource Consumption

INFORMATICA CONFIDENTIAL

BEST PRACTICES

868 of 954

The following sections summarize some recommendations for PowerCenter resource consumption.

Processor
1 to 1.5 CPUs per concurrent non-partitioned session or transformation job. Note: - However Virtual CPU is considered as 0.75 CPU. For example 4 CPU with 4 cores each, could be considered as 12 Virtual CPUs.

Memory
q

20 to 30MB of memory for the Integration Service for session coordination.


q

20 to 30MB of memory per session, if there are no aggregations, lookups, or heterogeneous data joins. Note that 32-bit systems have an operating system limitation of 2GB per session.
q

Caches for aggregation, lookups or joins use additional memory:


q

Lookup tables are cached in full; the memory consumed depends on the size of the tables and selected data ports.
q

Aggregate caches store the individual groups; more memory is used if there are more groups. Sorting the input to aggregations greatly reduces the need for memory.
q

Joins cache the master table in a join; memory consumed depends on the size of the master.
q

Full Pushdown Optimization uses much less resources on PowerCenter server in comparison to partial (source/target) pushdown optimization.

System Recommendations
PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple servers using the Grid Option. The Grid
INFORMATICA CONFIDENTIAL BEST PRACTICES 869 of 954

Option allows for adding capacity at a low cost while providing implicit High Availability with the active/active Integration Service configuration. Below are the recommendations for a single node PowerCenter server.

Minimum Server
1 Node, 4 CPUs and 16GB of memory (instead of the minimal requirement of 4GB RAM) and 6 GB storage for PowerCenter binaries. A separate file system is recommended for the infa_shared working file directory and it can be sized depending on the work load profile.

Disk Space
Disk space is not a factor if the machine is used only for PowerCenter services, unless the following conditions exist:
q q

Data is staged to flat files on the PowerCenter machine. Data is stored in incremental aggregation files for adding data to aggregates. The space consumed is about the size of the data aggregated. Temporary space is needed for paging for transformations that require large caches that cannot be entirely cached by system memory Sessions logs are saved by timestamp

If any of these factors is true additional storage should be allocated for the file system used by the infa_shared directory. Typically Informatica customers allocate a minimum of 100 to 200 GB for this file system. Informatica recommends monitoring disk space on a regular basis or maintaining some type of script to purge unused files.

Sizing Analysis.
The basic goal is to size the server so that all jobs can complete within the specified load window. You should consider the answers to the questions in the "Environment" and "PowerCenter Server Sizing" sections to estimate the required number of sessions, the volume of data that each session moves, and its lookup table, aggregation, and heterogeneous join caching requirements. Use these estimates with the recommendations in the "PowerCenter Resource Consumption" section to determine the required number of processors, memory, and disk space to achieve the required performance to meet the load window.PowerCenter provides an advanced level of auto memory configuration with the option of using manual configuration. The minimum required cache memory for each active transformation in a mapping can be calculated

INFORMATICA CONFIDENTIAL

BEST PRACTICES

870 of 954

and accumulated for concurrent jobs. You can use the Cache Calculator feature for Aggregator, Joiner, Rank, and Lookup transformations:

Note that the deployment environment often creates performance constraints that hardware capacity cannot overcome. The Integration Service throughput is usually constrained by one or more of the environmental factors addressed by the questions in the "Environment" section. For example, if the data sources and target are both remote from the PowerCenter server, the network is often the constraining factor. At some point, additional sessions, processors, and memory may not yield faster execution because the network (not the PowerCenter services) imposes the performance limit. The hardware sizing analysis is highly dependent on the environment in which the server is deployed. You need to understand the performance characteristics of the environment before making any sizing conclusions. It is also vitally important to remember that other applications (in addition to PowerCenter) are likely to use the platform. PowerCenter often runs on a server with a database engine and query/analysis tools. In fact, in an environment where PowerCenter, the target database, and query/analysis tools all run on the same server, the query/analysis tool often drives the hardware requirements. However, if the loading is performed after business hours, the query/analysis tools requirements may not be a sizing limitation.

Last updated: 27-May-08 14:44

INFORMATICA CONFIDENTIAL

BEST PRACTICES

871 of 954

PowerCenter Admin Console Challenge


Using the PowerCenter Administration Console to administer PowerCenter domain and services.

Description
PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple machines. The PowerCenter domain is the fundamental administrative unit in PowerCenter. A domain is a collection of nodes and services that you can group in folders based on administration ownership. The Administration Console consolidates administrative tasks for domain objects such as services, nodes, licenses, and grids. For more information on domain configuration, refer the Best Practice on Domain Configuration.

Folders and Security


It is a good practice to create folders in the domain in order to organize objects and manage security. Folders can contain nodes, services, grids, licenses and other folders. Folders can be created based on functionality type, object type, or environment type.
q

Functionality-type folders group services based on a functional area such as Sales or Marketing. Object type-folders group objects based on the service type. For example, Integration services folder. Environment-type folders group objects based on the environment. For example, if you have development and testing on the same domain, group the services according to the environment.

Create User Accounts in the admin console, then set permissions and privileges to the folders the users need access to. It is a good practice for the Administrator to monitor the user activity in the domain periodically and save the reports for audit purposes.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

872 of 954

Nodes, Services, and Grids


A node is the logical representation of a machine in a domain. One node in the domain acts as a gateway to receive service requests from clients and route them to the appropriate service and node. Node properties can be set and modified using the admin console. It is important to note that the property to set the maximum session/ tasks to run is Maximum Processes. Set this threshold to a maximum number; for example, 200 is a good threshold. If you are using Adaptive Dispatch mode it is a good practice to recalculate the CPU profile when the node is idle since it uses 100 percent of the CPU. The admin console also allows you to manage application services. You can access properties of the services under one window using the admin console. For more information on configuring the properties, refer the Best Practice on Advanced Server Configuration Options In addition, you can create and configure grids to nodes using admin console.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

873 of 954

PowerCenter Enterprise Grid Option Challenge


Build a cost-effective and scaleable data integration architecture that is reliable and able to respond to changing business requirements.

Description
The PowerCenter Grid Option enables enterprises to build dynamic and scalable data integration infrastructures that have the flexibility to meet diverse business needs. The Grid Option can exploit underutilized computing resources to handle peak load periods and its dynamic partitioning and load balancing capabilities can improve the overall reliability of a data integration platform. If a server fails in a grid-only configuration without the HA option/capability, the tasks assigned to it are not automatically recovered, but any subsequent tasks are assigned to other servers. The foundation for a successful PowerCenter Grid Option implementation is the storage subsystem. In order to provide a high performance single file system view to PowerCenter, it is necessary to set up either a Clustered File System (CFS) or a Network Attached Storage (NAS). While NAS is directly accessible from multiple nodes and can use the existing network fabric, CFS allows nodes to share the same directories by managing concurrent read/write access. CFS should be configured for simultaneous read/writes and the CFS block size should be set to optimize the PowerCenter disk I/O. A separate mount point for the infa_shared directory should be created using the shared file system. The infa_shared directory contains the working file subdirectories such as Cache, SrcFiles, TgtFiles. The PowerCenter binaries should be installed on local storage. Some CFS alternatives include:
q q q q q

Red Hat Global File System (GFS) Sun Cluster (CFS, QFS) Veritas Storage Foundation Cluster File System HP Serviceguard Cluster File System IBM AIX Cluster File System (GPFS)

NAS provides access to storage over the network. NAS devices contain a server that provides file services to other hosts on the LAN using network file access methods such as CIFS or NFS. Most NAS devices offer file services over the Windows-centric SMB (Server Message Block) and CIFS (Common Internet File System) protocol, the Unix favorite NFS (Network File System), or the nearuniversal HTTP. With new emerging protocols such as DAFS (Direct Access File System), traditionally I/O intensive applications are moving to NAS. NAS is directly connected to a LAN and hence consumes large amounts of LAN bandwidth. In addition, special backup methods need to be used for backup and disaster recovery.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

874 of 954

The PowerCenter Integration Service reads from and writes to the shared file system in a Grid configuration. When persistent lookups are used, there may be simultaneous reads from multiple nodes. The Integration Service performs random reads for lookup caches. If it is determined that cache performance degradations are experienced as a result of using a certain type of CFS or NAS product, the cache directory can be placed in local storage. In the case of persistent cache files that need to be accessed from multiple nodes, the persistent cache file can be built on one node first and then copied to other nodes. This will reduce the random read performance impact of the CFS or NAS product. When installing the PowerCenter Grid Option on Unix, use the same user id (uid) and group id (gid) for each Unix account. If the infa_shared directory is placed on a shared file system like CFS or NAS the Unix accounts should have read/write access to the same files. For example, if a workflow running on node1 creates a persistent cache file in the Cache directory, node2 should be able to read and update this file. When installing the PowerCenter Grid Option on Windows, the user assigned to the Informatica Services joining the grid should have permissions to access the shared directory. This can be accomplished by granting Full Control, Change and Read access to the shared directory for the Machine Account. As a post installation step, the persistent cache files, parameter files, logs, and other run-time files should be configured to use the shared file system by pointing the $PMRootDir variable to this directory. PowerCenter resources can be configured to assign specific tasks to specific nodes. The objective in this type of configuration is to create a dynamic Grid to meet changing business needs. For example, a dummy custom resource can be defined and assigned to tasks. This custom resource can be made permanently available to the production nodes. If during peak month-end processing the need arises to use an additional node from the test environment, simply make this custom resource available to the additional node to allow production tasks to run on the new server. In metric-based dispatch mode and adaptive dispatch mode, the Load Balancer collects and stores statistics from the last three runs of the task and compares them with node load metrics. This metadata is available in the OPB_TASK_STATS repository table. The CPU and memory metrics available in this table can be used for capacity planning and departmental charge backs. Since this table contains statistics from the last three runs of the task, it is necessary to build a process to extract data from this table to load a custom history table. The history table can be used to calculate averages and perform trend analysis. Proactive monitoring for Service Manager failures is essential. Service Manager manages both the integration service and the repository service. In a two-node grid configuration, two Service Manager processes are running. Use custom scripts or third party tools such as Tivoli Monitoring or HP OpenView to check the health and availability of the PowerCenter Service Manager process. Below is a sample script that can be called from a monitoring tool: # !/bin/ksh # Initializing runtime variables typeset integer no_srv=0 srv_env='uname -n' # Checking to verify if the PowerCenter process (tomcat) is currently running

INFORMATICA CONFIDENTIAL

BEST PRACTICES

875 of 954

# in background no_srv=`ps -ef | grep tomcat | grep -v grep | wc -l` # If it is not running, then exit with message if [ $no_srv -eq 0 ] then echo "PowerCenter service process on $srv_env is not running" exit 1 fi exit 0 To upgrade a two-node grid without incurring downtime, follow the steps below: 1. 2. 3. 4. 5. 6. Set up a separate schema/database to hold a copy of the production repository. Take node1 out of the existing grid. Upgrade the binaries and the repository while node2 is handling the production loads. Switch the production loads to node1. While node1 is handling the production loads, upgrade the node2 binaries. After the node2 upgrade is complete, node2 can be put back on the grid.

With Session on Grid, PowerCenter automatically distributes the partitions of the transformations across the grid. You do not need to specify the distribution of nodes for each transformation. By using Dynamic Partitioning that bases the number of partitions on the number of nodes in the grid, a session can scale up automatically when the number of nodes in the grid is expanded.

Last updated: 27-May-08 13:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

876 of 954

Understanding and Setting UNIX Resources for PowerCenter Installations Challenge


This Best Practice explains what UNIX resource limits are, and how to control and manage them.

Description
UNIX systems impose per-process limits on resources such as processor usage, memory, and file handles. Understanding and setting these resources correctly is essential for PowerCenter installations.

Understanding UNIX Resource Limits


UNIX systems impose limits on several different resources. The resources that can be limited depend on the actual operating system (e.g., Solaris, AIX, Linux, or HPUX) and the version of the operating system. In general, all UNIX systems implement per-process limits on the following resources. There may be additional resource limits, depending on the operating system.

Resource Processor time

Description The maximum amount of processor time that can be used by a process, usually in seconds. The size of the largest single file a process can create. Usually specified in blocks of 512 bytes. The maximum amount of data memory a process can allocate. Usually specified in KB. The maximum amount of stack memory a process can allocate. Usually specified in KB. The maximum number of files that can be open simultaneously. The maximum amount of memory a process can use, including stack, instructions, and data. Usually specified in KB. The maximum size of a core dump file. Usually specified in blocks of 512 bytes.

Maximum file size

Process data

Process stack

Number of open files Total virtual memory

Core file size

These limits are implemented on an individual process basis. The limits are also inherited by child processes when they are created. In practice, this means that the resource limits are typically set at log-on time, and apply to all processes started from the login shell. In the case of PowerCenter, any limits in effect before the Integration Service is started also apply to all sessions (pmdtm) started from that node. Any limits in effect when the Repository Service is started also apply to all pmrepagents started from that repository service (repository service process is an instance of the repository service running on a particular machine or node).
INFORMATICA CONFIDENTIAL BEST PRACTICES 877 of 954

When a process exceeds its resource limit, UNIX fails the operation that caused the limit to be exceeded. Depending on the limit that is reached, memory allocations fail, files cant be opened, and processes are terminated when they exceed their processor time. Since PowerCenter sessions often use a large amount of processor time, open many files, and can use large amounts of memory, it is important to set resource limits correctly to prevent the operating system from limiting access to required resources, while preventing problems.

Hard and Soft Limits


Each resource that can be limited actually allows two limits to be specified a soft limit and a hard limit. Hard and soft limits can be confusing. From a practical point of view, the difference between hard and soft limits doesnt matter to PowerCenter or any other process; the lower value is enforced when it reached, whether it is a hard or soft limit. The difference between hard and soft limits really only matters when changing resource limits. The hard limits are the absolute maximums set by the System Administrator that can only be changed by the System Administrator. The soft limits are recommended values set by the System Administrator, and can be increased by the user, up to the maximum limits.

UNIX Resource Limit Commands


The standard interface to UNIX resource limits is the ulimit shell command. This command displays and sets resource limits. The C shell implements a variation of this command called limit, which has different syntax but the same functions.
q q

ulimit a ulimit a H

Displays all soft limits Displays all hard limits in effect

Recommended ulimit settings for a PowerCenter server:

Resource Processor time

Description Unlimited. This is needed for the pmserver and pmrepserver that run forever. Based on whats needed for the specific application. This is an important parameter to keep a session from filling a whole filesystem, but needs to be large enough to not affect normal production operations. 1GB to 2GB 32MB At least 256. Each network connection counts as a file so source, target, and repository connections, as well as cache files all use file handles.

Maximum file size

Process data Process stack Number of open files

INFORMATICA CONFIDENTIAL

BEST PRACTICES

878 of 954

Total virtual memory

The largest expected size of a session. 1Gig should be adequate, unless sessions are expected to create large in-memory aggregate and lookup caches that require more memory. If you have sessions that are likely to required more than 1Gig, set the Total virtual memory appropriately. Remember that in 32-bit OS, the maximum virtual memory for a session is 2Gigs. Unlimited, unless disk space is very tight. The largest core files can be ~2-3GB, but after analysis they should be deleted, and there really shouldnt be multiple core files lying around.

Core file size

Setting Resource Limits


Resource limits are normally set in the log-in script, either .profile for the Korn shell or .bash_profile for the bash shell. One ulimit command is required for each resource being set, and usually the soft limit is set. A typical sequence is: ulimit -S -c unlimited ulimit -S -d 1232896 ulimit -S -s 32768 ulimit -S -t unlimited ulimit -S -f 2097152 ulimit -S -n 1024 ulimit -S -v unlimited after running this, the limits are changed: % ulimit S a core file size (blocks, -c) unlimited data seg size (kbytes, -d) 1232896 file size (blocks, -f) 2097152 max memory size (kbytes, -m) unlimited open files (-n) 1024 stack size (kbytes, -s) 32768 cpu time (seconds, -t) unlimited virtual memory (kbytes, -v) unlimited

Setting or Changing Hard Resource Limits


Setting or changing hard resource limits varies across UNIX types. Most current UNIX systems set the initial hard limits in the file /etc/profile, which must be changed by a System Administrator. In some cases, it is necessary to run a system utility such as smit on AIX to change the global system limits.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

879 of 954

PowerExchange for Oracle CDC Challenge


Configure the Oracle environment for optimal performance when using PowerExchange Change Data Capture (CDC) in a production environment.

Description
There are two performance types that need to be considered when dealing with Oracle CDC; latency of the data and restartability of the environment. Some of the factors that impact these performance types are configurable within PowerExchange, while others are not. These two performance types are addressed separately in this Best Practice.

Data Latency Performance


The objective of latency performance is to minimize the amount of time that it takes for a change made to the source database to appear in the target database. Some of the factors that can affect latency performance are discussed below.

Location of PowerExchange CDC


The optimal location for installing PowerExchange CDC is on the server that contains the Oracle source database. This eliminates the need to use the network to pass data between Oracles LogMiner and PowerExchange. It also eliminates the need to use SQL*Net for this process and it minimizes the amount of data being moved across the network. For best results, install the PowerExchange Listener on the same server as the source database server.

Volume of Data
The volume of data that the Oracle Log Miner has to process in order to provide changed data to PowerExchange can have a significant impact on performance. Bear in mind that in addition to the changed data rows, other processes may be writing large volumes of data to the Oracle redo logs. These include, but are not limited to:
q

Oracle catalog dumps

INFORMATICA CONFIDENTIAL

BEST PRACTICES

880 of 954

q q

Oracle workload monitor customizations Other (non-Oracle) tools that use the redo logs to provide proprietary information

In order to optimize PowerExchanges CDC performance, the amount of data these processes write to the Oracle redo logs needs to be minimized (both in terms of volume and frequency). This includes minimizing the invocations of the LogMiner to just a single occurrence. Review the processes that are actively writing data to the Oracle redo logs and tune them within the context of a production environment. Monitoring the redo log switches and the creation of archived log files is one way to determine how busy the source database is. The size of the archived log files and how often they are being created over a day will give a good idea about performance implications.

Server Workload
Optimize the performance of the Oracle database server by reducing the number of unnecessary tasks it is performing concurrently with the PowerExchange CDC components. This may include a full review of the backup and restore schedules, Oracle import and export processing and other application software utilized within the production server environment. PowerCenter also contributes to the workload on the server where PowerExchange CDC is running; so it is important to optimize these workload tasks. This can be accomplished through mapping design. If possible, include all of the processing of PowerExchange CDC sources within the same mapping. This will minimize the number of tasks generated and will ensure that all of the required data from either the Oracle archive log (i.e., near real time) or the CDC files (i.e., CAPXRT, condense) process within a single pass of the logs or CDC files.

Condense Option Considerations


The condense option for Oracle CDC provides only the required data by reducing the collected data based on the Unit of Work information. This can prevent the transfer of unnecessary data and save CPU and memory resources. In order to properly allocate space for the files created by the condense process it is necessary to perform capacity planning. In determining the space required for the CDC data files it is important to know whether before and after images (or just after images) are required. Also, the retention period for these files must be considered. The retention period is defined in the COND_CDCT_RET_P parameter in the dtlca.cfg file. The value that appears for this

INFORMATICA CONFIDENTIAL

BEST PRACTICES

881 of 954

parameter specifies the retention period in days. The general algorithms for calculating this space are outlined below. After Image Only Estimated condense file disk space for Table A = ((The width of Table A in bytes * Estimated number of data changes for Table A per 24 hour period) + 700 bytes for the six fields added to each CDC record) * the value on theCOND_CDCT_RET_P parameter Before/After Image Estimated condense file disk space for Table A = ((The width of Table A in bytes * Estimated number of data changes for Table A per 24 hour period) * 2) + 700 bytes for the six fields added to each CDC record) * the value on theCOND_CDCT_RET_P parameter Accurate capacity planning can be accomplished by running sample condense jobs for a given number of source changes to determine the storage required. The size of files created by the condense process can be used for projecting the actual storage required in a production environment.

Continuous Capture Extract Option Considerations


When Continuous Capture Extract is used for Oracle CDC, condense files can be consumed with CAPXRT processing. Since the PowerCenter session waits for the creation of new condense files (rather than stopping and restarting) the CPU and memory impact of real-time processing is reduced. Similar to the Condense option, there is a need to perform proper capacity planning for the files created as a result of using the Continuous Capture Extract option.

PowerExchange CDC Restart Performance


The amount of time required to restart the PowerExchange CDC process should be considered when determining performance. The PowerExchange CDC process will need to be restarted whenever any of the following events occur:
q

A schema change is made to a table.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

882 of 954

q q q q

An existing Change Registration is amended. The PowerExchange service pack is applied or a configuration file is changed. An Oracle patch or bug fix is applied. An Operating System patch or upgrade is applied.

A copy of the Oracle catalog must be placed on the archive log in order for LogMiner to function correctly. The frequency of these copies is very site specific and it can impact the amount of time that it takes the CDC process to restart. There are several parameters that appear in the dbmover.cfg configuration file that can assist in optimizing restart performance. These parameters are: RSTRADV: The RSTRADV parameter specifies the number of seconds to wait after receiving a Unit of Work (UOW) for a source table before advancing the restart tokens by returning an empty UOW. This parameter is very beneficial in cases where the frequency of updates on some tables is low in comparison to other tables. CATINT: The CATINT parameter specifies the frequency in which the Oracle catalog is copied to the archive logs. Since LogMiner needs a copy of the catalog on the archive log to become operational, this is an important parameter as it will have an impact on which archive log is used to restart the CDC process. When Oracle places a catalog copy on the archive log, it will first flush all of the online redo logs to the archive logs prior to writing out the catalog. CATBEGIN: The CATBEGIN parameter specifies the time of day that the Oracle catalog copy process should begin. The time of day that is specified in this parameter is based on a 24 hour clock. CATEND: The CATEND parameter specifies the time of day that the Oracle catalog copy process should end. The time of day that is specified in this parameter is based on a 24 hour clock. It is important to carefully code these parameters as it will impact the amount of time it takes to restart the PowerExchange CDC process. Sample of the dbmover.cfg parameters that affect the Oracle CDC process. /********************************************************************/ /* Change Data Capture Connection Specifications /********************************************************************/

INFORMATICA CONFIDENTIAL

BEST PRACTICES

883 of 954

/* CAPT_PATH=/mountpoint/infapwx/v851/chgreg CAPT_XTRA=/mountpoint/infapwx/v851/chgreg/camaps /* CAPI_SRC_DFLT=(ORA,CAPIUOWC) CAPI_SRC_DFLT= (CAPX,CAPICAPX) /* /********************************************************************/ /* Oracle Change Data Capture Parameters /********************************************************************/ /* see Oracle Adapter Guide /* Chapter 3 Preparing for Oracle CDC /* see Reference Guide /* Chapter 9 - Configuration File Parameters /* see Readme_ORACAPT.txt /********************************************************************/ /* /********************************************************************/ /*************** Oracle - Change Data Capture **************/ /********************************************************************/ ORACLEID=(ORACAPT,oracle_sid,connect_string,capture_connect_string) CAPI_CONNECTION=(NAME=CAPIUOWC,TYPE=(UOWC,CAPINAME=CAPIORA, RSTRADV=60)) CAPI_CONNECTION=(NAME=CAPIORA,DLLTRACE=ABC,TYPE=(ORCL, CATINT=30, CATBEGIN=00:01,CATEND=23:59, COMMITINT=5, REPNODE=local,BYPASSUF=Y,ORACOLL=ORACAPT)) /* /****************** Oracle - Continuous CAPX ***************/ /* /*CAPI_CONNECTION=(NAME=CAPICAPX,TYPE=(CAPX,DFLTINST=ORACAPT)) /* Sample of the dtlca.cfg parameters that control the Oracle CDC condense process. /********************************************************************/ /* PowerExchange Condense Configuration File /* See Oracle Adapter Guide /* Chapter 3 Preparing for Oracle CDC /* Chapter 6 Condensing Changed Data /********************************************************************/ /* The value for the DBID parameter must match the Collection-ID /* contained in the ORACLE-ID statement in the dbmover.cfg file. /********************************************************************/ /* DBID=ORACAPT
INFORMATICA CONFIDENTIAL BEST PRACTICES 884 of 954

DB_TYPE=ORA /* EXT_CAPT_MASK=/mountpoint/infapwx/v851/condense/condense CHKPT_BASENAME=/mountpoint/infapwx/v851/condense/condense.CHKPT CHKPT_NUM=10 COND_CDCT_RET_P=5 /* /********************************************************************/ /* COLL_END_LOG equal to 1 means BATCH MODE /* COLL_END_LOG equal to 0 means CONTINUOUS MODE /********************************************************************/ /* COLL_END_LOG=0 NO_DATA_WAIT=2 NO_DATA_WAIT2=60 /* /********************************************************************/ /* FILE_SWITCH_CRIT of M means minutes /* FILE_SWITCH_CRIT of R means records /********************************************************************/ /* FILE_SWITCH_CRIT=M FILE_SWITCH_VAL=15 /* /********************************************************************/ /* CAPT_IMAGE of AI means AFTER IMAGE /* CAPT_IMAGE of BA means BEFORE and AFTER IMAGE /********************************************************************/ /* CAPT_IMAGE=AI /* UID=Database User Id PWD=Database User Id Password /* SIGNALLING=Y /* /********************************************************************/ /* The following parameters are only used during a cold start and forces /* the cold start to use the most recent catalog copy. Without these parameter, /* if the v_$transaction and v_$archive_log views are out of sync the latest /* there is a very good chance that the most recent catalog copy will not /* be used for the cold start. /********************************************************************/ /* SEQUENCE_TOKEN=0
INFORMATICA CONFIDENTIAL BEST PRACTICES 885 of 954

RESTART_TOKEN=0

Last updated: 27-May-08 15:07

INFORMATICA CONFIDENTIAL

BEST PRACTICES

886 of 954

PowerExchange for SQL Server CDC Challenge


Install, configure, and performance tune PowerExchange for MS SQL Server Change Data Capture (CDC).

Description
PowerExchange Real-Time for MS SQL Server uses SQL Server publication technology to capture changed data. To use this feature Distribution must be enabled. The publisher database handles replication while the distributor database transfers the replicated data to PowerExchange; which is installed on the distribution database server. The following figure depicts a typical high-level architecture:

When looking at the architecture for SQL Server capture, we see that PowerExchange treats the SQL Server Publication process as a virtual change stream. By turning the standard SQL Server publication process on, SQL Server publishes changes to the SQL Server Distribution database. PowerExchange then reads the changes from the Distribution database. When Publication is used, and the Distribution function is enabled, support for capturing changes for a table of interest are dynamically activated through the registration of a source in the PowerExchange Navigator GUI (i.e., PowerExchange makes the appropriate calls to SQL Server automatically, via SQL DMO objects).

Key Setup Steps

INFORMATICA CONFIDENTIAL

BEST PRACTICES

887 of 954

The key steps involved in setting up the change capture process are: 1. Modify the PowerExchange dbmover.cfg file on the server. Example statements that must be added: CAPI_CONN_NAME=CAPIMSSC CAPI_CONNECTION=(NAME=CAPIMSSC, TYPE=(MSQL,DISTSRV=SDMS052,DISTDB=distribution,repnode=SDMS052)) 2. Configure MS SQLServer replication. Microsoft SQL Server Replication must be enabled using the Microsoft SQL Server Publication technology. Informatica recommends enabling distribution through the SQL Server Management Console. Multiple SQL Servers can use a single Distribution database. However, Informatica recommends using a single Distribution database for Production and a separate one for Development/Test. In addition, for a busy environment, placing the Distribution database on a separate server is advisable. Also, configure the Distribution database for a retention period of 10 to 14 days. 3. 4.

Ensure that the MS SQL Server Agent Service is running. Register sources using the PowerExchange Navigator. Source tables must have a primary key. Note that system admin authority is required to register source tables.

Performance Tuning Tips


If you plan to capture large numbers of transaction updates, consider using a dedicated distributed server as the host of the distribution database. This will avoid contention for CPU and disk storage with a production instance. Sometimes SQL Server CDC performance is slow. It requires approximately ten seconds for changes made at the source to take effect at the target. This is specifically when data is coming in low volumes. You can alter the following parameters to enhance this performance:
q q

POLWAIT PollingInterval

POLWAIT This parameter specifies the number of seconds to wait between polling for new data after end of current data has been achieved.
q q

Specify this parameter in the dbmover.cfg file of the Microsoft SQL Distribution database machine. The default is ten seconds. Reducing this value to one or two seconds can improve the performance.

PollingInterval
INFORMATICA CONFIDENTIAL BEST PRACTICES 888 of 954

You can also decrease the polling interval parameter of the log reader agent in Microsoft SQL Server. Reducing this to a lower value reduces the delay in polling for new records.
q q

Modify this parameter using the SQL Server Enterprise Manager. The default value for this parameter is 10 seconds.

Be aware, however, that the trade-off with the above options is, to some extent, increased overhead and frequency of access to the source distribution database. To minimize overhead and frequency of access to the database, increase the delay between the time an update is performed and the time it is extracted. Increasing the value of POLWAIT in the dbmover cfg file reduces the frequency with which the source distribution database is accessed. In addition, increasing the value of Real-Time Flush Latency in the PowerCenter Application Connection can also reduce the frequency of access to the source.
Last updated: 27-May-08 12:39

INFORMATICA CONFIDENTIAL

BEST PRACTICES

889 of 954

PowerExchange Installation (for AS/400) Challenge


Installing and configuring PowerExchange for AS/400 includes setting up the LISTENER, modifying configuration files and creating application connections for use in the sessions.

Description
Installing PowerExchange on AS/400 is a relatively straight-forward task that can be accomplished with the assistance of resources such as:
q q

AS/400 system Programmer DB2 DBA

Be sure to adhere to the sequence of the following steps to successfully install PowerExchange: 1. 2. 3. 4. 5. 6. Complete the PowerExchange pre-install checklist and obtain valid license keys. Install PowerExchange on the AS/400. Start the PowerExchange Listener on AS/400. Install the PowerExchange client (Navigator) on a workstation. Test connectivity to the AS/400 from the workstation. Install PowerExchange client (Navigator) on the UNIX/NT server running PowerCenter Integration Service. 7. Test connectivity to the AS/400 from the server.

Install PowerExchange on the AS/400


Informatica recommends using the following naming conventions for PowerExchange:
q q q q

datalib - for the user specified database library condlib - for the user specified condensed files library dtllib - for the software library name dtluser - as the userid

The following example demonstrates use of the recommended naming conventions:


q q q q q q

PWX811P01D PWX-PowerExchange, 851-Version, P01-Patch level, D-Datalib PWX811P01C PWX-PowerExchange, 851-Version, P01-Patch level, C-Condlib PWX811P01S PWX-PowerExchange, 851-Version, P01-Patch level, S-Sourcelib PWX811P01M PWX-PowerExchange, 851-Version, P01-Patch level, M-Maplib PWX811P01X PWX-PowerExchange, 851-Version, P01-Patch level, X-Extracts PWX811P01T PWX-PowerExchange, 851-Version, P01-Patch level, T-Templib

INFORMATICA CONFIDENTIAL

BEST PRACTICES

890 of 954

PWX811P01I PWX-PowerExchange, 851-Version, P01-Patch level, I-ICUlib

Informatica recommends using PWXADMIN as the user id for the PowerExchange Administrator Below are the installation steps:

Step 1: Create the PowerExchange Libraries

Create the software library using the following command: CRTLIB(PWX851P01S) CRTAUT(*CHANGE)

If the sourcing or targeting flat/sequential files are from or to the AS/400, you will need to create a data maps library. Use the following command to create the data maps library: CRTTLIB LIB(PWX851P01M) CRTAUT(*CHANGE)

Because later on in the installation process you must choose a different library name to store the datamaps, you will need to change the DMX_DIR= parameter from stdatamaps to PWX851P01M in the configuration file (datalib/cfg member DBMOVER). You may choose to run PowerExchange within an Independent Auxiliary Storage Pool (IASP). If you intend to use the IASP, use the following command:
q

CRTLIB LIB(PWX851P01S) CRTAUT(*CHANGE) ASP(*ASPDEV) ASPDEV(YOURASPDEV)

Step 2: Create Library SAVE File for Restore

CRTSAVF FILE(QGPL/PWX851P01T)

If you intend to run PowerExchange with multibyte support, you need to create a second save file using the following command:
q

CRTSAVF FILE(QGPL/PWX851P01I)

Step 3: FTP Binary File to AS/400


You should have a file (pwxas4.vnnn.exe where nnn is the version/release/modification level) containing the appropriate PowerExchange AS/400. This file is a self-extracting executable. For the current release of PowerExchange 8.5.1 the file is pwxas4_v851_01.exe.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

891 of 954

q q

Select this file from the CD or directory that the software was copied into, and double click on it Copy the PWXAS4.v851 file to your temp library on the AS400 (PWX851P01T) by entering the following command PUT PWXAS4.V851 QGPL/PWX851P01T

Copy the PWXAS4.v851.ICU file to your temp ICU library on the AS400 (PWX811P01I) by entering the command PUT PWXAS4.V851.ICU QGPL/PWX851P01I

Step 4: Restore the Install Library


You must now restore the library. After it is decompressed, the library is shipped to dtllib. Use the following command:
q

RSTLIB SAVLIB(DTLV851) DEV(*SAVF) SAVF(QGPL/PWX851P01T) RSTLIB(PWX851P01S) MBROPT(*ALL) ALWOBJDIF(*ALL)

If you intend to run PowerExchange with multibyte support, you must restore the additional objects using the following command:
q

RSTOBJ OBJ(*ALL) SAVLIB(DTLV851) DEV(*SAVF) OBJTYPE(*ALL) SAVF(QGPL/PWX851P01I) MBROPT(*ALL) ALWOBJDIF(*ALL) RSTLIB(PWX851P01S)

If you intend to run PowerExchange within an Independent Auxiliary Storage Pool (IASP), you need to specify the details for the IASP to know which the objects should be restored. Use the RSTASPDEV (YOURASPDEV) parameter. The following example specifies the additional objects for multibyte support.
q q q q q

RSTLIB SAVLIB(DTLVXYZ) DEV(*SAVF) SAVF(QGPL/LIBREST) MBROPT(*ALL) ALWOBJDIF(*NONE) RSTLIB(DTLLIB) RSTASPDEV(YOURASPDEV) RSTOBJ OBJ(*ALL) SAVLIB(DTLVXYX) DEV(*SAVF) OBJTYPE(*ALL) SAVF(QGPL/ LIBRESTICU) MBROPT(*ALL) ALWOBJDIF(*ALL) RSTLIB(DTLLIB) RSTASPDEV(YOURASPDEV)

Step 5: Update License Key File


PowerExchange requires a license key to run successfully. It is held in the file dtllib/LICENSE(KEY), which must be in the same library as the dtllst program (the PowerExchange LISTENER). The license key is normally IP address specific. Update the single record member with the 44 byte key, with hyphens every 4 bytes.

Step 6: Create PowerExchange Environment

INFORMATICA CONFIDENTIAL

BEST PRACTICES

892 of 954

After you have installed the software, you will need to create a Power Exchange environment for the software to run. This environment will consist of dtllib, datalib and (optionally) two additional libraries (for data capture processing), as follows:
q q

datalib - Base database files library condlib - Condensed files library

Additionally, the cpxlib (Capture Extract) library is required if the environment is to support change data capture processing. Files in this these libraries are deleted during normal operation by PowerExchange. You should not, therefore, place your own files in this library without first contacting Informatica Support. Use the following command to create the PowerExchange environment:
q

ADDLIBLE PWX851P01S POSITION(*FIRST)

Use the following commands to create a new subsystem to run PowerExchange in.
q

CRTPWXENV DESC(PWX_V851P01_Install) DATALIB(PWX851P01D) CONDLIB(PWX851P01C) ASPDEV(*NONE) CRTSYSOBJ(*YES) CPXLIB(PWX851P01X) CRTPWXENV DESC('User Description') DATALIB(datalib) CONDLIB(*NONE) ASPDEV(*NONE) CRTSYSOBJ(*YES)

q q

Note: If you restored dtllib into an IASP, you must specify that device name in the CRTPWXENV command. For example:
q q

CRTPWXENV DESC('User Description') DATALIB(DATALIB) CONDLIB(*NONE) CRTSYSOBJ(*YES) ASPDEV(YOURASPDEV)

Step 7: Update Configuration File


One of the PowerExchange configuration files is datalib/CFG(DBMOVER); it holds many defaults and the information that PowerExchange uses to communicate with other platforms. You may not need to customize this file at this stage of the installation process, but for additional information on the contents of this file. Refer to the Configuration File Parameters section of the PowerExchange Reference Manual. An example of the DBMOVER file is shown below: *************** Beginning of data ****************************************** 0001.00 /********************************************************************/ 0002.00 /* PowerExchange Configuration File 0003.00 /********************************************************************/ 0004.00 LISTENER=(node1,TCPIP,2480) 0005.00 NODE=(local,TCPIP,127.0.0.1,2480) 0006.00 NODE=(node1,TCPIP,127.0.0.1,2480) 0007.00 NODE=(default,TCPIP,x,2480) 0008.00 APPBUFSIZE=256000 0009.00 COLON=: 0010.00 COMPRESS=Y
INFORMATICA CONFIDENTIAL BEST PRACTICES 893 of 954

0011.00 CONSOLE_TRACE=Y 0012.00 DECPOINT=. 0013.00 DEFAULTCHAR=* 0014.00 DEFAULTDATE=19800101 0015.00 DMX_DIR=PWX851P01M 0016.00 MAXTASKS=25 0017.00 MSGPREFIX=PWX 0018.00 NEGSIGN=0019.00 NOGETHOSTBYNAME=N 0020.00 PIPE=| 0021.00 POLLTIME=1000 0022.00 SECURITY=(0,N) 0023.00 TIMEOUTS=(600,600,600) 0024.00 /* sample trace TRACE=(TCPIP,1,99) 0025.00 /* Enable to extract BIT data as CHAR: DB2_BIN_AS_CHAR=Y 0026.00 /* uncomment and modify the CAPI_CONNECTION lines to activate changed data 0027.00 /* propagation 0029.00 CAPI_CONNECTION=(NAME=DTECAPU, 0030.00 TYPE=(UOWC,CAPINAME=DTLJPAS4)) 0031.00 CAPI_CONNECTION=(NAME=DTLJPAS4, 0032.00 TYPE=(AS4J,JOURNAL=REPORTSDB2/QSQJRN,INST=FOCUST1,EOF=N, 0033.00 STOPIT=(CONT=5),LIBASUSER=N,AS4JRNEXIT=N)) 0040.00 CPX_DIR=PWX851P01X

Step 8: Change Object Ownership


After all of the components in the shipped library have been created, they are owned in Informatica's internal systems by the userid dtluser.
q q

CALL PGM(PWX851P01S/CHGALLOBJ) PARM(PWX851P01S PWXADMIN) CALL PGM(PWX851P01S/CHGALLOBJ) PARM(PWX851P01D PWXADMIN)

If Change Capture is installed


q q

CALL PGM(PWX851P01S/CHGALLOBJ) PARM(PWX851P01C PWXADMIN) CALL PGM(PWX851P01S/CHGALLOBJ) PARM(PWX851P01X PWXADMIN)

Step 9: Authorize PowerExchange Userid Security Setting


Prior to running jobs, you will need to assign the userid that you created back in Step 1: (i.e., dtluser) *EXECUTE authority to the following objects:
q q q q

QSYGETPH QSYRLSPH QWTSETP QCLRPGMI

Use the following commands to assign the userid:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

894 of 954

q q q q

GRTOBJAUT OBJ(QSYGETPH) OBJTYPE(*PGM) AUT(*EXECUTE) USER(PWXADMIN) GRTOBJAUT OBJ(QSYRLSPH) OBJTYPE(*PGM) AUT(*EXECUTE) USER(PWXADMIN) GRTOBJAUT OBJ(QWTSETP) OBJTYPE(*PGM) AUT(*EXECUTE) USER(PWXADMIN) GRTOBJAUT OBJ(QCLRPGMI) OBJTYPE(*PGM) AUT(*EXECUTE *READ) USER(PWXADMIN)

Step 10: Start PowerExchange Listener


The standard command form to start the Listener is as follows: SBMJOB CMD(CALL PGM(dtllib/DTLLST) PARM(NODE1)) JOB(MYJOB) JOBD(datalib/ DTLLIST) JOBQ(*JOBD) PRTDEV(*JOBD) OUTQ(*JOBD) CURLIB(*CRTDFT) INLLIBL(*JOBD) INLASPGRP(*JOBD)

Step 11: Stopping the PowerExchange LISTENER


The standard command form to stop the LISTENER is as follows:
q

SNDLSTCMD LSTMSGQLIB(PWX851P01D) LSTCMD(CLOSE)

Once the LISTENER start/stop test is complete, installation on AS/400 is finished. The LISTENER can be started to run and PowerCenter application connections can be configured.

Power Center Real time Application Connections


q q

Click the PWX DB2400 CDC Real time connection as show below and fill in the parameter details The User name and password can be anything if security in DBMOVER is set to 0; otherwise, they must be populated with proper AS400 user id and password.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

895 of 954

Mention the restart token folder name and file name as shown below:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

896 of 954

Set the Number of Runs to Keep parameter for the number of versions that you want to keep in your restart token file. If the workflow needs to run continuously in real time mode, set the idle time to -1. If real time needs to run from the time it is triggered till the end of changes, set the idle time to 0.

q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

897 of 954

Leave the Journal name blank if your tables reside on the default journal that is mentioned in the DBMOVER file. Alternatively, the journal name can be overridden by mentioning the journal lib and file. The first figure below shows an instance where the connection uses default journal; the second figure below shows the Journal override.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

898 of 954

INFORMATICA CONFIDENTIAL

BEST PRACTICES

899 of 954

The Session settings for the real time session look like the following:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

900 of 954

Last updated: 30-May-08 12:46

INFORMATICA CONFIDENTIAL

BEST PRACTICES

901 of 954

PowerExchange Installation (for Mainframe) Challenge


Installing and configuring a PowerExchange listener on a mainframe, ensuring that the process is both efficient and effective.

Description
PowerExchange installation is very straight-forward and can generally be accomplished in a timely fashion. When considering a PowerExchange installation, be sure that the appropriate resources are available. These include, but are not limited to:
q q

MVS systems operator Appropriate database administrator; this depends on what (if any) databases are going to be sources/and or targets (e.g., IMS, IDMS, etc.). MVS Security resources

Be sure to adhere to the sequence of the following steps to successfully install PowerExchange. Note that in this very typical scenario, the mainframe source data is going to be pulled across to a server box. 1. 2. 3. 4. 5. 6. 7. Complete the PowerExchange pre-install checklist and obtain valid license keys. Install PowerExchange on the mainframe. Start the PowerExchange jobs/tasks on the mainframe. Install the PowerExchange client (Navigator) on a workstation. Test connectivity to the mainframe from the workstation. Install Navigator on the UNIX/NT server. Test connectivity to the mainframe from the server.

Complete the PowerExchange Pre-install Checklist and Obtain Valid License Keys
Reviewing the environment and recording the information in a detailed checklist facilitates the PowerExchange install. The checklist (which is a prerequisite) is installed in the Documentation Folder when the PowerExchange software is installed. It is also

INFORMATICA CONFIDENTIAL

BEST PRACTICES

902 of 954

available within the client from the PowerExchange Program Group. Be sure to complete all relevant sections. You will need a valid license key in order to run any of the PowerExchange components. This is a 44 or 64-byte key that uses hyphens every 4 bytes. For example: 1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1 The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F). Keys are valid for a specific time period and are also linked to an exact or generic TCP/ IP address. They also control access to certain databases. You cannot successfully install PowerExchange without a valid key for all required components. Note: When copying software from one machine to another, you may encounter license key problems since the license key is IP specific. Be prepared to deal with this eventuality, especially if you are going to a backup site for disaster recovery testing. In the case of such an event Informatica Product Shipping or Support can generate a temporary key very quickly.

Install PowerExchange on the Mainframe


Step 1: Create a folder c:\PWX on the workstation. Copy the file with a naming convention similar to PWXOS26.Vxxx.EXE from the PowerExchange CD or from the extract of the zip file downloaded to this directory. Double click the file to unzip its contents into this directory. Step 2: Create the PDS HLQ.PWXVxxx.RUNLIB and HLQ.PWXVxxx. BINLIB with fixed blocks and a length of 80 attributes on the mainframe in order to pre-allocate the needed libraries.. Ensure sufficient space for the required jobs/tasks by setting the cylinders to 150 and directory blocks of 50. Step 3: Run the MVS_Install file. This displays the MVS Install Assistant. Configure the IP Address, Logon ID, Password, HLQ, and Default volume setting on the display screen. Also, enter the license key. Click the Custom buttons to configure the desired data sources. Be sure that the HLQ on this screen matches the HLQ of the allocated RUNLIB (from step 2). Save these settings and click Process. This creates the JCL libraries
INFORMATICA CONFIDENTIAL BEST PRACTICES 903 of 954

and opens the following screen to FTP these libraries to MVS. Click XMIT to complete the FTP process. Note: A new installer GUI was added as of PowerExchange 8.5. Simply follow the installation screens in the GUI for this step. Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g., execution class, message class, etc.) Step 5: Edit the SETUPBLK member in RUNLIB. Copy in the JOBCARD and SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with return code 0 (success) or 1, and a list of the needed installation jobs can be found in the XJOBS member.

Start The PowerExchange Jobs/Tasks on the Mainframe


The installed PowerExchange Listener can be run as a normal batch job or as a started task. Informatica recommends that it initially be submitted as a batch job: RUNLIB (STARTLST). If it will be run as a started task then copy the PSTRTLST member in runlib to the started task proclib. It should return: DTL-00607 Listener VRM x.x.x Build Vxxx_P0x started. If implementing change capture, start the PowerExchange Agent (as a started task): /S DTLA It should return: DTLEDMI1722561: EDM Agent DTLA has completed initialization. Note: The load libraries must be APF authorized prior to starting the Agent.

Install The PowerExchange Client (Navigator) on a Workstation


Step 1: Run the Windows or UNIX installation file in the software folder on the installation CD and follow the prompts. Step 2: Enter the license key. Step 3: Follow the wizard to complete the install and reboot the machine.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

904 of 954

Step 4: Add a node entry to the configuration file \Program Files\Informatica \Informatica Power Exchange\dbmover.cfg to point to the Listener on the mainframe. node = (mainframe location name, TCPIP, mainframe IP address, 2480)

Test Connectivity to the Mainframe from the Workstation


Ensure communication to the PowerExchange Listener on the mainframe by entering the following in DOS on the workstation: DTLREXE PROG=PING LOC=mainframe location or nodename in dbmover.cfg It should return: DLT-00755 DTLREXE Command OK!

Install PowerExchange on the UNIX Server


Step 1: Create a user for the PowerExchange installation on the UNIX box. Step 2: Create a UNIX directory /opt/inform/pwxvxxxp0x. Step 3: FTP the file \software\Unix\dtlxxx_vxxx.tar on the installation CD to the pwx installation directory on UNIX. Step 4: Use the UNIX tar command to extract the files. The command is tar xvf pwxxxx_vxxx.tar. Step 5: Update the logon profile with the correct path, library path, and home environment variables. Step 6: Update the license key file on the server. Step 7: Update the configuration file on the server (dbmover.cfg) by adding a node entry to point to the Listener on the mainframe. Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC, update the odbc.ini file on the server by adding data source entries that point to PowerExchange-accessed data:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

905 of 954

[pwx_mvs_db2] DRIVER=<install dir>/libdtlodbc.so DESCRIPTION=MVS DB2 DBTYPE=db2 LOCATION=mvs1 DBQUAL1=DB2T

Test Connectivity to the Mainframe from the Server


Ensure communication to the PowerExchange Listener on the mainframe by entering the following on the UNIX server: DTLREXE PROG=PING LOC=mainframe location It should return: DLT-00755 DTLREXE Command OK!

Changed Data Capture


There is a separate manual for each type of change data capture option. This manual contains the specifics on the following general steps. You will need to understand the appropiate options guide to ensure success. Step 1: APF authorize the .LOAD and the .LOADLIB libraries. This is required for external security. Step 2: Copy the Agent from the PowerExchange PROCLIB to the system site PROCLIB. Step 3: After the Agent has been started, run job SETUP2. Step 4: Create an active registration for a table/segment/record in Navigator that is setup for changes.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

906 of 954

Step 5: Start the ECCR. Step 6: Issue a change to the table/segment/record that you registered in Navigator. Step 7: Perform an extraction map row test in Navigator

Last updated: 10-Jun-08 15:40

INFORMATICA CONFIDENTIAL

BEST PRACTICES

907 of 954

Assessing the Business Case Challenge


Assessing the business case for a project must consider both the tangible and intangible potential benefits. The assessment should also validate the benefits and ensure they are realistic to the Project Sponsor and Key Stakeholders to ensure project funding.

Description
A Business Case should include both qualitative and quantitative measures of potential benefits. The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the Statement of Project Goals and Objectives (both generated in Subtask 1.1.1 Establish Business Project Scope ) and focuses on discussions with the project beneficiaries regarding the expected benefits in terms of problem alleviation, cost savings or controls, and increased efficiencies and opportunities. Many qualitative items are intangible, but you may be able to cite examples of the potential costs or risks if the system is not implemented. An example may be the cost of bad data quality resulting in the loss of a key customer or an invalid analysis resulting in bad business decisions. Risk factors may be classified as business, technical, or execution in nature. Examples of these risks are uncertainty of value or the unreliability of collected information, new technology employed, or a major change in business thinking for personnel executing change. It is important to identify an estimated value added or cost eliminated to strengthen the business case. The better definition of the factors, the better the value to the business case. The Quantitative Assessment portion of the Business Case provides specific measurable details of the proposed project, such as the estimated ROI. This may involve the following calculations:
q

Cash flow analysis- Projects positive and negative cash flows for the anticipated life of the project. Typically, ROI measurements use the cash flow formula to depict results.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

908 of 954

Net present value - Evaluates cash flow according to the long-term value of current investment. Net present value shows how much capital needs to be invested currently, at an assumed interest rate, in order to create a stream of payments over time. For instance, to generate an income stream of $500 per month over six months at an interest rate of eight percent would require an investment (i.e., a net present value) of $2,311.44. Return on investment - Calculates net present value of total incremental cost savings and revenue divided by the net present value of total costs multiplied by 100. This type of ROI calculation is frequently referred to as return-onequity or return-on-capital. Payback Period - Determines how much time must pass before an initial capital investment is recovered.

The following are steps to calculate the quantitative business case or ROI: Step 1 Develop Enterprise Deployment Map. This is a model of the project phases over a timeline, estimating as specifically as possible participants, requirements, and systems involved. A data integration or migration initiative or amendment may require estimating customer participation (e.g., by department and location), subject area and type of information/analysis, numbers of users, numbers and complexity of target data systems (data marts or operational databases, for example) and data sources, types of sources, and size of data set. A data migration project may require customer participation, legacy system migrations, and retirement procedures. The types of estimations vary by project types and goals. It is important to note that the more details you have for estimations, the more precise your phased solutions are likely to be. The scope of the project should also be made known in the deployment map. Step 2 Analyze Potential Benefits. Discussions with representative managers and users or the Project Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for presenting this analysis is often a "before" and "after" format that compares the current situation to the project expectations, Include in this step, costs that can be avoided by the deployment of this project. Step 3 Calculate Net Present Value for all Benefits. Information gathered in this step should help the customer representatives to understand how the expected benefits are going to be allocated throughout the organization over time, using the enterprise deployment map as a guide. Step 4 Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of the project. Cost estimates should address the following fundamental cost components:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

909 of 954

q q q q q q q q q

Hardware Networks RDBMS software Back-end tools Query/reporting tools Internal labor External labor Ongoing support Training

Step 5 Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial ROI snapshots until costs can be more clearly predicted. Step 6 Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are:
q

Scope creep, which can be mitigated by thorough planning and tight project scope. Integration complexity, which may be reduced by standardizing on vendors with integrated product sets or open architectures. Architectural strategy that is inappropriate. Current support infrastructure may not meet the needs of the project. Conflicting priorities may impact resource availability. Other miscellaneous risks from management or end users who may withhold project support; from the entanglements of internal politics; and from technologies that don't function as promised. Unexpected data quality, complexity, or definition issues often are discovered late in the course of the project and can adversely affect effort, cost, and schedule. This can be somewhat mitigated by early source analysis.

q q q q

Step 7 Determine Overall ROI. When all other portions of the business case are complete, calculate the project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total costs from net present value of (total
INFORMATICA CONFIDENTIAL BEST PRACTICES 910 of 954

incremental revenue plus cost savings).

Final Deliverable
The final deliverable of this phase of development is a complete business case that documents both tangible (quantified) and in-tangible (non-quantified, but estimate of benefits and risks) to be presented to the Project Sponsor and Key Stakeholders. This allows them to review the Business Case in order to justify the development effort. If your organization has the concept of a Project Office which provides the governance for project and priorities, many times this is part of the original Project Charter which states items like scope, initial high level requirements, and key project stakeholders. However, developing a full Business Case can validate any initial analysis and provide additional justification. Additionally, the Project Office should provide guidance in building and communicating the Business Case. Once completed, the Project Manager is responsible for scheduling the review and socialization of the Business Case.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

911 of 954

Defining and Prioritizing Requirements Challenge


Defining and prioritizing business and functional requirements is often accomplished through a combination of interviews and facilitated meetings (i.e., workshops) between the Project Sponsor and beneficiaries and the Project Manager and Business Analyst. Requirements need to be gathered from business users who currently use and/or have the potential to use the information being assessed. All input is important since the assessment should encompass an enterprise view of the data rather than a limited functional, departmental, or line-of-business view. Types of specific detailed data requirements gathered include:
q q q q q q

Data names to be assessed Data definitions Data formats and physical attributes Required business rules including allowed values Data usage Expected quality levels

By gathering and documenting some of the key detailed data requirements, a solid understanding the business rules involved is reached. Certainly, all elements cant be analyzed in detail, but it helps in getting to the heart of the business system so you are better prepared when speaking with business and technical users.

Description
The following steps are key for successfully defining and prioritizing requirements:

Step 1: Discovery
Gathering business requirements is one of the most important stages of any data integration project. Business requirements affect virtually every aspect of the data integration project starting from Project Planning and Management to End-User

INFORMATICA CONFIDENTIAL

BEST PRACTICES

912 of 954

Application Specification. They are like a hub that sits in the middle and touches the various stages (spokes) of the data integration project. There are two basic techniques for gathering requirements and investigating the underlying operational data: interviews and facilitated sessions.

Data Profiling
Informatica Data Explorer (IDE) is an automated data profiling and analysis software product that can be extremely beneficial in defining and prioritizing requirements. It provides a detailed description of data content, structure, rules, and quality by profiling the actual data that is loaded into the product. Some industry examples of why data profiling is crucial prior to beginning the development process are:
q

Cost of poor data quality is 15 to 25 percent of operating profit.


q

Poor data management is costing global business $1.4 billion a year.


q

37 percent of projects are cancelled; 50 percent are completed but with 20 percent overruns, leaving only 13 percent completed on time and within budget.
q

Using a Data Profiling Tool can lower the risk and lower the cost of the project and increase the chances of success.
q

Data Profiling reports can be posted to a central presence where all team members can review results and track accuracy. IDE provides the ability to promote collaboration through tags, notes, action items, transformations and rules. By profiling the information, the framework is set to have an effective interview process with business and technical users.

Interviews
By conducting interview research before starting the requirements gathering process, interviewees can be categorized into functional business management and Information Technology (IT) management. This, in conjunction with effective data profiling, helps to establish a comprehensive set of business requirements.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

913 of 954

Business Interviewees. Depending on the needs of the project, even though you may be focused on a single primary business area, it is always beneficial to interview horizontally to achieve a good cross-functional perspective of the enterprise. This also provides insight into how extensible your project is across the enterprise. Before you interview, be sure to develop an interview questionnaire based upon profiling results, as well as business questions; schedule the interview time and place; and prepare the interviewees by sending a sample agenda. When interviewing business people, it is always important to start with the upper echelons of management so as to understand the overall vision, assuming you have the business background, confidence and credibility to converse at those levels. If not adequately prepared, the safer approach is to interview middle management. If you are interviewing across multiple teams, you might want to scramble interviews among teams. This way if you hear different perspectives from finance and marketing, you can resolve the discrepancies with a scrambled interview schedule. A note to keep in mind is that business is sponsoring the data integration project and is going to be the end-users of the application. They will decide the success criteria of your data integration project and determine future sponsorship. Questioning during these sessions should include the following:
q

Who are the stakeholders for this milestone delivery (IT, field business analysts, executive management)? What are the target business functions, roles, and responsibilities? What are the key relevant business strategies, decisions, and processes (in brief)? What information is important to drive, support, and measure success for those strategies/processes? What key metrics? What dimensions for those metrics? What current reporting and analysis is applicable? Who provides it? How is it presented? How is it used? How can it be improved?

q q

IT interviewees. The IT interviewees have a different flavor than the business user community. Interviewing the IT team is generally very beneficial because it is composed of data gurus who deal with the data on a daily basis. They can provide great insight into data quality issues, help in systematic exploration of legacy source systems, and understanding business user needs around critical reports. If you are developing a prototype, they can help get things done quickly and address important business reports. Questioning during these sessions should include the following:
q

Request an overview of existing legacy source systems. How does data


BEST PRACTICES 914 of 954

INFORMATICA CONFIDENTIAL

current flow from these systems to the users?


q

What day-to-day maintenance issues does the operations team encounter with these systems? Ask for their insight into data quality issues. What business users do they support? What reports are generated on a daily, weekly, or monthly basis? What are the current service level agreements for these reports? How can the DI project support the IS department needs? Review data profiling reports and analyze the anomalies in the data. Note and record each of the comments from the more detailed analysis. What are the key business rules involved in each item?

q q

q q

Facilitated Sessions
Facilitated sessions - known sometimes as JAD (Joint Application Development) or RAD (Rapid Application Development) - are ways to work as a group of business and technical users to capture the requirements. This can be very valuable in gathering comprehensive requirements and building the project team. The difficulty is the amount of preparation and planning required to make the session a pleasant, and worthwhile experience. Facilitated sessions provide quick feedback by gathering all the people from the various teams into a meeting and initiating the requirements process. You need a facilitator who is experienced in these meetings to ensure that all the participants get a chance to speak and provide feedback. During individual (or small group) interviews with highlevel management, there is often focus and clarity of vision that may be hindered in large meetings. Thus, it is extremely important to encourage all attendees to participate and minimize a small number from dominating the requirement process. A challenge of facilitated sessions is matching everyones busy schedules and actually getting them into a meeting room. However, this part of the process must be focused and brief or it can become unwieldy with too much time expended just trying to coordinate calendars among worthy forum participants. Set a time period and target list of participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available. Questions asked during facilitated sessions are similar to the questions asked to business and IS interviewees.

Step 2: Validation and Prioritization


The Business Analyst, with the help of the Project Architect, documents the findings of

INFORMATICA CONFIDENTIAL

BEST PRACTICES

915 of 954

the discovery process after interviewing the business and IT management. The next step is to define the business requirements specification. The resulting Business Requirements Specification includes a matrix linking the specific business requirements to their functional requirements. Defining the business requirements is a time consuming process and should be facilitated by forming a working group team. A working group team usually consists of business users, business analysts, project manager, and other individuals who can help to define the business requirements. The working group should meet weekly to define and finalize business requirements. The working group helps to:
q q q q q q

Design the current state and future state Identify supply format and transport mechanism Identify required message types Develop Service Level Agreement(s), including timings Identify supply management and control requirements Identify common verifications, validations, business validations and transformation rules Identify common reference data requirements Identify common exceptions Produce the physical message specification

q q q

At this time also, the Architect develops the Information Requirements Specification to clearly represent the structure of the information requirements. This document, based on the business requirements findings, can facilitate discussion of informational details and provide the starting point for the target model definition. The detailed business requirements and information requirements should be reviewed with the project beneficiaries and prioritized based on business need and the stated project objectives and scope.

Step 3: The Incremental Roadmap


Concurrent with the validation of the business requirements, the Architect begins the Functional Requirements Specification providing details on the technical requirements for the project. As general technical feasibility is compared to the prioritization from Step 2, the Project

INFORMATICA CONFIDENTIAL

BEST PRACTICES

916 of 954

Manager, Business Analyst, and Architect develop consensus on a project "phasing" approach. Items of secondary priority and those with poor near-term feasibility are relegated to subsequent phases of the project. Thus, they develop a phased, or incremental, "roadmap" for the project (Project Roadmap).

Final Deliverable
The final deliverable of this phase of development is a complete list of business requirement, a diagram of current and future state, and a list of high-level business rules affected by the requirements that will effect the change from current to future. This provides the development team with much of the information in order to begin the design effort of the system modifications. Once completed, the Project Manager is responsible for scheduling the review and socialization of the requirements and plan to achieve sign-off on the deliverable. This is presented to the Project Sponsor for approval and becomes the first "increment" or starting point for the Project Plan.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

917 of 954

Developing a Work Breakdown Structure (WBS) Challenge


Developing a comprehensive work breakdown structure (WBS) is crucial for capturing all the tasks required for a data integration project. Many times, items such as full analysis, testing, or even specification development, can create a sense of false optimism for the project. The WBS clearly depicts all of the various tasks and subtasks required to complete a project. Most project time and resource estimates are supported by the WBS. A thorough, accurate WBS is critical for effective monitoring and also facilitates communication with project sponsors and key stakeholders.

Description
The WBS is a deliverable-oriented hierarchical tree that allows large tasks to be visualized as a group of related smaller, more manageable subtasks. These tasks and subtasks can then be assigned to various resources, which helps to identify accountability and is invaluable for tracking progress. The WBS serves as a starting point as well as a monitoring tool for the project. One challenge in developing a thorough WBS is obtaining the correct balance between sufficient detail, and too much detail. The WBS shouldnt include every minor detail in the project, but it does need to break the tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at least a day. It is also important to maintain consistency across project for the level of detail. A well designed WBS can be extracted at a higher level to communicate overall project progress, as shown in the following sample. The actual WBS for the project manager may, for example, may be a level of detail deeper than the overall project WBS to ensure that all steps are completed, but the communication can roll up a level or two to make things more clear.

Plan Architecture - Set up of Informatica Environment Develop analytic solution architecture Design development architecture Customize and implement Iterative Framework Data Profiling Legacy Stage Pre-Load Stage

% Complete 82% 46% 59%

Budget Actual Hours Hours 167 28 32 137 13 19

100% 150% 150%

32 10 10

32 15 15

INFORMATICA CONFIDENTIAL

BEST PRACTICES

918 of 954

Reference Data Reusable Objects Review and signoff of Architecture

128% 56% 50%

18 27 10

23 15 5

Analysis - Target-to-Source Data Mapping Customer (9 tables) Product (7 tables) Inventory (3 tables) Shipping (3 tables) Invoicing (7 tables) Orders (13 tables) Review and signoff of Functional Specification

48% 87% 98% 0% 0% 0% 37% 0%

1000 135 215 60 60 140 380 10

479 117 210 0 0 0 140 0

Total Architecture and Analysis

52%

1167

602

A fundamental question is to whether to include activities as part of a WBS. The following statements are generally true for most projects, most of the time, and therefore are appropriate as the basis for resolving this question.
q

The project manager should have the right to decompose the WBS to whatever level of detail he or she requires to effectively plan and manage the project. The WBS is a project management tool that can be used in different ways, depending upon the needs of the project manager. The lowest level of the WBS can be activities.

The hierarchical structure should be organized by deliverables and milestones with process steps detailed within it. The WBS can be structured from a process or life cycle basis (i.e., the accepted concept of Phases), with non-deliverables detailed within it.
q

At the lowest level in the WBS, an individual should be identified and held accountable for the result. This person should be an individual contributor, creating the deliverable personally, or a manager who will in turn create a set of tasks to plan and manage the results.
q

The WBS is not necessarily a sequential document. Tasks in the hierarchy are often
INFORMATICA CONFIDENTIAL BEST PRACTICES 919 of 954

completed in parallel. At part, the goal is to list every task that must be completed; it is not necessary to determine the critical path for completing these tasks. r For example, multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3). Subtasks 4.3.1 through 4.3.4 may have sequential requirements that forces them to be completed in order while subtasks 4.3.5 through 4.3.7 can - and should - be completed in parallel if they do not have sequential requirements.
r

It is important to remember that a task is not complete until all of its corresponding subtasks are completed - whether sequentially or in parallel. For example, the Build Phase is not complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin for the Deploy Phase long before the Build Phase is complete.

The Project Plan provides a starting point for further development of the project WBS. This sample is a Microsoft Project file that has been "pre-loaded" with the phases, tasks, and subtasks that make up the Informatica methodology. The Project Manager can use this WBS as a starting point, but should review it to ensure that it corresponds to the specific development effort, removing any steps that arent relevant or adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the development effort. If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown Structure is also available. The phases, tasks, and subtasks can be exported from Excel into many other project management tools, simplifying the effort of developing the WBS. Sometimes it is best to build an initial task list and timeline with a project team using a facilitator with the project team. The project manager can act as a facilitator or can appoint one, freeing up the project manager and enabling team members to focus on determining the actual tasks and effort needed. Depending on the size and scope of the project, sub-projects may be beneficial, with multiple project teams creating their own project plans. The overall project manager then brings the plans together into a master project plan. This group of projects can be defined as a program and the project manager and project architect manage the interaction among the various development teams.

Caution: Do not expect plans to be set in stone. Plans inevitably change as the project progresses; new information becomes available; scope, resources and priorities change; deliverables are (or are not) completed on time, etc. The process of estimating and modifying the plan should be repeated many times throughout the project. Even initial planning is likely to take several iterations to gather enough information. Significant changes to the project plan become the basis to communicate with the project sponsor(s) and/ or key stakeholders with regard to decisions to be made and priorities rearranged. The goal of the project manager is to be non-biased toward any decision, but to place the responsibility with the sponsor to shape direction.

Approaches to Building WBS Structures: Waterfall vs. Iterative


Data integration projects differ somewhat from other types of development projects, although they also share some key attributes. The following list summarizes some unique aspects of data integration projects:
q

Business requirements are less tangible and predictable than in OLTP (online transactional processing) projects. Database queries are very data intensive, involving few or many tables, but with many, many rows. In OLTP, transactions are data selective, involving few or many tables and comparatively few
BEST PRACTICES 920 of 954

INFORMATICA CONFIDENTIAL

rows.
q

Metadata is important, but in OLTP the meaning of fields is predetermined on a screen or report. In a data integration project (e.g., warehouse or common data management, etc.), metadata and traceability are much more critical.
q

Data integration projects, like all development projects, must be managed. To manage them, they must follow a clear plan. Data integration project managers often have a more difficult job than those managing OLTP projects because there are so many pieces and sources to manage. Two purposes of the WBS are to manage work and ensure success. Although this is the same as any project, data integration projects are unlike typical waterfall projects in that they are based on a iterative approach. Three of the main principles of iteration are as follows:
q

Iteration. Division of work into small chunks of effort using lessons learned from earlier iterations.
q

Time boxing. Delivery of capability in short intervals, with the first release typically requiring from three to nine months (depending on complexity) and quarterly releases thereafter.
q

Prototyping. Early delivery of a prototype, with a working database delivered approximately one-third of the way through. Incidentally, most iteration projects follow an essentially waterfall process within a given increment. The danger is that projects can iterate or spiral out of control..

The three principles listed above are very important because even the best data integration plans are likely to invite failure if these principles are ignored. An example of a failure waiting to happen, even with a fully detailed plan, is a large common data management project that gathers all requirements upfront and delivers the application all-at-once after three years. It is not the "large" that is the problem, but the "all requirements upfront" and the "all-at-once in three years." Even enterprise data warehouses are delivered piece-by-piece using these three (and other) principles. The feedback you can gather from increment to increment is critical to the success of the future increments. The benefit is that such incremental deliveries establish patterns for development that can be used and leveraged for future deliveries.

What is the Correct Development Approach?


The correct development approach is usually dictated by corporate standards and by departments such as the Project Management Office (PMO). Regardless of the development approach chosen, high-level phases typically include planning the project; gathering data requirements; developing data models; designing and developing the physical database(s); developing the source, profile, and map data; and extracting, transforming, and loading the data. Lower-level planning details are typically carried out by the project manager and project team leads.

Preparing the WBS


INFORMATICA CONFIDENTIAL BEST PRACTICES 921 of 954

The WBS can be prepared using manual or automated techniques, or a combination of the two. In many cases, a manual technique is used identify and record the high-level phases and tasks, then the information is transferred to project tracking software such as Microsoft Project. Project team members typically begin by identifying the high-level phases and tasks, writing the relevant information on large sticky notes or index cards, then mount the notes or cards on a wall or white board. Use one sticky note or card per phase or task so that you can easily be rearrange them as the project order evolves. As the project plan progresses, you can add information to the cards or notes to flesh out the details, such as task owner, time estimates, and dependencies. This information can then be fed into the project tracking software. Once you have a fairly detailed methodology, you can enter the phase and task information into your project tracking software. When the project team is assembled, you can enter additional tasks and details directly into the software. Be aware however, that the project team can better understand a project and its various components if they actually participate in the high-level development activities, as they do in the manual approach. Using software alone, without input from relevant project team members, to designate phases, tasks, dependencies and time lines can be difficult and prone to errors and ommissions. Benefits of developing the project timeline manually, with input from team members include:
q

Tasks, effort and dependencies are visible to all team members.


q

Team has a greater understanding of and commitment to the project.


q

Team members have an opportunity to work with each other and set the foundation. This is particularly important if the team is geographically dispersed and cannot work face-to-face throughout much of the project.

How Much Descriptive Information is Needed?


The project plan should incorporate a thorough description of the project and its goals. Be sure to review the business objectives, constraints, and high-level phases but keep the description as short and simple as possible. In many cases, a verb-noun form works well (e.g., interview users, document requirments, etc.). After you have described the project on a high-level, identify the tasks needed to complete each phase. It is often helpful to use the notes section in the tracking software (e.g., Microsoft Project) to provide narrative for each task or subtask. In general, decompose the tasks until they have a rough durations of two to 20 days. Remember to break down the tasks only to the level of detail that you are willing to track. Include key checkpoints or milestones as tasks to be completed. Again, a noun-verb form works well for milestones (e. g., requirements completed, data model completed, etc.).

Assigning and Delegating Responsibility


Identify a single owner for each task in the project plan. Although other resources may help to complete the task; the individual who is designated as the owner is ultimately responsible for ensuring that the task, and any associated deliverables, is completed on time. After the WBS is loaded into the selected project tracking software and refined for the specific project requirements, the Project Manager can begin to estimate the level of effort involved in completing each of the steps. When the estimate is complete, the project manager can assign individual resources and prepare
INFORMATICA CONFIDENTIAL BEST PRACTICES 922 of 954

a project schedule . The end result is the Project Plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan. Use your project plan to track progress. Be sure to review and modify estimates and keep the project plan updated throughout the project.

Last updated: 09-Feb-07 16:29

INFORMATICA CONFIDENTIAL

BEST PRACTICES

923 of 954

Developing and Maintaining the Project Plan Challenge


The challenge of developing and maintaing a project plan is to incorporate all of the necessary components while retaining the flexibility necessary to accommodate change. A two-fold approach is required to meet the challenge: 1. A project that is clear in scope contains the following elements:
q q q

A designated begin and end date. Well-defined business and technical requirements Adequate resources must be assigned.

Without these components, the project is subject slippage and incorrect expectations set with the Project Sponsor. 2. Project Plans are subject to revision and change throughout the project. It is imperative to establish a communication plan with the Project Sponsor; such communication may involve a weekly status report of accomplishments, and/or a report on issues and plans for the following week. This type of forum is very helpful in involving the Project Sponsor to actively make decisions with regards to change in scope or timeframes. If your organization has the concept of a Project Office that provides governance for the project and priorities, look for a Project Charter that contains items like scope, initial high-level requirements, and key project stakeholders. Additionally, the Project Office should provide guidance in funding and resource allocation for key projects. Informaticas PowerCenter and Data Quality are not exempted from this project planning process. However, the purpose here is to provide some key elements that can be used to develop and maintain a data integration, data migration, or data quality project.

Description
Use the following steps as a guide for developing the initial project plan: 1. Define major milestones based on the project scope. (Be sure to list all key items such as analysis, design, development, and testing.) 2. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point or for recommending tasks for inclusion. 3. Continue the detail breakdown, if possible, to a level at which there are logical chunks of work can be completed and assigned to resources for accountability purposes. This level provides satisfactory detail to facilitate estimation, assignment of resources, and tracking of progress. If the detail tasks are too broad in scope, such as assigning multiple resources, estimates are much less likely to be accurate and resource accountability becomes difficult to maintain. 4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if applicable). This helps to build commitment for the project plan. 5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must start or complete concurrently with another). 6. Define the resources based on the role definitions and estimated number of resources needed for each role. 7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan. 8. Ensure that project plan follows your organizations system development methodology. Note: Informatica Professional Services has found success in projects that blend thewaterfall method with the iterative method. TheWaterfall method works well in the early stages of a project, such as analysis and initial design. The Iterative methods work well in accelerating development and testing where feedback from extensive testing
INFORMATICA CONFIDENTIAL BEST PRACTICES 924 of 954

validates the design of the system. At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities. Set the constraint type to As Soon As Possible and avoid setting a constraint date. Use the Effort-Driven approach so that the Project Plan can be easily modified as adjustments are made. By setting the initial definition of tasks and efforts, the resulting schedule should provide a realistic picture of the project, unfettered by concerns about ideal user-requested completion dates. In other words, be as realistic as possible in your initial estimations, even if the resulting scheduling is likely to miss Project Sponsor expectations. This helps to establish good communications with your Project Sponsor so you can begin to negotiate scope and resources in good faith. This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for opportunities for parallel activities, perhaps adding resources if necessary, to improve the schedule. When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions, dependencies, assignments, milestone dates, etc. Expect to modify the plan as a result of this review.

Reviewing and Revising the Project Plan


Once the Project Sponsor and Key Stakeholders agree to the initial plan, it becomes the basis for assigning tasks and setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule and updating the plan based on status and changes to assumptions. One of the key communication methods is building the concept of a weekly or bi-weekly Project Sponsor meeting. Attendance at this meeting should include the Project Sponsor, Key Stakeholders, Lead Developers, and the Project Manager. Elements of a Project Sponsor meeting should include: a) Key Accomplishments (milestones, events at a high-level), b) Progress to Date against the initial plan, c) Actual Hours vs. Budgeted Hours, d) Key Issues and e) Plans for Next Period.

Key Accomplishments
Listing key accomplishments provides an audit trail of activities completed for comparison against the initial plan. This is an opportunity to bring in the lead developers and have them report to management on what they have accomplished; it also provides them with an opportunity to raise concerns, which is very good from a motivation perspective since they have to own and account to management. Keep accomplishments at a high-level and coach the team members to be brief, keeping their presentation to a five to ten minute maximum during this portion of the meeting.

Progress against Initial Plan


The following matrix shows progress on relevant stages of the project. Roll-up tasks to a management level so it is readable to the Project Sponsor (see sample below).

Plan Architecture - Set up of Informatica Migration Environment Develop data integration solution architecture Design development architecture Customize and implement Iterative Migration Framework
INFORMATICA CONFIDENTIAL BEST PRACTICES

Percent Complete 10% 28%

Budget Hours 167 28 32

925 of 954

Data Profiling Legacy Stage Pre-Load Stage Reference Data Reusable Objects Review and signoff of Architecture Analysis - Target-to-Source Data Mapping Customer (9 tables) Product (6 tables) Inventory (3 tables) Shipping (3 tables) Invoicing (7 tables) Orders (19 tables) Review and signoff of Functional Specification

80% 100% 100% 83% 19% 0%

32 10 10 18 27 10 1000 135 215 60 60 140 380 10

90% 90% 0% 0% 57% 40% 0%

Budget versus Actual


A key measure to be aware of is budgeted vs. actual cost of the project. The Project Sponsor needs to know if additional funding is required; forecasting actual hours against budgeted hours allows the Project Sponsor to determine when additional funding or a change in scope is required. Many projects are cancelled because of cost overruns, so it is the Project Managers job to keep expenditures under control. The following example shows how a budgeted vs. actual report may look.

Resource A Resource B Resource C Resource D Project Manager

10Apr 28

17Apr 40

24Apr 1-May 8-May 15-May 24 40 40 40 10 40 40 40 40 36 40 24 40 36 40 12 8 8 *462 97 160 160 687 160

22May 40 40 40 40 16

29May 32 32 32 32 32

284 202 188 212 76 962 1167

110

160

160

160

Key Issues
This is the most important part of the meeting. Presenting key issues such as resource commitment, user roadblocks, key design concerns, etc, to the Project Sponsor and Key Stakeholders as they occur allows them to make immediate decisions and minimizes the risk of impact to the project.

Plans for Next Period


This communicates back to the Project Sponsor where the resources are to be deployed. If key issues dictate a change, this is an opportunity to redirect the resources and use them correctly. Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment Sample Deliverable), or changes in priority or approach as they arise to determine if they effect the plan. It may be necessary to revise the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add new tasks or postpone existing ones.
INFORMATICA CONFIDENTIAL BEST PRACTICES 926 of 954

Tracking Changes
One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If company and project management do not require tracking against a baseline, simply maintain the plan through updates without a baseline. Maintain all records of Project Sponsor meetings and recap changes in scope after the meeting is completed.

Summary
Managing a data integration, data migration, or data quality project requires good project planning and communications. Many data integration project fail because of issues such as poor data quality or complexity of integration issues. However, good communication and expectation setting with the Project Sponsor can prevent such issues from causing a project to fail.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

927 of 954

Developing the Business Case Challenge


Identifying the departments and individuals that are likely to benefit directly from the project implementation. Understanding these individuals, and their business information requirements, is key to defining and scoping the project.

Description
The following four steps summarize business case development and lay a good foundation for proceeding into detailed business requirements for the project. 1. One of the first steps in establishing the business scope is identifying the project beneficiaries and understanding their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart that is useful for ensuring that all project team members understand the corporate/business organization.
q

Activity - Interview project sponsor to identify beneficiaries, define their business roles and project participation. Deliverable - Organization chart of corporate beneficiaries and participants.

2. The next step in establishing the business scope is to understand the business problem or need that the project addresses. This information should be clearly defined in a Problem/Needs Statement, using business terms to describe the problem. For example, the problem may be expressed as "a lack of information" rather than "a lack of technology" and should detail the business decisions or analysis that is required to resolve the lack of information. The best way to gather this type of information is by interviewing the Project Sponsor and/or the project beneficiaries.
q

Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding problems and needs related to project. Deliverable - Problem/Need Statement

3. The next step in creating the project scope is defining the business goals and objectives for the project and detailing them in a comprehensive Statement of Project
INFORMATICA CONFIDENTIAL BEST PRACTICES 928 of 954

Goals and Objectives. This statement should be a high-level expression of the desired business solution (e.g., what strategic or tactical benefits does the business expect to gain from the project,) and should avoid any technical considerations at this point. Again, the Project Sponsor and beneficiaries are the best sources for this type of information. It may be practical to combine information gathering for the needs assessment and goals definition, using individual interviews or general meetings to elicit the information.
q

Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding business goals and objectives for the project. Deliverable - Statement of Project Goals and Objectives

4. The final step is creating a Project Scope and Assumptions statement that clearly defines the boundaries of the project based on the Statement of Project Goals and Objective and the associated project assumptions. This statement should focus on the type of information or analysis that will be included in the project rather than what will not. The assumptions statements are optional and may include qualifiers on the scope, such as assumptions of feasibility, specific roles and responsibilities, or availability of resources or data.
q

Activity -Business Analyst develops Project Scope and Assumptions statement for presentation to the Project Sponsor. Deliverable - Project Scope and Assumptions statement

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

929 of 954

Managing the Project Lifecycle Challenge


To provide an effective communications plan to provide on-going management throughout the project lifecycle and to inform the Project Sponsor regarding status of the project.

Description
The quality of a project can be directly correlated to the amount of review that occurs during its lifecycle and the involvement of the Project Sponsor and Key Stakeholders.

Project Status Reports


In addition to the initial project plan review with the Project Sponsor, it is critical to schedule regular status meetings with the sponsor and project team to review status, issues, scope changes and schedule updates. This is known as the project sponsor meeting. Gather status, issues and schedule update information from the team one day before the status meeting in order to compile and distribute the Project Status Report . In addition, make sure lead developers of major assignments are present to report on the status and issues, if applicable.

Project Management Review


The Project Manager should coordinate, if not facilitate, reviews of requirements, plans and deliverables with company management, including business requirements reviews with business personnel and technical reviews with project technical personnel. Set a process in place beforehand to ensure appropriate personnel are invited, any relevant documents are distributed at least 24 hours in advance, and that reviews focus on questions and issues (rather than a laborious "reading of the code"). Reviews may include:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

930 of 954

q q q q q

Project scope and business case review. Business requirements review. Source analysis and business rules reviews. Data architecture review. Technical infrastructure review (hardware and software capacity and configuration planning). Data integration logic review (source to target mappings, cleansing and transformation logic, etc.). Source extraction process review. Operations review (operations and maintenance of load sessions, etc.). Reviews of operations plan, QA plan, deployment and support plan.

q q q

Project Sponsor Meetings


A project sponsor meeting should be completed weekly to bi-weekly to communicate progress to the Project Sponsor and Key Stakeholders. The purpose is to keep key user management involved and engaged in the process. In addition, it is to communicate any changes to the initial plan and to have them weigh in on the decision process. Elements of the meeting include:
q q q q

Key Accomplishments. Activities Next Week. Tracking of Progress to-Date (Budget vs. Actual). Key Issues / Roadblocks.

It is the Project Managers role to stay neutral to any issue and to effectively state facts and allow the Project Sponsor or other key executives to make decisions. Many times this process builds the partnership necessary for success.

Change in Scope
Directly address and evaluate any changes to the planned project activities, priorities, or staffing as they arise, or are proposed, in terms of their impact on the project plan. The Project Manager should institute a change management process in response to any issue or request that appears to add or alter expected activities and has the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

931 of 954

potential to affect the plan.


q

Use the Scope Change Assessment to record the background problem or requirement and the recommended resolution that constitutes the potential scope change. Note that such a change-in-scope document helps capture key documentation that is particularly useful if the project overruns or fails to deliver upon Project Sponsor expectations. Review each potential change with the technical team to assess its impact on the project, evaluating the effect in terms of schedule, budget, staffing requirements, and so forth. Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any potential risks to the project.

Even if there is no evident effect on the schedule, it is important to document these changes because they may affect project direction and it may become necessary, later in the project cycle, to justify these changes to management.

Management of Issues
Any questions, problems, or issues that arise and are not immediately resolved should be tracked to ensure that someone is accountable for resolving them so that their effect can also be visible. Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry and resolution as well as the details of the issue and of its solution. Significant or "showstopper" issues should also be mentioned on the status report and communicated through the weekly project sponsor meeting. This way, the Project Sponsor has the opportunity to resolve and cure a potential issue.

Project Acceptance and Close


A formal project acceptance and close helps document the final status of the project. Rather than simply walking away from a project when it seems complete, this explicit close procedure both documents and helps finalize the project with the Project Sponsor. For most projects this involves a meeting where the Project Sponsor and/or department managers acknowledge completion or sign a statement of satisfactory completion.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

932 of 954

Even for relatively short projects, use the Project Close Report to finalize the project with a final status report detailing:
r r r

What was accomplished. Any justification for tasks expected but not completed. Recommendations.

Prepare for the close by considering what the project team has learned about the environments, procedures, data integration design, data architecture, and other project plans. Formulate the recommendations based on issues or problems that need to be addressed. Succinctly describe each problem or recommendation and if applicable, briefly describe a recommended approach.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

933 of 954

Using Interviews to Determine Corporate Data Integration Requirements Challenge


Data warehousing projects are usually initiated out of a business need for a certain type of reports (i.e., we need consistent reporting of revenue, bookings and backlog). Except in the case of narrowly-focused, departmental data marts however, this is not enough guidance to drive a full data integration solution. Further, a successful, singlepurpose data mart can build a reputation such that, after a relatively brief period of proving its value to users, business management floods the technical group with requests for more data marts in other areas. The only way to avoid silos of data marts is to think bigger at the beginning and canvas the enterprise (or at least the department, if thats your limit of scope) for a broad analysis of data integration requirements.

Description
Determining the data integration requirements in satisfactory detail and clarity is a difficult task however, especially while ensuring that the requirements are representative of all the potential stakeholders. This Best Practice summarizes the recommended interview and prioritization process for this requirements analysis.

Process Steps
The first step in the process is to identify and interview all major sponsors and stakeholders. This typically includes the executive staff and CFO since they are likely to be the key decision makers who will depend on the data integraton. At a minimum, figure on 10 to 20 interview sessions. The next step in the process is to interview representative information providers. These individuals include the decision makers who provide the strategic perspective on what information to pursue, as well as details on that information, and how it is currently used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors and stakeholders regarding the findings of the interviews and the recommended subject areas and information profiles. It is often helpful to facilitate a Prioritization Workshop with the major stakeholders, sponsors, and information providers in order to set priorities on the subject areas.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

934 of 954

Conduct Interviews
The following paragraphs offer some tips on the actual interviewing process. Two sections at the end of this document provide sample interview outlines for the executive staff and information providers. Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A focused, consistent interview format is desirable. Don't feel bound to the script, however, since interviewees are likely to raise some interesting points that may not be included in the original interview format. Pursue these subjects as they come up, asking detailed questions. This approach often leads to discoveries of strategic uses for information that may be exciting to the client and provide sparkle and focus to the project. Questions to the executives or decision-makers should focus on what business strategies and decisions need information to support or monitor them. (Refer to Outline for executive Interviews at the end of this document). Coverage here is critical if key managers are left out, you may miss a critical viewpoint and may miss an important buy-in. Interviews of information providers are secondary but can be very useful. These are the business analyst-types who report to decision-makers and currently provide reports and analyses using Excel or Lotus or a database program to consolidate data from more than one source and provide regular and ad hoc reports or conduct sophisticated analysis. In subsequent phases of the project, you must identify all of these individuals, learn what information they access, and how they process it. At this stage however, you should focus on the basics, building a foundation for the project and discovering what tools are currently in use and where gaps may exist in the analysis and reporting functions. Be sure to take detailed notes throughout the interview process. If there are a lot of interviews, you may want the interviewer to partner with someone who can take good notes, perhaps on a laptop to save note transcription time later. It is important to take down the details of what each person says because, at this stage, it is difficult to know what is likely to be important. While some interviewees may want to see detailed notes from their interviews, this is not very efficient since it takes time to clean up the notes for review. The most efficient approach is to simply consolidate the interview notes into a summary format following the interviews. Be sure to review previous interviews as you go through the interviewing process, You can often use information from earlier interviews to pursue topics in later interviews in more detail and with varying perspectives.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

935 of 954

The executive interviews must be carried out in business terms. There can be no mention of the data warehouse or systems of record or particular source data entities or issues related to sourcing, cleansing or transformation. It is strictly forbidden to use any technical language. It can be valuable to have an industry expert prepare and even accompany the interviewer to provide business terminology and focus. If the interview falls into technical details, for example, into a discussion of whether certain information is currently available or could be integrated into the data warehouse, it is up to the interviewer to re-focus immediately on business needs. If this focus is not maintained, the opportunity for brainstorming is likely to be lost, which will reduce the quality and breadth of the business drivers. Because of the above caution, it is rarely acceptable to have IS resources present at the executive interviews. These resources are likely to engage the executive (or vice versa) in a discussion of current reporting problems or technical issues and thereby destroy the interview opportunity. Keep the interview groups small. One or two Professional Services personnel should suffice with at most one client project person. Especially for executive interviews, there should be one interviewee. There is sometimes a need to interview a group of middle managers together, but if there are more than two or three, you are likely to get much less input from the participants.

Distribute Interview Findings and Recommended Subject Areas


At the completion of the interviews, compile the interview notes and consolidate the content into a summary.This summary should help to breakout the input into departments or other groupings significant to the client. Use this content and your interview experience along with best practices or industry experience to recommend specific, well-defined subject areas. Remember that this is a critical opportunity to position the project to the decisionmakers by accurately representing their interests while adding enough creativity to capture their imagination. Provide them with models or profiles of the sort of information that could be included in a subject area so they can visualize its utility. This sort of visionary concept of their strategic information needs is crucial to drive their awareness and is often suggested during interviews of the more strategic thinkers. Tie descriptions of the information directly to stated business drivers (e.g., key processes and decisions) to further accentuate the business solution. A typical table of contents in the initial Findings and Recommendations document might look like this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

936 of 954

I. Introduction II. Executive Summary A. Objectives for the Data Warehouse B. Summary of Requirements C. High Priority Information Categories D. Issues III. Recommendations A. Strategic Information Requirements B. Issues Related to Availability of Data C. Suggested Initial Increments D. Data Warehouse Model IV. Summary of Findings A. Description of Process Used B. Key Business Strategies [this includes descriptions of processes, decisions, other drivers) C. Key Departmental Strategies and Measurements D. Existing Sources of Information E. How Information is Used F. Issues Related to Information Access V. Appendices A. Organizational structure, departmental roles B. Departmental responsibilities, and relationships

Conduct Prioritization Workshop


This is a critical workshop for consensus on the business drivers. Key executives and decision-makers should attend, along with some key information providers. It is advisable to schedule this workshop offsite to assure attendance and attention, but the workshop must be efficient typically confined to a half-day. Be sure to announce the workshop well enough in advance to ensure that key attendees can put it on their schedules. Sending the announcement of the workshop may coincide with the initial distribution of the interview findings. The workshop agenda should include the following items:
q q q q q

Agenda and Introductions Project Background and Objectives Validate Interview Findings: Key Issues Validate Information Needs Reality Check: Feasibility

INFORMATICA CONFIDENTIAL

BEST PRACTICES

937 of 954

q q q

Prioritize Information Needs Data Integration Plan Wrap-up and Next Steps

Keep the presentation as simple and concise as possible, and avoid technical discussions or detailed sidetracks.

Validate information needs


Key business drivers should be determined well in advance of the workshop, using information gathered during the interviewing process. Prior to the workshop, these business drivers should be written out, preferably in display format on flipcharts or similar presentation media, along with relevant comments or additions from the interviewees and/or workshop attendees. During the validation segment of the workshop, attendees need to review and discuss the specific types of information that have been identified as important for triggering or monitoring the business drivers. At this point, it is advisable to compile as complete a list as possible; it can be refined and prioritized in subsequent phases of the project. As much as possible, categorize the information needs by function, maybe even by specific driver (i.e., a strategic process or decision). Considering the information needs on a function by function basis fosters discussion of how the information is used and by whom.

Reality check: feasibility


With the results of brainstorming over business drivers and information needs listed (all over the walls, presumably), take a brief detour into reality before prioritizing and planning. You need to consider overall feasibility before establishing the first priority information area(s) and setting a plan to implement the data warehousing solution with initial increments to address those first priorities. Briefly describe the current state of the likely information sources (SORs). What information is currently accessible with a reasonable likelihood of the quality and content necessary for the high priority information areas? If there is likely to be a high degree of complexity or technical difficulty in obtaining the source information, you may need to reduce the priority of that information area (i.e., tackle it after some successes in other areas). Avoid getting into too much detail or technical issues. Describe the general types of information that will be needed (e.g., sales revenue, service costs, customer descriptive
INFORMATICA CONFIDENTIAL BEST PRACTICES 938 of 954

information, etc.), focusing on what you expect will be needed for the highest priority information needs.

Data Integration Plan


The project sponsors, stakeholders, and users should all understand that the process of implementing the data warehousing solution is incremental.. Develop a high-level plan for implementing the project, focusing on increments that are both high-value and high-feasibility. Implementing these increments first provides an opportunity to build credibility for the project. The objective during this step is to obtain buy-in for your implementation plan and to begin to set expectations in terms of timing. Be practical though; don't establish too rigorous a timeline!

Wrap-up and next steps


At the close of the workshop, review the group's decisions (in 30 seconds or less), schedule the delivery of notes and findings to the attendees, and discuss the next steps of the data warehousing project.

Document the Roadmap


As soon as possible after the workshop, provide the attendees and other project stakeholders with the results:
q q

Definitions of each subject area, categorized by functional area Within each subject area, descriptions of the business drivers and information metrics Lists of the feasibility issues The subject area priorities and the implementation timeline.

q q

Outline for Executive Interviews


I. Introductions II. General description of information strategy process A. Purpose and goals B. Overview of steps and deliverables q Interviews to understand business information strategies and expectations
q

Document strategy findings

INFORMATICA CONFIDENTIAL

BEST PRACTICES

939 of 954

Consensus-building meeting to prioritize information requirements and identify quick hits Model strategic subject areas Produce multi-phase Business Intelligence strategy

q q

III. Goals for this meeting A. Description of business vision, strategies B. Perspective on strategic business issues and how they drive information needs q Information needed to support or achieve business goals
q

How success is measured

IV. Briefly describe your roles and responsibilities? q The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add. A. What are your key business strategies and objectives? q How do corporate strategic initiatives impact your group?
q

These may include MBOs (personal performance objectives), and workgroup objectives or strategies.

B. What do you see as the Critical Success Factors for an Enterprise Information Strategy? q What are its potential obstacles or pitfalls? C. What information do you need to achieve or support key decisions related to your business objectives? D. How will your organizations progress and final success be measured (e. g., metrics, critical success factors)? E. What information or decisions from other groups affect your success? F. What are other valuable information sources (i.e., computer reports, industry reports, email, key people, meetings, phone)? G. Do you have regular strategy meetings? What information is shared as you develop your strategy? H. If it is difficult for the interviewee to brainstorm about information needs, try asking the question this way: "When you return from a two-week vacation, what information do you want to know first?" I. Of all the information you now receive, what is the most valuable? J. What information do you need that is not now readily available? K. How accurate is the information you are now getting? L. To whom do you provide information? M. Who provides information to you? N. Who would you recommend be involved in the cross-functional Consensus Workshop?

INFORMATICA CONFIDENTIAL

BEST PRACTICES

940 of 954

Outline for Information Provider Interviews


I. Introductions II. General description of information strategy process A. Purpose and goals B. Overview of steps and deliverables q Interviews to understand business information strategies and expectations
q

Document strategy findings and model the strategic subject areas Consensus-building meeting to prioritize information requirements and identify quick hits Produce multi-phase Business Intelligence strategy

III. Goals for this meeting 1. Understanding of how business issues drive information needs 2. High-level understanding of what information is currently provided to whom q Where does it come from
q q

How is it processed What are its quality or access issues

IV. Briefly describe your roles and responsibilities? q The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add. A. Who do you provide information to? B. What information do you provide to help support or measure the progress/success of their key business decisions? C. Of all the information you now provide, what is the most requested or most widely used? D. What are your sources for the information (both in terms of systems and personnel)? E. What types of analysis do you regularly perform (i.e., trends, investigating problems)? How do you provide these analyses (e.g., charts, graphs, spreadsheets)? F. How do you change/add value to the information? G. Are there quality or usability problems with the information you work with? How accurate is it?

Last updated: 05-Jun-08 15:16

INFORMATICA CONFIDENTIAL

BEST PRACTICES

941 of 954

Upgrading Data Analyzer Challenge


Seamlessly upgrade Data Analyzer from one release to another while safeguarding the repository.

Description
Upgrading Data Analyzer involves two steps: 1. Upgrading the Data Analyzer application. 2. Upgrading the Data Anaylzer repository.

Steps Before The Upgrade


1. Backup the repository. To ensure a clean backup, shutdown Data Analyzer and create the backup, following the steps in the Data Analyzer manual. 2. Restore the backed up repository into an empty database or a new schema. This will ensure that you have a hot backup of the repository if, for some reason, the upgrade fails.

Steps for upgrading Data Analyzer application


The upgrade process varies from application server to application server on which Data Analyzer is hosted.

For WebLogic:
1. Install WebLogic 8.1 without uninstalling the existing Application Server (WebLogic 6.1). 2. Install the Data Analyzer application on the new WebLogic 8.1 Application Server, making sure to use a different port than the one used in the old installation.. When prompted for repository, please choose the option of existing repository and give the connection details of the database that hosts the backed up old repository of Data Analyzer. 3. When the installation is complete, use the Upgrade utility to connect to the database that hosts the Data Analyzer backed up repository and perform the
INFORMATICA CONFIDENTIAL BEST PRACTICES 942 of 954

upgrade.

For Jboss and WebSphere:


1. Uninstall Data Analyzer 2. Install new Data Analyzer version 3. When prompted for a repository, choose the option of existing repository and give the connection details of the database that hosts the backed up Data Analyzer 4. Use the Upgrade utility and connect to the database that hosts the backed up Data Analyzer repository and perform the upgrade. When the repository upgrade is complete, start Data Analyzer and perform a simple acceptance test. You can use the following test case (or a subset of the following test case) as an acceptance test). 1. 2. 3. 4. 5. 6. Open a simple report Open a cached report. Open a report with filtersets. Open a sectional report. Open a workflow and also its nodes. Open a report and drill through it.

When all the reports open without problems, your upgrade can be called complete. Once the upgrade is complete, repeat the above process on the actual repository. Note: This upgrade process creates two instances of Data Analyzer. So when the upgrade is successful, uninstall the older version, following the steps in the Data Analyzer manual.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

943 of 954

Upgrading PowerCenter Challenge


Upgrading an existing installation of PowerCenter to a later version encompasses upgrading the repositories, implementing any necessary modifications, testing, and configuring new features. With PowerCenter 8.1, the expansion of the Service-Oriented Archicture with its domain and node concept brings additional challenges to the upgrade process. The challenge is for data integration administrators to approach the upgrade process in a structured fashion and minimize risk to the environment and ongoing project work. Some of the challenges typically encountered during an upgrade include:
q q

Limiting development downtime. Ensuring that development work performed during the upgrade is accurately migrated to the upgraded environment. Testing the upgraded environment to ensure that data integration results are identical to the previous version. Ensuring that all elements of the various environments (e.g., Development, Test, and Production) are upgraded successfully.

Description
Some typical reasons for an initiating a PowerCenter upgrade include:
q

Additional features and capabilities in the new version of PowerCenter that enhance development productivity and administration. To keep pace with higher demands for data integration. To achieve process performance gains. To maintain an environment of fully supported software as older PowerCenter versions end support status.

q q q

Upgrade Team
Assembling a team of knowledgeable individuals to carry out the PowerCenter upgrade is key to completing the process within schedule and budgetary guidelines. Typically,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

944 of 954

the upgrade team needs the following key players:


q q q q

PowerCenter Administrator Database Administrator System Administrator Informatica team - the business and technical users that "own" the various areas in the Informatica environment. These resources are required for knowledge transfer and testing during the upgrade process and after the upgrade is complete.

Upgrade Paths
The upgrade process details depend on which of the existing PowerCenter versions you are upgrading from and which version you are moving to. The following bullet items summarize the upgrade paths for the various PowerCenter versions:
q

PowerCenter 8.1.1 (available since September 2006)


r r r

Direct upgrade for PowerCenter 6.x to 8.1.1 Direct upgrade for PowerCenter 7.x to 8.1.1 Direct upgrade for PowerCenter 8.0 to 8.1.1

Other versions:
r r

For version 4.6 or earlier - upgrade to 5.x, then to 7.x and to 8.1.1 For version 4.7 or later - upgrade to 6.x and then to 8.1.1

Upgrade Tips
Some of the following items may seem obvious, but adhering to these tips should help to ensure that the upgrade process goes smoothly.
q

Be sure to have sufficient memory and disk space (database) for the installed software. As new features are added into PowerCenter, the repository grows in size anywhere from 5 to 25 percent per release to accommodate the metadata for the new features. Plan for this increase in all of your PowerCenter repositories. Always read and save the upgrade log file. Backup Repository Server and PowerCenter Server configuration files prior to beginning the upgrade process. Test the AEP/EP (Advanced External Procedure/External Procedure) prior to

q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

945 of 954

beginning the upgrade. Recompiling may be necessary.


q

PowerCenter 8.x and beyond require Domain Metadata in addition to the standard PowerCenter Repositories. Work with your DBA to create a location for the Domain Metadata Repository, which is created at install time. Ensure that all repositories for upgrade are backed up and that they can be restored successfully. Repositories can be restored to the same database in a different schema to allow an upgrade to be carried out in parallel. This is especially useful if the PowerCenter test and development environments reside in a single repository. When naming your nodes and domains in PowerCenter 8, think carefully about the naming convention before the upgrade. While changing the name of a node or the domain later is possible, it is not an easy task since it is embedded in much of the general operation of the product. Avoid using IP addresses and machine names for the domain and node names since over time machine IP addresses and server names may change. With PowerCenter 8, a central location exists for shared files (i.e., log files, error files, checkpoint files, etc.) across the domain. If using the Grid option or High Availability option, it is important that this file structure is on a highperformance file system and viewable by all nodes in the domain. If High Availability is configured, this file system should also be highly available.

Upgrading Multiple Projects


Be sure to consider the following items if the upgrade involves multiple projects:
q

All projects sharing a repository must upgrade at same time (test concurrently). Projects using multiple repositories must all upgrade at same time. After upgrade, each project should undergo full regression testing.

q q

Upgrade Project Plan


The full upgrade process from version to version can be time consuming, particularly around the testing and verification stages. Informatica strongly recommends developing a project plan to track progress and inform managers and team members of the tasks that need to be completed, uncertainties, or missed steps

Scheduling the Upgrade

INFORMATICA CONFIDENTIAL

BEST PRACTICES

946 of 954

When an upgrade is scheduled in conjunction with other development work, it is prudent to have it occur within a separate test environment that mimics (or at least closely resembles) production. This reduces the risk of unexpected errors and can decrease the effort spent on the upgrade. It may also allow the development work to continue in parallel with the upgrade effort, depending on the specific site setup.

Environmental Impact
With each new PowerCenter release, there is the potential for the upgrade to effect your data integration environment based on new components and features. The PowerCenter 8 upgrade changes the architecture from PowerCenter version 7, so you should spend time planning the upgrade strategy concerning domains, nodes, domain metadata, and the other architectural components with PowerCenter 8. Depending on the complexity of your data integration environment, this may be a minor or major impact. Single integration server/single repository installations are not likely to notice much of a difference to the architecture, but customers striving for highly-available systems with enterprise scalability may need to spend time understanding how to alter their physical architecture to take advantage of these new features in PowerCenter 8. For more information on these architecture changes, reference the PowerCenter documentation and the Best Practice on Domain Configuration.

Upgrade Process
Informatica recommends using the following approach to handle the challenges inherent in an upgrade effort.

Choosing an Appropriate Environment


It is always advisable to have at least three separate environments: one each for Development, Test, and Production. The Test environment is generally the best place to start the upgrade process since it is likely to be the most similar to Production. If possible, select a test sandbox that parallels production as closely as possible. This enables you to carry out data comparisons between PowerCenter versions. An added benefit of starting the upgrade process in a test environment is that development can continue without interruption. Your corporate policies on development, test, and sandbox environments and the work that can or cannot be done in them will determine the precise order for the upgrade and any associated development changes. Note that if changes are required as a result of the upgrade, they need to be migrated to Production. Use the existing version to backup the PowerCenter repository, then ensure that the backup works by restoring it to a new schema in the repository database.
INFORMATICA CONFIDENTIAL BEST PRACTICES 947 of 954

Alternatively, you can begin the upgrade process in the Development environment or create a parallel environment in which to start the effort. The decision to use or copy an existing platform depends on the state of project work across all environments. If it is not possible to set up a parallel environment, the upgrade may start in Development, then progress to the Test and Production systems. However, using a parallel environment is likely to minimize development downtime. The important thing is to understand the upgrade process and your own business and technical requirements, then adapt the approaches described in this document to one that suits your particular situation.

Organizing the Upgrade Effort


Begin by evaluating the entire upgrade effort in terms of resources, time, and environments. This includes training, availability of database, operating system, and PowerCenter administrator resources as well as time to perform the upgrade and carry out the necessary testing in all environments. Refer to the release notes to help identify mappings and other repository objects that may need changes as a result of the upgrade. Provide detailed training for the Upgrade team to ensure that everyone directly involved in the upgrade process understands the new version and is capable of using it for their own development work and assisting others with the upgrade process. Run regression tests for all components on the old version. If possible, store the results so that you can use them for comparison purposes after the upgrade is complete. Before you begin the upgrade, be sure to backup the repository and server caches, scripts, logs, bad files, parameter files, source and target files, and external procedures. Also be sure to copy backed-up server files to the new directories as the upgrade progresses. If you are working in a UNIX environment and have to use the same machine for existing and upgrade versions, be sure to use separate users and directories. Be careful to ensure that profile path statements do not overlap between the new and old versions of PowerCenter. For additional information, refer to the installation guide for path statements and environment variables for your platform and operating system.

Installing and Configuring the Software


q

Install the new version of the PowerCenter components on the server.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

948 of 954

Ensure that the PowerCenter client is installed on at least one workstation to be used for upgrade testing and that connections to repositories are updated if parallel repositories are being used. Re-compile any Advanced External Procedures/External Procedures if necessary, and test them. The PowerCenter license key is now in the form of a file. During the installation of PowerCenter, youll be asked for the location of this key file. The key should be saved on the server prior to beginning the installation process. When installing PowerCenter 8.x, youll configure the domain, node, repository service, and the integration service at the same time. Ensure that you have all necessary database connections ready before beginning the installation process. If upgrading to PowerCenter 8.x from PowerCenter 7.x (or earlier), you must gather all of your configuration files that are going to be used in the automated process to upgrade the Integration Services and Repositories. See the PowerCenter Upgrade Manual for more information on how to gather them and where to locate them for the upgrade process. Once the installation has been completed, use the Repository Server Administration Console to perform the upgrade. Unlike previous versions of PowerCenter, in version 8 the Administration Console is a web application. The Administration Console URL is http://hostname:portnumber where hostname is the name of their server where the PowerCenter services are installed and port number is the port identified during the installation process. The default port number is 6001. Re-register any plug-ins (such as PowerExchange) to the newly upgraded environment. You can start both the repository and integration services on the Admin Console. Analyze upgrade activity logs to identify areas where changes may be required, rerun full regression tests on the upgraded repository. Execute test plans. Ensure that there are no failures and all the loads run successfully in the upgraded environment. Verify the data to ensure that there are no changes and no additional or missing records.

Implementing Changes and Testing


If changes are needed, decide where those changes are going to be made. It is generally advisable to migrate work back from test to an upgraded development environment. Complete the necessary changes, then migrate forward through test to

INFORMATICA CONFIDENTIAL

BEST PRACTICES

949 of 954

production. Assess the changes when the results from the test runs are available. If you decide to deviate from best practice and make changes in test and migrate them forward to production, remember that you'll still need to implement the changes in development. Otherwise, these changes will be lost the next time work is migrated from development to the test environment. When you are satisfied with the results of testing, upgrade the other environments by backing up and restoring the appropriate repositories. Be sure to closely monitor the production environment and check the results after the upgrade. Also remember to archive and remove old repositories from the previous version.

After the Upgrade


q

If multiple nodes were configured and you own the PowerCenter Grid option, you can create a server grid to test performance gains If you own the high-availability option, you should configure your environment for high availability including setting up failover gateway node(s) and designating primary and backup nodes for your various PowerCenter services. In addition, your shared file location for the domain should be located on a highly available, high-performance file server. Start measuring data quality by creating a sample data profile. If LDAP is in use, associate LDAP users with PowerCenter users. Install PowerCenter Reports and configure the built-in reports for the PowerCenter repository.

q q q

Repository Versioning
After upgrading to version 8.x, you can set the repository to versioned if you purchased the Team-Based Management option and enabled it via the license key. Keep in mind that once the repository is set to versioned, it cannot be set back to nonversioned. You can invoke the team-based development option in the Administration Console.

Upgrading Folder Versions


After upgrading to version 8.x, you'll need to remember the following:
q

There are no more folder versions in version 8.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

950 of 954

q q q

The folder with the highest version number becomes the current folder. Other versions of the folders are folder_<folder_version_number>. Shortcuts are created to mappings from the current folder.

Upgrading Pmrep and Pmcmd Scripts


q q q

No more folder versions for pmrep and pmrepagent scripts. Ensure that the workflow/session folder names match the upgraded names. Note that pmcmd command structure changes significantly after version 5. Version 5 pmcmd commands can still run in version 8, but may not be backwards-compatible in future versions.

Advanced External Procedure Transformations


AEPs are upgraded to Custom Transformation, a non-blocking transformation. To use this feature, you need to recompile the procedure, but you can use the old DLL/ library if recompilation is not required.

Upgrading XML Definitions


q q q q

Version 8 supports XML schema. The upgrade removes namespaces and prefixes for multiple namespaces. Circular reference definitions are read-only after the upgrade. Some datatypes are changed in XML definitions by the upgrade.

For more information on the specific changes to the PowerCenter software for your particular upgraded version, reference the release notes as well as the PowerCenter documentation.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

951 of 954

Upgrading PowerExchange Challenge


Upgrading and configuring PowerExchange on a mainframe to a new release and ensuring that there is minimum impact to the current PowerExchange schedule.

Description
The PowerExchange upgrade is essentially an installation with a few additional steps and some changes to the steps of a new installation. When planning for a PowerExchange upgrade the same resources are required as the initial implementation requires. These include, but are not limited to:
q q

MVS systems operator Appropriate database administrator; this depends on what (if any) databases are going to be sources/and or targets (e.g., IMS, IDMS, etc.). MVS Security resources

Since an upgrade is so similar to an initial implementation of PowerExchange, this document does not address the details of the installation. This document addresses the steps that are not documented in the Best Practices Installation document, as well as changes to existing steps in that document. For details on installing a new PowerExchange release see the Best Practice PowerExchange Installation (for Mainframe) .

Upgrading PowerExchange on the Mainframe


The following steps are modifications to the installation steps or additional steps required to upgrade PowerExchange on the mainframe. More detailed information for upgrades can also be found in the PWX Migration Guide that comes with each release. 1. Choose a new high-level qualifier when allocating the libraries, RUNLIB and BINLIB, on the mainframe. Consider using the version of PowerExchange as part of the dataset name. An example would be SYSB.PWX811.RUNLIB. These two libraries need to be APF authorized. 2. Backup the mainframe datasets and libraries. Also, backup the PowerExchange

INFORMATICA CONFIDENTIAL

BEST PRACTICES

952 of 954

paths on the client workstations and the PowerCenter server. 3. When executing the MVS Install Assistant and providing values on each screen, make sure the following parameters differ from those used in the existing version of PowerExchange. Specify new high-level qualifiers used for the PowerExchange datasets, libraries, and VSAM files. The value needs to match the qualifier used for the RUNLIB and BINLIB datasets allocated earlier. Consider including the version of PowerExchange in the high-level nodes of the datasets. An example could be SYSB.PWX811. The PowerExchange Agent/Logger three character prefix needs to be unique and differ from that used in the existing version of PowerExchange. Make sure the values on Logger/Agent/Condenser Parameters screen reflect the new prefix. For DB2, the plan name specified should differ from that used in the existing release. Run the jobs listed in the XJOBS member in the RUNLIB. Before starting the Listener, rename the DBMOVER member in the new RUNLIB dataset. Copy the DBMOVER member from the current PowerExchange RUNLIB to the corresponding library for the new release of PowerExchange. Update the port numbers to reflect the new ports. Update any dataset names specified in the NETPORTS to reflect the new high-level qualifier. Start the Listener and make sure the PING works. See the other document or the Implementation guide for more details. The existing Datamaps must now be migrated to the new release using the DTLURDMO utility. Details and examples can be found in the PWX Utilities Guide and the PWX Migration Guide.

4. 5. 6.

7. 8.

At this point, the mainframe upgrade is complete for bulk processing. For PowerExchange Change Data Capture or Change Data Capture Real-time, complete the additional steps in the installation manual. Also perform the following steps: 1. Use the DTLURDMO utility to migrate existing Capture Registrations and Capture Extractions to the new release. 2. Create a Registration Group for each source. 3. Open and save each Extraction Map in the new Extraction Groups. 4. Insure the values for CHKPT_BASENAME and EXT_CAPT_MASK parameters are correct before running a Condense.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

953 of 954

Upgrade PowerExchange on a Client Workstation and the Server


The installation procedures on the client workstations and the server are the same as they are for an initial implementation with a few exceptions. The differences are as follow: 1. New paths should be specified during the installation of the new release. 2. After the installation, copy the old DBMOVER.CFG configuration member to the new path and modify the ports to reflect those of the new release. 3. Make sure the PATHS reflects the path specified earlier for the new release. Testing can begin now. When testing is complete, the new version can go live.

Go Live With New Release


1. 2. 3. 4. 5. 6. Stop all workflows. Stop all production updates to the existing sources. Ensure all captured data has been processed. Stop all tasks on the mainframe (Agent, Listener, etc.) Start the new tasks on the mainframe. Resume production updates to the sources and resume the workflow schedule.

After the Migration


Consider removing or de-installing the software for the old release on the workstations and server to avoid any conflicts.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

954 of 954

You might also like