You are on page 1of 6

My Home

Home

Topics
Groups

People
Wiki

Companies
Communities

Jobs
White Papers

White Paper Library


Q&A and Docs Directory Events

Search Toolbox.com
Subscriptions

Blogs

Toolbox for IT Topics

Business Intelligence Blogs

DataStage tip for beginners - parallel lookup types


Vincent McBurney Jan 6, 2006 | Comments (35) Tw eet 1 Recommend 3 Share

Tooling Around in the IBM InfoSphere


by Vincent McBurney IBM Information Champion

The blog dedicated to a tool based approach to data integration with news and tips on IBM InfoSphere, Informatica, Oracle, ... more
0

Receive the latest blog posts:


Your email address FOLLOW

Parallel DataStage jobs can have many sources of reference data for lookups including database tables, sequential files or native datasets. Which is the most efficient? This question has popped up several times over on the DSExchange. In DataStage server jobs the answer is quite simple, local hash files are the fastest method of a key based lookup, as long as the time taken to build the hash file does not wipe out your benefits from using it. In a parallel job there are a very large number of stages that can be used as a lookup, a much wider variety then server jobs, this includes most data sources and the parallel staging formats of datasets and lookup filesets. I have discounted database lookups as the overhead of the database connectivity and any network passage makes them slower then most local storage. I did a test comparing datasets to sequential files to lookup filesets and increased row volumes to see how they responded. The test had three jobs, each with a sequential file input stage and a reference stage writing to a copy stage.
Small lookups

I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18 seconds. No lookuptables were created apart from the existing lookup fileset one. This indicates the lookup data fit into memory and did not overflow to a resource file.
1 Million Row Test

Share Your Perspective


Share your professional knowledge and experience with peers. Start a blog on Toolbox for IT today! BEGIN NOW

The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the lookup sequential file took 35 seconds even though it had to partition the data. I assume this is because the input also had to be partitioned and this was the bottleneck in the job.
2 million rows

Starting to see some big differences now. Lookup fileset down at 45 seconds is only three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up to 1:32. The cost of partitioning the lookup data is really showing now.
3 million rows

The filset still at 45 seconds, swallowed up the extra 1 million rows with ease. Dataset up to 2:06 and the sequential file up to 2:20. As a final test I replaced the lookup stage with a join stage and tested the dataset and sequential file reference links. The dataset join finished in 1:02 and the sequential file join finished in 1:15. A large join proved faster then a large lookup but not as fast as a lookup file.
Conclusion

If your lookup size is low enough to fit into memory then the source is irrelevent, they all load up very quickly, even database lookups are fast. If you have very large lookup files spilling into lookup table resources then the lookup fileset outstrips the other options. A join becomes a viable option. They are a bit harder to design as you can only join one source at a time whereas a lookup can join multiple sources. I usually go with lookups for code to description or code to key type lookups regardless of the size, I reserve the joins for references that bring back lots of columns. I will certainly be making more use of the lookup fileset to get more performance from jobs. Sparse database lookups, which I didn't test for, are an option if you have a very large reference table and a small number of input rows.

Work With Me Want to work for the IBM Information Management Australian partner of the year 2010? I am building a team of great Information Server and DataStage consultants in Melbourne, Sydney, Canberra, Adelaide and Perth. Send an email to vmcburney at focuss.com.au.

Links

it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183

1/6

Steal This IM Methodology Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way. Informatica Data Quality Blog DataFlux Community of Experts Group 1 Data Quality Blog IBM LeverageInformation Community Blog Data Governance Blog dq:view - Steve Tuck on Data Quality Popular White Paper On This Topic The Future of Database and Data Warehousing Software More White Papers Categories Blogging GO

Read 35 comments

35 Comments
Unknow n User | Jan 6, 2006

Lookup stage can perform lookups against only three things: a virtual Data Set, a Lookup File Set (which is aided by an index) or a "sparse" lookup against a DB2 or Oracle table. I don't think the tests you did really differentiate very well between the first two. But I can't think of a good way to do so, until you have many processing nodes. Loading the Lookup File Set (presumably from a sequential file) and importing from the sequential file into a virtual Data Set would be comparable operations.
Vincent McBurney | Jan 7, 2006

Thanks Ray, being a blog it is a quick post and a quick set of= tests, not really the number of rows and nodes that gives it a= really thourough going over=2E I just wanted to demonstrate that= on large lookups the lookup fileset has the lookuptable already= prepared while the virtual dataset needs to build it=2E I've also noted in the past the lookup fileset can be faster as= it does not return the key lookup fields back to the lookup= stage and has a lower volume of data being moved around=2E You= can derive certain formula of number of input rows versus number= of lookup rows versus bits of lookup data etc=2E It helps to know= what the alternatives are=2E
New to Toolbox? Ask A Question Join
Unknow n User | Feb 5, 2006

This information is very helpful for the biginers. Thanks for your posting Raghu

Unknow n User | Jan 23, 2007

Very impressive Vincent. Keep up the good work. This is a good aid for beginers. Maybe you can make it a FAQ entry at DSXchange.
Unknow n User | Feb 20, 2007

hi, very good useful information. continue...... regards

Unknow n User | Apr 5, 2007

what are the different types of lookup,when u use a normal lookup and when u use a sparse lookup?
Vincent McBurney | Apr 5, 2007

A sparse lookup fires a database command per row of input data, a normal database lookup fires a single large database select statement at the start of the job and saves or caches the data on the DataStage architecture to provide lookup results. It then comes down to balance. A small number of input rows and a large number of reference rows makes spares better, or a conditional lookup with a small number of lookup calls makes sparse better. But most of the time a normal lookup is better to restrict the number of round trips to the database.
Unknow n User | Jun 12, 2007

dear vincent, as a beginner for datastage, i read ur work here whether understanding fully or not. Right now am just going thro datastage tutorials and pdf. can u guide me more on this? vasan
Unknow n User | Jul 6, 2007

hi, can anyone tel me that how can we create a 64-bit Hash file in Datastage?

Vincent McBurney | Jul 6, 2007

There are a few threads with instructions on this at dsxchange such as RESIZE 32 to 64.
Unknow n User | Jul 9, 2007

What about having the database server doing a lookup using a left join? I'm thinking that it might be a good option for people who are not using DataStage Enterprise edition as the DB server can use parallel processing to perform the join task .

it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183

2/6

task .
Unknow n User | Jul 25, 2007

which one is faster lookup from hashfile or joining table in oracle databse query?

Vincent McBurney | Jul 25, 2007

This is part of the "art" of DataStage, knowing when to use a database lookup and when to use a cached or hashed file. There are a lot of variables such as the size of reference data versus the size of input data, local versus remote databases, sharing hash files between jobs etc. The best way to find out is to test both versions of the job to see which is faster. You don't need to do this for every job just do it on your high volume jobs to get a feel for the best approach and then apply it to other jobs.
Ramu Kalvakuntla | Aug 15, 2007

Hi Guys, I have lot of experience in Ab Initio but new to Datastage. I have situation which related to the lookup Vs Join question. I have two high volume tables each having about 500 million records. We need to get data for a given month by joining these two big tables. Both the tables are partitioned on a Primary key. The client had a Pl/SQL script which joins these two tables and loads into a temp table. While loading into temp table it joins one day of data at a time on these tables as they are OLTP tables and if we join one complete month it throws a Rollback segment error on oracle. This process takes an average of 9 to 12 hours to load into tamp table. On Average we get about 5 million records into temp table. I did a proof of concept in Ab Initio and got the above data in less than an hour. So, here is what I did in Ab Initio. First I have unloaded data from table1 by using the date range for complete one month (Date field is indexed). This took about 15 minutes to unload 15 million records. Then I have portioned the data on primary key and did a sort on the same key and then used DB Join component in Ab Initio. This graph runs 12 way parallel. So, I got 5 million records out after join because I don't have all the records from table1 in table2. Now how do we do similar thing in Datastage. I read lot of documentation but I couldnt find any thing like DB Join. I found Sparse lookup and Database lookup. Can I use any of these to get the above performance that I got in Ab Initio. Any suggestion would be appreciated. Thanks, Ramu Kalvakuntla Solutions Architect
Unknow n User | Oct 27, 2007

hai, anyone give some concepts in T-ETL ..... bcoz, i'm currently work in Information server with federationserver.... thks Maha
Unknow n User | Nov 3, 2007

hai, anyone give some idea about loockup in ds parallel Ex thx jayappa

Unknow n User | Nov 5, 2007

hi, any one could explain elborately about hash partition in datastage parallel extender. Thanks Raj
suresh babu | Dec 6, 2007

what is the sequence of execution of Derviations and constraints and Stage variable? and how that sequence should be executed?
Vincent McBurney | Dec 6, 2007

I would describe it as top to bottom and left to right. The stage variables are executed in the order they appear in the table and the code is executed from left to right. That lets let's you use the trick to compare current row to previous row. In this example the ChangeFlag tells you whether the field customerid in the current row is different to the value from the previous row: StageVariable: Derivation

it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183

3/6

ChangeFlag: input.customerid <> OldCustomerId OldCustomerId: input.customerid


annesow janya | Dec 12, 2007

hi, we use sparse lookup when the number of records in refernce table are more when compared to the source table. Regards Anne
annesow janya | Dec 12, 2007

the order of execution is stage variables, constarints & then derivations. stage variable are executed first bcoz the output of one constraints cannot be an input to another constraint we specify th param or condition in stage variables. stage variables obtain values during read & doesnt pass tht value to the output. plz correct me if im wrong Regards Anne
Unknow n User | Mar 5, 2008

I have created DS job having only two stages source sequential file and target is DB stage where in DB2 stage I am Updating key column of the Table. But I am facing Deadlock issue on this table saying SQLSTATE=40001 Reason code "2" SQL-911? Can tell me why this problem occur.
Unknow n User | Mar 5, 2008

Hi, I have created DS job having only two stages source sequential file and target is DB stage where in DB2 stage I am Updating key column of the Table. But I am facing Deadlock issue on this table saying SQLSTATE@001 Reason code "2" SQL-911? Can tell me why this problem occur. Veera B. H

Unknow n User | Mar 5, 2008

Hi, I have created DS job having only two stages source sequential file and target is DB stage where in DB2 stage I am Updating key column of the Table. But I am facing Deadlock issue on this table saying SQLSTATE@001 Reason code "2" SQL-911? Can tell me why this problem occur. And No body using my table only am the only one using but still this problem I faced after creating 3 diffrent jobs. Please any one suggest me about this error in data stage.
Unknow n User | Mar 5, 2008

Hi, I have created DS job having only two stages source sequential file and target is DB stage where in DB2 stage I am Updating key column of the Table. But I am facing Deadlock issue on this table saying SQLSTATE@001 Reason code "2" SQL-911? Can tell me why this problem occur. And No body using my table only am the only one using but still this problem I faced after creating 3 diffrent jobs. Please any one suggest me about this error in data stage.
Unknow n User | Mar 5, 2008

Hi, Please any one suggest me about this error in data stage.I have created DS job having only two stages source sequential file and target is DB stage where in DB2 stage I am Updating key column of the Table. But I am facing Deadlock issue on this table saying SQLSTATE@001 Reason code "2" SQL-911? Can tell me why this problem occur. And No body using my table only am the only one using but still this problem I faced after creating 3 diffrent jobs.
ANUP KUMAR | May 16, 2008

Vincent.....What are the ways/stages we can do sparse database lookups from Teradata database (I need to hit a query joing a couple of tables.
sologubg | Jun 20, 2008

I read seq file that have two types of records (different layout) and in the first position of the record file has RECORD_TYPE = '1' or = '2'. Fiers control record has '1' all others (details)- '2' I want to write to output (hash) just 1st record (with RECORD_TYPE = '1') and not read after that, because job is producing warnings when reads details records. What is the best solution.

it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183

4/6

I could not find any split processing to split that input file (if there are different layouts for 2 types of records) Stage variables ??? to stop job after 1sr record or after record_type > '1' . How to do that????
Unknow n User | Sep 23, 2008

Hi Vincent , I appreciate ur work on comparing the sequential files,datasets,lookupfile sets.yes i do agree that lookup filesets are faster when compared to the remaining as they are indexed but the problem comes with size of the data.when we create a lookup fileset it gets created by default partitioning method i,e entire .so when there is very large amount of the lookup data then u must certainly use a hash partitioning method which is going to improve the performance otherwise a lookup fileset is a waste.
Amritendra kumar | Jan 16, 2009

i m facing dead lock in db2 while using datastage. my source table is db2 and i m rejecting record also in same table and updating the key columns there. but i m facing deadlock problem can any body advice how can i handle this situation. it's urgent!!
Oct 10, 2009

Hai, I am new to datastage. can anyone suggest datastage pdfs....


USER_1801390 | Nov 23, 2009

What is mean by virtual data set in dataStage?

USER_2104857 | Jul 22, 2010

In datastage if we want to join two tables having bulk data it is better to use funnel stage. It depends on what type of data you are having. To use funnel stage 2 tables should have same metadata. If the case is different in two tables it is better to use join stage or merge stage. Comparing two stages if you have less volume then it is advisable to use join or it is better to use merge stage. Join stage will take more time then merge stage. Even though Join, Look-up, merge will do the same jobs. My option for the high volume of data is merge stage because in merge we can have any number of input links and reject links. RAHUL MORAVINENI
navinkumar kurapati | Nov 3, 2010

virtual dataset means in datastage data is passed b/w stages through links. and dataset is o/s file.so the data resides in links is called virtual data set.
USER_2646209 | Mar 22

pls clarify wat exactly pools in configuration file do..???

Leave a Comment

Connect to this blog to be notified of new entries. Name You are not logged in. Sign In to post unmoderated comments. Join the community to create your free profile today. Your email address PREVIEW SUBMIT

Want to read more from Vincent McBurney? Check out the blog archive.
Archive Category: DataStage Keyword Tags: Datastage parallel jobs lookup stage join stage lookup fileset
Disclaimer: Blog contents express the view points of their independent authors and are not review ed for correctness or accuracy by Toolbox for IT. Any opinions, comments, solutions or other commentary expressed by blog authors are not endorsed or recommended by Toolbox for IT or any vendor. If you feel

it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183

5/6

a blog entry is inappropriate, click here to notify Toolbox for IT.

Browse all IT Blogs

Ads by Google
vLookup for Procurement Learn how use MS Excel functions,including vLookup in this article! www.NextLevelPurchasing.com Jobs in India Find/post jobs in your area100% free - Join the OLX community olx.in Informatica 9.1 Big Data Turn Big Data into BigOpportunities with Informatica 9.1 www.informatica.com/sg ETL Testing - iQA Complete ETL Testing and QA Toolto test your data warehouse www.icedq.com Job Vacancies. Find 1000's of Jobs in your City.Connect with Employers. Apply Now! Quikr.com/Jobs

w hat's this?

Toolbox for IT My Home Topics People Companies Jobs White Paper Library Collaboration Tools Discussion Groups Blogs Wiki Follow Toolbox.com Toolbox for IT on Twitter Toolbox.com on Twitter Toolbox.com on Facebook

Topics on Toolbox for IT Data Center Data Center Development C Languages Java Visual Basic Web Design & Development Enterprise Applications CRM ERP Infor PeopleSoft SAP SCM Siebel Enterprise Architecture & EAI Enterprise Architecture & EAI Information Management Business Intelligence Database Data Warehouse Knowledge Management Oracle IT Management & Strategy Emerging Technology & Trends IT Management & Strategy Project & Portfolio Management Cloud Computing Cloud Computing Networking & Infrastructure Hardware Mobile & Wireless Networking Telephony Operating Systems Linux UNIX Windows Security Security Storage Storage

Toolbox.com About News Privacy Terms of Use Work at Toolbox.com Advertise Contact us Provide Feedback Help Topics Technical Support Other Communities Toolbox for HR Toolbox for Finance

Copyright 1998-2012 Ziff Davis, Inc (Toolbox.com). All rights reserved. All product names are trademarks of their respective companies. Toolbox.com is not affiliated with or endorsed by any company listed at this site.

it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183

6/6

You might also like