You are on page 1of 11

Datastage Interview Questions & Answers 1. Tell me about u r current project? Ans: Explain in detail. 2.

What is the difference between OLTP and OLAP? Ans: OLTP systems contain normalized data where as OLAP systems contain denormalized data. OLTP stores current data where as OLAP stores current and history data for analysis. The query retrieval is very fast in OLTP when compared to the OLAP systems because in OLTP all data is stored in one table and in OLAP data is stored in multiple tables. 3. What are the dimension & facts u r loading? Ans: say some dimension tables and fact tables names of u r project. 4. How these dimensions and facts are connected? Ans: By using primary keys in dimension tables and foreign keys in fact tables we can connect he dimension and fact tables. 5. What is the use of having flag values, timestamp values in target tables? Ans: Flag values and timestamp values are used to maintain history. 6. What is the difference between star schema and Snowflake schema? Ans: In star schema dimension tables contain De-normalized data and fact tables contain normalized data where as in snow flake schema both dimension and fact tables contain normalized data. 7. What is the use of partitioning and what are the types of partitioning? Ans: If you want to process huge amount of data then we need partitioning.

By using partitioning we can send the data to into different nodes. Partitioning is of 2 types: 1) Pipeline parallelism: It is the ability to downstream stages to begin processing a row once the upstream has finished processing that row. 2) Partition Parallelism: For example, if we have 100 records and 4 node configuration file, then each node will process 25 records. 8. What are link partitioner and link collector? Ans: Link partitioner is used to send the data to different nodes and link collector is used to collect the data from that nodes. 9. How do you preserve partitioning? Ans: By using same partition we can preserve partitioning. 10. Ans: Hash, entire, same, modulus, auto and etc, 11. If we use SAME partitioning in the first stage which partitioning method it will take? Ans: DataStage uses Round robin when it partitions the data initially. 12. Ans: If key column is an integer, then we will use modulus. Of course, we can use hash partition as well but performance wise modulus is better because depending on the hash code hash partition will send the data to different nodes .Hash requires more time to process the data. 13. Ans: Transformers are 2 types. a. Basic Transformer b. Parallel transformer What are the types of transformers used in DataStage PX? What is the use of modulus partitioning? What are the types of partitioning techniques?

Difference: A Basic transformer compiles in "Basic Language" whereas a Normal Transformer compiles in "C++". Basic transformer does not run on multiple nodes whereas a Normal Transformer can run on multiple nodes giving better performance. Basic transformer takes less time to compile than the Normal Transformer. Usage: A basic transformer should be used in Server Jobs. Normal Transformer should be used in Parallel jobs as it will run on multiple nodes here giving better performance.
14. What are Performance tunings you have done in your last project to increase the performance of slowly running jobs? Ans:

1. Using Dataset stage instead of sequential files wherever necessary. 2. Use Join stage instead of Lookup stage when the data is huge. 3. Use Operator stages like remove duplicate, Filter, and Copy etc instead of transformer stage. 4. Sort the data before sending to change capture stage or remove duplicate stage. 5. Key column should be hash partitioned and sorted before aggregate operation. 6. Filter unwanted records in beginning of the job flow itself. 15.
Ans:

What is Peek Stage? When do you use it?

The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links. The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets, like the Head stage and the Tail stage. The Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your application. 16.
Ans:

What is row generator? When do you use it?

The Row Generator stage is a Development/Debug stage. It has no input links, and a single output link.

The Row Generator stage produces a set of mock data fitting the specified metadata. This is useful where we want to test our job but have no real data available to process. 17.
Ans:

What is RCP? How it is implemented?

DataStage is flexible about Meta data. It can cope with the situation where Meta data isnt fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the Meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). This can be enabled for a project via the DataStage Administrator, and set for individual links via the Outputs Page Columns tab for most stages or in the Outputs page General tab for Transformer stages. You should always ensure that runtime column propagation is turned on. RCP is implemented through Schema File. The schema file is a plain text file contains a record (or row) definition. 18.
Ans:

What are Stage Variables, Derivations and Constants?

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. Constant - Conditions that are either true or false that specifies flow of data with a link. The order of execution is Stage variables -> Constraints -> Derivations. 19. Ans: Surrogate key is mainly used in SCD type2.For example, I have a table EMP and empno is the primary key. Whenever I will try to load duplicate data on empno it will give referential integrity error. For that reason we have surrogate key concept in Datastage. Surrogate will generate sequence numbers and by using these surrogate key number, we can uniquely identified each record in a table. What is the significance of surrogate key in DataStage?

20.

What is the difference between SCD type1, type2 and type3?

Ans: Type1: Maintain only current data. Type2: Maintain current data and full historical data. Type3: Maintain current data and previous data. 21. How to implement SCD type2 in DataStage?

Ans: First we will use Change Capture stage. It will compare before dataset and after dataset give and will give the change codes for copy, insert, and update and delete records. Then we will generate surrogate key by using Surrogate key generator stage and then will use Change Apply stage to apply the changes. 22. How to capture duplicate data in DataStage (or) I have a file A and the column is eno and values in eno are 1, 2,3,4,1 and, 2. I want 1,2,3,4 (unique records) in one file and 1,2 (duplicate records) in another file. How will u do that in DataStage? Ans: By using Create key change column property in sort stage we can capture the duplicates. This property will give 1 to unique records and 0 to duplicate records. Then by using filter or transformer stage, we can send unique records into one file and duplicate records into another file. 23. ) I have a file A and the column is eno and values in eno are 1, 2,3,4,1 and, 2. I want 3, 4 (complete unique records) in one file and 1, 2 (complete duplicate records) in another file. How will u do that in DataStage? Ans: By using aggregator stage we can do that. We need to set aggregator type=count rows and output column name=out1 and group by =eno (key column). Then by using filter or transformer stage, we can send complete unique records into one file and complete duplicate records into another file. I have File A and B. Both are customer files. File A is having a, b, c are columns. File B is having b and c is columns. I want an output in File C where a, b and c are columns. I have 10 records in File A and 5 records in file B. Now tell me how to concatenate these two files in DataStage? Ans: In file A, a, b, c are the columns and in file B, b and c are the columns. So we need to generate a dummy column a (a is the column name) in File B by

using Column Generator stage, then by using Funnel stage we can concatenate these 2 files. Then u will get 15 records in output. 24. What is the difference between Normal Lookup and sparse lookup? Ans:
If the reference table is having fewer amounts of data than primary table, then better to go for normal lookup. If the reference table is having huge amounts of data than primary table, then better to go for sparse lookup.

25. Ans:

What is meant by Junk dimension?

Junk dimensions are dimensions that contain miscellaneous data (like flags and indicators) that do not fit in the base dimension table. 26. Ans: A degenerate dimension is data that is dimensional in nature but stored in a fact table. Degenerate Dimension: This is nothing but dimension data stored within fact tables. Example: If you have a dimension that has Order Number and Invoice number fields and have one-to-one relational ship with fact table. In such case, you may want to go with one table with billion records instead of two tables with billion records. You would consider storing these fields within fact itself instead of keeping it in a separate dimension table to save the space. Junk Dimension: Junk dimension is nothing but miscellaneous data that does not fit in any base dimension hence stored in a separate table. Example: If you have fields like flags or indicators and repeating in each fact record. You may think to create a separate table to hold all possible flags and indicators and keep reference in fact table. What is meant by Degenerated dimension?

27. Ans:

What type of data you are getting?

Customers data only.

28. I have created a dataset on 4 node configuration file. Tell me total how many files will be crated? What are those? Ans: Total 5 files will be created. Those are 1 descriptor file and 4 dataset files. 29. I have created a dataset on 2 node configuration file. Can I use the same dataset on 4 node configuration file? Ans: Yes. We can do that. But vice versa is not possible means if you create a dataset on 4 node configuration file and try to reuse the same dataset on 2 node configuration file, Job will get executed without any error, but you will not get the expected data in output. 30. What value would be listed on the datasets when the column value is "NULL"? Ans: Dataset will show NULL when there is a null in the data. Oracle Interview Questions & Answers 31. What is meant by Referential Integrity?

Ans: Referential integrity is used to maintain relationship between tables and to maintain consistent data in tables that means not duplicated data. 32. How do you connect to the oracle server?

Ans: By creating dsn name, we can connect to the server. 33. What is the difference between Union and Union All?

Ans: Union sorts the combined set and removes duplicates where as Union All does not remove duplicates. Union All is faster than Union because Unions duplicates elimination required sorting operation which takes time. 34. Ans: a. Delete is a DML command where as Truncate is DDL command. b. We can write WHERE clause in Delete operation where as we cannot write Where clause in Truncate operation. What is the difference between Delete and Truncate?

c. We can rollback the data in Delete where as we cannot rollback the data in Truncate. d. Truncate is faster than Delete because when you perform delete operation, first it will store the data in rollback space and then delete operation will be performed but when you perform truncate operation it will directly delete the data. 35. Ans: e. View is a logical representation of data where as materialize view is physical duplicate representation of data. f. View doesnt hold data but it point to the data where as materialize view holds data. g. Whenever you update a base table the corresponding views will be automatically refreshed where as in materialized view, we can update for a certain period of time. h. The main purpose of Materialized view is to do calculations and display data from multiple tables using joins. 36. What is the difference between Where clause and having clause? Ans: i. j. Where clause can be used without GROUP BY clause where as having cannot be used without GROUP BY clause. The Where clause selects data before grouping where as having clause selects data after grouping. What is the difference between view and a materialized view?

k. The where clause cannot contain aggregate functions where as having clause can contain aggregate functions. 37. What is the difference between In clause and Exists clause?

Ans: If the result of sub query is small then it is better to use In clause where as If the result of sub query is huge then it is better to use Exists clause. 38. What is the use of DROP option in the ALTER TABLE command?

Ans: Drop option in Alter Table command is used to drop constraints specified on the table.

39.

What is the use of CASCADE Constraints?

Ans: When this clause is used with the DROP command, a parent table can be dropped even when a child table exists. 40. How to find duplicate records in a table?

Ans: Select empno, count (*) from EMP group by empno having count (*) > 1; 41. How to remove duplicates from a table? a) Delete from EMP e1 where rowid not in (select min (rowid) from EMP e2 group by empno); b) Delete from EMP e1 where empno in (select empno from Emp e2 group by empno having count(*) >1); How to retrieve the first 10 records from a table?

42.

Ans: Select * from (select empno, ename, sal, row_number () over (order by Sal desc) as row_number from EMP) where row_number < =10; 43. How to find 2nd highest salary from a table?

Ans: Select max (sal) as high2 from EMP where sal < (Select max (sal) from EMP); 44. How to fins 5th highest salary from a table?

Ans: Select min (sal) as high5 from EMP where sal in (select distinct top 5 sal from EMP order by sal desc); 45. How to find the nth salary from a table? Ans: Select distinct (e1.sal) from EMP e1 where &N= (Select count (distinct (e2.sal)) from EMP e2 where e1.sal=e2.sal); 46. Tell me the syntax of decode statement? Ans: The decode function has the same functionality of If-then-else statement. Decode (expression, search, result [search, result] [, default] Expression is the value to compare. Search is the value that is compared against expression. Result is value returned, if the expression equals to search. Example: Select ename,decode(empid,1000,IBM,2000,Microsoft,3000,capgemini, tcs) as result from emp; The above decode statement is equivalent to the following IF-THENELSE statement: IF empid = 1000 THEN

result := 'IBM'; ELSIF empid = 2000 THEN result := 'Microsoft'; ELSE result := 'Capgemini'; END IF; Note: And also prepare on SQL queries which are in SQL Question & Answers document. Prepare Unix commands as well (diff b/w find and grep and how to delete a dataset by using Orchadmin command) Prepare on the below question as well: 1. Merge Statement (insert, delete and update in one sql) 2. Ref table - 5lakhs records. Primary table - 50000 records then go for sparse lookup. If it is vice versa then normal lookup 3. Unix command to run a datastage job 4. Normal table vs Dimension table. I have 2 tables A and B. how to classify which is normal table and dimension tables In normal table dont have any hierarchies but in dimension tables hierarchies are there. 5. Normalization and de normalization 6. Normalization is done on basis of dimension or fact. (on basis of attributes) 7. Select empno, count(*) from emp group by empno having count(*) > 1 8. Decode function syntax 9. Types of source systems, issues u have faced in u project 10.Unix command to find and replace a string Sed command 11.How to extract only duplicates in datastage Create key change column in sort stage 12.$? ,$0 what it will do

You might also like