Basic Teradata Query Optimization Tips

TERADATA TUNING GUIDELINE DOCUMENT
Name
Anusha Srichander anusha.srichander@wipro.com Author(s) Sri Sakthi Shalini A sre.anbalagan@wipro.com Savitha Ramasami Savitha.ramasami@wipro.com Reviewed by Padmapriya Mohankumar padmapriya.mohankumar@wipro.com
Account
BTS ES DWH- MFG
BTS ES DWH- MFG
INTRODUCTION TERADATA QUERY OPTIMIZATION: Optimization is the technique of selecting the least expensive plan (fastest plan) for the query to fetch results. The optimizer considers the possible query plans for a given input query, and attempts to determine which of those plans will be the most efficient. Cost-based query optimizers assign an estimated "cost" to each possible query plan, and choose the plan with the smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU requirements, and other factors determined from the data dictionary. Optimization is directly proportional to the availability of 1. CPU resources 2. Systems resources AMPs, PEs etc. Teradata performance tuning is a technique of improving the process in order for query to perform faster with the minimal use of CPU resources. The factors need to be focused in optimization are: (i) Skew Factor (ii) CPU consumption (iii) Spool space. SQL OPTIMIZATION TIPS: Some of the points to be taken into consideration in tuning the Teradata SQL. ( 1 ) STATISTICS Collecting statistics is one of the most primary steps in Teradata query Optimization. Statistics collection is essential for the optimal performance of the Teradata query optimizer. The query optimizer relies on statistics to help it determine the best way to access data. Statistics also help the optimizer ascertain how many rows exist in tables being queried and predict how many rows will qualify for
given conditions. Lack of statistics, or out-dated statistics, might result in the optimizer choosing a less-than-optimal method for accessing data tables. Also, statistics help Teradata determine the spool file size needed to contain the resulting data. Accurate statistics could make the difference between a successful query and a query that runs out of spool space. Syntax: To check whether the Statistics defined for the table: Help stats table_name; To collect or refresh the statistics: Collect stats on table_name [index/column](col_name, col_name, ); DIAGNOSTIC STATEMENT DAIGNOSTIC HELPSTATS ON FOR SESSION The above statement can be used to determine the stats that might be required to improve the performance of the SQL. The EXPLAIN plan needs to be executed following the above statement to find the stats suggestion. Stats will qualify one of the below confidence levels: 1) No Confidence - no statistics defined for a table. 2) Low Confidence - Stats are difficult to use precisely. 3) High Confidence - Optimizer is sure of results based on the stats available. Statistics need to be collected for: 1. All non-unique indexes. 2. UPI of small tables (tables with less than x rows per AMP, depends on available number of AMPs) 3. All indexes of a join index 4. Any column used in joins 5. Any columns used in a WHERE clause 6. Indexes of global temporary tables. Stats cannot be collected on: 1. Volatile tables 2. LOB columns
Collected statistics are not automatically updated by the system. Should refresh statistics when 5-10 % of the table rows have changed. Always collect statistics at the column level even when collecting on an index. This is because indexes can be dropped at any time, so they are often dropped and recreated. When to collect Statistics: After the following: 1. Fast loads 2. Multi loads 3. Non-utility (TPump/BTEQ/ODBC/JDBC) Collect statistics after a significant percentage of data values have changed. 4. Deletes/Purges Collect Statistics after a significant percentage of data has been removed from a table. 5. Recovery if a table is lost and then recovered from an archive, recollect statistics. 6. Reconfiguration Recollect all statistics after a system reconfiguration. ( 2 ) USAGE OF DISTINCT KEYWORD It is much better to use the GROUP BY instead of DISTINCT in terms of Teradata as both the keywords provide the same result. DISTINCT is better for columns with a low number of rows per value: Number of rows < Number of AMPs GROUP BY is better for columns with a large number of rows per value: Number of rows > Number of AMPs ( 3 ) USAGE OF IN CLAUSE While there may not be a theoretical limit to the number of items contained within an IN clause you may see performance degradations if the list starts to exceed a few hundred items. There has been an optimizer tweak in place that tries to build a spool file when a large IN list is encountered. When the number of values exceeds the acceptable limit, the hard-coded values can be inserted into a volatile table or a global temporary table with stats collected on the global temporary table and the IN clause can be made to an equi-join. Example: SELECT MD.EMP_ID,MD.EMP_NAME, VN. DEPT_DESC, COUNT(*) FROM EMPLOYEE MD
, DEPARTMENT VN WHERE MD.DEPT_ID = VN. DEPT_ID AND VN.LOC = 'USA' AND MD.DEPT_ID IN ('2087','309','4009','123','5743','456','0987','2643','7545'',8655,'9737','865 4','5744','67894','0444','755','78446','96422','59426',8365,7072,639620, 6497,7220,7294,8937,5978,7894,2497,2864,0742) GROUP BY MD.EMP_ID , MD.EMP_NAME , VN.DEPT_DESC HAVING DEPT_ID BETWEEN '501' AND '9450' AND COUNT(MD. EMP_ID) >= 100 ORDER BY 2; CREATE MULTISET VOLATILE TABLE EMP_DEPT AS (SELECT DEPT_ID FROM DEPARTMENT) WITH NO DATA ON COMMIT PRESERVE ROWS; After inserting above all the hard coded values into the volatile table, the query can be modified as below: SELECT MD.EMP_ID,MD.EMP_NAME, VN. DEPT_DESC, COUNT(*) FROM EMPLOYEE MD , DEPARTMENT VN , EMP_DEPT TEMP WHERE MD.DEPT_ID = VN. DEPT_ID AND VN.LOC = 'USA' AND MD.DEPT_ID = TEMP. DEPT_ID GROUP BY MD.EMP_ID , MD.EMP_NAME , VN.DEPT_DESC HAVING DEPT_ID BETWEEN '501' AND '9450' AND COUNT(MD. EMP_ID) >= 100 ORDER BY 2; ( 4 ) USAGE OF MANIPULATED COLUMNS IN WHERE CLAUSE When Manipulated columns are used in the WHERE clause or JOIN conditions, optimizer will not be able to use the stats collected on those columns. ( 5 ) DATATYPE MISMATCH Try to avoid data transformation during the join. Columns with the same data type should be used in the join condition. Otherwise one of the compared fields will undergo conversion before join happens. ( 6 ) CHARACTER SET
While joining two tables make sure that both the columns fall under the same character set. Otherwise implicit conversion of one to the other takes place resulting in poor performance. ( 7 ) DATE COMPARISON When comparing values of date in a particular range, the query may result in product join. This can be avoided with the usage of SYS_CALENDAR.CALENDAR, which is Teradata's in-built database. Example: Insert into table_a select t2.a1,t2.a2,t2.a3,t2.a4 from table_2 t2 join table_3 t3 on t2.a1=t3.a1 and t2.a5_dt>=t3.a4_dt and t2.a5_dt<=t3.a5_dt; The above query can be replaced with sys_calendar to eliminate (but not completely) product join Example: Insert into table_a select t2.a1,t2.a2,t2.a3,t2.a4 from table_2 t2 join SYS_CALENDAR.CALANDAR sys_cal on sys_cal.calendar_date = t2.a5_dt join table_3 t3 on t2.a1=t3.a1 and sys_cal.calendar_date >=t3.a4_dt and sys_cal.calendar_date <=t3.a5_dt; ( 8 ) PROPER USAGE OF ALIAS AND TABLE NAMES Example: Insert into table_a
select t2.a1, t2.a2, t2.a3, t2.a4 from table_2 t2 join table_3 t3 on t2.a1=t3.a1 join table_4 t4 on table_3.a1= t4.a1; This may result in utilization of high CPU, if the tables used above consist of few rows as the table name and the alias name goes for full table scan of the same table twice.(table_3) as per the above given example. Suppose, if either of the table is very big, the above case may lead to SPOOL error.
( 9 ) MISSING JOINS Example: If there are two tables tab1 & tab2, each with two below columns tab1 empno ename Select tab1.empno, tab2.deptno from tab1,tab2,tab3 where tab1.ename = tab2.ename; If both of the tables used in the above SQL are small, it may result in a product join (Cartesian product) and may consume high CPU. If either of the table is very big, the case may lead to SPOOL error. tab2 tab3 deptno ename dname dname
( 10 ) UNNECESSARY JOINS Try to avoid unnecessary joins especially the usage of LEFT OUTER joins when no data is being pulled from it and which does not have any filter. Example: SELECT E.EMP_ID, E.EMP_NAME, 7
D.DEPT_NAME, D.DEPT_LOC FROM EMPLYOEE E JOIN DEPT D ON E.DEPT_ID = D.DEPT_ID LEFT JOIN PAY_ROLL P ON E.EMP_ID = P.EMP_ID WHERE E.JOIN_DT >= 2009-10-12 The above query can be modified as SELECT E.EMP_ID, E.EMP_NAME, D.DEPT_NAME, D.DEPT_LOC FROM EMPLYOEE E JOIN DEPT D ON E.DEPT_ID = D.DEPT_ID WHERE E.JOIN_DT >= 2009-10-12 ( 11 ) PROPER SELECTION OF PI FOR A TABLE Primary Index is the sole mechanism by which data is distributed over the AMPs and PI for a table should be chosen in such a way that the column with the most unique values. A table when created, by default, assumes first column to be the PI when the index is not specified explicitly. PI for a table should be chosen in such a way on a column with the most unique values. The below query can be used to see the distribution of rows to the AMPs in a system. select hashamp(hashbucket(hashrow(primary_index_columns))) as "AMP" ,count(*) from your table group by 1 order by 2 desc;
When a table does not have any column with the most unique values, identity column may help.
IDENTITY COLUMNS: If a table does not have any column with the most unique values , identity columns can be used. Example: unq_pk INTEGER GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE -2147483647 MAXVALUE 100000000 CYCLE), The "unq_pk" represents column name and "INTEGER" followed by "unq_pk" represents data type. These values are generated dynamically in a random way whenever data is inserted in to the table holding the above Identity column. Using the above column as a PI, would distribute the data almost equally in to the available AMP's which reduces skewing data to a single AMP. RESTRICTIONS ON IDENTITY COLUMNS Identity columns need not be PI. These column can not be part of a PI/SI.
The PI column, Should contain maximum unique values. Should be unchangeable (rarely changed). Should be defined in CREATE TABLE statement explicitly.
( 12 ) SPLITING THE QUERY
Reduce the number of joins in a query. If many number of joins need to be used in the query, split the query into two parts by adding few joins into the volatile table in such a way after ensuring that the dense tables get filtered avoiding unnecessary redistribution. The other technique in splitting the query is to make all the INNER JOINS into a volatile table and can be joined with rest of the other LEFT OUTER joins.
( 13 ) CONCATENATION OF THE SQL Concatenation allows you to retrieve data correlated to the MIN/MAX function in a single pass. This is a special application of concatenation that precludes the need for a correlated sub query. Example: If you want to find the employee with the highest salary in each department, the query might be: SELECT Dept_No ,Salary ,Last_Name ,Fname FROM Employee_Table WHERE Dept_No, Salary IN (SELECT Dept_No, MAX(Salary) FROM Employee_Table GROUP BY Dept_No) ORDER BY Dept_No; The above query could be rewritten as: SELECT Dept_No, MAX(Salary|| || Lname || , ||Fname) FROM Employee_Table GROUP BY Dept_No ORDER BY Dept_No ; Note: If two or more employees have the same maximum salary, this query selects only one employee per department.
( 14 ) USAGE OF NESTED VIEWS: In Teradata, there is no materialized view. So, it is better to avoid the creation of views by nesting many other views together. If it is really required a table can be constructed and data can be loaded on to the table in an incremental way (which satisfies any particular date/fiscal period).
( 15 ) USAGE OF DERIVED TABLES:
10
Before the usage of a derived table in a query, it is better to make sure that the table returns minimal number of rows. If the same query has to be used in a query more than once, it is good to populate those values into a volatile table and join can be made appropriately. ( 16 ) USAGE OF PI AND NON-PI JOINS: If a query has PI and a non-PI join, the table with the non-PI definitely may have duplicated values compared to the PI column. Example: SELECT E. DEPT_ID, D.DEPT_LOC FROM EMP E, DEPT D WHERE E.DEPT_ID = D.DEPT_ID; The above query can be rewritten as SELECT E. DEPT_ID,D. DEPT_LOC FROM DEPT D, (SELECT DEPT_ID FROM EMP GROUP BY 1) E WHERE E.DEPT_ID = D.DEPT_ID; ( 17 ) USAGE OF SET AND MULTISET TABLE: A MULTISET table accepts duplicate records where as a SET does not. A SET table with no unique indexes will force Teradata to check for duplicate rows every time a row is inserted or updated. This can cause a lot of overhead on such operations. It is better to use MULTISET after ensuring that the records getting populated to the target is always unique.
CONCLUSION:
11
This document serves as a tip or a guideline in tuning Teradata SQL.
12

Basic Teradata Query Optimization Tips

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic Teradata Query Optimization Tips

Uploaded by

Copyright:

Available Formats

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

BTS ES DWH- MFG

BTS ES DWH- MFG

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

( 12 ) SPLITING THE QUERY

TERADATA TUNING GUIDELINE DOCUMENT

( 15 ) USAGE OF DERIVED TABLES:

TERADATA TUNING GUIDELINE DOCUMENT

TERADATA TUNING GUIDELINE DOCUMENT

This document serves as a tip or a guideline in tuning Teradata SQL.

You might also like