Professional Documents
Culture Documents
Rolf Hanusa
An outer join is defined in sections; it is defined as the UNION ALL of various pieces.
The pieces pulled together are defined by the type of outer join:
• Piece 1: The inner join the result of the two tables as described by the full ON
clause, with all conditions applied
• Piece 2: All rows from the left table not included in Piece1, extended with NULL
values for each column of the right table
• Piece 3: All rows from the right table not included in Piece 1, extended with NULL
values for each column of the left table.
For each type of outer join (left, right, full), just put the proper "pieces" together using
UNION ALL.
One or more join conditions, also called "connecting terms," are required in the ON
clause for each relation in an outer join. These join conditions are used to define the rows
in the outer table that take part in the match to the inner table.
I recommend that you use only join conditions in ON clauses. However, when a
search condition (used for row selection) is required on the inner table, it should be put in
the ON clause as well. A search condition in the ON clause of the inner table will not
limit the number of rows in the answer set. It only defines the rows eligible to take part in
the match to the outer table.
An outer join can also include a WHERE clause; however, the results you get when
you do include it may be surprising--or at least not obvious. This will be explained in
more detail later in the article. To limit the number of qualifying rows in the outer table
(and therefore the answer set), the search condition for the outer table must be in the
WHERE clause. Note: The WHERE clause is applied only after the outer join has been
produced.
Here's a little known (or less understood) outer join rule: If a search condition on the
inner table is placed in the WHERE clause, the JOIN is logically equivalent to an INNER
JOIN, even if you code OUTER JOIN in the query. Read on to see how this can impact
your results.
These rules are not strange concepts unique to Teradata. This is a fully SQL-92-
compliant implementation (for better or worse). Teradata's optimizer does, however, take
advantage of these concepts in processing these queries. Instead of executing the outer
join just as it is defined, the optimizer rewrites the query to roll the whole, complex
process into a single step, as well as to eliminate outer joins that really aren't.
The following examples represent actual cases that I have encountered as a DBA.
Although I've changed them slightly to avoid any conflict of interest, the basic syntax and
counts remain accurate. Since Teradata EXPLAINs may be new to some readers, they
have been altered slightly for clarity (that is, aliases were replaced with database names,
and so forth).
Before writing a query, it is important to understand the business question that it is
supposed to answer. Here is a simple explanation of the business question we are trying
to answer in the remainder of this article:
We want to know all the customers (using table CUSTOMER, which contains over
18 million rows):
• Who reside in the DISTRICT of K,
And:
• Who have a SERVICE_TYPE of ABC or XYZ,
And: Their monthly revenue (using table REVENUE, which contains over 234
million rows) for the month of July 1997 (199707)
• Using DATA_DATE = 199707,
And (here's where the outer join comes in):
If the customer revenue is unknown (that is, if no revenue records are found), we
want to keep the customer record with a NULL for MONTHLY REVENUE.
Sounds simple enough, doesn't it? I thought so too until I started analyzing my
original answer sets and found them to be incorrect and, in some cases, very surprising.
In fact, until I researched several coding alternatives and repeatedly questioned one of
NCR's developers (who now probably uses caller ID to screen my calls), I was convinced
that Teradata's optimizer had lost its mind. It hadn't, but I almost did. You'll see what I
mean as we go through the following examples and analyze the results.
The first example (see Listing 1) is a single table select, which provides the base of
customer records that we want. The second example (see Listing 2) is an inner join that
will help EXPLAIN the remaining queries and results. It starts with the same base of
customer records but matches them with revenue records for a particular month. Note that
all customer records found a matching revenue record.
SELECT C.CUSTNUM
FROM SAMPDB.CUSTOMER C
WHERE C.DISTRICT='K'
AND (C.SERVICE_TYPE= 'ABC'
OR C.SERVICE_TYPE= 'XYZ')
ORDER BY 1;
In Listing 3, an outer join is requested, but if we apply these rules stated, we end up
with a surprising result. Although we are asking for a LEFT OUTER JOIN, it is in fact
treated as an inner join. Because all the selection criteria are in the WHERE clause, they
are logically applied only after the outer join processing has been completed. This means
that Listings 2 and 3 are logically similar and will provide the same result. It is important
to note that Teradata recognizes that this query is the same as an inner join and executes
it as such (see EXPLAIN Exq3). Therefore, it executes with the speed of an inner join.
Note: For those of you who are unfamiliar with a Teradata EXPLAIN, it is a textual
description of the processing steps that the Teradata Optimizer will use to execute an
SQL query.
EXPLAIN Exq3:
EXPLAIN Exq5 (As you can see, this EXPLAIN output is identical to EXPLAIN
Exq3 and, as expected, so is the answer set.):
1. First, we lock SAMPDB.CUSTOMER for access, and we lock
SAMPDB2.REVENUE or access.
2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a
RowHash match scan with a condition of ("(SAMPDB. T1.DISTRICT = 'K') and
((SAMPDB. T1.SERVICE_TYPE= 'ABC') or (SAMPDB. T1.SERVICE_
TYPE='XYZ'))"), which is joined to SAMPDB.CUSTOMER with a condition of
("SAMPDB2.REVENUE.DATA_DATE =199707"). SAMPDB.CUSTOMER and
SAMPDB2.REVENUE are joined using a merge join, with a join condition of ("
(SAMPDB. T1.CUSTNUM = SAMPDB2. REVENUE.CUSTNUM)"). The input table
SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1,
which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key
in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The estimated time
for this step is 6 minutes and 2 seconds.
3. Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
--> The contents of Spool 1 are sent back to the user as the result of statement 1. The
total estimated time is 0 hours and 6 minutes and 2 seconds.
Finally, we have the correct answer. This example (see Listing 6) is an outer join
providing the answer set, which answers the original business question.
This query returns 18,034 rows. 13,010 rows have non-NULL values for
MONTHLY_REVENUE.
In this query, the left (outer) table is limited by the search conditions in the WHERE
clause, and the search condition in the ON clause for the right (inner) table defines the
NULL-able nonmatching rows. This EXPLAIN confirms that this is in fact an outer join
(see EXPLAIN Exq6).
EXPLAIN Exq6:
1. First, we lock SAMPDB.CUSTOMER for access, and we lock
SAMPDB2.REVENUE for access.
2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a
RowHash match scan with a condition of ( "((SAMPDB. T1.SERVICE_TYPE= 'ABC')
or (SAMPDB. T1.SERVICE_ TYPE='XYZ')) and (SAMPDB.T1. DISTRICT = 'K')"),
which is joined to SAMPDB.CUSTOMER with a condition of
( "SAMPDB2.REVENUE.DATA_ DATE = 199707"). SAMPDB.CUSTOMER and
SAMPDB2.REVENUE are left outer joined using a merge join, with a join condition of
(" (SAMPDB. T1.CUSTNUM = SAMPDB2.REVENUE.CUSTNUM )"). The input table
SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1,
which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key
in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The estimated time
for this step is 6 minutes and 2 seconds.
3. Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
--> The contents of Spool 1 are sent back to the user as the result of statement 1. The
total estimated time is 0 hours and 6 minutes and 2 seconds.
As the previous examples show, outer joins, when used properly, provide additional
information from a single query that formerly required multiple queries and/or steps to
achieve. However, the proper use of outer joins requires training and/or experience
because simple logic does not always apply. Use the following steps to be sure that you're
getting the "correct" answer (that is, the one you expect to get):
1. Make sure that you understand the question you are trying to answer; you should
have a pretty good idea what the answer set should look like.
2. Write the query, keeping in mind the proper placement of join conditions and
search conditions:
• All join conditions are placed on the ON clause.
• Search conditions for the inner table are placed on the ON clause while search
conditions on the outer table are placed in the WHERE clause.
3. Always EXPLAIN the query before executing it. Look for the words "outer join."
If you don't see them, it's not one.
4. Run the query and compare the result with your expectations.
If your answer set matches your expectations, it is probably correct. If not, check the
locations of any selection criteria that you have placed in the ON and/or WHERE clauses.
As this article demonstrates, many results are possible and the correct solution is not
necessarily intuitive, especially in a more complex query. Now let's look at that 12-way
complex join...•
Rolf Hanusa is the project leader and lead DBA for Southwestern Bell's Corporate
Data Warehouse (CDW) Project. Rolf has more than 10 years experience as a DBA,
supporting both Teradata and DB2 DSS systems. He is also an active member of the
Partners Product Advisory Council, a group of NCR/Teradata customers that provides
input to NCR on the product direction of NCR's large system products, as well as
enhancements to the Teradata RDBMS. You can reach him via email at
rh9151@stlmail1.sbc.com.