Advance Database Management System: Unit - 2 .Query Processing and Optimization

Advance Database
Management System
Unit 2 .Query processing and Optimization
Dhanashree Huddedar
Index
Overview
Measures of Query Cost
Selection operation
Sorting
Join Operation
Other operation
Evaluation of Expressions
Transformation of relational expressions
Estimating Statistics of Expression results
Choice of Evolution Plan
Materialized Views
Overview
Query processing: Is the list of activities that
are perform to obtain the required tuples that
satisfy a given query.
Query optimization: The process of choosing a
suitable execution strategy for processing a
query.
Two internal representations of a query:
Query Tree
Query Graph
Basic Steps in Query Processing

1. Parsing and translation
2. Optimization
3. Evaluation
4. Execution
Translating SQL Queries into

Relational Algebra
Parser & Translator:
Syntax
Schema element
Converts the query into R.A expression.
Optimizer
Find all equivalent R.A expressions
Find the R.A expression with least cost
Cost(CPU, Block access, time spent)
Will create query evaluation plan which tell what R.A and what algorithm is
used.
Query evaluation plan:
Evaluate the above plan and get the result

Relational Algebra
Query block:
The basic unit that can be translated into the algebraic operators and
optimized.
A query block contains a single SELECT-FROMWHERE expression, as well

as GROUP BY and HAVING clause if these are part of the block.
Nested queries within a query are identified as separate query blocks.
Aggregate operators in SQL must be included in the extended algebra.

Relational Algebra
Measures of Query Cost

Cost is generally measured as total elapsed time for
answering query
Many factors contribute to time cost
disk accesses, CPU, or even network communication
Typically disk access is the predominant cost, and is also
relatively easy to estimate. Measured by taking into account
Number of seeks
Number of blocks read
Number of blocks written
Cost to write a block is greater than cost to

read a block
data is read back after being written to ensure
that the write was successful
Measures of Query Cost (Cont.)

Cost of algorithm (e.g., for join or selection)
depends on:
database buffer size
more memory for DB buffer reduces disk

accesses. Thus DB buffer size is a parameter for
estimating cost
We refer to the cost estimate of algorithm S as
cost(S).
Cost of writing output to disk is not considered.
Measures of Query Cost (Cont.)

Several algorithms can reduce disk IO by using extra
buffer space
Amount of real memory available to buffer depends
on other concurrent queries and OS processes, known
only during execution
We often use worst case estimates, assuming only
the minimum amount of memory needed for the
operation is available
Required data may be buffer resident already, avoiding
disk I/O
But hard to take into account for cost estimation
Catalogue Information for Cost

Estimation
Example sets for optimization of

select algorithms
OP1: Ssn = 123456789 (EMPLOYEE)
OP2: Dnumber > 5 (DEPARTMENT)
OP3: Dno = 5 (EMPLOYEE)
OP4: Dno = 5 AND Salary > 30000 AND Sex = F (EMPLOYEE)
OP5: Essn=123456789 AND Pno =10(WORKS_ON)
Selection operation
S1Linear search (brute force algorithm).
Retrieve every record in the file, and test whether its attribute values satisfy
the selection condition.
Since the records are grouped into disk blocks, each disk block is read into a
main memory buffer, and then a search through the records within the disk
block is conducted in main memory.
S2Binary search.
If the selection condition involves an equality comparison on a key attribute

on which the file is ordered, binary searchwhich is more efficient than
linear searchcan be used.
An example is OP1 if Ssn is the ordering attribute for the EMPLOYEE file.
Binary search is not used in Db because ordered files are not used unless
they also have a corresponding primary key index
S3aUsing a primary index.

with a primary indexfor example, Ssn = 123456789 in OP1use the
primary index to retrieve the record.
Note that this condition retrieves a single record (at most).
S3bUsing a hash key.

with a hash keyfor example, Ssn = 123456789in OP1use the hash key to
retrieve the record.
Note that this condition retrieves a single record (at most).
S4Using a primary index to retrieve multiple records.
If the comparison condition is >, >=, <, or <= on a key field with a primary
indexfor example, Dnumber > 5 in OP2use the index to find the record
satisfying the corresponding equality condition (Dnumber = 5),
then retrieve all subsequent records in the (ordered) file. For the condition
Dnumber < 5, retrieve all the preceding records.
S5Using a clustering index to retrieve multiple records. If the

selection condition involves an equality comparison on a nonkey
attribute with a clustering indexfor example, Dno = 5 in OP3use
the index to retrieve all the records satisfying the condition.
S6Using a secondary (B+-tree) index on an equality

comparison. This search method can be used to retrieve a single
record if the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is not a key. This can also be used
for comparisons involving >, >=, <, or <=.
Sorting
We may build an index on the relation, and then use the index to read
the relation in sorted order. May lead to one disk block access for each
tuple.
For relations that fit in memory, techniques like quicksort can be used.
For relations that dont fit in memory, external
sort-merge is a good choice.
External Sort-Merge
Let M denote memory size (in pages).
1. Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation:
(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run Ri; increment i.
Let the final value of i be N
2. Merge the runs (next slide)..
External Sort-Merge (Cont.)

2. Merge the runs (N-way merge). We assume (for now) that N < M.
1. Use N blocks of memory to buffer input runs, and 1 block to buffer
output. Read the first block of each run into its buffer page
2. repeat
1. Select the first record (in sort order) among all buffer pages
2. Write the record to the output buffer. If the output buffer is full
write it to disk.
3. Delete the record from its input buffer page.
If the buffer page becomes empty then
read the next block (if any) of the run into the buffer.
3. until all input buffer pages are empty:
External Sort-Merge (Cont.)

If N M, several merge passes are required.
In each pass, contiguous groups of M - 1 runs are
merged.
A pass reduces the number of runs by a factor of M
-1, and creates runs longer by the same factor.
E.g. If M=11, and there are 90 runs, one pass
reduces the number of runs to 9, each 10 times
the size of the initial runs
Repeated passes are performed till all runs have
been merged into one.
Example: External Sorting Using Sort-Merge

a 19
g 24
a 19
d 31
c 33
b 14
e 16
r 16
d 21
m 3
p
a 14
initial
relation
a 19
d 31
g 24
d 21
r 16
create
runs
d 31
e 16
g 24
m 3
m 3
p
merge
pass1
r 16
r 16
runs
d 21
d 21
a 14
runs
a 14
m 3
c 33
g 24
e 16
b 14
e 16
c 33
a 19
33
d 31
b 14
a 14
b 14
merge
pass2
sorted
output
JOIN Operation
The JOIN operation is one of the most time-consuming operations
in query processing.
Implementing the JOIN Operation:
Join (EQUIJOIN, NATURAL JOIN)
twoway join: a join on two files
e.g. R
A=B
multi-way joins: joins involving more than two files.

e.g. R
A=B
C=D
Methods for Implementing Joins.

Methods for implementing joins:
J1 Nested-loop join (brute force):
For each record t in R (outer loop), retrieve every record s from S
(inner loop) and test whether the two records satisfy the join
condition t[A] = s[B].
J2 Single-loop join (Using an access structure to retrieve the

matching records):
If an index (or hash key) exists for one of the two join attributes
say, B of S retrieve each record t in R, one at a time, and
then use the access structure to retrieve directly all matching
records s from S that satisfy s[B] = t[A].
J3 Sort-merge join:
If the records of R and S are physically sorted (ordered) by
value of the join attributes A and B, respectively, we can
implement the join in the most efficient way possible.
Both files are scanned in order of the join attributes,
matching the records that have the same values for A and
B.
In this method, the records of each file are scanned only
once each for matching with the other fileunless both A
and B are non-key attributes, in which case the method
needs to be modified slightly.
Join Operation Continued

J4 Hash-join:
The records of files R and S are both hashed to the
same hash file, using the same hashing function on
the join attributes A of R and B of S as hash keys.
A single pass through the file with fewer records
(say, R) hashes its records to the hash file buckets.
A single pass through the other file (S) then hashes
each of its records to the appropriate bucket, where
the record is combined with all matching records
from R.
Factors affecting JOIN performance

Available buffer space
Join selection factor
Choice of inner VS outer relation
Other Operations
Other relational operations and extended relational operationssuch as
duplicate elimination, projection, set operations, outer join, and
aggregation
Duplicate Elimination: can be implemented via hashing or
sorting.
On sorting duplicates will come adjacent to each other, and all
but one set of duplicates can be deleted.
Optimization: duplicates can be deleted during run generation
as well as at intermediate merge steps in external sort-merge.
Hashing is similar duplicates will come into the same
bucket.
Projection:
perform projection on each tuple

followed by duplicate elimination.
Other Operations
Continued
Aggregation can be implemented in a manner similar to duplicate
elimination.
Sorting or hashing can be used to bring tuples in the same group
together, and then the aggregate functions can be applied on each
group.
Optimization: combine tuples in the same group during run generation
and intermediate merges, by computing partial aggregate values
For count, min, max, sum: keep aggregate values on tuples found so
far in the group.
When combining partial aggregate for count, add up the
aggregates
For avg, keep sum and count, and divide sum by count at the end
Other Operations
Continued
Set operations (, and ): can either use variant of merge-join
after sorting, or variant of hash-join.

E.g., Set operations using hashing:
1. Partition both relations using the same hash function
2. Process each partition i as follows.
1. Using a different hashing function, build an in-memory hash
index on ri.
2. Process si as follows
r s:
1. Add tuples in si to the hash index if they are not already in
it.
2. At end of si add the tuples in the hash index to the result.
Other Operations : Set Operations
E.g., Set operations using hashing:

1. as before partition r and s,
2. as before, process each partition i as follows

1. build a hash index on ri
2. Process si as follows
r s:
1. output tuples in si to the result if they are
already there in the hash index
r s:
1. for each tuple in si, if it is there in the hash
index, delete it from the index.
2. At end of si add remaining tuples in the
hash index to the result.
Other Operations : Outer Join
Outer join can be computed either as
A join followed by addition of null-padded nonparticipating tuples.

by modifying the join algorithms.
Modifying merge join to compute r
In r
s)
s, non participating tuples are those in r R(r
Modify merge-join to compute r
s:
During merging, for every tuple tr from r that do not

match any tuple in s, output tr padded with nulls.
Right outer-join and full outer-join can be computed
similarly.
Other Operations : Outer Join

Modifying hash join to compute r
If r is probe relation, output non-matching r tuples

padded with nulls
If r is build relation, when probing keep track of which
r tuples matched s tuples. At end of si output
non-matched r tuples padded with nulls
Evaluation of Expressions
So far: we have seen algorithms for individual operations
Alternatives for evaluating an entire expression tree
Materialization: generate results of an expression whose
inputs are relations or are already computed, materialize
(store) it on disk. Repeat.
Pipelining: pass on tuples to parent operations even as an
operation is being executed
We study above alternatives in more detail
Materialization
Materialized evaluation: evaluate one operation at a time,
starting at the lowest-level. Use intermediate results
materialized into temporary relations to evaluate next-level
operations.
E.g., in figure below, compute and store
building "Watson " (department )
then compute the store its join with instructor, and finally
compute the projection on name.
Materialization (Cont.)
Materialized evaluation is always applicable
Cost of writing results to disk and reading them back can be quite high
Our cost formulas for operations ignore cost of writing results to
disk, so
Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk
Double buffering: use two output buffers for each operation, when
one is full write it to disk while the other is getting filled
Allows overlap of disk writes with computation and reduces
execution time
Pipelining
Pipelined evaluation : evaluate several operations

simultaneously, passing the results of one operation on to
the next.
E.g., in previous expression tree, dont store result of
building "Watson " (department )
instead, pass tuples directly to the join.. Similarly, dont

store result of join, pass tuples directly to projection.
Much cheaper than materialization: no need to store a
temporary relation to disk.
Pipelining may not always be possible e.g., sort, hash-join.
For pipelining to be effective, use evaluation algorithms that
generate output tuples even as tuples are received for
inputs to the operation.
Pipelines can be executed in two ways: demand driven
and producer driven
Pipelining (Cont.)
In demand driven or lazy evaluation
system repeatedly requests next tuple from top level operation
Each operation requests next tuple from children operations as

required, in order to output its next tuple
In between calls, operation has to maintain state so it knows what to

return next
In producer-driven or eager pipelining
Operators produce tuples eagerly and pass them up to their parents

Buffer maintained between operators, child puts tuples in buffer, parent
removes tuples from buffer
if buffer is full, child waits till there is space in the buffer, and then
generates more tuples
System schedules operations that have space in output buffer and can
process more input tuples
Alternative name: pull and push models of pipelining
Pipelining (Cont.)
Implementation of demand-driven pipelining
Each operation is implemented as an iterator implementing the
following operations
open()
E.g. file scan: initialize file scan
state: pointer to beginning of file
E.g.merge join: sort relations;
state: pointers to beginning of sorted relations
next()
E.g. for file scan: Output next tuple, and advance and store
file pointer
E.g. for merge join: continue with merge from earlier state
till
next output tuple is found. Save pointers as iterator state.
close()
Important Question
Define query optimization and explain its significance for DBMS
List and explain the steps followed to process a high level query
Discuss Sort-Merge algorithm with example
List and explain search algorithm to implement Select and Join

statements
Explain following relational operation algorithm : Select Join Project

Union Cartesian Product
What is difference between pipelining and materialization
What is meant by Cost-based query optimization

Advance Database Management System: Unit - 2 .Query Processing and Optimization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advance Database Management System: Unit - 2 .Query Processing and Optimization

Uploaded by

Copyright:

Available Formats

Advance Database

Measures of Query Cost

Transformation of relational expressions

Estimating Statistics of Expression results

Choice of Evolution Plan

Basic Steps in Query Processing

Translating SQL Queries into

Parser & Translator:

Converts the query into R.A expression.

Find all equivalent R.A expressions

Find the R.A expression with least cost

Cost(CPU, Block access, time spent)

Query evaluation plan:

Evaluate the above plan and get the result

Translating SQL Queries into

A query block contains a single SELECT-FROMWHERE expression, as well

Nested queries within a query are identified as separate query blocks.

Aggregate operators in SQL must be included in the extended algebra.

Translating SQL Queries into

Measures of Query Cost

Cost to write a block is greater than cost to

Measures of Query Cost (Cont.)

more memory for DB buffer reduces disk

Measures of Query Cost (Cont.)

Catalogue Information for Cost

Example sets for optimization of

OP1: Ssn = 123456789 (EMPLOYEE)

OP2: Dnumber > 5 (DEPARTMENT)

OP3: Dno = 5 (EMPLOYEE)

OP4: Dno = 5 AND Salary > 30000 AND Sex = F (EMPLOYEE)

OP5: Essn=123456789 AND Pno =10(WORKS_ON)

S1Linear search (brute force algorithm).

If the selection condition involves an equality comparison on a key attribute

S3aUsing a primary index.

If the selection condition involves an equality comparison on a key attribute

Note that this condition retrieves a single record (at most).

S3bUsing a hash key.

If the selection condition involves an equality comparison on a key attribute

Note that this condition retrieves a single record (at most).

S4Using a primary index to retrieve multiple records.

S5Using a clustering index to retrieve multiple records. If the

S6Using a secondary (B+-tree) index on an equality

External Sort-Merge (Cont.)

External Sort-Merge (Cont.)

Example: External Sorting Using Sort-Merge

multi-way joins: joins involving more than two files.

Methods for Implementing Joins.

J2 Single-loop join (Using an access structure to retrieve the

Join Operation Continued

Factors affecting JOIN performance

Duplicate Elimination: can be implemented via hashing or

perform projection on each tuple

after sorting, or variant of hash-join.

Other Operations : Set Operations

E.g., Set operations using hashing:

2. as before, process each partition i as follows

Other Operations : Outer Join

Outer join can be computed either as

A join followed by addition of null-padded nonparticipating tuples.

s, non participating tuples are those in r R(r

Modify merge-join to compute r

During merging, for every tuple tr from r that do not

Other Operations : Outer Join