ArchDBMS 3 2x2

I A heap file provides just enough structure to maintain a collection of records (of
a table).
Module 3: File Organizations and Indexes
I The heap file supports sequential scans (openScan) over the collection, e.g.
Web Forms
Applications
SQL Interface
SELECT
FROM
Module Outline
SQL Commands
3.1
3.2
3.3
3.4
Comparison of file organizations

Overview of indexes
Properties of indexes
Indexes and SQL
No further operations receive specific support from the heap file.

Plan Executor
Parser
Operator Evaluator
Optimizer
I For queries like

SELECT
FROM
WHERE
Query Processor
!
e here
You ar
Files and Index Structures

Transaction
Manager
This is a repetition of material

from Information Systems .
A, B
R
Buffer Manager
Recovery
Manager
A, B
R
C > 42
or
SELECT
FROM
ORDER BY
A, B
R
C ASC
it would definitely be helpful if the SQL query processor could rely on a particular
organization of the records in the file for table R.
Lock
Manager
Disk Space Manager
Concurrency Control
DBMS
File organization for table R

Index Files
Which organization of records in the file for table R could speed up the
evaluation of the two queries above?
System Catalog
Data Files
Database
77
78
3.1
This section . . .
Comparison of file organizations
I . . . presents a comparison of 3 file organizations:

I We will now enter a competition in which 3 file organizations are assessed in 5
disciplines:

1 files of randomly ordered records (heap files)

2 files sorted on some record field(s)

1 Scan: fetch all records in a given file.

3 files hashed on some record field(s).

2 Search with equality test: needed to implement SQL queries like
I . . . introduces the index concept:

A file organization is tuned to make a certain query (class) efficient, but if we

have to support more than one query class, we may be in trouble. Consider:
Q
SELECT
FROM
WHERE
R
_ _ _
C = 42 .
_ _ _

3 Search with range selection: needed to implement SQL queries like (upper
A, B, C
R
A > 0 AND A < 100
or lower bound might be unspecified)
If the file for table R is sorted on C, this does not buy us anything for query
Q.
SELECT
FROM
WHERE
SELECT
FROM
WHERE
R
_ _ _ _ _ _ _ _ _
A > 0 AND A < 100 .
_ _ _ _ _ _ _ _ _
If Q is an important query but is not supported by Rs file organization, we

can build a support data structure, an index, to speed up (queries similar to)
Q.

5 Delete a record (identified by its rid), fix up the files organization if needed.
79
80

4 Insert a given record in the file, respecting the files organization.
3.1.1
Cost model
! Aside: Hashing
I Performing these 5 database operations clearly involves block I/O, the major
cost factor.
I However, we have to additionally pay for CPU time used to search inside a
page, compare a record field to a selection constant, etc.
A hashed file uses a hash function h to map a given record onto a specific
page of the file.
Example: h uses the lower 3 bits of the first field (of type INTEGER) of
the record to compute the corresponding page number:
h (h42, true, "foo"i)
h (h14, true, "bar"i)
h (h26, false, "baz"i)
I To analyze cost more accurately, we introduce the following parameters:

Parameter
Description
b
r
D
C
H
# of pages in the file

# of records on a page
time needed to read/write a disk page
CPU time needed to process a record (e.g., compare a field value)
CPU time taken to apply a hash function to a record
2
6
2
(42 = 1010102 )
(14 = 11102 )
(26 = 110102 )
I The hash function determines the page number only; record placement
inside a page is not prescribed by the hashed file.
I If a page p is filled to capacity, a chain of overflow pages is maintained
(hanging off page p) to store additional records with h (h. . . i) = p.
I Remarks:
I To avoid immediate overflowing when a new record is inserted into a

hashed file, pages are typically filled to 80 % only when a heap file is
initially (re)organized into a hashed file.
D 15 ms
C H 0.1 s
This is a coarse model to estimate the actual execution time
(we do not model network access, cache effects, burst I/O, . . .).
(We will come back to hashing later.)

82
81
3.1.2
Scan
3.1.3
Search with equality test (A = const)

1 Heap file

1 Heap file
Scanning the records of a file involves reading all b pages as well as processing
each of the r records on each page:
Scanheap = b (D + r C)
The equality test is (a) on a primary key, (b) not on a primary key :
(a)
(b)
Searchheap = 1/2 b (D + r C)
Searchheap = b (D + r C)

2 Sorted file (sorted on A)

2 Sorted file
The sort order does not help much here. However, the scan retrieves the records
in sorted order (which can be big plus):
We assume the equality test to be on the field determining the sort order. The
sort order enables us to use binary search:
Scansort = b (D + r C)
Searchsort = log2 b D + log2 r C

3 Hashed file
Again, the hash function does not help. We simply scan from the beginning
(skipping over the spare free space typically found in hashed files):
(If more than one record qualifies, all other matches are stored right after the
first hit.)

3 Hashed file (hashed on A)
Hashed files support equality searching best. The hash function directly leads us
to the page containing the hit (overflow chains ignored here):
Scanhash = (100/80) b (D + r C)
| {z }
=1.25
(a)
(b)
Scanning a hashed file
Searchhash = H + D + 1/2 r C
Searchhash = H + D + r C
(All qualifying records are on the same page or, if present, in its overflow chain.)
In which order does a scan of a hashed file retrieve its records?

83
84
3.1.4
Search with range selection (A lower AND A upper )
3.1.5
Insert

1 Heap file

1 Heap file
We can add the record to some arbitrary page (e.g., the last page). This involves
reading and writing the page:
Qualifying records can appear anywhere in the file:

Rangeheap = b (D + r C)
Insertheap = 2 D + C

2 Sorted file (sorted on A)
Use equality search (with A = lower ), then sequentially scan the file until a
record with A > upper is found:

Rangesort = log2 b D + log2 r C + n/r D + n C

2 Sorted file
On average, the new record will belong in the middle of the file. After insertion,
we have to shift all subsequent records (in the latter half of the file):
Insertsort = log2 b D + log2 r C + 1/2 b (2 D + r C)
{z
} |
|
{z
}
search
(n denotes the number of hits in the range)
shift latter half

3 Hashed file

3 Hashed file (sorted on A)
Hashing offers no help here as hash functions are designed to scatter records all
over the hashed file (e.g., h (h7, . . . i) = 7, h (h8, . . . i) = 0) :
We pretend to search for the record, then read and write the page determined
by the hash function (we assume the spare 20 % space on the page is sufficient
to hold the new record):
Rangehash = 1.25 b (D + r C)
Inserthash = H
+ D} + C + D
| {z
search
86
85
3.1.6
Delete (record specified by its rid)
I There is no single file organization that serves all 5 operations equally fast.
Dilemma: more advanced file organizations make a real difference in speed!

1 Heap file
If we do not try to compact the file (because the file uses free space management)
after we have found and removed the record, the cost is:
range selections for increasing file sizes (D = 15 ms, C = 0.1 s, r = 100, n = 10):
sorted file
heap/hashed file
10000
1000
time [s]
100
Deleteheap = |{z}
D +C+D
10
1
search
by rid
0.1
0.01

2 Sorted file
10
Again, we access the records page and then (on average) shift the latter half
the file to compact the file:
100
1000
b [pages]
10000
100000
deletions for increasing file sizes (as above, n = 1):

sorted file
heap/hashed file
10000
1000
Deletesort = D + 1/2 b (2 D + r C)
|
{z
}
time [s]
100
shift latter half
10
1
0.1

3 Hashed file
0.01
Accessing the page using the rid is even faster than the hash function, so the
hashed file behaves like the heap file:
Deletehash = D + C + D
87
10
100
1000
10000
100000
b [pages]
I There exist index structures which offer all the advantages of a sorted file and
support insertions/deletions efficiently1 : B+ trees.
1 At
the cost of a modest space overhead.
88
3.2
Overview of indexes
I We can design the index entries, i.e., the k, in various ways:
I If the basic organization of a file does not support a particular operation, we can
additionally maintain an auxiliary structure, an index, which adds the needed
support.
I We will use indexes like guides. Each guide is specialized to accelerate searches
on a specific attribute A (or a combination of attributes) of the records in its
associated file:
Variant

a

b

c

2 The index responds with an associated index entry k
(k contains enough information to access the actual record in the file),

the record will have an A-field with value k.2

1
/ index
/ k

h. .. , A = k, . . . i
rid

[rid1 , rid2 , . . . ]
a , there is no need to store the data records in addition the

With variant
indexthe index itself is a special file organization.
If we build multiple indexes for a file, at most one of these should use variant

a to avoid redundant storage of records.

3 Read the actual record by using the guiding information in k;
entry k
I Remarks:

1 Query the index for the location of a record with A = k (k is the search
key),
Index
k,
k,
k,
/ h. . . , A = k, . . . i
b and
c use rid(s) to point into the actual data file.
Variants
c leads to less index entries if multiple records match a search key k,
Variant
but index entries are of variable length.
2 This is true for so-called exact match indexes. In the more general case with similarity
indexes, the records are not guaranteed to contain the value k, they are only candidates for
having this value.
89
Example:3
I The data file contains hname, age, sali records, the file itself (index entry variant

a ) is hashed on field age (hash function h1 ).
b ), pointing into the data
I The index file contains hsal, ridi index entries (variant
file.
I This file organization + index efficiently supports equality searches on the age
and sal keys.
1
2
h(age)=0
Tracy, 44, 5004
3000
3000
h(sal)=0
5004
age
h1
h(age)=1
Basu, 33, 4003
4003
Clustered vs. unclustered indexes
I Suppose, we have to support range selections on records such that lower

A upper for field A.
I If we maintain an index on the A-field, we can
h2
Ashby, 25, 3000
Bristow, 29, 2007
3.3.1
Properties of indexes
I This will work provided that the data file is sorted on the field A:
sal
5004
3.3

1 query the index once for a record with A = lower , and then

2 sequentially scan the data file from there until we encounter a record with field
A > upper .
Smith, 44, 3000

Jones, 40, 6003
90
+
B tree
h(sal)=3
2007
h(age)=2
Cass, 50, 5004
Daniels, 22, 6003
6003
index file
6003
index entries (k*)
data file
hashed on age
index file
<sal,rid> entries
data
3
1
3 refer to the index lookup scheme at the beginning of this section.

...
records
91
data file
92
I If the data file associated with an index is sorted on the index search key, the
index is said to be clustered.
I In general, the cost for a range selection grows tremendously if the index on A
is unclustered. In this case, proximity of index entries does not imply proximity
in the data file.
As before, we can query the index for a record with A = lower . To continue the
scan, however, we have to revisit the index entries which point us to data pages
scattered all over the data file:
+
B tree
3.3.2
Dense vs. sparse indexes
I A clustered index comes with more advantages than the improved speed for
range selections presented above. We can additionally design the index to be
space efficient:
To keep the size of the index file small, we maintain one index entry k per
data file page (not one index entry per data record). Key k is the smallest
search key on that page.
Indexes of this kind are called sparse (otherwise indexes are dense).
I To search a record with field A = k in a sparse A-index, we
index file

1 locate the largest index entry k 0 such that k 0 k, then
index entries (k*)

2 access the page pointed to by k 0 , and
data
records

3 scan this page (and the following pages, if needed) to find records with
data file
h. . . , A = k, . . . i.
Since the data file is clustered (i.e., sorted) on field A, we are guaranteed to
find matching records in the proximity.
I Remarks:
a , the index is obviously clustered by definiIf the index entries (k) are of variant
tion.
A data file can have at most one clustered index (but any number of unclustered
indexes).
93
3.3.3
Example:
I Again, the data file contains hname, age, sali records. We maintain a clustered
sparse index on field name and an unclustered dense index on field age. Both
b to point into the data file:
use index entry variant
Ashby, 25, 3000
22
Basu, 33, 4003
Bristow, 30, 2007
25
30
Ashby
Cass
Cass, 50, 5004
Smith
Daniels, 22, 6003

Jones, 40, 6003
33
Smith, 44, 3000

Tracy, 44, 5004

Sparse Index
on
Name
Data File
Primary vs. secondary indexes
Terminology
In the literature, you may often find a distinction between primary
(mostly used for indexes on the primary key) and secondary (mostly
used for indexes on other attributes) indexes. The terminology, however,
is not very uniform, so some text books may use those terms for different
properties.
1 of indexes according
You might as well find primary denoting variant
to Section 3.2, while secondary may be used to characterize the other
2 and
3.
two variants
40
44
I Remarks:
94
44
50
Dense Index
on
Age
Sparse indexes need 23 orders of magnitude less space than dense indexes.
We cannot build a sparse index that is unclustered (i.e., there is at most one
sparse index per file).
SQL queries and index exploitation

`
How do you` propose to evaluate query SELECT MAX(age) FROM employees ?

How about SELECT MAX(name) FROM employees ?
95
96
3.3.4
Multi-attribute indexes
3.4
Each of the index techniques sketched so far and discussed in the sequel can be
applied to a combination of attribute values in a straightforward way:
I concatenate indexed attributes to form an index key,
e.g. hlastname,firstnamei 7 searchkey
I define index on searchkey
I index will support lookup based on both attribute values,
e.g. . . . WHERE lastname=Doe AND firstname=John . . .
The SQL-92 standard does not include any statement for the specification (creation,
dropping) of index structures. In fact, SQL does not even require SQL systems to
provide indexes at all!
Nonetheless, practically all SQL implementations support one or more kinds of indexes. A typical SQL statement to create an index would look like this:
1
2
I possibly will also support lookup based on prefix of values,

e.g. . . . WHERE lastname=Doe . . .
There are more sophisticated index structures that can provide support for symmetric
lookups for both attributes alone (or, to be more general, for all subsets of indexes
attributes). These are often called multi-dimensional indexes. A large number of
such indexes have been proposed, especially for geometric applications.
97
Bibliography
Elmasri, R. and Navathe, S. (2000). Fundamentals of Database Systems. Addison-Wesley,
Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen von
Datenbanken.
H
arder, T. (1999). Datenbanksysteme: Konzepte und Techniken der Implementierung.
Springer.
Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Intl Thompson Publishing, Bonn.
Mitschang, B. (1995). Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Implementierungsaspekte. Vieweg.
Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill,
New York, 3 edition.
99
Indexes and SQL
I Index specification in SQL

CREATE INDEX IndAgeRating ON Students
WITH STRUCTURE = BTREE,
KEY = (age,gpa)
N.B. SQL-99 does not include indexes either.
98

ArchDBMS 3 2x2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ArchDBMS 3 2x2

Uploaded by

Copyright:

Available Formats

I A heap file provides just enough structure to maintain a collection of records (of

Module 3: File Organizations and Indexes

Comparison of file organizations

No further operations receive specific support from the heap file.

I For queries like

Files and Index Structures

This is a repetition of material

File organization for table R

Comparison of file organizations

I . . . presents a comparison of 3 file organizations:

I . . . introduces the index concept:

A file organization is tuned to make a certain query (class) efficient, but if we

or lower bound might be unspecified)

If Q is an important query but is not supported by Rs file organization, we

I To analyze cost more accurately, we introduce the following parameters:

# of pages in the file

I To avoid immediate overflowing when a new record is inserted into a

(We will come back to hashing later.)

Search with equality test (A = const)

Searchsort = log2 b D + log2 r C

Scanning a hashed file

In which order does a scan of a hashed file retrieve its records?

Search with range selection (A lower AND A upper )

Qualifying records can appear anywhere in the file:

(n denotes the number of hits in the range)

shift latter half

Delete (record specified by its rid)

deletions for increasing file sizes (as above, n = 1):

shift latter half

the cost of a modest space overhead.

I We can design the index entries, i.e., the k, in various ways:

(k contains enough information to access the actual record in the file),

a , there is no need to store the data records in addition the

Tracy, 44, 5004

Clustered vs. unclustered indexes

I Suppose, we have to support range selections on records such that lower

Ashby, 25, 3000

Bristow, 29, 2007

Smith, 44, 3000

3 refer to the index lookup scheme at the beginning of this section.

Dense vs. sparse indexes

I To search a record with field A = k in a sparse A-index, we

index entries (k*)

Cass, 50, 5004

Daniels, 22, 6003

Smith, 44, 3000

Primary vs. secondary indexes

SQL queries and index exploitation

How do you` propose to evaluate query SELECT MAX(age) FROM employees ?

I possibly will also support lookup based on prefix of values,

Indexes and SQL

I Index specification in SQL

N.B. SQL-99 does not include indexes either.

You might also like