Professional Documents
Culture Documents
UNIT V
PART -A (2 Marks)
If a transaction writes a data item without reading the data is called blind write. This sometimes causes
inconsistency.
A schedule, S is serial if for every transaction T participating in the schedule and all the operations of T
is executed consecutively in the schedule; otherwise the schedule is called Non-serial schedule.
It is used to prevent concurrent transactions from interfering with one another and enforcing an
additional condition that guarantees serializability.
A time stamp is a unique identifier for each transaction generated by the system. Concurrency control
protocols use this time stamp to ensure serializability.
Shared lock allows other transactions to read the data item and write is not allowed. Exclusive lock
allows both read and write of data item and a single transaction exclusively holds the lock.
Deadlock occurs when one transaction T in a set of two or more transactions is waiting for some item
that is locked by some other transaction in the set.
_ High throughput Number of transactions that are executed in a given amount of time is high.
_ Low delayAverage response time is reduced. Average response time is the average time for a
transaction to be completed after it has been submitted.
Recoverable schedule is the one where for each pair of transactions Ti and Tj such that Tj reads a data
item previously written by Ti, the commit operation of Ti appears before the commit operation of Tj.
The DBMS must devise an execution strategy for retrieving the result of the query from the database
files. Process of choosing a suitable execution strategy for processing a query is known as Query
optimization.
A distributed system consists of a collection of sites, connected together via some kind of
communication network, in which each site is a full database system site in its own right.
o But the sites have agreed to work together so that a user at any site can access data anywhere in the
network exactly as if the data were all stored at the users own site.
Reliability: the probability that the system is up and running at any given moment.
Availability: the probability that the system is up and running continuously throughout a specific
period.
The objective of minimizing network utilization implies the query optimization process itself needs to be
distributed as well as the query execution process.
R. Loganathan, AP/CSE. Mahalakshmi Engineering College, Trichy
14. Explain Client/Server.(A.UNov/Dec 2009)
Client/Server refers primarily to architecture, or logical division of responsibility, the client is the
application called the front end. The server is theDBMS, which is the back end.
15. What are the rules that have to be followed during fragmentation?
It is a problem that occurs when there is more than one copy of data item indifferent locations and when
the changes are made only in some copies not in all copies.
o Majority locking
The process of generating and reproducing multiple copies of data at one or more sites is called
replication.
o Read-only snapshots
o procedural replication
In some of the architectures multiple CPUs are working in parallel and are physically located in a close
environment in the same building and communicating at a very high speed. The databases operating in
such environments known as parallel databases.
Data warehouse is a subject oriented, integrated, and non-volatile, time variant collection of data in
support of managements decisions.
The data mining refers to the mining or discovery of new information interns of patterns or rules from
vast amount of data.
FLASH MEMORY
Data survives power failure
Data can be written at a location only once, but location can be erased and written to
again
Can support only a limited number (10K 1M) of write/erase cycles.
Reads are roughly as fast as main memory
But writes are slow (few microseconds), erase is slower
Cost per unit of storage roughly similar to main memory
Widely used in embedded devices such as digital cameras
Is a type of EEPROM (Electrically Erasable Programmable Read-Only Memory)
Data is stored on spinning disk, and read/written magnetically
Primary medium for the long-term storage of data; typically stores entire database.
Data must be moved from disk to main memory for access, and written back for storage
Much slower access than main memory (more on this later)
R. Loganathan, AP/CSE. Mahalakshmi Engineering College, Trichy
direct-access possible to read data on disk in any order, unlike magnetic tape
Capacities range up to roughly 400 GB currently
Much larger capacity and cost/byte than main memory/flash memory
Growing constantly and rapidly with technology improvements (factor of 2 to 3 every 2
years)
Survives power failures and system crashes
Disk failure can destroy data, but is rare
OPTICAL STORAGE
non-volatile, data is read optically from a spinning disk using a laser
CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R,
DVD+R)
Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-
RAM)
Reads and writes are slower than with magnetic disk
Juke-box systems, with large numbers of removable disks, a few drives, and a
mechanism for automatic loading/unloading of disks available for storing large volumes
of data
TAPE STORAGE
non-volatile, used primarily for backup (to recover from disk failure), and for archival
data
sequential-access much slower than disk
very high capacity (40 to 300 GB tapes available)
tape can be removed from drive storage costs much cheaper than disk, but drives are
expensive
Tape jukeboxes available for storing massive amounts of data hundreds of terabytes (1
terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
2 MAGNETIC-DISK
reserved space
pointers
Reserved space can use fixed-length records of a known maximum length; unused space in
shorter records filled with a null or end-of-record symbol.
Pointer Method
Pointer method
A variable-length record is represented by a list of fixed-length records, chained
together via pointers.
Can be used even if the maximum record length is not known
Disadvantage to pointer structure; space is wasted in all records except the first in an a chain.
Solution is to allow two kinds of block in file:
Anchor block contains the first records of chain
Overflow block contains records other than those that are the first records of
chairs.
4. Explain about Organization Of Records In Files.
Heap a record can be placed anywhere in the file where there is space
Sequential store records in sequential order, based on the value of the search key of each
record
Hashing a hash function computed on some attribute of each record; the result specifies in
which block of the file the record should be placed.
Records of each relation may be stored in a separate file. In a clustering file organization
records of several different relations can be stored in the same file
Motivation: store related records on the same block to minimize I/O
Sequential File Organization
R. Loganathan, AP/CSE. Mahalakshmi Engineering College, Trichy
Suitable for applications that require sequential processing of the entire file The records in the
file are ordered by a search-key
Deletion use pointer chains
Insertion locate the position where the record is to be inserted
if there is free space insert there
if no free space, insert the record in an overflow block
In either case, pointer chain must be updated
Need to reorganize the file from time to time to restore sequential order
Clustering File Organization
Simple file structure stores each relation in a separate file
Can instead store several relations in one file using a clustering file organization
Example: Consider two relation
The depositor Relation The customer Relation
good for queries involving depositor customer, and for queries involving one single customer
and his accounts
bad for queries involving only customer
results in variable size records
INDEXING AND HASHING
Comparison of Ordered Indexing and Hashing
Index Definition in SQL
Multiple-Key Access
Basic Concepts
Indexing mechanisms used to speed up access to desired data.
E.g., author catalog in library
Search Key - attribute to set of attributes used to look up records in a file.
Otherwise k Km1, where there are m pointers in the node. Then follow Pm to the child
node.
If the node reached by following the pointer above is not a leaf node, repeat the
above procedure on the node, and follow the corresponding pointer.
Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the
desired record or bucket. Else no record with search-key value k exists.
In processing a query, a path is traversed in the tree from the root to some leaf node.
If there are K search-key values in the file, the path is no longer than logn/2(K).
Procedure find(value V )
set C = root node
while C is not a leaf node begin
Let Ki = smallest search-key value, if any, greater than V
Redistribute the pointers between the node and a sibling such that both have
more than the minimum number of entries.
Update the corresponding search-key value in the parent of the node.
The node deletions may cascade upwards till a node which has n/2 or more pointers is found. If
the root node has only one pointer after deletion, it is deleted and the sole child becomes the
root.
5. Explain about B-TREE INDEX FILES.
Similar to B+-tree, but B-tree allow search-key values to appear only once; eliminates
redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer
field for each search key in a nonleaf node must be included.
Generalized B-tree leaf node
Advantages of B-Tree indices:
May use less tree nodes than a corresponding B+-Tree.
Sometimes possible to find search-key value before reaching leaf node.
Disadvantages of B-Tree indices:
Only small fraction of all search-key values are found early
Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater
depth than corresponding B+-Tree
Insertion and deletion more complicated than in B+-Trees
Implementation is harder than B+-Trees.
R. Loganathan, AP/CSE. Mahalakshmi Engineering College, Trichy
Typically, advantages of B-Trees do not outweigh disadvantages.
A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block).
In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of all
bucket addresses
Hash function is used to locate records for access, insertion as well as deletion.
Records with different search-key values may be mapped to the same bucket; thus entire
bucket has to be searched sequentially to locate a record.
6. Explain about Hash Functions.
Worst has function maps all search-key values to the same bucket; this makes access
time proportional to the number of search-key values in the file.
An ideal hash function is uniform, i.e., each bucket is assigned the same number of
search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same number of records
assigned to it irrespective of the actual distribution of search-key values in the file.
Typical hash functions perform computation on the internal binary representation of the
search-key.
For example, for a string search-key, the binary representations of all the characters in
the string could be added and the sum modulo the number of buckets could be returned. .
Handling of Bucket Overflows
Bucket overflow can occur because of
Insufficient buckets
Skew in distribution of records. This can occur due to two reasons:
multiple records have same search-key value
chosen hash function produces non-uniform distribution of key values
Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is
R. Loganathan, AP/CSE. Mahalakshmi Engineering College, Trichy
handled by using overflow buckets.
Overflow chaining the overflow buckets of a given bucket are chained together in a linked
list.
Above scheme is called closed hashing.
An alternative, called open hashing, which does not use overflow buckets, is not
suitable for database applications.
Hash Indices
Hashing can be used not only for file organization, but also for index-structure creation.
A hash index organizes the search keys, with their associated record pointers, into a hash file
structure.
Strictly speaking, hash indices are always secondary indices
if the file itself is organized using hashing, a separate primary hash index on it
using the same search-key is unnecessary.
However, we use the term hash index to refer to both secondary index structures
and hash organized files.
Deficiencies of Static Hashing
In static hashing, function h maps search-key values to a fixed set of B of bucket addresses.
Databases grow with time. If initial number of buckets is too small, performance
will degrade due to too much overflows.
If file size at some point in the future is anticipated and number of buckets
allocated accordingly, significant amount of space will be wasted initially.
If database shrinks, again space will be wasted.
One option is periodic re-organization of the file with a new hash function, but it
is very expensive.
These problems can be avoided by using techniques that allow the number of buckets to be
modified dynamically.
DYNAMIC HASHING
Good for database that grows and shrinks in size
Allows the hash function to be modified dynamically
Use N blocks of memory to buffer input runs, and 1 block to buffer output.
Read the first block of each run into its buffer page
repeat
Select the first record (in sort order) among all buffer pages
Write the record to the output buffer. If the output buffer is full write it to
disk.
Delete the record from its input buffer page. If the buffer page becomes
empty then read the next block (if any) of the run into the buffer.
until all input buffer pages are empty:
If N M, several merge passes are required.
1. In each pass, contiguous groups of M - 1 runs are merged.
2. A pass reduces the number of runs by a factor of M -1, and creates runs longer by the same
factor.
E.g. If M=11, and there are 90 runs, one pass reduces the number of runs to 9, each 10 times the
size of the initial runs
Repeated passes are performed till all runs have been merged into one.
Example: External Sorting Using Sort-Merge
Cost analysis:
Total number of merge passes required: logM1(br/M).
Block transfers for initial run creation as well as in each pass is 2br
R. Loganathan, AP/CSE. Mahalakshmi Engineering College, Trichy
Cost of seeks
During run generation: one seek to read each run and one seek to write each run
During the merge phase
Buffer size: bb (read/write bb blocks at a time)
2 br / bb seeks for each merge pass
except the final one which does not require a write
Total number of seeks: 2 br / M + br / bb (2 logM1(br / M) -1)
JOIN OPERATION
Several different algorithms to implement joins
1. Nested-loop join
3. Indexed nested-loop join
4. Merge-join
5. Hash-join
Choice based on cost estimate
Examples use the following information
Number of records of customer: 10,000 depositor: 5000
Number of blocks of customer: 400 depositor: 100
Nested-Loop Join
To compute the theta join r s for each tuple tr in r do begin for each tuple ts in s do begin test
pair (tr,ts) to see if they satisfy the join condition if they do, add tr ts to the result. end end
r is called the outer relation and s the inner relation of the join.
Requires no indices and can be used with any kind of join condition.
Expensive since it examines every pair of tuples in the two relations.
In the worst case, if there is enough memory only to hold one block of each relation, the
estimated cost is nr bs + br block transfers, plus nr + br seeks
If the smaller relation fits entirely in memory, use that as the inner relation.
Reduces cost to br + bs block transfers and 2 seeks
Assuming worst case memory availability cost estimate is
with depositor as outer relation:
Modify merge-join to compute r s: During merging, for every tuple tr from r that do not match
any tuple in s, output tr padded with nulls. Right outer-join and full outer-join can be computed
similarly. Modifying hash join to compute r s If r is probe relation, output non-matching r
tuples padded with nulls If r is build relation, when probing keep track of which r tuples
matched s tuples. At end of si output non-matched r tuples padded with nulls