You are on page 1of 22

ADBT Notes(Unit 1)

Introduction
 Parallel machines are becoming quite common and affordable
 Prices of microprocessors, memory and disks have dropped
sharply
 Databases are growing increasingly large
 large volumes of transaction data are collected and stored for
later analysis.
 multimedia objects like images are increasingly stored in
databases
 Large-scale parallel database systems increasingly used for:
 storing large volumes of data
 processing time-consuming decision-support queries
 providing high throughput for transaction processing

Data can be partitioned across multiple disks for parallel I/O


Different queries can be run in parallel with each other (Inter-Query
Parallelism)
– Concurrency control takes care of conflicts
Queries are expressed in high level language SQL, then translated to relational
algebra
– Individual relational operations (e.g., sort, join, aggregation) can be
executed in parallel (Intra-Query Parallelism)
– data can be partitioned and each processor can work independently
on its own partition.
Thus, databases naturally lend themselves to parallelism
– Potential parallelism is everywhere in database processing
There Are Two Types Of Parallelism:

Parallelism is natural to DBMS processing


– Pipelined parallelism: many machines each doing one step in a
multi-step process.
– Partitioned parallelism: many machines doing the same thing to
different pieces of data.
– Both are natural in DBMS!

Database system can be


 Centralized (hierarchical, networking, relational, OODBMS, ORDBMS)
 based on single processing unit or compute architecture.
 All data is maintained at single site
 Processing of individual transactions is sequential.
 Parallel (To meet the complex requirements of users)
 Distributed

Four Distinct Motivation:


 Performance: Using several resources (e.g., CPUs and disks) in
parallel can significantly improve performance.
 Increased availability: If a site containing a relation goes down, the
relation continues to be available if a copy is maintained at another site.
 Distributed access to data: An organization may have branches in
several cities. A bank manager is likely to look up the accounts of
customers at the local branch and this locality can be exploited by
distributing the data accordingly.
 Analysis of distributed data: Organizations increasingly want
to examine all the data available to them, even when it is stored
across multiple sites and on multiple database systems.
Q1.What is Parallel Database Systems?

1. A parallel database system is one that seeks to improve performance
through parallel implementation of various operations such as loading
data, building indexes, and evaluating queries.

2. Parallel database systems consist of multiple processors and multiple


disks connected by a fast interconnection network.
Parallel processing divides a large task into smaller tasks and executes the
smaller tasks concurrently on several nodes.
3. A coarse-grain parallel machine consists of a small number of powerful
processors
4. A massively parallel or fine grain parallel machine utilizes thousands
of smaller processors.
5. Two main performance measures:
– throughput :
The number of tasks that can be completed in a given time
interval
– response time :
the amount of time it takes to complete a single task from the
time it is submitted
Advantages of Parallel databases
1. Increased throughput
2. Improved response time
3. Useful for the applications to query extremely large databases and to
process an extremely large number of transaction rate.
4. Substantial performance improvements
5. Increased availability of system.
6. Greater flexibility
7. Possible to serve large number of users.

Disadvantages
1. More start-up costs
2. Interference Problem
3. Skew problem
Q2.what is Speedup & Scaleup?
 SpeedUp:
1. Speedup is More resources means proportionally less time for given
amount of data.

2. A fixed-sized problem executing on a small system is given to a system


which is N-times larger.
3. It is Measured by:
speedup = small system elapsed time
large system elapsed time
4. Speedup is linear if equation equals N.

ScaleUp:
1. Scale-Up is if resources increased in proportion to increase in data size,
time is constant.

2. Increase the size of both the problem and the system N-times larger system
used to perform N-times larger job Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed time
3. Scale up is linear if equation equals 1.
Factors Limiting Speedup and Scaleup:
Speedup and scaleup are often sublinear due to:
1. Startup costs:
Cost of starting up multiple processes may dominate computation
time, if the degree of parallelism is high.
2. Interference:
Processes accessing shared resources (e.g.,system bus, disks, or
locks) compete with each other
3. Skew:
Overall execution time determined by slowest of parallely
executing tasks.

Q3. ARCHITECTURES FOR PARALLEL DATABASES



Three main architectures are proposed for building parallel databases:
1. Shared memory :
a. processors share a common memory.
b. Multiple CPUs are attached to an interconnection network and
can access a common region of main memory

2. Shared disk :
a. processors have direct access to all disks.
b. Each CPU has a private memory and direct access to all disks
through an interconnection network.

3. Shared nothing :
a. processors share neither a common memory nor common disk
b. each CPU has local main memory and disk space, but no two
CPUs can access the same storage area; all communication
between CPUs is through a network connection.
1. Shared memory architecture:
1. Multiple CPUs are attached to an interconnection network and can
access a common region of main memory
2. Extremely efficient communication between processors — data in shared
memory can be accessed by any processor without having to move it
using software.
3. Disadvantage – architecture is not scalable beyond 32 or 64 processors
since the bus or the interconnection network becomes a bottleneck
4. Widely used for lower degrees of parallelism (4 to 8).

2. Shared disk architecture

1. Each CPU has a private memory and direct access to all disks through
an interconnection network.
2. All processors can directly access all disks via an interconnection
network, but the processors have private memories.
a. The memory bus is not a bottleneck
b. Architecture provides a degree of fault-tolerance — if a
processor fails, the other processors can take over its tasks since
the database is resident on disks that are accessible from all
processors.
3. Examples: IBM Sysplex and DEC clusters
4. Downside:
Bottleneck now occurs at interconnection to the disk subsystem.
Shared-disk systems can scale to a somewhat larger number of
processors, but communication between processors is slower.
3. Shared nothing architecture
1. Node consists of a processor, memory, and one or more disks.
Processors at one node communicate with another processor at another
node using an interconnection network. A node functions as the server for
the data on the disk or disks the node owns.
Examples: Teradata, Tandem, Oracle-n CUBE
2. Data accessed from local disks (and local memory accesses) do not pass
through interconnection network, thereby minimizing the interference
of resource sharing.
3. Shared-nothing multiprocessors can be scaled up to thousands of
processors without interference.
4. Main drawback:
cost of communication and non-local disk access; sending data
involves software interaction at both ends.

Hierarchical Architecture
Combines characteristics of shared-memory, shared-disk, and
shared-nothing architectures.
PARALLEL QUERY EVALUATION

Introduction:

Q4. Parallel Database Issues:



1. Considering parallel execution of a single query
a. Data partitioning
b. Parallelization of existing operator evaluation code
2. If an operator consumes the output of a second operator,
pipelined parallelism can evaluate each individual operator in a query
plan in a parallel fashion.
3. The key to evaluating an operator in parallel is to partition the input
data; then work on each partition in parallel and combine the results.
This approach is called data-partitioned parallel evaluation.

Q5. Different Types of DBMS Parallelism (||-ism) ?



There are 3 Types of DBMS Parallelism:
1. Input/ Output parallelism
2. Inter-query parallelism
3. Intra-query parallelism:
a. Intra-operator parallelism
b. Inter-operator parallelism

COMPARISON OF PARALLELISM
Evaluate how well partitioning techniques support the following types of data
access:
1.Scanning the entire relation.
2.Locating a tuple associatively – point queries
E.g., r.A = 25.
3.Locating all tuples such that the value of a given attribute lies within a
specified range – range queries.
E.g., 10  r.A < 25.
1. Input/ Output parallelism

Parallel I/O refers to the process of writing to, or reading from, two
or more I/O devices simultaneously.

a. Data partitioning
Reduce the time required to retrieve relations from disk by partitioning
the relations on multiple disks.
1. Horizontal partitioning :
tuples of a relation are divided among many disks such that each
tuple resides on one disk. (number of disks = n)
2. Vertical partitioning :
involves creating tables with fewer columns and using additional
tables to store the remaining columns.

Horizontal partitioning
a. Round-robin partitioning
b. Hash partitioning
c. Range partitioning
d. Schema partitioning

a. Round-robin partitioning
It Send the ith tuple inserted in the relation to diski mod n. It
ensures even distribution of tuples across disks.

Advantages
1. Best suited for sequential scan of entire relation on each query.
2. All disks have almost an equal number of tuples; retrieval work
is thus well balanced between disks.

Range queries are difficult to process


1. No clustering -- tuples are scattered across all disks
2. Point queries and range queries are complicated to process.
b. Hash partitioning
1. Choose one or more attributes as the partitioning attributes with
range 0…n - 1
2. Choose hash function h
3. Let i denote result of hash function h applied to the partitioning
attribute value of a tuple. Send tuple to disk i.
4. Eg: if hash function applied on the attribute results 2 then it is
placed on disk 2.
Advantages
1. Assuming hash function is good, and partitioning attributes form
a key, tuples will be equally distributed between disks which
prevents skewing.
2. Retrieval work is then well balanced between disks.
3. Good for point queries on partitioning attribute
4. Can lookup single disk, leaving others available for answering
other queries.

Eg: SELECT * FROM EMPLOYEE WHERE EMPID=10235;

5. Useful for sequential scans of the entire relation. Time taken to


scan is 1/n ; n: no. of disks.

Disadvantages
1. Not well suited for point queries on non-partitioning attributes
2. No clustering, so difficult to answer range queries

SELECT * FROM EMPLOYEE WHERE EMPID>10235 AND


EMPID<20235;

c. Range partitioning
1. Administrator specifies that attribute values within a range are to be
placed on a certain disk.
Eg: range partitioning with three disks numbered 0,1,2 might place
2. Tuples for employee numbers upto 1000 on disk 0, tuples for
employee numbers 1001-1500 on disk 1 and tuples for employee
numbers 1501 to 2000 to disk 2.
Advantages
1. Offers good performance for range based queries and exact match
queries involving partitioning attributes.
2. Search narrows to exactly those disks that might have any tuples of
interest.
Disadvantages:
It Can cause skewing in some cases.

d. Schema partitioning
Different relations within a database are placed on different disks.
Disadvantage:
More prone to data skewing.

Q6. Handling of data Skew



The distribution of tuples to disks may be skewed — that is, some disks have
many tuples, while others may have fewer tuples.
Types of skew:
1. Attribute-value skew.
a. Some values appear in the partitioning attributes of many
tuples; all the tuples with the same value for the
partitioning attribute end up in the same partition.
b. Can occur with range-partitioning and hash-partitioning
2. Partition skew.
a. With range-partitioning, badly chosen partition vector may
assign too many tuples to some partitions and too few to
others.
b. Less likely with hash-partitioning if a good hash-function is
chosen.
Handling Skew using Histograms
1. Balanced partitioning vector can be constructed from histogram in a
relatively straightforward fashion
- Assume uniform distribution within each range of the histogram
2. Histogram can be constructed by scanning relation, or sampling (blocks
containing) tuples of the relation
Handling Skew Using Virtual Processor Partitioning:
Skew in range partitioning can be handled elegantly using virtual
processor partitioning:
a. create a large number of partitions (say 10 to 20 times the number
of processors)
b. Assign virtual processors to partitions either in round-robin fashion
or based on estimated cost of processing each virtual partition
Basic idea:
a. If any normal partition would have been skewed, it is very likely
the skew is spread over a number of virtual partitions
b. Skewed virtual partitions get spread across a number of processors,
so work gets distributed evenly!

Continue of Q6
2. Inter-query parallelism

1. Multiple queries / transactions execute in parallel with one


another
2. DBMS uses means of transaction dispatching.
3. Incoming requests are routed to the least busy processor to keep
the overall workload balanced.
4. Difficult to automate this process in Shared nothing architecture.
5. Shared disk architecture with the use of lock management
supports interquery parallelsim.
Advantages
Increases transaction throughput
– used primarily to scale up a transaction processing system
to support a larger number of transactions per second
Disadvantages
a. Response time of individual operations are not faster.
b. Difficult to implement in a shared disk or shared nothing
architecture.

3. Intra-query Parallelism(Parallel Query processing)

Execution of a single query in parallel on multiple processors using


shared nothing parallel architecture.

Example:
- A relation has been partitioned across multiple disks by range partitioning on
some attribute.
- The user wants to sort on the partitioning attibute.
- The sort operation can be implemented by sorting each partition in parallel ,
then concatenating the sorted partitions to get the final sorted relation.

The query is parallelized by parallelising individual operations.


Two approaches
1. Each CPU can execute the same task against some portion of the data.
2. The task can be divided into different subtasks with each CPU executing
a different subtask.
Advantages:
Speeds up long-running queries and queries involving multiple joins.

a. Intra operation parallelism


Parallelize the execution of each individual operation of a task like sort,
join, projection etc.

b. Inter operation parallelism


The different operations in a query expression are executed in
parallel.
Two types are
• Pipelined parallelism
• Independent parallelism.
Q7. What are the two types of inter Operation Parallelism

Two types are:
a. Pipelined parallelism
b. Independent parallelism.
1. Pipelined parallelism
The output tuples of one operation A are consumed by second
operation B, even before the first operation has produced the entire set of
tuples in its output.
It is possible to run operations A and B simultaneously on different
processors, so that the operation B consumes tuples in parallel with
operation A producing them.
Advantages
a. Useful with smaller number of CPUs .
b. Pipelined execution avoid writing immediate results to disk.
Disadvantages
a. Doesn’t scaleup well.
b. It is not possible to pipeline relational operators that do not produce
output until all inputs have been accessed.

2. Independent parallelism

The operations in a query processor that do not depend on one another


can be executed in parallel.

Advantage

Independent parallelism is useful with lower degree of parallelism.

Disadvantage

It doesn't provide a high degree of parallelism.


SELECT * FROM employees ORDER BY last_name;
• The execution plan implements a full scan of the employees table.
• This operation is followed by a sorting of the retrieved rows, based on the
value of the last_ name column
• There are actually eight parallel execution servers involved in the query
even though the DOP is 4.
• All of the parallel execution servers involved in the scan operation send
rows to the appropriate parallel execution server performing the SORT
operation.
• If a row scanned by a parallel execution server contains a value for the
last_name column between A and G, that row is sent to the first ORDER
BY parallel execution server.
• When the scan operation is complete, the sorting processes can return
the sorted results to the query coordinator, which, in turn, returns the
complete query results to the user.

Parallelizing Sequential Operator Evaluation Code


1. Streams (from different disks or the output of other operators) are
merged as needed to provide the inputs for a relational operator, and the
output of an operator is split as needed to parallelize subsequent
processing.
2. The merge and split operators should be able to buffer some data and
should be able to halt the operators producing their input data.
3. They can then regulate the speed of the execution according to the
execution speed of the operator that consumes their output.
Parallelizing Individual operations

Various operations can be implemented in parallel in a sharednothing


architecture.

Q8. What is Bulk Loading and Scanning?



1. Pages can be read in parallel while scanning a relation, and the
retrieved tuples can then be merged, if the relation is partitioned
across several disks.
2. If hashing or range partitioning is used, selection queries can be
answered by going to just those processors that contain relevant
tuples.
3. bulk loading(to load multiple rows of data into a database table)
can also be done in parallel.

Q9. What is Sorting ?



1. Each CPU sort the part of the relation that is on its local disk
and then merge these sorted sets of tuples.
2. First redistribute all tuples in the relation using range
partitioning.
3. Each processor then sorts the tuples assigned to it, using some
sequential sorting algorithm
4. The entire sorted relation can be retrieved by visiting the
processors in an order corresponding to the ranges assigned to
them and simply scanning the tuples
Q.10 Types Of Sorting:

a. Range-Partitioning Sort

1. Choose processors P0, ..., Pm, where m  n -1 to do sorting.


2. Create range-partition vector with m entries, on the sorting
attributes
3. Redistribute the relation using range partitioning
– all tuples that line in the ith range are sent to processor Pi
– Pi stores the tuples it received temporarily on disk Di.
– This step requires I/O and communication overhead.
4. Each processor Pi sorts its partition of the relation locally.
5. Each processors executes same operation (sort) in parallel with
other processors, without any interaction with the others
(data parallelism).
6. Final merge operation is trivial: range-partitioning ensures that,
for 1< j < m, the key values in processor Pi are all less than the
key values in Pj.

b. Parallel External Sort-Merge:

1. Assume the relation has already been partitioned among disks


D0, ..., Dn-1 (in whatever manner).
2. Each processor Pi locally sorts the data on disk Di.
3. The sorted runs on each processor are then merged to get the
final sorted output.
4. Parallelize the merging of sorted runs as follows:
– The sorted partitions at each processor Pi are range-
partitioned across the processors P0, ..., Pm-1.
– Each processor Pi performs a merge on the streams as
they are received, to get a single sorted run.
– The sorted runs on processors P0,..., Pm-1 are concatenated
to get the final result.
Parallel External Sort-Merge
Q11. What are Joins?
There are Two types of Joins:
a. Parallel Join
b. Partitioned Join
Parallel Join:
1. The join operation requires pairs of tuples to be tested to see if
they satisfy the join condition, and if they do, the pair is
added to the join output.
2. Parallel join algorithms attempt to split the pairs to be tested
over several processors.
3. Each processor then computes part of the join locally.
4. In a final step, the results from each processor can be collected
together to produce the final result.
Partitioned Join
1. partition the two input relations across the processors, and
compute the join locally at each processor.
2. Let r and s be the input relations, and we want to compute
rr.A=s.B s.
Where r and s each are partitioned into n partitions, denoted r0,
r1, ..., rn-1 and s0, s1, ..., sn-1.
3. It Can use either range partitioning or hash partitioning.
4. r and s must be partitioned on their join attributes r.A and s.B,
using the same range-partitioning vector or hash function.
5. Partitions ri and si are sent to processor Pi,
6. Each processor Pi locally computes ri ri.A=si.B si. Any of the
standard join methods can be used.
Q12. What is Improved Parallel Hash Join

Partitioned Parallel Hash-Join
1. Assume s is smaller than r and therefore s is chosen as the
build relation.
2. A hash function h1 takes the join attribute value of each
tuple in s and maps this tuple to one of the n processors.
3. Each processor Pi reads the tuples of s that are on its disk Di,
and sends each tuple to the appropriate processor based on hash
function h1. Let si denote the tuples of relation s that are sent to
processor Pi.
4. As tuples of relation s are received at the destination processors,
they are partitioned further using another hash function, h2,
which is used to compute the hash-join locally.
5. Once the tuples of s have been distributed, the larger relation r is
redistributed across the m processors using the hash function h1
– Let ri denote the tuples of relation r that are sent to
processor Pi.
6. As the r tuples are received at the destination processors, they
are repartitioned using the function h2
7. Each processor Pi executes the build and probe phases of the
hash-join algorithm on the local partitions ri and si of r and s to
produce a partition of the final result of the hash-join.

You might also like