You are on page 1of 31

Database Design and The EntityRelationship Model

Lecture 21 R&G - Chapter 2

A relationship, I think, is like a shark, you know? It has to constantly move forward or it dies. And I think what we got on our hands is a dead shark.
Woody Allen (from Annie Hall, 1979)

Administriva
Homework 4 Due One Week From Today Midterm is graded

Exam Summary
20

Diff from Mean Exam 1

Average: 74.5 Max: 93.5 Min: 36.5 Std Dev: 11.6

15 Good on Exam 1 Worse on Exam 2 10

0 -40 -30 -20 -10 -5 0 10 20 30

-10 Bad on Exam 1 Good on Exam 2 -15

-20 Diff from Mean Exam 2

Exam Details
Section: Low High / Total (Average) 1: External Sorting 10 24 / 24 (18) 2: Query Execution 17 43 / 44 (36) 3: Query Optim. 6.5 29 / 32 (20)

Section 1: External Sorting


For all questions in this section, assume that you have 96 pages in the buffer pool, and you wish to sort EMP by Ename. Using the general external sort, how many sorting passes are required? (6pts) #passes = 1 + Ceil(log(B-1)(N/B)) = 1 + Ceil(log(95)(10,000/96)) =1+2 =3 What is the I/O cost of doing this sort? (6pts) I/O Cost = 2N * (# passes) = 20,000 * 3

Section 1 (cont)
Database systems sometimes use blocked I/O because reading a block of continuous pages is more efficient that doing a separate I/O for each page. Assume that in our computer system, it is much more efficient to read and write blocks of 32 pages at a time, so all reads and writes from files must be in blocks of 32 pages (and if a file has less than 32 pages, it is padded with blank pages). Consider doing external sort with blocked I/O, which gets faster I/O at the expense of more sorting passes. If the database must read and write 32 pages at a time, how many sorting passes are required? Show either the formula you use, or the temp file sizes after each pass. (4pts) Pass 0: 105 runs of 96 pages Pass 1: 2-way merge 96 + 96 -> result is 53 runs of 192 pages Pass 2: 2-way merge of 192 + 192 -> result is 27 runs of 384 pages Pass 3: -> result is 14 runs of 768 pages, Pass 4: -> result is 7 runs 1536 pages, Pass 5: -> result is 4 runs 3072 pages, Pass 6: -> result is 2 runs of 6144, Pass 7: -> result is 10,000 8 passes total

Section 1 (cont)
Blocked I/O is used because reading a 32-page block is faster than 32 separate 1-page I/Os. For each sorting pass, you still must read and write every page in the file, but instead of doing 10,000 1-page I/Os, instead you do (10,000/32) block I/Os. Assume that in our system, we can read a 32-page block in the time it would normally take to do 8 single-page I/Os. Is the blocked I/O sort faster or slower than a regular sort from question 2? By approximately what ratio? (4pts) Assume that instead of a heap file, the records from EMP are stored a clustered B-Tree index, whose key is Ename, using alternative 1 (i.e., the full data records are stored in the leaves of the tree). The B-tree has depth 3. Assuming the B-Tree already exists, what is the approximate I/O cost to use the B-tree to get the records in sorted order? (4pts)

Section 2: Relational Operators


Consider the operation: (EID < 5000)EMP

What is the I/O cost of this operation if there is no index? (4pts) Lecture 15 slide 9: no index -> sequential scan, N I/Os, meaning 10,000 I/Os
What is the reduction factor? (4pts) 5000/100,000 = 0.05 Assume that instead of a heap file, the records from EMP are stored in a clustered B-Tree index, whose key is Ename, using alternative 1 (i.e., the full data records are stored in the leaves of the tree). The B-tree has depth 3. Assuming the B-Tree already exists, what is the I/O cost of this selection operation? (4pts) Due to a copy paste error on my part, the B-Tree Index is *not* useful for the query. I gave credit either way, whether people tried to use the index as an index, or for sequential scan. In considering the cost of using the index, either as an index or for a scan, full credit required knowing that B-Trees have an overhead of approximately 50%, i.e., there are 50% more leaves than the number of pages in a heap file.

Relational Operators (cont)


Consider the query: select distinct Ename from EMP order by Ename If you had 96 pages of memory, what algorithm would you use to execute this query and why? (4pts) Order by suggests sort-merge to remove dups. If, instead, the query were: select distinct Ename from EMP and you had 101 pages of memory, what algorithm would you use and why? (4pts) Could use hashing to remove dups, which would have a similar I/O cost but be more CPU efficient.

Relational Operators (cont)


Consider the join: In_Dept Dept Assuming 100 pages of memory, what is the I/O cost of this join using Blocked Nested Loops? (4pts) Lec 15, slide 27: Blocked nested loops: M + Ceil(M/(B-2)) * N 550 + Ceil(550/100)) * 20 = 550 + 6 * 20 = 670 What is the I/O cost of this using Index Nested Loops, with an unclustered Hash index on Dept.DID? Assume an average cost of 1.2 I/Os to get to the right hash bucket. (4pts) Lec 15, slide 27: index nested loops: M + #tuples in M * index cost 550 + 110,000 * (1.2 + 1) = 242,000 Consider the join: Dept In_Dept. Assuming 100 pages of memory, what is the I/O cost of this using Sort-Merge-Join, optimized so that the last merge pass is combined with the join if possible? (4pts) Best: 1 pass over In_Dept, to make 6 runs, sort Dept in memory, merge. Cost 550*3 + 20 = 1670 By the book: 3(M + N) = 3*570 = 1710 Assuming 100 pages of memory, what is the I/O cost of this using regular (not hybrid) hash join? (4pts) Best: partition In_Dept (2 * 550), hash Dept in memory, match. Cost 550 * 3 + 20 = 1670 By the book: 3(M + N) = 1710

Relational Operators (cont)


Consider the join: ( (EID <= 100)EMP) In_Building Assume there is no index on EMP, and an unclustered hash index on In_Building.EID, and plan operators are pipelined where possible. You have 100 pages of memory. Assume that it takes on average 1.2 I/Os get reach the appropriate hash bucket. What is the I/O cost of this using Blocked Nested Loops? (4pts) M + Ceil(M/(B-2)) * N If you do not push select down:10,000 + Ceil(10,000/(100-2)) * 550 = 10,000 + 103*550 = 66,650 If you do push select, pipeline result: Size of selection in pages is S = (I/100,000)*10,000 = I/10 = 10 Join cost: 10,000 + Ceil(10/(100-2)) * (550) = 10,550 What is the I/O cost of this using Index Nested Loops? (4pts) If you do push select, pipeline result: Size of selection in pages is: S = (I/100,000)*10,000 = I/10 = 10 each EID will have 1.1 matching tuples. Join cost: M + (# tuples * index cost) 10,000 + 100 * (1.2 + 1.1) = 10,230

Section 3: Query Optimization


There are several parts of an SQL query that can cause a column (or columns) to be considered an "interesting order", meaning that an ordered input can lower the cost of evaluating that part of the query. Briefly describe three of these, and for each one briefly explain why. (6 pts) Order By, Group By, Joins, Distinct, some aggregations (e.g. Max, Min) Write the reduction factor that would be used by a System R style query optimizer for each of the following predicates (6pts) Building.BID > 150

EMP.Ename = Joe

BID ranges from 1 to 200, so this would be .

IN_DEPT.EID = 003

We dont know about distribution of names, so 1/10. Since there are 100,000 employees, this is 1/100,000

Given the following query, where X is the join operator: (EMP.Ename) (Dept.Budget > 500000) (Emp X In_Dept X Dept) Mark whether each of the following queries are equivalent (True/False). __T___(EMP.Ename) (Dept.Budget > 500000) (Emp X Dept X In_Dept) __T___(EMP.Ename) (Dept.Budget > 500000) (Dept X In_Dept X Emp) __T___(EMP.Ename) (Emp X In_Dept X (Dept.Budget > 500000)(Dept)) __F___(Dept.Budget > 500000) ((EMP.Ename) (Emp) X In_Dept X Dept)

Query Optimization (cont)


Given the schema on the first page of the exam, assume the following: There are unclustered hash indexes on both Emp.EID and Dept.DID. There are clustered B-Tree Indexes (alt 1 data records store in the B-Tree leaves) on both Emp.EID and Dept.DID. There is an unclustered Btree index on (Emp.Salary, Emp.EID) You can assume that Btree-Indexes have heights of 3. And the cost for getting to the data record using a hash index is 2. What is the best access plan for the following query (4pts). SELECT E.eid, E.Salary FROM Emp E WHERE E.Salary > 100,000 Very best is unclustered B-tree, index only plan Draw all possible join orders considered by a System-R style query optimizer for the following query. (4pts) SELECT E.eid, D.dname FROM Emp E, In_Dept I, Dept D WHERE E.sal = 64,000 AND D.budget > 500,000 AND E.eid = I.eid AND I.did = D.did (E X I) X D) (I X E) X D) (D X I) X E) (I X D) X E)

Where are we?


This week: Database Design (me) The ER Model as a design tool Next Week: Concurrency Control & Recovery (Prof. Roth) Week after: More Database Design (me)

Today and Thursday: The ER Model


A different data model from Relational
Most commonly used for database design Today: Details of the ER Model Thursday: Translating ER Schemas to Relational

Databases Model the Real World


Data Model translates real world things into structures computers can store Many models: Relational, E-R, O-O, Network, Hierarchical, etc. Relational Rows & Columns Keys & Foreign Keys to link Relations
Enrolled
sid 53666 53666 53650 53666 cid grade Carnatic101 C Reggae203 B Topology112 A History105 B

Students
sid 53666 53688 53650 name login Jones jones@cs Smith smith@eecs Smith smith@math age 18 18 19 gpa 3.4 3.2 3.8

Database Design
The process of modelling things in the real world into elements of a data model. I.E., describing things in the real world using a data model. E.G., describing students and enrollments using various tables with key/foreign key relationships The Relational model is not the only model in use

A Problem with the Relational Model


CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2)) CREATE TABLE Students (sid CHAR(20), name CHAR(20), login CHAR(10), age INTEGER, gpa FLOAT)

With complicated schemas, it may be hard for a person to understand the structure from the data definition.

Enrolled
cid Carnatic101 Reggae203 Topology112 History105 grade C B A B sid 53666 53666 53650 53666

Students
sid 53666 53688 53650 name login Jones jones@cs Smith smith@eecs Smith smith@math age 18 18 19 gpa 3.4 3.2 3.8

One Solution: The E-R Model


Instead of relations, it has: Entities and Relationships These are described with diagrams, both structure, notation more obvious to humans A visual language for describing schemas
name since lot Students did Courses dname budget

ssn

Enrolled_in

Steps in Database Design


Requirements Analysis user needs; what must database do? Conceptual Design high level descr (often done w/ER model) Logical Design translate ER into DBMS data model Schema Refinement (in 2 weeks) consistency, normalization Physical Design (discussed already) indexes, disk layout Security Design who accesses what, and how

Conceptual Design
Define enterprise entities and relationships What information about entities and relationships should be in database? What are the integrity constraints or business rules that hold? A database `schema in the ER Model is represented pictorially (ER diagrams). Can map an ER diagram into a relational schema.

ER Model Basics

ssn

name

lot

Employees

Entity: Real-world thing, distinguishable from other objects. Entity described by set of attributes.
Entity Set: A collection of similar entities. E.g., all employees. All entities in an entity set have the same set of attributes. (Until we consider hierarchies, anyway!) Each entity set has a key (underlined). Each attribute has a domain.

ER Model Basics (Contd.)


name since lot Employees did dname budget Departments

ssn

Works_In

Relationship: Association among two or more entities. E.g., Attishoo works in Pharmacy department. relationships can have their own attributes. Relationship Set: Collection of similar relationships. An n-ary relationship set R relates n entity sets E1 ... En ; each relationship in R involves entities e1 E1, ..., en En

ER Model Basics (Cont.)


name ssn lot

since dname did

Employees

budget
Departments Works_In

supervisor

subordinate

Reports_To

Same entity set can participate in different relationship sets, or in different roles in the same set.

name ssn lot

since did

dname budget

Key Constraints
Employees

An employee can work in many departments; a dept can have many employees. In contrast, each dept has at most one manager, according to the key constraint on Manages.

Manages

Departments

Works_In since

Many-toMany

1-to Many

1-to-1

Participation Constraints
Does every employee work in a department? If so, this is a participation constraint the participation of Employees in Works_In is said to be total (vs. partial) What if every department has an employee working in it? Basically means at least one
name ssn lot

since
dname

did
Manages

budget
Departments

Employees

Works_In

Means: exactly one


since

Weak Entities
A weak entity can be identified uniquely only by considering the primary key of another (owner) entity. Owner entity set and weak entity set must participate in a one-to-many relationship set (one owner, many weak entities). Weak entity set must have total participation in this identifying relationship set.
name
ssn lot cost pname age

Employees

Policy

Dependents

Weak entities have only a partial key (dashed underline)

Binary vs. Ternary Relationships


name
ssn lot Employees pname age

If each policy is owned by just 1 employee: Key constraint on Policies would mean policy can only cover 1 dependent!

Covers

Dependents

Bad design
policyid name ssn Employees Purchaser lot

Policies cost pname age

Dependents

Think through all the constraints in the 2nd diagram!

Beneficiary

Better design
policyid

Policies cost

Binary vs. Ternary Relationships (Contd.)


Previous example illustrated case when two binary relationships were better than one ternary relationship.

Opposite example: a ternary relation Contracts relates entity sets Parts, Departments and Suppliers, and has descriptive attribute qty. No combination of binary relationships is an adequate substitute.

Binary vs. Ternary Relationships (Contd.)


qty Parts Contract Departments

VS.
Suppliers
Parts needs Departments

can-supply

Suppliers

deals-with

S can-supply P, D needs P, and D deals-with S does not imply that D has agreed to buy P from S. How do we record qty?

Summary so far
Entities and Entity Set (boxes) Relationships and Relationship sets (diamonds) binary n-ary Key constraints (1-1,1-M, M-M, arrows on 1 side) Participation constraints (bold for Total) Weak entities - require strong entity for key Next, a couple more advanced concepts

You might also like