DBMS

CSCI 585 UNIVERSITY OF SOUTHERN CALIFORNIA Cyrus Shahabi
SPRING 2001 COMPUTER SCIENCE DEPARTMENT Jan. 11,2001
Introduction
Database: An integrated collection of data, usually stored on
secondary storage, pertaining to some organization
Database management system (DBMS): A collection of soft-
ware/programs that control the database, allowing individuals
to access (Read/Write) data contained therein. Example: a uni-
versity database might contain information about entities such
as students, faculty, courses, classrooms. Information relating
the entities with each other: students enrolled in courses, faculty
teaching courses, courses using rooms.
Database system: The collection of database and DBMS to-
gether.
Before DBMS:
After DBMS:
1
Why a DBMS? (or Problems with \le-processing" approach)
1. Reduce application development time
{ DBMS handles common functionality (e.g., all students
with GPA > 3.0)
{ Ideally, the DBMS interface is \what" oriented (instead
of \how")
{ Provides many tools to assist users
2. Resolve data redundancy and inconsistency (data sharing):
data is better utilized (discovered and reused), redundancy
of data is minimized
3. Data independence: Application programs depend on data
representation and storage details
{ Dierent le formats
{ Dierent programming languages
A database system, however, contains not only the database
itself but also a complete denition or description of the
database, termed metadata. The DBMS software refers
to metadata to know the structure of a specic database.
4. Data integrity: one may enforce consistency constraints on
data, e.g., number of seats sold number of seats on the
plane 1.1
5. Centralized control: DBA tunes the database to balance
user's needs
6. Security: mechanisms to prevent unauthorized access. These
mechanisms are based on content instead of le-oriented ap-
proach.
7. Concurrency control: avoids undesirable race conditions that
arise with simultaneous access/updates to data
2
8. Crash recovery: ensures the integrity of data in the presence
of failures
Data model: A set of concepts to describe the structure of a
database. That is, a specication tool (class of data structures)
for describing data, including entities, relationships, semantics,
constraints, etc. Purpose: to capture information about real
world data.
{ High-level or conceptual data model: Specify the overall log-
ical structure of the database (closer to human perception).
They describe what data is stored in the database. Exam-
ples are:
1. Object-based data models: these models are closer to
human perception and farther from system perception.
Examples are: 1) Entity-Relationship, and 2) Object-
Oriented models.
2. Record-based data models: concepts that may be under-
stood by end users but not too far from the way data is
organized within the computer. Examples are: 1) Rela-
tional, 2) Network, and 3) Hierarchical models.
{ Low-level or physical data models: describe data at the low-
est level (closer to system perception). They describe how
data is stored in the database (e.g., record formats and or-
derings). Examples are: 1) Unifying and 2) Frame memory
models.
3
Instance of the database: the collection of information stored
in the database at a particular moment in time (changes fre-
quently).
Database schema: the overall design of the database (changes
infrequently). (e.g., variable and its content).
Database management systems architecture:
(Note: The only data that actually exist is at the physical level.)
Data Independence:
1. Physical data independence: Modify the physical scheme
(data structures, e.g., B-tree or hash index) without causing
application programs to be rewritten. These modications
are necessary to enhance performance and new software re-
leases. Most relational vendors support this kind of data
independence.
2. Logical data independence: Modify the conceptual scheme
(e.g., add a new attribute to a table, rename an attribute)
without causing application programs to be rewritten. This
kind of data independence is harder to achieve.
4
There are several languages associated with a database:
1. Data Denition Language (DDL): The database scheme is
specied by a set of denitions that are expressed by a special
language named DDL. The result of compiling DDL state-
ments is a set of tables stored in a le called data catalog.
This le contains metadata (data about the data stored in
the database).
2. Data Manipulation Language (DML): a language that en-
ables users to access or manipulate data (retrieve, insert, re-
place, delete) as organized by a certain data model. We will
look at a commercial DML named SQL. In general, there are
two types of DML:
(a) Procedural: Describes what data is needed and how to
get it: e.g., relational algebra
(b) Non-procedural: Describes what data is needed without
specifying how to get it: e.g., tuple relational calculus
There are several kind of users associated with a system:
1. Database administrator: denes schemas, storage structures
and access method denitions, physical organization, autho-
rization, integrity constraints.
2. Application programmers: they write a program and make
it available to the end-users
3. Sophisticated users: they use a query language (SQL) to
access the database interactively
4. Naive (end) users: they invoke the application programs
5
Session 2: E-R Data Model (CH-2)
CSCI-585 Tuesday Jan. 16, 2001
Cyrus Shahabi
• Data Model: A collection of conceptual tools for describing data, data
relationships, data semantics & consistency constraints. Example: Entity
Relationships Data Model (E-R model)
• Based on a perception of a real world which consists of a set of basic
objects called entities and relationships among these objects.
• Entity: An entity is an object that exists and is distinguishable from other
objects (e.g., Tony with SS#, csci585 in Spring-2001, …)
• Entity set: A set of entities of the same type (e.g., students, courses).
Presented as:
Student
• Entity sets need not to be disjoint (e.g., Pamu (TA) is both a student and
an employee of USC)
• Attributes: An entity is represented by a set of attributes (e.g., student:
name, SS#, age, address, …). Attribute can be considered as a function
that maps an entity into a domain (e.g., SS#: entity->integer).
Presented as:
• Relationship: A relationship is an association among several entities

(e.g., enrolled relationship associates Tony with cscsi-585)
• Relationship set: A set of relationships with the same type. Formal
definition:
Presented as: Enrolled
• Role: The function that an entity plays in a relationship is called its role.
Normally implicit.
• Recursive relationship: Same entity set participates more than once in a
relationship in different roles. Role names, hence, become essential:
Manage Employee
• A relationship may also have descriptive attributes:
Enrolled
Students Courses
• Degree of relationship: Number of participating entity sets in a

relationship.
• Binary relationships: A relationship that involves two entities
(e.g., enrolled)
• Ternary (N-ary) relationships: A relationship that involves three
(N) entities (e.g., Tony enrolled in csci-585 at USC)
• It is always possible to replace a non-binary relationship set by a number
of distinct binary relationship sets (e.g., Tony enrolled in csci-585, csci-
585 is-offered at USC). Hence, we can restrict ER to include only binary
relationships.
• Mapping cardinalities (example of definition of constraints in a data

model): The number of entities to which another entity can be associated
via a relationship set (depends on the real-world that is being modeled by
the relationship set).
• The following is possible mapping cardinality of relation R between two
entity sets A and B (binary):
• One-to-one (1:1): Women marrying Men (assuming no
polygamy!)
• Many-to-one (N:1): Children having mothers
• one-to-many (1:N): Mothers having children
• many-to-many (M:N): Students enrolled in courses
(From now on, when I say entity or relationship, I mean entity set and
relationship set)
• Keys: Entities and relationships are distinguishable using various keys.
• Superkey: A combination of one or more attributes that allow us to
identify uniquely an entity in an entity set (e.g., SS#, name & SS#).
• Candidate key: A minimal superkey (no proper subset is a superkey) that
uniquely identifies an entity (e.g., SS#, name & address, phone#).
• Primary key: A candidate key chosen by DBA to identify entities of an
entity set (e.g., SS#).
SS#
• Weak entity set: An entity set that does not have enough attributes to
form a primary key (e.g., transaction#, date, amount. Different accounts
might have similar transaction#).
Transaction
• Strong entity set: One with a primary key.

• How to distinguish entities of a weak entity set?
• Discriminator: Set of attributes in a weak entity set that allow
distinguishing among all those entities in the entity set that depends on
one particular strong entity (e.g. transaction# is unique within the same
account#)
• Primary key of a weak entity set is formed by the primary key of the
strong entity set on which it is existence-dependent (termed, owner entity
set), plus it’s discriminator (e.g., account#+transaction#).
• Attributes for relationships: Can be migrated to one of the participating
entity sets for 1:1, 1:N and N:1 relationships (which one?). But NOT for
M:N relationships.
• Attribute types:
• Composite vs. simple: Useful when we sometimes need to refer to
the entire attributes as a unit and sometimes to each of components:
Street City State Zipcode
Address
• Multivalued vs. single-valued: An attribute with a set of values for a

same entity.
Degrees
• Derived vs. stored: The value of the attribute can be derived from
either other attributes (e.g., age from DOB) or related entities (e.g.,
NumberOfEmployees).
Age
• Null attribute: When a value is not applicable for an attribute of a

particular entity (e.g., AppartmentNumber, Degrees); or the value
exists but is missing (e.g., null value for Weight of a person); or the
value is not known to exist or not (e.g., null value for phone#).
(An interactive/attractive example - here)

Session 3: Relational Data Model (CH-4)
CSCI-585 Thursday Jan. 18, 2001
Cyrus Shahabi
• Relational data model represents the database as a collection of

tables, termed relations, each with a unique name (not to be
confused with the relationships in ER model).
• A row (also termed a tuple or a record) in a table represents a

collection of related values.
• Each value corresponds to an attribute or a field.
• A column of a table represents an attribute.
• Since a table is a collection of such relationships, the concept of
table is very similar to the mathematical concept of relation.
• Given a relation with attributes (A1:D1, A2:D2, …, An:Dn) the
domain of this relation is D1 x D2 x … x Dn. The relation (or
table) itself is a subset of this domain (e.g., student (SS#:
integer, name: string, gpa: float) --- domain: integer x string x
float)
• A tuple variable t refers to a tuple in the relation. One may
access the value of an attribute contained in a tuple variable by
t[attr-name]. (e.g., t[SS#])
• Review:
• Database scheme describes the logical design of the database
(e.g., the table frame).
• Database instance is the data in the database at a given instant in
time (e.g., the contents of the table).
Various types of constraints:
• Domain constraints: Value of each attribute A must be an
atomic value from the domain for that attribute (e.g., age:
integer; 4 is acceptable and 4.3 is not!).
• Key constraints: Since a relation is a set of tuples, by definition
all tuples in a relation must also be distinct. That is, no two
tuples can have the same combination of values for all their
attributes. Same concepts of superkey, candidate key and
primary key apply here.
• Entity integrity constraint: No primary key value can be null.
(Issue: Different attribute names refer to the same concept in real

world and cannot be forced to have the same name. Meanwhile,
same attribute names might refer to different concepts in real
world. Hence, some sort of regulation (or constraint) required.)
• Referential integrity constraint: A tuple in one relation that

refers to another relation must refer to an existing tuple in that
relation (e.g., value of DNO for EMPLOYEE must match the
Dnumber value of some tuple in DEPARTMENT).
• Set of attributes FK in R1 is foreign key of R1 if:
1. Attributes in FK have the same domain as the primary key
attributes PK of relation R2.
2. A value of FK in a tuple t1 of R1 either occurs as a value
of PK for some tuple t2 in R2 or is null.
• A foreign key can refer to its own relation (e.g., manager_ss#).
ER-to-Relational Mapping
• Strong entity set with attributes a1, a2, …, an: represent it as a
table with n unique columns (one column per attribute).
Example: ….
Each row in this table corresponds to one entity of the entity set.
We may add/delete/modify rows in the table.
• Weak entity set with attributes a1, a2, …, an and an owner entity
set with primary key b1, b2, …, bm : represent it as a table with
n+m columns, one for each of { a1, a2, …, an} U { b1, b2, …, bm
}. b1, b2, …, bm is the foreign key of the resulting relation
referring to the corresponding relation of the owner entity set.
Example:
• (Idea: keep rows unique.)
• N-ary relationship set R with attributes a1, a2, …, an among
entity sets Ei ‘s (say m entity sets): represent it as a table with
n+m columns, one for each of { a1, a2, …, an} U {prim-key(E1),
prim-key(E2), …, prim-key(Em)}.
• Binary relationship set R with attributes a1, a2, …, an among
entity sets corresponding to relations S and T:
• If 1:1 then choose either relations (say S) and extend it with
prim-key(T) U { a1, a2, …, an}
• If 1:N or N:1 then choose the N-side relation (say S) and

extend it with prim-key(T) U { a1, a2, …, an}
• If N:M then create a new relation as:

prim-key(S) U prim-key(T) U { a1, a2, …, an}
• For multivalued attribute A of entity set S, create a new relation
as: A U prim-key(S)
Session 4: EER: Extended (or Enhanced) ER Model
(CH-3)
CSCI-585 T&TH Jan. 23&25, 2001
Cyrus Shahabi
• Example ER for a real-world problem

• Generalization is the result of computing the union of two or
more entity sets (or subclasses) to produce a higher-level entity
set (or superclass). It represents the containment relationship
that exists between the higher-level entity set and one or more
lower level entity sets.
• Specialization constructs the lower level entity sets that are a
subset of higher-level entity set. Specialization is the reverse of
generalization (for the remainder of this session we focus on
specialization without loss of generality). Example:
(Attribute inheritance)
• There might exist many specialization of the same entity set
based on different distinguishing characteristics. Hence, an
entity can be a member of a number of subclasses. Example:
• An entity cannot merely exist by being a member of a subclass
but no superclass. However, it is not essential that every entity
in a superclass be a member of some subclass.
• Why specialization:
1. Define a set of subclasses of an entity set.
2. Associate additional specific attributes with each subclass.
3. Establish additional specific relationship sets between each
subclass and other entity sets.
Different types of specialization

• Predicate-defined (or condition-defined): Determine subclass
membership by examining the value of a specific attribute
(termed, defining attribute).
• User-defined: The user specifies subclass membership
individually for each entity.
• Disjoint: An entity can be a member of at most one of the
subclasses.
• Overlap: When the subclasses are not disjoint.
• Total: Every entity in the superclass must be a member of some

subclass.
• Partial: An entity might belong to no subclass.
EER-to-Relational Mapping
• Option 1: One table for superclass + two tables for subclasses
(one for each) consisting of their corresponding attributes plus
the primary key of the superclass.
• Option 2: Same as option 1, but without creating a table for the
superclass:
(Only if specialization is both disjoint and total)
• Option 3: A singles table including all the attributes of the

superclass and all subclasses, plus an extra type attribute t to
indicate the subclass to which each tuple belongs:
(Only if specialization is disjoint; null value for t if partial; t can

be the defining attribute for the predicate-defined specialization)
• Option 4: Same as option 3 except there are m Boolean type

attributes, one for each subclass:
(This option can support overlap specialization)
Session 4: EER: Extended (or Enhanced) ER Model
(CH-3)
CSCI-585 T&TH Jan. 23&25, 2001
Cyrus Shahabi
• Example ER for a real-world problem

• Generalization is the result of computing the union of two or
more entity sets (or subclasses) to produce a higher-level entity
set (or superclass). It represents the containment relationship
that exists between the higher-level entity set and one or more
lower level entity sets.
• Specialization constructs the lower level entity sets that are a
subset of higher-level entity set. Specialization is the reverse of
generalization (for the remainder of this session we focus on
specialization without loss of generality). Example:
(Attribute inheritance)
• There might exist many specialization of the same entity set
based on different distinguishing characteristics. Hence, an
entity can be a member of a number of subclasses. Example:
• An entity cannot merely exist by being a member of a subclass
but no superclass. However, it is not essential that every entity
in a superclass be a member of some subclass.
• Why specialization:
1. Define a set of subclasses of an entity set.
2. Associate additional specific attributes with each subclass.
3. Establish additional specific relationship sets between each
subclass and other entity sets.
Different types of specialization

• Predicate-defined (or condition-defined): Determine subclass
membership by examining the value of a specific attribute
(termed, defining attribute).
• User-defined: The user specifies subclass membership
individually for each entity.
• Disjoint: An entity can be a member of at most one of the
subclasses.
• Overlap: When the subclasses are not disjoint.
• Total: Every entity in the superclass must be a member of some

subclass.
• Partial: An entity might belong to no subclass.
EER-to-Relational Mapping
• Option 1: One table for superclass + two tables for subclasses
(one for each) consisting of their corresponding attributes plus
the primary key of the superclass.
• Option 2: Same as option 1, but without creating a table for the
superclass:
(Only if specialization is both disjoint and total)
• Option 3: A singles table including all the attributes of the

superclass and all subclasses, plus an extra type attribute t to
indicate the subclass to which each tuple belongs:
(Only if specialization is disjoint; null value for t if partial; t can

be the defining attribute for the predicate-defined specialization)
• Option 4: Same as option 3 except there are m Boolean type

attributes, one for each subclass:
(This option can support overlap specialization)
Session 5: SQL (CH-7)
CSCI-585 Tuesday Jan. 30, 2001
Cyrus Shahabi
(Disclaimer: Some example queries are covered, but you need to go read
the book and do more exercise on your own, not everything is covered!)
Emp (SS#, name, age, salary, dno)
Dept (dno, dname, floor, mgrSS#)
• Structured Query Language (SQL) consists of four basic
commands: Select, Insert, Update, and Delete.
• The select command has the following syntax:
select [distinct] target-list

from tuple variable list
[where qualification]
[order by target list subset]
[group by target list subset]
[having set-qualification]
the items in square brackets ([ ]) are electives.
• The result of a select query is a relation.
• Target list determines the attributes of the resulting relation.

• Asterisk * can be used in target list to choose all the
attributes.
select *
from Emp
• To eliminate duplicates from the result, use the keyword
distinct in the target list. For example, the following query
shows all distinct salaries (with no duplicates):
select distinct e.salary

from Emp e
• Qualification determines a condition that is met by the tuples of
the resulting relation. For example the “salary” attribute of all
the tuples in the following resulting relation is greater than 50k.

from Emp e
where e.salary > 50,000
• Conceptualize the execution of a select query as one of:

1) evaluate qualification to locate qualifying tuples,
2) project out columns of these tuples using the target-list, and
3) make the resulting tuples into a relation and eliminate duplicates
if necessary.
• Qualification list can be a Boolean combination (and, or, not) of

selection and/or join clauses.
• A selection clause is a comparison between an indexed tuple

variable and a constant. A comparison operator might be
{=, <>, <=, >=, >, <}.

from Emp e
where e.salary > 50,000
• A join clause is a comparison between two different relation

indexed tuple variable. Once again, a comparison operator
might be {=, <>, <=, >=, >, <}.
Find the name and department name of all the employees:

select e.name, d.dname
from Emp e, Dept d
where e.dno = d.dno
• Join and selection clauses combined.
Find the name of all those employee who work in Toy

department.
select e.name
from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Toy’
• SQL provides set operations: union, intersect, minus.
Example: Retrieve the social security of those employees

who work in both the shoe and toy departments.
(for this example assume: Emp (SS#, name, age, salary, dno))
(select distinct e.SS#

from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Toy’)
intersect
( select distinct e.SS#
from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Shoe’)
Note that the following query is WRONG!
select distinct e.SS#

from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Toy’ and
d.dname=‘Shoe’
This is because the department name is single value (either Toy

or Shoe). The result of this query is an empty relation.
(note: try it with either shoe or toy, union)

• SQL has set membership operator: in.
Example: select those employees in the shoe department who earn
more than 50,000 and who work for the toy department.
(for this example assume: Emp (SS#, name, age, salary, dno))
select e.name
from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Shoe’ and e.salary >
50,000 and e.SS# in (
select e.SS#
from Emp e, Dept d
Note: not in can be used for negation of in (… who does not work
for toy department.)
• SQL has set comparison operators: contains, some, all. It is

important to note that these are Boolean operators that produce
either True of False as their final output. These operators do
NOT produce a set of tuples as output. Set comparison
operators are used in the qualification list of a query.
• A contains B means set A is a superset of set B.
Example: Social security number of employees who manage

all the departments on the first floor.
select d1.mgrSS#
from Dept d1
where (
( select d2.dname
from Dept d2
where d1.mgrSS#=d2.mgrSS#)
contains
( select dname
from Dept
where floor = 1)
)
Note that a manager may manage department on the second and
third floor.
• Find the name of employees whose salary is higher than the

salary of everyone on the first floor.
select name
from Emp
where salary > all
( select e.salary
from Emp e, Dept d
where e.dno = d.dno and d.floor=1)
• Find the name of employees whose salary is higher than the

salary of some of the employees on the second floor.
select name
from Emp
where salary > some
( select e.salary
from Emp e, Dept d
where e.dno = d.dno and d.floor=2)
• SQL supports five aggregate functions that can be applied to

any attribute:
1. count([distinct] X): number of [distinct] values for attribute X
2. sum([distinct] X): sum of [distinct] values for attribute X
3. avg([distinct] X): average of [distinct] values for attribute X
4. max(X): maximum value for attribute X
5. min(X): minimum value for attribute X
The result of each of these functions is a single value.
• Find the number of employees whose salary is higher than
100,000:
select count(SS#)
from Emp
where salary > 100,000
• Count (*) can be used (without distinct) if the number of rows in

resulting relation is required.
select count(*)
from Emp
where salary > 100,000
• Qualified aggregates: the set on which the aggregate is applied

can be restricted by the where clause.
Find the average salary of the employees in the toy department.

select avg(salary)
from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Toy’
• Aggregate functions can be used only in the target list of a

query (some versions allow its use in where clause, it can also
be used in having clause)
Find the employees who make more than the average salary:
select SS#
from Emp
where salary >
(select avg(salary)
from Emp)
• Group-by: At times, one might want the system to apply a

function and do the grouping at the same time.
Compute the average salary for each department.

select d.dno, d.dname, avg(salary)
from Emp e, Dept d
where e.dno = d.dno
group by dno
• Having can be used to restrict the relevant groups.
Find the average salary of those departments with more than 10

employees.
select d.dno, d.dname, avg(e.salary)
from Emp e, Dept d
where e.dno = d.dno
group by d.dno
having count (e.SS#) > 10
Note: Where qualification is applied before having qualification.

Where clause works on tuples while having clause works on
groups.
Session 6: SQL … (CH-7)
CSCI-585 Monday Feb. 2, 2001
Cyrus Shahabi
(Some example queries, but you need to go read the book and do more
exercise on your own, not everything is covered!)
Emp (SS#, name, age, salary, dno)

Dept (dno, dname, floor, mgrSS#)
• SQL provides commands to change the state of database: insert,

delete, and update.
• Insert has two different syntax:
1. insert into rel-name values value list
2. insert into rel-name select
To illustrate, assume the existence of two relations:
register(sid, sname, paid, course#) and CSCI585(sid,sname).
If Joe and Bob register for csci585 without having paid:
insert into register values
(666-66-6666, ‘Joe’, No, 585)
(777-77-7777, ‘Bob’, No, 585)
To insert all CSCI585 student into CSCI585 relation who have
paid:
insert into CSCI585
select sid, name
from register r
where r.paid = ‘yes’ and r.course#=585
Note that the target list of the select command must confirm to the
schema of CSCI585
• Delete has the following syntax:

delete rel-name where qualification
Example: Fire all those employees whose salary is less than
average.
delete Emp
where salary < (select avg(salary)
from Emp)
Problem: Average changes as we delete! Some versions disallow
the above types of delete; some enforce the following semantic:
Step1: execute query:

select *
from rel-name
where qualification
Step 2: remove tuples found in Step 1 from rel-name
• Update command has the following syntax:

update rel-name
set target-list
where qualification
• Example: Give a 10% raise to all employees in the toy

department.
update Emp
set salary = 1.1 * salary
where SS# in (
select e.SS#
from Emp e, Dept d
• What if we wanted to give a 10% raise to all employees who

earn less than average (same discussion as delete)?
update Emp
set salary = 1.1 * salary
where salary < (select avg(salary)
from Emp)
• Hence, the semantic of update is as follows:
Step 1: Execute the following two queries:

insert into del-temp
select full-target-list
from rel-name
where qualification
insert into app-temp

select extended target list
from rel-name
where qualification
Extended target list in our example would be:

(SS#, name, age, sal * 1.1, dno).
Full target list in our example would be:
(SS#, name, age, sal, dno).
Step 2: Remove tuples in del-temp from rel-name

Step 3: Insert tuples in app-temp into rel-name
• Order by: To sort the results of a query.
Example: List all employees in ascending order by age and

descending order by salary (default is ascending)
select SS#, name
from Emp
order by age asc, salary desc
• Views: To provide a higher level of abstraction.
Syntax: create view v as <query expression>
Example: A view of all employees working in toy department
create view Toy-employee as
select SS#, name, salary
from Emp, Dept
where Emp.dno = Dept.dno and dname = ‘Toy’
A view name can appear in any place that a relation name may
appear.
• Insertion to views: Use of null values!
• Updates on views: Works for views based on single

relation where a candidate key of the base relation is
included in the view attributes. Forbidden (ambiguous)
on a view which is defined in terms of more than one
relation, or on views with grouping and aggregate
functions).
create view AvgDeptSal (dno, dname, AvgSalary) as

select d.dno, d.dname, avg(salary)
from Emp e, Dept d
where e.dno = d.dno
group by dno
• Data definition:
1. create schema s
Creates a schema!
2. create table r (A1 D1, A2 D2, …, An Dn)

A is the attribute name and D is its domain data type.
(Look into the book and Informix manual for details. Different
products have different syntaxes. You can define primary keys and
foreign keys here as well.)
3. drop table r
Get rid of the entire relation r
4. delete r
Only delete the tuples but keep the relation
5. alter table r add A D

Only in some versions (modifies the database schema)
Session 7: Object-Oriented Model (AR-3: Connolly CH-17)
CSCI-585 Tuesday Feb. 6, 2001
Cyrus Shahabi
• Relational databases (2nd generation) were designed for

traditional banking-type applications with well-structured,
homogenous data elements (vertical & horizontal homogeneity)
and a minimal fixed set of limited operations (e.g., set & tuple-
oriented operations).
• New applications (e.g., CAD, CAM, CASE, OA, and CAP),
however, require concurrent modeling of both data and
processes acting upon the data.
• Hence, a combination of database and software-engineering
disciplines lead to the 3rd generation of database management
systems: Object Database Management Systems, ODBMS.
• Note that a classic debate in database community is that do we
need a new model or relational model is sufficient and can be
extended to support new applications.
• People in favor of relational model argue that:
• New versions of SQL (e.g., SQL-92 and SQL3) are designed
to incorporate functionality required by new applications
(UDT, UDF, …).
• Embedded SQL can address almost all the requirements of
the new applications.
• “Object people”, however, counter-argue that in the above-
mentioned solutions, it is the application rather than the inherent
capabilities of the model that provides the required
functionality.
• Object people say there is an impedance mismatch between
programming languages (handling one row of data at a time)
and SQL (multiple row handling) which makes conversions
inefficient.
• Relational people say, instead of defining new models, let’s
introduce set-level functionality into programming languages.
• What do you think?
• Read “Evolution of Data Management” by Jim Gray.
• Read “Object-Relational DBMS – The Next Wave” by Michael
Stonebraker. (Both members of National Academy of
Engineering.)
• Other problems with RDBMS:
• Short-lived transactions
• Schema changes are difficult: most organizations are locked
into their existing database structures. Taylor in 1992 said:
Organizations are unable to make these changes because they
cannot afford the time and expense required modifying their
information systems (sounds familiar? Y2K, Euro, …).
• Poor at navigational access (moving between
records/objects), and strong in content-based associative
access (e.g., navigate your family tree with “people” relation
in SQL!).
Object-Oriented Concepts
• Abstraction and Encapsulation: Provided by Abstract Data

Types (ADT).
• Abstraction is the process of identifying the essential aspects
of an entity and ignoring the unimportant properties. Focus
on what an object is and what it does, rather than how it
should be implemented.*
• Encapsulation (or information hiding) provides data
independence by separating the external aspects of an object
from its internal details, which is hidden from the outside
world.*
• Objects and Attributes:
• Object is a uniquely identifiable entity that contains both the
attributes that describe the state of a real-world object and
the actions that conceptualize the behavior of a real-world
object. The difference between object and entity is that
object encapsulates both state and behavior while entity only
models state.*
• Attributes (or instance variables) describe the current state of
an object (the notation for attribute: object-name.attribute-
name).*
• Object Identity: A unique system-wide identifier (Object
Identifier or OID for short) associated with an object, which is
independent of its attributes (i.e., its state) and is invisible to the
user. Hence, two objects may have the same state but different
OIDs.
• Differences between OID in ODBMS and primary-keys in
RDBMS:
1. Primary key is only unique across the relation not across
the entire system.
2. Primary key depends on the state of the object.
3. OIDs are efficient (storage-wise).
4. OIDs are fast (pointer to actual locations).
5. OIDs are invisible to the user.*
• Methods: define the behavior of the object. They can be
used to change the object’s state by modifying its attribute
values, or to query the value of the selected attributes.
Attributes
A method consists of a name and a body that performs the

behavior associated with the method name (notation: object-
name.method-name).*
• Classes: A group of objects with the same attributes and

methods. Hence, the attributes and the associated methods are
defined once for the class rather than separately for each object.
• The instances of a class are those objects belonging to a
class.* (What is the difference between a class instance and
an object instance?)
• Class is also an object and has its own attributes and methods
called class attributes and class methods, respectively.*
(e.g., constructors and destructors methods)
• Every class is an object (or an instance) of a higher-level

class called a metaclass.
• Subclasses, Superclasses and Inheritance

• Overloading
• Polymorphism and Dynamic Binding
Session 8: Object-Oriented Model … (AR-3: Connolly CH-17)
CSCI-585 Thursday Feb. 8, 2001
Cyrus Shahabi
Object-Oriented Concepts
• Abstraction and Encapsulation

• Objects and Attributes
• Object Identity
• Methods and Messages
• Classes
• Subclasses, Superclasses and Inheritance
• Subclasses: A class of objects that is defined as a special case
of a more general class (the process of forming subclasses is
called specialization).
• Superclass: A class of objects that is defined as a general
case of a number of special classes (the process of forming a
superclass is called generalization).
• All instances of a subclass are also instances of its superclass.
• Inheritance: By default, a subclass inherits all the properties
of its superclass (or it can redefine some (or all) of the
inherited methods). Additionally, it may define its own
unique properties.
1. Single inheritance: When a subclass inherits from no more
than one superclass (note: forming class hierarchies is
permissible here).
2. Multiple inheritance: When a subclass inherits from more

than one superclass (note: a mechanism is required to
resolve conflicts when the Superclasses have the same
attributes and/or methods). Due to its complexity, not all
OO languages and database systems support this concept.
3. Repeated inheritance: A special case of multiple
inheritance where the multiple Superclasses inherit from a
common superclass (note: must ensure that subclasses do
not inherit properties multiple times).
• Overriding: To redefine an inherited property by defining the

same property differently at the subclass level.
• Overloading: A general case of overriding where the same

method name is reused within a class definition (overriding) or
across class definitions. Hence, a single message can perform
different functions depending on which object receiving it and,
if appropriate what parameters are passed to the method (e.g.,
print method for different objects).
• Polymorphism: “Having many forms” in Greek, is a general
case of overloading.
1. Inclusion polymorphism: Same as overriding.
2. Operation (or ad hoc) polymorphism: Same as overloading.
3. Parametric polymorphism (or Genericity): It uses types as
parameters in generic type (or class) definition.
• Binding: The process of selecting the appropriate method

based on an object’s type. If the determination of an
object’s type can be deferred until runtime (rather than
compile time), the selection is called dynamic or late
binding.
• Developing an ODBMS:
1. Extending an existing OO language with database
capabilities (e.g., GemStone extending Smalltalk).
2. Providing extensible OO-DBMS libraries (e.g., ObjectStore).
(Class libraries that support persistence, aggregation, data types,
etc.)
3. Extending an existing database language with OO
capabilities (e.g., SQL3, ODMG standard for Object SQL).
4. Embedding ODB language constructs in a conventional host
language (e.g., O2 provides embedded extensions for C).
5. Developing a novel database data model/data language (e.g.,
SIM).
Session 9: SQL3 (CH-17 of Connolly)
CSCI-585 Tuesday Feb. 13, 2001
Cyrus Shahabi
• What we covered up to now was SQL2 or SQL92 (ISO 1992

standard). Next standard is SQL3 to incorporate some object-
oriented concepts.
• Abstract Data Types: ADT defines both attribute
specifications and function specifications.
• Encapsulation is imposed using public and private tags for
both functions and attributes.
• Syntax of defining ADTs is very similar to C++ syntax.
Create type person(

Private
date-of-birth DATE,
Public
Name VARCHAR NOT NULL,
SNO VARCHAR NOT NULL,
Age UPDATABLE VIRTUAL GET WITH get_age SET WITH set_age,
CONSTRUCTOR FUNCTION person (P person, N VARCHAR, S
VARCHAR, DOB DATE)
RETURN person
BEGIN
SET P.Name = N;
SET P.SNO = S;
SET P.date-of-birth = DOB;
END,
DESTRUCTOR PROCEDURE remove_person (P person);
BEGIN
DESTROY P;
END,
ACTOR FUNCTION get_age (P person) RETURNS INTEGER;
RETURN /* code to calculate age */
END
ACTOR FUNCTION set_age (P person, DOB: DATE);
RETURN /* set date-of-birth */
END );
• Attribute types: in addition to regular stored attributes, one
can define virtual attributes. Virtual attributes are the same as
derived attributes (e.g., age) and are accessed using their
defined GET function and modified via their defined SET
function.
• Function types:
• Constructors: to initialize new instances of ADT
• Destructors: to release resources used by instances
• Actors: to perform all other operations (e.g., GET and SET
functions for virtual attributes).
• Object Identity: A unique OID is associated with each instance

of ADT. OID value is stored in an implicit stored attribute that
cannot be assigned or updated by users. The type of OID
attribute is REF that is similar to pointer type in C++.
• Subtypes and Supertypes: is supported via UNDER clause:

create type staff under person (<same as ADT definition>)
Name of functions can be overloaded.
• Data Definition: for upward compatibility to SQL92, it is still

necessary to define a table even though the table is a single
column with one ADT.
create table people (info person);
create table EMP (info staff, dno integer);
• Querying: Same syntax as SQL92, except now you can include

ADT’s attributes and actor-functions in the “where” and
“select” clauses (if they are public).
select info.Name, info.get_age(info)
from people
where info.get_age(info) > 21
Application Programming
for Relational Databases
Cyrus Shahabi
Computer Science Department
University of Southern California
shahabi@usc.edu
• Overview
• JDBC Package
• Connecting to databases with JDBC
• Executing select queries
• Executing update queries
Overview
• Role of an application: Update databases,
extract info, through:
– User interfaces
– Non-interactive programs
• Development tools (Access, Oracle):

– For user Interfaces
• Programming languages (C, C++, Java,… ):
– User Interfaces
– Non-Interactive programs
– More professional
Client server architecture
• Database client:
– Connects to DB to manipulate data:
• Software package
• Application (incorporates software package)
• Client software:
– Provide general and specific capabilities
– Oracle provides different capabilities as
Sybase (its own methods, … )
• Client-Server architectures:
– 2 tier
– 3 tier
– Layer 1:
• user interface
– Layer 2:
• Middleware
– Layer 3:
• DB server
• Middleware:
– Server for client
Client for DB
• Example: Web interaction with DB
– Layer 1: web browser
– Layer 2: web server + cgi program
– Layer 3: DB server
• Application layer (1):
– User interfaces
– Other utilities (report generator, …)
– Connect to middleware
– Can connect to DB too
– Can have more than one connection
– Can issue SQL, or invoke methods in lower layers.
• Middleware layer (2):
– More reliable than user applications
Database interaction in Access
• Direct interaction with DB
• For implementing applications
• Not professional
• Developer edition:
– Generates stand alone application
• Access application:
– GUI + “Visual Basic for Applications” code
• Connection to DB through:
– Microsoft Jet database engine
• Support SQL access
• Different file formats
– Other Database Connectivity (ODBC)
• Support SQL DBs
• Requires driver for each DB server
– Driver allows the program to become a client for DB
• Client behaves Independent of DB server
• Making data source
available to ODBC
application:
– Install ODBC driver manager
– Install specific driver for a DB
server
– Database should be registered
for ODBC manager
• How application works with data

source:
– Contacts driver manager to
request for specific data source
– Manager finds appropriate
driver for the source
Database interaction in Java
• Includes:
– Java.sql package
• Set of classes
• Supports JDBC (java database connectivity?)
strategy, independent of the DB server
– Difference between JDBC and ODBC:

• JDBC driver manager is part of the application
Database interaction in
Embedded SQL
• Extension of a language (C++,C) with new
commands:
• Void addEmployee( char *ssn, char *lastname,
• char *firstname) {
– Exec SQL
» Insert into customer( ssn, lastname, firstname )
values( :ssn, :lastname, :firstname )

}
• Not legal language

• Compilation precedes by a translation preprocessor from
embedded SQL into legal C
– Advantages: ???
– Disadvantages:
N t t bl b t d t b t
JDBC package
• Collection of interfaces and classes:
– Driver: creates a connection
– Connection: represents a collection
– DatabaseMetaData: information about the DB
server
– Statement: executing queries
– PreparedStatement: precompiled and stored
query
– CallableStatment: execute SQL stored procedures
– ResultSet: results of execution of queries
– ResultSetMetaData: meta data for ResultSet
• Each JDBC package implements the

JDBC, different strategies
• Strategies to USE
JDBC
– JDBC-ODBC bridge
• Con: ODBC must be
installed
– JDBC database client
• Con: JDBC driver for
each server must be
available
– JDBC middleware
client
• Pro: Only one JDBC
driver is required
• Application does not
need direct connection
to DB (e g applet)
Connecting with JDBC
• Database connection needs two pieces
– JDBC package driver class name
• Package driver provide connection to DB
– URL of the database
• JDBC package designator
• Location of the server
• Database designator, in form of:
– Server name, Database name, Username,
password, …
– Properties
Connecting to DB with JDBC
• Step 1: Find, open and load appropriate driver
• 1. Class.forName( “sun.jdbc.odbc.JdbcOdbcDriver” );
• 2. Class.forName() oracle.thin.Driver” );
• 3. Class.forName( “symantec.dbAnywhere.driver” );
• Informs availability of the driver to “DriverManager”

(registers the driver with DriverManager)
• (Example 1)
• Step 2: Make connection to the DB
• Connection conn = DriverManager( URL, Properties);
– Properties: specific to the driver
• URL = Protocol + user
– Protocol= jdbc:<subprotocol>:<subname>
» E.g.: jdbc:odbc:mydatabase
» E.g.:
jdbc:oracle:thin://oracle.cs.fsu.edu/bighit
• (Example 1)
• Step 3: Make Statement object
– Used to send SQL to DB
• executeQuery(): SQL that returns table
• executeUpdate(): SQL that doesn’t return table
• Execute(): SQL that may return both, or different thing
• Step 4: obtain metadata (optional)

• DatabaseMetaData object
– getTimeDatefunctions: all date and time functions
– ….
• (Example 2)
Executing select queries
• Step 5: issue select queries
– Queries that return table as result
– Using statement object
– Uses executeQuery() method
– Return the results as ResultSet object
• Meta data in ResultSetMetaData object
– Every call to executeQuery() deletes previous
results
• (Example 2)
Executing select queries
• Step 6: retrieve the results of select queries
– Using ResultSet object
• Returns results as a set of rows
• Accesses values by column name or column number
• Uses a cursor to move between the results
• Supported methods:
– JDBC 1: scroll forward
– JDBC 2: scroll forward/backward, absolute/relative
positioning, updating results.
– JDBC 2: supports SQL99 data types(blob, clob,…)
• Meta data in ResultSetMetaData:

• Number of columns, Column names, column type name,
• (Example 2)
Executing update queries
• Step 7: issue update queries
– Queries that return a row count (integer) as result
• Number of rows affected by the query
• -1 if error
– Using statement object
– Uses executeUpdate() method
– Meta data in ResultSetMetaData object
• (Example 3)
Executing update queries
• Step 8: More Advanced
– Use PreparedStatement
• faster than regular Statement
• (Example 4)
– Cursors
• forward, backward, absolute/relative positions
• (Example 5)
Introduction to Spatial Database
Systems
by Cyrus Shahabi
from
Ralf Hart Hartmut Guting’s
VLDB Journal v3, n4, October 1994
Outline
• Introduction & definition
• Modeling
• Querying
• Data structures and algorithms
• System architecture
• Conclusion and summary
1
Introduction
• Various fields/applications require management of
geometric, geographic or spatial data:
– A geographic space: surface of the earth
– Man-made space: layout of VLSI design
– Model of rat brain
Introduction …
• Common challenge: dealing with large
collections of relatively simple geometric
objects
• Different from image and pictorial database
systems:
– Containing sets of objects in space rather than
images or pictures of a space
2
Definition
• A spatial database system:
– Is a database system
• A DBMS with additional capabilities for handling
spatial data
– Offers spatial data types (SDTs) in its data
model and query language
• Structure in space: e.g., POINT, LINE, REGION
• Relationships among them: (l intersects r)
– Supports SDT in its implementation
• Providing at least spatial indexing (retrieving objects
in particular area without scanning the whole space)
• Efficient algorithm for spatial joins (not simply
filtering the cartesian product)
Modeling
• WLOG assume 2-D and GIS application,
two basic things need to be represented:
– Objects in space: cities, forests, or rivers
– modeling single objects
– Space: say something about every point in
space (e.g., partition of a country into districts)
– modeling spatially related collections of
objects
3
Modeling …
• Fundamental abstractions for modeling
single objects:
– Point: object represented only by its location in
space, e.g., center of a state
– Line (actually a curve or ployline):
representation of moving through or
connections in space, e.g., road, river
– Region: representation of an extent in 2d-space,
e.g., lake, city
Modeling …
• Instances of spatially related
collections of objects:
– Partition: set of region objects that are
required to be disjoint (adjacency or
region objects with common
boundaries), e.g., thematic maps
– Networks: embedded graph in plane
consisting of set of points (vertices)
and lines (edges) objects, e.g.
highways, power supply lines, rivers
4
Modeling …
A sample (ROSE) spatial type system
EXT={lines, regions}, GEO={points, lines, regions}
• Spatial predicates for topological relationships:

– inside: geo x regions bool
– intersect, meets: ext1 x ext2 bool
– adjacent, encloses: regions x regions bool
• Operations returning atomic spatial data types:
– intersection: lines x lines points
– intersection: regions x regions regions
– plus, minus: geo x geo geo
– contour: regions lines
Modeling …
• Spatial operators returning numbers
– dist: geo1 x geo2 real
– perimeter, area: regions real
• Spatial operations on set of objects
– sum: set(obj) x (objgeo) geo
– A spatial aggregate function, geometric union of all
attribute values, e.g., union of set of provinces determine
the area of the country
– closest: set(obj) x (objgeo1) x geo2 set(obj)
– Determines within a set of objects those whose spatial
attribute value has minimal distance from geometric query
object
5
Modeling …
• Spatial relationships:
– Topological relationships: e.g., adjacent, inside, disjoint.
Are invariant under topological transformations like
translation, scaling, rotation
– Direction relationships: e.g., above, below, or north_of,
sothwest_of, …
– Metric relationships: e.g., distance
• Enumeration of all possible topological relationships
between two simple regions (no holes, connected):
– Based on comparing two objects boundaries (δA) and
interiors (Ao), there are 4 sets each of which be empty or
not = 24=16. 8 of these are not valid and 2 symmetric so:
• 6 valid topological relationships:
disjoint, in, touch, equal, cover, overlap
Modeling …
• DBMS data model must be extended by SDTs at
the level of atomic data types (such as integer,
string), or better be open for user-defined types
(OR-DBMS approach):
relation states (sname: STRING; area: REGION; spop: INTEGER)
relation cities (cname: STRING; center: POINT; ext: REGION;
cpop: INTEGER);
relation rivers (rname: STRING; route: LINE)
6
Querying
• Two main issues:
1. Connecting the operations of a spatial algebra
(including predicates to express spatial
relationships) to the facilities of a DBMS
query language.
2. Providing graphical presentation of spatial
data (i.e., results of queries), and graphical
input of SDT values used in queries.
Querying …
Fundamental spatial algebra operations:
• Spatial selection: returning those objects satisfying a
spatial predicate with the query object
– “All cities in Bavaria”
SELECT sname FROM cities c WHERE c.center inside Bavaria.area
– “All rivers intersecting a query window”
SELECT * FROM rivers r WHERE r.route intersects Window
– “All big cities no more than 100 Kms from Hagen”
SELECT cname FROM cities c WHERE dist(c.center, Hagen.center) <
100 and c.pop > 500k
(conjunction with other predicates and query optimization)
7
Querying …
• Spatial join: A join which compares any two
joined objects based on a predicate on their spatial
attribute values.
– “For each river pass through Bavaria, find all cities
within less than 50 Kms.”
SELECT r.rname, c.cname, length(intersection(r.route, c.area))
FROM rivers r, cities c
WHERE r.route intersects Bavaria.area and
dist(r.route,c.area) < 50 Km
Querying …
• Graphical I/O issue: how to determine “Window” or
“Bavaria” in previous examples (input); or how to
show “intersection(route, Bavaria.area)” or “r.route”
(output) (results are usually a combination of several
queries).
• Requirements for spatial querying [Egenhofer]:
– Spatial data types
– Graphical display of query results
– Graphical combination (overlay) of several query results
(start a new picture, add/remove layers, change order of layers)
– Display of context (e.g., show background such as a raster
image (satellite image) or boundary of states)
– Facility to check the content of a display (which query
contributed to the content)
8
Querying …
• Extended dialog: use pointing device to select objects within a
subarea, zooming, …
• Varying graphical representations: different colors, patterns,
intensity, symbols to different objects classes or even objects
within a class
• Legend: clarify the assignment of graphical representations to
object classes
• Label placement: selecting object attributes (e.g., population) as
labels
• Scale selection: determines not only size of the graphical
representations but also what kind of symbol be used and
whether an object be shown at all
• Subarea for queries: focus attention for follow-up queries
9
Introduction to Spatial Database
Systems
by Cyrus Shahabi
from
Ralf Hart Hartmut Guting’s
VLDB Journal v3, n4, October 1994
Data Structures & Algorithms

1. Implementation of spatial algebra in an
integrated manner with the DBMS query
processing.
2. Not just simply implementing atomic
operations using computational geometry
algorithms, but consider the use of the
predicates within set-oriented query
processing Spatial indexing or access
methods, and spatial join algorithms
1
Data Structures …
• Representation of a value of a SDT must be
compatible with two different views:
1. DBMS perspective:
• Same as attribute values of other types with respect to
generic operations
• Can have varying and possibly large size
• Reside permanently on disk page(s)
• Can efficiently be loaded into memory
• Offers a number of type-specific implementations fo
generic operations needed by the DBMS (e.g.,
transformation functions from/to ASCII or graphic)
Data Structures …
2. Spatial algebra implementation perspective, the
representation:
• Is a value of some programming language data type
• Is some arbitrary data structure which is possibly
quite complex
• Supports efficient computational geometry
algorithms for spatial algebra operations
• Is no geared only to one particular algorithm but is
balanced to support many operations well enough
2
Data Structures …
• From both perspectives, the representation should
be mapped by the compiler into a single or
perhaps a few contiguous areas (to support DBMS
paging). Also supports:
• Plane sweep sequence: object’s vertices stored in a
specific sweep order (e.g., x-order) to expedite
plane-sweep operation.
• Approximations: stores some approximations as
well, e.g., MBR
• Stored unary function values: such as perimeter or
area be stored once the object is constructed to
eliminate future expensive computations.
Spatial Indexing
• To expedite spatial selection (as well as other
operations such as spatial joins, …)
• It organizes space and the objects in it in some
way so that only parts of the space and a subset
of the objects need to be considered to answer a
query.
• Two main approaches:
1. Dedicated spatial data structures (e.g., R-tree)
2. Spatial objects mapped to a 1-D space to utilize
standard indexing techniques (e.g., B-tree)
3
Spatial Indexing
• A fundamental idea: use of approximations: 1)
continuous (e.g., bounding box), or 2) grid.
• Filter and refine strategy for query processing:

1. Filter: returns a set of candidate object which is a
superset of the objects fulfilling a predicate
2. Refine: for each candidate, the exact geometry is
checked
Spatial Indexing …
• Spatial data structures either store points or
rectangles (for line or region values)
• Operations on those structures: insert, delete,
member
• Query types for points:
– Range query: all points within a query rectangle
– Nearest neighbor: point closest to a query point
– Distance scan: enumerate points in increasing distance
from a query point.
• Query types for rectangles:
– Intersection query
query rectangle
– Containment query
4
• A spatial index structure organizes points into buckets.
• Each bucket has an associated bucket region, a part of
space containing all objects stored in that bucket.
• For point data structures, the regions are disjoint &
partition space so that each point belongs into
precisely one bucket.
• For rectangle data structures, bucket regions may
overlap.
A kd-tree partitioning of
2d-space
where each bucket can
hold up to 3 points
• One dimensional embedding: z-order or bit-interleaving
– Find a linear order for the cells of the grid while maintaining
“locality” (i.e., cells close to each other in space are also close to each
other in the linear order)
– Define this order recursively for a grid that is obtained by
hierarchical subdivision of space
11
0
01 11 10 1110
01
1
00 10 00
0 1 00 01 10 11
5
• Any shape (approximated as set of cells) over the grid
can now be decomposed into a minimal number of cells
at different levels (using always the highest possible
level) 000 001 010 011 100 101 110 111
10010 100110
1000
000 001 010 011 100 101 110 111
■ Hence, for each spatial object, we can obtain a set of

“spatial keys”
■ Index: can be a B-tree of lexicographically ordered list
of the union of these spatial keys
• Spatial index structures for points:
Y1
Y2
Y3
X1 X2 X3 X4 buckets
Scales KD-Tree
Grid-file
6
Spatial index structures for rectangles: unlike points,
rectangles don’t fall into a unique cell of a partition and
might intersect partition boundaries
– Transformation approach: instead of k-dimensional
rectangles, 2k-dimensional points are stored using a point data
structure
– Overlapping regions: partitioning space is abandoned &
bucket regions may overlap (e.g., R-tree & R*-tree)
– Clipping: keep partitioning, a rectangle that intersects
partition boundaries is clipped and represented within each
intersecting cell (e.g., R+-tree)
• A rectangle with 4 coordinates (Xleft, Xright, Ybottom, Ytop)
can be considered as a point in 4d-space
• For illustration, consider how an interval i = (i1, i2)
with 2 coordinates can be mapped to 2d-space (as a
point):
Y
Intersection query with interval i:
Find all points (x,y) where:
i2
x < i2 and y> i1
i1
i1 i2 X
7
Spatial Join
• Traditional join methods such as hash join or
sort/merge join are not applicable.
• Filtering cartesian product is expensive.
• Two general classes:
1. Grid approximation/bounding box
2. None/one/both operands are presented in a spatial index
structure
– Grid approximations and overlap predicate:
– A parallel scan of two sets of z-elements corresponding to
two sets of spatial objects is performed
– Too fine a grid, too many z-elements per object
(inefficient)
– Too coarse a grid, too many “false hits” in a spatial join
Spatial Join …
• Bounding boxes: for two sets of rectangles R, S all
pairs (r,s), r in R, s in S, such that r intersects s:
– No spatial index on R and S: bb_join which uses a
computational geometry algorithm to detect rectangle
intersection, similar to external merge sorting
– Spatial index on either R or S: index join scan the
non-indexed operand and for each object, the bounding
box of its SDT attribute is used as a search argument on
the indexed operand (only efficient if non-indexed
operand is not too big or else bb-join might be better)
– Both R and S are indexed: synchronized traversal of
both structures so that pairs of cells of their respective
partitions covering the same part of space are
encountered together.
8
System Architecture
• Extensions required to a standard DBMS architecture:
– Representations for the data types of a spatial algebra
– Procedures for the atomic operations (e.g., overlap)
– Spatial index structures
– Access operations for spatial indices (e.g., insert)
– Filter and refine techniques
– Spatial join algorithms
– Cost functions for all these operations (for query optimizer)
– Statistics for estimating selectivity of spatial selection and
join
– Extensions of optimizer to map queries into the specialized
query processing method
– Spatial data types & operations within data definition and
query language
– User interface extensions to handle graphical representation
and input of SDT values
System Architecture …
• The only clean way to accommodate these
extensions is an integrated architecture based on
the use of an extensible DBMS.
• There is no difference in principle between:
– a standard data type such as a STRING and a spatial
data type such as REGION
– same for operations: concatenating two strings or
forming intersection of two regions
– clustering and secondary index for standard attribute
(e.g., B-tree) & for spatial attribute (R-tree)
– sort/merge join and bounding-box join
– query optimization (only reflected in the cost functions)
9
System Architecture
Extensibility of the architecture is orthogonal to the data
model implemented by that architecture:
– Probe is OO
– DASDBS is nested relational
– POSTGRES, Starbust and Gral extended relational models
• OO is good due to extensibility at the data type level,
but lack extensibility at index structures, query
processing or query optimization.
• Hence, current commercial solutions are OR-DBMSs:
– NCR Teradata Object Relational (TOR)
– IBM DB2 (spatial extenders)
– Informix Universal Server (spatial datablade)
– Oracle 8i (spatial cartridges)
10
CSCI585-Spring2001 CSCI585-Spring2001
XML Overview
XML: Extensible Markup Language n XML is a meta -language, a simplified form of

SGML (Standard Generalized Markup Language)
Cyrus Shahabi n XML was initiated in large parts by Jon Bosak of
Sun Microsystems, Inc., through a W3C working
Computer Science Department group
University of Southern California n References:
1. “XML Pocket Reference,” Robert Eckstein, O’Reilly
shahabi@usc.edu & Associates, Inc., 1999
2. “Describing and Manipulating XML Data, ” Sudarshan S.
Chawathe, Bulletin of Data Engineering, v22, n3, Sep. 99
3. “Querying XML Data,” Deutsch et. al, Bulletin of Data
Engineering, v22, n3, Sep. 99
C. Shahabi C. Shahabi
CSCI585-Spring2001
XML Overview (cont.) CSCI585-Spring2001
XML Terminology
n An XML compliant application generally needs n Element, e.g.,:

three files to display XML content: <Body>
This is text formatted according to the
u The XML document body element
• Contains the data tagged with meaningful XML </Body>
elements
u An element consists always of two tags:
u A document type definition - DTD • An opening tag, e.g., <Body>
• Specifies the rules how elements and attributes are • A closing tag, e.g., </Body>
logically related
n An element can have attributes, e.g.,:
u A stylesheet <Price currency=“Euro”>25.43</Price>
• Dictates the formatting when the XML document is u Attribute values must always be in quotes
displayed. Examples: CSS - cascading style sheets, (unlike HTML)
XSL - extensible stylesheet language
1
CSCI585-Spring2001
A Simple XML Document CSCI585-Spring2001
A Simple XML Document

n Example: Book description
n Markup: Text delimited by angle brackets (< …>)
<?xml version=“1.0” standalone=“no”?> n Character Data: the rest
<!DOCTYPE BOOKCATALOG SYSTEM "http:// tt.com/bookcatalog.dtd"> n Element names are not unique

u (e.g., two <review>)
<book>
<title>The spy who came in from the cold</title> n Attribute names are unique within an element
<author>John <lastname>Le Carre</lastname></author> u (e.g., one “currency” attribute in price)
<price currency="USD">5.59</price> n Elements can be empty and hence presented concisely
<review><author>Ben</author>Perhaps one of the finest...</review> u (e.g., <bestseller></bestseller> = <bestseller/>
<review><author>Jerry</author>An intriguing tale of...</review>
n An XML document is well-formed if it satisfies simple
<bestseller authority="NY Times"/>
</book> syntactic constraints
CSCI585-Spring2001
A Simple Document Type Definition CSCI585-Spring2001
The DTD Language (0)
n Example DTD n Example DTD
<!ELEMENT book (title, author+, price, review*, bestseller?)>

<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA|lastname|firstname|fullname)*> <!ELEMENT author (#PCDATA|lastname|firstname|fullname)*>
<!ELEMENT price (#PCDATA)> <!ELEMENT price (#PCDATA)>
<!ATTLIST price currency CDATA "USD"
source (list|regular|sale) list
u Required child elements for book element:
taxed CDATA #FIXED "yes">
title, author, price
<!ELEMENT bestseller EMPTY>
<!ATTLIST bestseller authority CDATA #REQUIRED> u #PCDATA: parsed character data
2
CSCI585-Spring2001
The DTD Language (1) CSCI585-Spring2001
n An XML compliant document is composed of u Nested and ordered elements:

elements: <!ELEMENT books (title,author)>
u Simple elements <!ELEMENT title (#PCDATA)>
<!ELEMENT title ANY> <!ELEMENT author (#PCDATA)>
• The element can contain valid tags and character • The order of the elements must be title, then author
data
u Nested either-or elements
<!ELEMENT books (title|author)>
• The element cannot contain tags, only character data <!ELEMENT title (#PCDATA)>
u Nested elements <!ELEMENT authors (#PCDATA)>
<!ELEMENT book (title)>
• There must be either a title or an author element, but
<!ELEMENT title (#PCDATA)> not both.
CSCI585-Spring2001
u Grouping and recurrence: n Inside a DTD we can declare an entity which

allows us to use an entity reference to substitute
a series of characters, similar to macros.
<!ELEMENT title (#PCDATA)> u Format:
<!ELEMENT author (#PCDATA|lastname|firstname|fullname)*> <!ENTITY name “replacement_characters”>
• Example for the © symbol:
• ? 0 or 1 time <!ENTITY copyright “©”>
• + 1 or more times
u Usage: entities must be prefixed with ‘&’ and
• * 0 or more times followed by a semicolon (‘;’):
u declaration requires every book element to have a price sub-element <copyright>
u the use of some element names (e.g., review, lastname) without a &copyright; 2000 MyCompany, Inc.
corresponding declaration is not an error; such elements are simply not
constrained by this DTD </copyright>
3
CSCI585-Spring2001
n Parameter entity references appear only within a n External entities allow us to include data from
DTD and cannot be used in an XML document. another XML document (think of an #include<...>
They are prefixed with a %. statement in C):
u Format and usage: u Format and usage:
<!ENTITY % name “replacement_characters”> <!ENTITY quotes SYSTEM
• Example: “http://www.stocks.com/quotes.xml”>
<!ENTITY % pcdata “(#PCDATA)”> • Example:
<!ENTITY authortitle %pcdata;> <document>
<heading>Current stock quotes</heading>
&quotes; 
</document>
u Works well for the inclusion of dynamic data.

CSCI585-Spring2001
n Attribute declarations in the DTD. Attributes for • Examples:

various XML elements must be specified in the <!ELEMENT price (#PCDATA)>
<!ATTLIST price currency CDATA "USD"
DTD.
source (list|regular|sale) list
u Format and usage: taxed CDATA #FIXED "yes">
<!ATTLIST target_element attr_name <!ELEMENT bestseller EMPTY>
<!ATTLIST bestseller authority CDATA #REQUIRED>
attr_type default>
• Examples: u Currency, of type character data, default USD
<!ATTLIST box length CDATA “0”> u Source, of one of the three enumerated types, default list
<!ATTLIST box width CDATA “0”> u Taxed, with the fixed value yes
<!ATTLIST frame visible (true|false) u Fixed attribute type is a special case of default
“true”> u It determines that the default value cannot be changed by
<!ATTLIST person marital (single | married an XML document conforming to the DTD
C. Shahabi
| divorced | widowed) #IMPLIED> C. Shahabi uE.g., a book in our XML example must be taxed
4
CSCI585-Spring2001
n Default modifiers in DTD attributes: n Datatypes in DTD attributes:

Modifier Description Type Description
#REQUIRED The attributes value must be specified with CDATA Character data
the element. enumerated
A series of values of which only 1 can be chosen
#IMPLIED The attribute value can remain unspecified. ENTITY An entity declared in the DTD
#FIXED The attribute value is fixed and cannot be ENTITIES Multiple whitespace separated entities declared
changed by the user. in the DTD
ID A unique element identifier
IDREF The value of a unique ID type attribute
IDREFS Multiple whitespace separated IDREFs of
elements
NMTOKEN An XML name token
NMTOKENS Multiple whitespace separated XML name tokens
NOTATION A notation declared in the DTD
CSCI585-Spring2001
n Example: Sales Order Document n Example: Sales Order Document DTD


<!ELEMENT Orders (SalesOrder+)>
“An order document is comprised of several sales <!ELEMENT SalesOrder (Customer,OrderDate,Item+)>
<!ELEMENT Customer
orders. Each individual order has a number and it (CustName,Street,City,State,ZIP)>
contains the customer information, the date when the <!ELEMENT OrderDate (#PCDATA)>
order was received, and the items ordered. Each <!ELEMENT Item (Part,Quantity)>
customer has a number, a name, street, city, state, <!ELEMENT Part (Description,Price)>
and ZIP code. Each item has an item number, parts <!ELEMENT CustName (#PCDATA)>
<!ELEMENT Street (#PCDATA)>
information and a quantity. The parts information <!ELEMENT ... (#PCDATA)>
contains a number, a description of the product and <!ATTLIST SalesOrder SONumber CDATA #REQUIRED>
its unit price. <!ATTLIST Customer CustNumber CDATA #REQUIRED>
<!ATTLIST Part PartNumber CDATA #REQUIRED>
The numbers should be treated as attributes. ” <!ATTLIST Item ItemNumber CDATA #REQUIRED>
5
CSCI585-Spring2001
n Example: Sales Order XML Document
n An XML document that satisfies the constraints of a DTD
<Orders>
<SalesOrder SONumber=“12345”> is said to be valid with respect to that DTD.
<Customer CustNumber=“543”>
<CustName>ABC Industries</CustName> n document type declaration (at the “prolog” of an XML
<Street>123 Main St.</Street> document):
<City>Chicago</City>
<State>IL</State> <ZIP>60609</ZIP> <!DOCTYPE BOOKCATALOG SYSTEM "http://t t.com/bookcatalog.dtd">
</Customer>
<OrderDate>10222000</OrderDate> n XML document claims validity with respect to the
<Item ItemNumber=“1”> BOOKCATALOG DTD
<Part PartNumber=“234”>
<Description>Turkey wrench</Description>
<Price>9.95</Price>
</Part>
<Quantity>10</Quantity>
</Item>
</SalesOrder>
C. Shahabi
</Orders> C. Shahabi
Example XSL
Extensible Stylesheet Language (XSL) <xsl:stylesheet

xmlns :xsl="http://w3.org/XSL/Transform/1.0"
n XSL is a language for transforming and formatting xmlns ="http://w3.org/TR/xhtml1"
XML indent-result="yes">
n Recently, the transformation and formatting parts of
XSL were separated u Declare the XSL and XHTML namespaces used by the stylesheet
n Here, we focus on the XSL transformation language, u The XHTML namespace is made the default namespace
called XSLT
n An XSLT stylesheet is a collection of transformation
rules that operate (non-destructively) on a source XML
document (source tree ) to produce a new XML
document (result tree)
n Each rule consists of a pattern and a template
u Patterns matched against nodes of source tree
u Templates instantiated to produce part of result tree
6
Example XSL …
CSCI585-Spring2001 Example XSL … CSCI585-Spring2001
 <xsl:template match="book/title">
u Each template element describes one transformation rule <h1><xsl:apply-templates/></h1>
u The match attribute of a template element specifies the rule pattern while its </xsl:template>
content is the template used to produce the corresponding portion of the result
tree
 <xsl:template match="book/author">
 <xsl:template match="/">
<xsl:apply-templates/>
<html><head><title>Our New Catalog</title></head>
<body> </xsl:template>
<xsl :apply- templates/>
</body>
</html> u Pattern, “book/title” matches a title element if its parent is a book
</xsl:template> element
u The template calls for recursive processing of the contents, enclosed in
u The pattern “/” denotes the root of the source tree XHTML literals for bold display (...)
u The template contains some standard XHTML header and trailer constructs u XSL processing includes implicit rules that match elements, attributes,
u The apply-templates element is a rule-processing instruction that denotes and character data (text) not matched by any explicit rules; these rules
recursive processing of the contents of the matched element simply copy data from source to result tree
u XSLT includes several other instructions which permit templates with u In our example, all character data (such as the the text “The spy...” in
constructs such as for-loops, conditional sections, and sorting the title) is copied to the result tree
XSL Example …
XSL Example …  <xsl:template match="book/review[1]"

priority="1.0">
<xsl:apply-templates/>
 <xsl:template </xsl:template>
match="book/price">
<xsl:apply-templates/> <xsl:apply-templates u Matches only the first review element in each book element due to the “[1]”
select="@*"> specification
u The template simply copies the contents to the result tree (using recursive
</xsl:template> processing with apply-templates combined with the default rules)
 <xsl:template match="book/review"

u An additional apply-template instruction to extract the priority="0.5">
currency attribute using the syntax @* </xsl:template>
</xsl:stylesheet>
u Includes only the first review for each book:

u We ensure that the first review for each book is processed using Rule 5 instead of
C. Shahabi C. Shahabi Rule 6 by assigning Rule 5 a higher priorit y
7
Why query XML data?
XML-QL: For Querying XML Data
n Data interchange on the Internet
u Integrating, transforming, cleaning and aggregating
n Motivation: XML data
u Why querying XML data? n Examples:
u Why a new query language for XML? u businesses publish data about their products &
services, for customers to compare and process
n Requirements for an XML query language u business partners could exchange internal
n XML-QL features operational data between their information systems
on secure channels
n Other XML query languages
u search robots could integrate automatically
information from related sources that publish their
data in XML format
• stock quotes from financial sites
• sports scores from news sites
Why a new query language? Requirements for an XML query language
n Semi -structured: XML data is not rigidly 1. Precise Semantics 7. Preserve Order and
Association
structured 2. Rewritability,
Optimizability 8. Mutually Embedding
n Self-describing: schema exists with data with XML
3. Query Operations
n Can naturally model irregularities 9. Support for New
4. Compositional Datatypes
u Missing elements (e.g., bestseller?) Semantics 10. Suitable for Metadata
u Multiple occurrences of the same element (reviews*) 5. No Schema Required 11. Server-side Processing
u Elements w/ atomic values in some data items and 6. Exploit Available 12. Programmatic
structured values in others Schema Manipulation
u Collections of elements with heterogeneous structure 13. XML Representation
8
XML-QL features XML-QL features …
n Running example: n Selection and extraction:

u Books published by Addison-Wesley after 1991
<!ELEMENT bib ((book|article)*)> WHERE <bib> <book year=$y>

<!ELEMENT book (author+, title, publisher)> <publisher><name>Addison-Wesley</name></publisher>
<title> $t </title>
<!ATTLIST book year PCDATA> <author> $a </author>
<!ELEMENT article (author+, title, year?, </book> </bib> IN "www.a.b.c/bib.xml", $y > 1991
(shortversion|longversion))> CONSTRUCT $a
<!ATTLIST article type PCDATA> n Where: what to select (kinda similar to where-clause in SQL)
<!ELEMENT publisher (name, address)> n Construct: what to return (kinda similar to select-clause in SQL)
<!ELEMENT author (firstname?, lastname)> n Extraction: by binding the variables $t, $a and $y, and returning
only $a
XML-QL features … XML-QL features …
n Result: n Reduction and Restructuring:
WHERE <bib> <book year=$y> <publisher> <name>Addison-Wesley </> </>

<firstname> John </firstname> < lastname> Smith </lastname> <title> $t </>
<author> $a </>
<firstname> Joe </firstname> < lastname> Doe </lastname>
</> </> IN "www.a.b.c/bib.xml", $y > 1991
<lastname> Aravind </lastname> CONSTRUCT <result> <author> $a </>
<firstname> Sue </firstname> <lastname> Smith </lastname> <title> $t </>
... </>
n Restructuring: by the template expression in CONSTRUCT
n Reduction: by controlling what elements are returned by
CONSTRUCT (e.g., only author and title)
u To reduce repeating template+variable twice: ELEMENT_AS
9
n Result: n More complex restructuring:

u Group results by book title
<result> <author><firstname> John </firstname> <lastname> Smith </lastname> </author> WHERE <bib> <book> <title> $t </>
<title/> Tractability </title> </result> <publisher> <name> Addison-Wesley </> </>
<result> <author><firstname> John </firstname> <lastname> Smith </lastname> </author> </> CONTENT_AS $p </> IN "www.a.b.c/bib.xml"
<title/> Decidability </title> </result>
CONSTRUCT <result> <title> $t </>
<result> <author><lastname> Arvind </lastname> </author> WHERE <author> $a </> IN $p
<title> Efficiency </title> </result>
CONSTRUCT <author> $a </>
...
</>
n CONTENT_AS is like ELEMENT_AS, but binds the variable to the

element’s content
XML-QL features …
n Result: XML-QL features …

u The first WHERE clause binds the variable $p to the content of
<book>...</book>
n Combination:
u For each such binding one <result> and one <title> element are emitted
u The inner WHERE clause is evaluated, which, in turn, produces one or n Assume another source with the following DTD;
several authors. <!ELEMENT reviews (entry*)>
<!ELEMENT entry (title, review)>
<result> <!ELEMENT review (#PCDATA)>
<title> An Introduction to Database Systems </title>
u Let’s combine <book> elements (source 1) with <entry> elements ( source 2)
<author> < lastname> Date </lastname> </author>
WHERE <bib> <book> <title> $t </> <publisher> $p </> </> </> IN
</result> "www.a.b.c/bib.xml"
<reviews> <entry> <title> $t </> <review> $r </> </> </> IN
<result> "www.a.b.c/reviews.xml"
<title> Foundation for Object/Relational Databases: The Thir d Manifesto </title> CONSTRUCT <book> <title> $t </> <publisher> $p </> <review> $r < /> </>
<author> < lastname> Date </lastname> </author> n Join on common title $t
<author> < lastname> Darwen </lastname> </author>
</result>
10
n Result: n Combination (2):

u The query tries every match in the first data source against every match in
the second data source, and
WHERE <bib> <book> <title> $t1 </> <publisher> $p </> </> </> IN
u checks if they have the same title: "www.a.b.c/bib.xml"
u if yes, a book is output <reviews> <entry> <title> $t2 </> <review> $r </> </> </> IN
“www.a.b.c/reviews.xml”
<book> <title> Tractability </title> <publisher>...</publisher> similar($t1, $t2)
<review>...</review> </book> CONSTRUCT <book> <title> $t1 </> <publisher> $p </> <review> $r
<book> <title> Decidability</title> <publisher>...</publisher> </> </>
<review>...</review> </book>
n Instead of identical titles, the above returns “similar” items
... u Similar is a user-defined external function
XML-QL features …
XML-QL features … n No schema required (2):

u With “regular-path expressions”
n No schema required: u Querying nested and cyclic structures, such as trees, directed-acyclic
u With “tag variables”: all publications published in 1995 in which Smith graphs, and arbitrary graphs.
is either an author or an editor: u By traversing arbitrary paths through XML elements:
<!ELEMENT part (name, brand, part*)>
WHERE <bib> <$p> <title> $t </title>
<!ELEMENT name (#PCDATA)>
<year> 1995 </>
<!ELEMENT brand (#PCDATA)>
<$e> Smith </> </>
</> IN "www.a.b.c/bib.xml", $e IN {author, editor} u name of every part element that contains a brand element equal to
Ford, regardless of the nesting level at which the part occurs:
CONSTRUCT <$p> <title> $t </title>
<$e> Smith </>
</> WHERE <(part)*> <name> $r </> <brand> Ford </> </> IN
"www.a.b.c/bib.xml"
u $p is a tag variable that can be bound to any tag, e.g., book, a rticle, etc. CONSTRUCT <result> $r </>
u $e is a tag variable; the query constrains it to be bound to one of author
or editor u The wildcard matches any tag and can appear wherever a tag is
C. Shahabi C. Shahabi permitted
11
Other XML Query Languages
n Lore (Lightweight Object Repository) & Lorel n Your third homework (HW#3):
n XSL, easy to express recursive processing: Create a DTD, an XSL, and an example XML
u all author elements, regardless of how deep they occur in the
data: document from a given EER diagram and then query
<xsl:template> <xsl:apply-templates/> </xsl:template> it using an XML query language of choice (suggested:
<xsl:template match="author"> <result> < xsl:value-of/> </result> XML-QL)!
</xsl:template>
n XQL: XSL match patterns+some concise syntax for constructing

results
n XML-GL: similar in expressiveness power to XML-QL but with a
GUI
n WebL: markup algebra + service combinators
12
Information Integration
José Luis Ambite

USC/Information Sciences Institute
Outline
• Information Integration
– Definition
– Motivating example
– Architectures
• Datalog
• Source Descriptions & Query Reformulation
– Global-as-View
– Local-as-View: Bucket Algorithm
– Source Capabilities: Recursive Rewritings
• Wrappers
• Matching Objects Across Sources
1
Single Interface to Multiple Sources
Decision Support Application Programs
Information Agent
Databases Knowledge Bases The Web Computer Programs
Motivation:
TheaterLoc Entertainment Agent
Tiger Map Hollywood.com
Server Trailers
Etak Geocoder
Agent
Zagat
CuisineNet Yahoo Movies
2
TheaterLoc
The problem of providing
uniform (sources transparent to user)
access to (query, and eventually updates too)
multiple (even 2 is a problem!)
autonomous (not affect the behavior of sources)
heterogeneous (different data models, schemas)
structured (at least semistructured)
data sources (not only databases)
3
Related Technologies
• Distributed databases:
– Sources are homogeneous
– Data is distributed a priori
– Sources are not autonomous
– Similarities at the optimization and execution level
• Information retrieval:
– Keyword search, no semantics
• Data mining:
– Discovering properties and patterns in data
Principal Dimensions of
• Virtual vs. materialized architecture
• Access: query only or query & update?
• Mediated schema
– Mediated schema requires schema integration and then query
reformulation. Two main approaches:
• Global as View
• Local as View
– Language for descriptions and queries: conjunctive queries (CQs), union
of CQs, Datalog (recursion), first-order logic (∧,∨,¬), description logics…
• Types of Sources
– Structured (DB’s) vs. semi-structured (Web)
– Source capabilities: positive and negative
4
Materialized Architecture:
Data Warehouse
Virtual Architecture:
Mediator
5
Mediator
Architecture
• User queries in
global (mediator)
schema
• Mediator translates
and decomposes
user query into
multiple source
queries
Datalog
• Datalog Program = set of datalog rules
• Datalog rule = conjunctive query
Big-LA-buyers(buyer,seller,product, price) :- head
Person(buyer, “Los Angeles”, phone), Datalog
Purchase(buyer, seller, product, price), body
price > 10000.
∀ buyer, seller, product, price

(∃ phone Person(buyer, “Los Angeles”, phone) ^
Purchase(buyer, seller, product, price) ^ First-Order Logic
price > 10000)
→ Big-LA-buyers(buyer,seller,product, price)
6
Conjunctive Queries and Views
CREATE VIEW Big-LA-buyers AS
SELECT buyer, seller, product, price
FROM Person, Purchase
WHERE Person.city = “Los Angeles” AND
Person.name = Purchase.buyer AND
Purchase.price > 10000
Big-LA-buyers(buyer,seller,product, price) :-
Person(buyer, “Los Angeles”, phone),
Purchase(buyer, seller, product, price),
price > 10000.
Datalog rule ~ view definition
Rule body ~ select-from-where construct of SQL
Transitive Closure
Suppose we are representing a graph by a relation Edge(X,Y):
Edge(a,b), Edge (a,c), Edge(b,d), Edge(c,d), Edge(d,e)
a d e
c
How do we answer the query:
Find all nodes reachable from a.
7
Recursion in Datalog
b
Path(X, Y) :- Edge(X, Y) a d e
Path(X, Y) :- Path(X, Z), Path(Z, Y). c
Semantics: evaluate the rules bottom-up until a fixpoint:
Iteration #0: Edge: {(a,b), (a,c), (b,d), (c,d), (d,e)}
Path: {}
Iteration #1: Path: {(a,b), (a,c), (b,d), (c,d), (d,e)}
Iteration #2: Path gets the new tuples: (a,d), (b,e), (c,e)
Iteration #3: Path gets the new tuple: (a,e)
Iteration #4: Nothing changes => stop.
8
Source Descriptions
Elements of source descriptions:
• Contents: source contains movies, directors, cast.
• Constraints: only movies produced after 1965.
• Completeness: contains all American movies.
• Capabilities:
– Negative: source requires movie title or director as input
– Positive: source can perform selections, joins, …
Desiderata for source

descriptions
• Distinguish between sources with closely related
data: so we can prune access to irrelevant sources.
• Enable easy addition of new information sources:
because sources are dynamically being added and
removed.
• Be able to find sources relevant to a query:
reformulate queries such that we obtain guarantees
on which sources we access.
9
Approches to Specification of
Source Descriptions
• Global-as-view:
– Mediator relation defined as view over source
relations
Ex: TSIMMIS (Stanford), HERMES (Maryland).
• Local-as-View:
– Source relation defined as view over mediator
relations
Ex: Information Manifold (AT&T),
Tukwila(UW), InfoMaster (Stanford).
Query Reformulation
Problem: rewrite the user query expressed in the mediated
schema into a query expressed in the source schemas.
Given a query Q in terms of the mediated-schema relations,
and descriptions of the information sources,
Find a query Q’ that uses only the source relations, such that
• Q’ |= Q (i.e., answers are correct; i.e., Q’ Q) and
• Q’ provides all possible answers to Q given the sources.
10
Answering queries using views
• Query Containment: q’ q ↔ ∀D q’(D) q(D)
• Query Equivalence: q’ = q ↔ q’ q ^ q q’
Given query q and view definitions V={V1…Vn}
• q’ is an Equivalent Rewriting of q using V if:
– q’ refers only to views in V, and
– q’ = q
• q’ is a Maximally-Contained Rewriting of q using V if:
– q’ refers only to views in V, and
– q’ q, and
– there is no rewriting q1, such that q’ q1 q and q1 ≠ q
Global-as-View (GAV)
Each mediator relation is defined as a view
over source relations.
MovieActor(title,actor) ←
DB1(title,actor,year)
MovieActor(title, actor) ←
DB2(title,director,actor,year)
MovieReview(title, review) ←
DB1(title,actor,year) ^ DB3(title,review)
11
Query Reformulation in GAV
Query reformulation = rule unfolding+simplification
Query: Find reviews for ‘Brando’ movies
q(title,review) :- MovieActor(title,‘Brando’),
MovieReview(title,review)
1. q’(title,review) :- DB1(title, ‘Brando’,year), Redundant
DB1(title,actor,year’), DB3(title,review)
q’(title,review) :- DB1(title,‘Brando’, year),
DB3(title,review)
Query Reformulation in GAV

Query reformulation = rule unfolding+simplification
Query: Find reviews for ‘Brando’ movies
q(title,review) :- MovieActor(title,‘Brando’),
MovieReview(title,review)
1. q’(title,review) :- DB1(title,‘Brando’, year),
DB3(title,review)
2. q’(title,review) :- Redundant
DB2(title,director,‘Brando’,year), wrt 1
DB1(title,actor, year’), DB3(title, review)
12
Local-as-View (LAV)
Each source relation is defined as a view over
mediator relations
S1: V1(title, year, director) →
Movie(title,year,director,genre) ^
American(director) ^ year ≥1960 ^
genre = ‘Comedy’
S2: V2 (title, review) →
Movie(title,year,director,genre) ^ year≥1990 ^
MovieReview(title, review)
Query Reformulation in LAV

Query: Reviews for comedies produced after 1950
q(title,review) :- Movie(title,year,director,’Comedy’),
year ≥1950, MovieReview(title,review)
Reformulated query:
q’(title,review) :- V1(title,year,director), q’ q
V2(title,review)
S1: V1(title, year, director) → Movie(title,year,director,genre) ^
American(director) ^ year ≥1960 ^ genre = ‘Comedy’
S2: V2 (title, review) → Movie(title,year,director,genre) ^ year≥1990 ^
MovieReview(title, review)
13
LAV vs. GAV
See [Ullman,ICDT-1997] for a detailed comparison.
• Local as View:
– Easier to add sources: specify the query expression.
– Easier to specify constraints on contents of the sources:
they are part of the query expression describing them.
• Global as view:
– Easier query reformulation
GLAV combines both (Friedman & Millstein 1999)
Query Reformulation in LAV

The Bucket Algorithm
Given: user query q, source descriptions {Vi}
1. Find relevant sources (fill buckets)
For each relation g in query q
• Find Vj that contains relation g
• Check that constraints in Vj are compatible with q
2. Combine source relations {Vj} from each
bucket into a conjunctive query q’ and
check for containment (q’ q)
14
The Bucket Algorithm: Example
V1(student,number,year) → Registered(student,course,year),
Course(course,number), number ≥ 500, year ≥ 1992
V2(student,dept,course) → Registered(student,course,year),
Enrolled(student,dept)
V3(student,course) → Registered(student,course,year), year≤1990
V4(student,course,number) → Registered(student,course,year),
Course(course,number), Enrolled(student,dept), number ≤ 100
User Query (using mediator relations):

q(S,D) :- Enrolled(S,D), Registered(S,C,Y), Course(C,N),
N ≥300, Y≥1995.
1. Filling the Buckets

V1(student,number,year) → Registered(student,course,year),
Course(course,number), number ≥ 500, year ≥ 1992
V2(student,dept,course) → Registered(student,course,year),
Enrolled(student,dept)
V3(student,course) → Registered(student,course,year), year≤1990
V4(student,course,number) → Registered(student,course,year),
Course(course,number), Enrolled(student,dept), number ≤ 100
q(S,D) :-
Enrolled(S,D), Registered(S,C,Y), Course(C,N), N ≥300, Y≥1995
V2(S,D,C’) V1(S,N’,Y) V1(S’,N,Y’)
V4(S,C’,N’) V2(S,D’,C)
V4(S,C,N’)
15
2. Checking Containment
q(S,D) :- Enrolled(S,D), Registered(S,C,Y), Course(C,N), N ≥300, Y≥1995
V2(S,D,C’) V1(S,N’,Y) V1(S’,N,Y’)
V4(S,C’,N’) V2(S,D’,C)
V4(S,C,N’)
q’(S,D) :- V2(S,D,C’), V1(S,N’,Y), V1(S’,N,Y’), N ≥300,Y≥1995

q’(S,D) :- V2(S,D,C’), V1(S,N,Y), N ≥300,Y≥1995
Registered(S,C’,Y) ^ Enrolled(S,D) ^
Registered(S,C’’,Y) ^ Course(C’’,N) ^ N ≥ 500 ^ Y ≥ 1992 ^
N ≥ 300 ^ Y ≥ 1995 →
Enrolled(S,D) ^ Registered(S,C’’,Y) ^ Course(C’’,N) ^ N≥ 500 ^ Y≥ 1995 →
Enrolled(S,D) ^ Registered(S,C,Y) ^ Course(C,N) ^ N≥300 ^ Y≥1995
=> q’ q (and q’ is a maximally-contained rewriting of q)
V1(student,number,year) → Registered(student,course,year), Course(course,number), number ≥ 500, year ≥ 1992
V2(student,dept,course) → Registered(student,course,year), Enrolled(student,dept)
V3(student,course) → Registered(student,course,year), year ≤ 1990
V4(student,course,number) → Registered(student,course,year), Course(course,number), Enrolled(student,dept), number ≤ 100
Modeling Source Capabilities

Negative capabilities:
• A web-site may require certain inputs (in an
HTML form) to answer a query
• Need to consider only valid query execution plans.
Positive capabilities:
• A source may be ODBC-compliant database.
• Need to decide the placement of operations
according to capabilities.
Problem: how to describe and exploit source
capabilities.
16
Negative Capabilities:
Binding Patterns
Sources:
AAAIdbf (X) → AAAIPapers(X)
CitationDBbf(X,Y) → Cites(X,Y)
AwardDBb(X) → AwardPaper(X)
Query: find all the award winning papers:

q(X) :- AwardPaper(X)
Recursive Rewritings
q(X) :- AwardPaper(X)
• Problem: Unbounded union of conjunctive queries
q1(X) :- AAAIdb(X), AwardDB(X)
q1(X) :- AAAIdb(X1), CitationDB(X1,X), AwardDB(X)
…
q1(X) :- AAAIdb(X1), CitationDB(X1,X2), …,
CitationDB(Xn,X), AwardDB(X)
AAAIdbf (X) → AAAIPapers(X)
• Solution: Recursive Rewriting CitationDBbf(X,Y) → Cites(X,Y)
AwardDBb(X) → AwardPaper(X)
papers(X) :- AAAIdb(X)
papers(X) :- papers(Y), CitationDB(Y,X)
q’(X) :- papers(X), AwardDB(X)
17
[Open world Assumption (source descriptions with containment)]
Wrapper Building Tools

• Wrappers provide uniform query language for
data access Name Address
Chinois on Main 2709 Main St.
Restaurants in Chao Dara 13 Union Sq.
Santa Monica? … ...
• Wrapping Web-pages by Non-experts

– Demonstration-oriented user interface enables users to show
system what to extract “by example”
– System automatically induces extraction patterns
– Simplifies wrapper maintenance
18
Example of Extraction Rule
[Muslea et al 1999]
RULE = sequence of landmarks (e.g., Cuisine : )
Page:
Name:Chinois on Main Cuisine : Pacific New Wave 
Start: SkipTo(Cuisine :) SkipTo() End: SkipTo( )
Example of Rule Induction

[Muslea et al 1999]
Training Examples:
Cuisine:ThaiReview: Good
Review: Excellent
SkipTo( )
SkipTo( ) ... SkipTo( : ) SkipTo() ... SkipTo()SkipTo()
… SkipTo(Review :) SkipTo( ) ...
19
Matching Objects Across Sources
Problem: how to decide that objects in two sources
refer to the same object in the real world
Example: Zagat Fodors
CPK California Pizza Kitchen
Ralph’s Ralph’s Grill
• Information Retrieval techniques for similarity
joins [Cohen,SIGMOD-98] [Tejada&Knoblock]
20
CSCI585
Introduction to Temporal Database

Research
by Cyrus Shahabi
from
Christian S. Jensen’s
Chapter 1
C. Shahabi 1
Outline
CSCI585
■ Introduction & definition

■ Modeling
■ Querying
■ Database design
◆ Logical design
◆ Conceptual design
■ DBMS implementation
◆ Query processing
◆ Implementation of algebraic operators
◆ Indexing structures
■ Summary
■ Open problems
C. Shahabi 2
1
CSCI585
Introduction
■ Most applications of database technology

are temporal in nature:
◆ Financial apps.: portfolio management,
accounting & banking
◆ Record-keeping apps.: personnel, medical-
record and inventory management
◆ Scheduling apps.: airline, car, hotel
reservations and project management
◆ Scientific apps.: weather monitoring
C. Shahabi 3
CSCI585
Definitions
■ Temporal DBMS manages time-referenced data,
hence, times are associated with database
entities
■ Two types of time: valid time and transaction
time
■ Valid time, vt, of a fact (any logical statement that
is either true or false) is the collected times
(possibly spanning the past, present & future)
when the fact is true
■ Although all facts have a valid time, the valid
time of a fact may not necessarily be recorded in
the database (unknown or irrelevant to the app.)
◆ If a database models different worlds, database facts
might have several valid times, one for each world
C. Shahabi 4
2
CSCI585
Definitions …
■ Transaction time, tt: the time that a fact is current

in the database
■ Tt may be associated with any database entity,
not only with facts
■ Although all entities can be assigned a tt, the
database designer may decide to not capture this
aspect for some entities
■ Tt aspect of an entity has a duration: from
insertion to deletion, with multiple insertions and
deletions being possible for the same entity !
■ Hence, deletion is pure logical (not physically
removed but ceased to be part of the database’s
current state
C. Shahabi 5
CSCI585
Definitions …
■ Tt captures time varying states of the db & apps.

that demand accountability and tractability rely
on dbs that record Tt
■ Tt, unlike vt, is well-behaved and may be
supplied automatically by the DBMS
■ Both tt and vt values are drawn from a time
domain, which may or may not stretch infinitely
into the past and future
■ Time domain may be discrete or continuous
■ In databases, a finite and discrete time domain is
typically assumed
C. Shahabi 6
3
Definitions …
CSCI585
■ Time is assumed to be totally ordered, but various

partial orders and cyclic time has also been
suggested
■ Uniqueness of “Now”:
◆ the current time is ever-increasing,
◆ all activity is trapped at the current time, and
◆ current time separates the past from the future
■ The spatial equivalent “here” doesn’t have the
above properties; the biggest difference between
time and space is that time cannot be reused!
■ The uniqueness of now is one of the reasons why
techniques from other research areas are not
readily (or not at all) applicable to temporal data
■ Now offers new data management challenges
C. Shahabi particular to temporal databases 7
CSCI585
Modeling
■ To extend a DBMS to become temporal,

mechanisms must be provided for capturing
valid and transaction times of the facts recorded
by relations (temporal relations)
■ More than 24 extended relational models
proposed to add time to relational model, most of
which supported only valid time
■ We consider three bitemporal ones for a video
rental applications: customers check out tapes
for certain durations of time and dates.
C. Shahabi 8
4
CSCI585
Modeling …
■ Bitemporal Conceptual Data Model (BCDM):

timestamps tuples with sets of (tt, vt) values
cID TapeNum ■ C101 rents T1234 on

May 2nd for 3 days, &
{(2,2), (2,3), (2,4), (3,2), (3,3), (3,4), returns it on 5th
C101 T1234 …, (UC,2), (UC,3), (UC,4)}
■ C102 rents T1245 on
{(5,5), (6,5), (6,6), (7,5), (7,6), (7,7),
5th open-ended, &
C102 T1245 (8,5), (8,6), (8,7),…, (UC,5), (UC,6),
returns it on 8th
(UC,7)}
{(9,9), (9,10), (9,11), (10,9), (10,10),
(10,11), (10,12), (10,13),…, (13,9),
9th to be returned on
C102 T1234 (13,10), (13,11), (13,12), (13,13), (14,9), 12th. On 10th the rent
…, (14,14), (15,9), …, (15,15), (16,9), is extended to include
…, (16,15), …, (UC,9), …, (UC,15)} 13th but tape is not
returned until 16th.
C. Shahabi 9
CSCI585
Modeling …
■ Bitemporal Conceptual Data Model (BCDM):

timestamps tuples with sets of (tt, vt) values

May 2nd for 3 days, &
returns it on 5th
9
5th open-ended, &
returns it on 8th
5 ■ C102 rents T1234 on
9th to be returned on
1
12th. On 10th the rent
1 5 9 17 is extended to include
13th but tape is not
returned until 16th.
C. Shahabi 10
5
CSCI585
Modeling …
■ BCDM pros:
◆ Since no two tuples with mutually identical explicit
values are allowed in BCDM relation instance, the full
history of a fact is contained in exactly one tuple
◆ Relation instances that are syntactically different have
different information content and vice versa
■ BCDM cons:
◆ Bad internal representation and display to users of
temporal info
◆ Varying length and voluminous timestamps of tuples
are impractical to manage directly
◆ Timestamp values are hard to comprehend in BCDM
format
C. Shahabi 11
CSCI585
Modeling …
■ Fixed-length format for tuples, where each
tuple’s timestamp encodes a rectangular or stair-
based bitemporal region
■ Several tuples may be needed to represent a
single fact
cID TapeNum Ts Te Vs Ve ■ C101 rents T1234 on
C101 T1234 2 UC 2 4 May 2nd for 3 days, &
returns it on 5th
C102 T1245 5 7 5 now ■ C102 rents T1245 on 5th
C102 T1245 8 UC 5 7 open-ended, & returns it
on 8th
C102 T1234 9 9 9 11 ■ C102 rents T1234 on 9th
to be returned on 12th.
C102 T1234 10 13 9 13 On 10th the rent is
C102 T1234 extended to include 13th
14 15 9 now but tape is not returned
C102 T1234 16 UC 9 15 until 16th. 12
C. Shahabi
6
CSCI585
Modeling …
■ Non-first-normal-form representation
■ Relation is thought of as recording
information about some types of objects
(e.g., information about customers)
CustomerID TapeNum
[2, Now] x [2,4] C101 [2, Now] x [2,4] T1234 May 2nd for 3 days, &
returns it on 5th
[5, 7] x [5, inf] C102 [5, 7] x [5, inf] T1245
[8, Now] x [5, 7] [8, Now] x [5, 7] ■ C102 rents T1245 on 5th
open-ended, & returns it
[9,9] x [9, 11] [9,9] x [9, 11] T1234
on 8th
[10,13] x [9, 13] [10,13] x [9, 13]
■ C102 rents T1234 on 9th
[14,15] x [9, inf] [14,15] x [9, inf]
to be returned on 12th.
[16, Now] x [9, 15] [16, Now] x [9, 15] On 10th the rent is
extended to include 13th
but tape is not returned
until 16th. 13
C. Shahabi
CSCI585
Modeling …
■ Note that 2nd tuple records two facts: rental
information for customer C102 for the two tapes
■ Pros of the two latter models:
◆ No need to update the relation at every tick, it is
achieved by introducing “now” variable that assume
the current value
■ Two choices to enter time values into relations
1. At the level of tuples (tuple timestamping)
2. At the level of attribute values (attribute timestamping)
C. Shahabi 14
7
CSCI585
Modeling …
■ Relation instances that all three models may

record are snapshot equivalent (corresponding
to a point-based view of data), e.g.,
A Vs Ve A Vs Ve A Vs Ve
a 2 8 a 2 4 a 2 8
b 2 8 a 5 8 b 2 4
b 2 8 b 5 8
■ The first relation is coalesced version of the

other two, but they are snapshot equiv.
■ Coalescing operation merges value equivalent
tuples with same non-timestamp attributes and
adjacent or overlapping time intervals
C. Shahabi 15
CSCI585
Modeling …
■ BCDM only allows coalesced relation
instances, i.e., relations are only different
if they are not snapshot equivalent
◆ The last two relations are not legal in BCDM
■ However, the three relations are not
equivalent from an interval-based view:
◆ First relation: a tape was checked out for 7
days
◆ Second relation: the tape was checked out for
3 days initially and then for 4 more days
C. Shahabi 16
8
CSCI585
Querying
■ Temporal queries “can” be expressed via
conventional query languages such as SQL (e.g.,
current temporal applications); however, with
great difficulty
cID TapeNum Vs Ve
C101 T1234 2 now
cID TapeNum C101 T1245 5 10
C101 T1234 C102 T1245 22 25
C102 C102 T1425 9 19
T1425
C102 T1324 C102 T1434 4 14
C103 T1243 C102 T1324 9 now
S-CheckedOut C103 T1243 7 21
V-CheckedOut
■ At time 17, the first relation is a snapshot of the
second
C. Shahabi 17
CSCI585
Querying …
■ Number of current checkouts:

◆ SELECT COUNT (TapeNum) FROM S-CHeckedOut
■ Temporal generalization of the above query: time-
varying count of tapes checked out
◆ If now is replaced with a fixed time value, this can be done
in SQL in 6 steps and 35 lines!
■ Specifying a key constraint:
◆ ALTER TABLE S-CheckedOut ADD PRIMARY KEY
(TapeNum)
■ TapeNum is also a key for V-CheckedOut at each
point in time
◆ It takes 12 lines and a complex SQL statement to express
this constraint
C. Shahabi 18
9
CSCI585
Querying …
■ Hence, some 40 temporal query languages have

been proposed (most with their own data model),
e.g., TSQL2
■ Simple queries should remain simple:
◆ VALIDTIME
SELECT COUNT (TapeNum) FROM V-CheckedOut
◆ CONSTRAINT temporalkey VALIDTIME UNIQUE TapeNum
■ Early languages based on: relational algebra
■ Later: calculus-based, Datalog-based and OO
■ Recent: extensions to SQL
C. Shahabi 19
CSCI585
Querying …
■ Many modeling issues impact the language
design, e.g., time stamping tuples or attributes
■ Language design must consider:
◆ time-varying nature of data,
◆ predicated on temporal values,
◆ temporal constructs,
◆ supporting states and/or events,
◆ supporting multiple calendars,
◆ modification of temporal relations,
◆ cursors, views, integrity constraints, handling now,
aggregates, schema versioning, periodic data
C. Shahabi 20
10
CSCI585
Querying …
■ Desired properties of temporal query

languages:
1. Temporal upward compatibility: conventional
queries and modifications of temporal
relations should act on the current state
2. Pervasive support for sequence queries: that
request the history of something, e.g.,
temporal aggregation above
3. Support for point-based and interval-based
view of data
4. Adequate expressive power
5. Ability to be efficiently implemented
C. Shahabi 21
CSCI585
DBMS Design
■ Database schemas capturing time-referenced

data are complex
■ Two traditional contexts of database design:
◆ Data model of DBMS at 3 levels: view, logical, physical
(e.g., relational model for the first two)
◆ A high-level conceptual design model: ER model
■ Then, mappings bring a conceptual design into a
schema that conforms to the specific
implementation data model (e.g., ER to relational
mapping)
■ Here: we consider temporal database “logical”
and “conceptual” design
C. Shahabi 22
11
CSCI585
Logical Design
■ Need for guidelines such as formalization

guidelines, but conventional normalization
concepts are not applicable to temporal
relational data models
■ A range of temporal normalization concepts have
been proposed: temporal dependencies, keys
and normal forms
■ Conventional dependencies do not apply:
TapeNum does not determine cID, (go through 3
examples, but it should!)
■ But it should: at any point in time, a tape can
only be checked out by a single customer
◆ ! TapeNum temporally determines cID, but the reverse
C. Shahabi does not hold 23
CSCI585
Logical Design …
1. A temporal relation satisfies a temporal

dependency if all its snapshots satisfy the
corresponding conventional dependency
■ How to determine snapshots? Timeslice
operators:
◆ Temporal predicate as argument: e.g., contain
◆ A time point as parameter: e.g., (tt, vt)
◆ Returns snapshot of the relation corresponding to the
specified time point, omitting the timestamp attribute
■ Problem: an atemporal approach! which applies
to each snapshot of a temporal relation in
isolation and hence fails to account for
“temporal” aspects of data
C. Shahabi 24
12
CSCI585
Logical Design …
2. Consider dependencies and associated

normal forms that hold between time
points
■ Build in the notion of time granularity
into the normalization concepts
■ Not only consider snapshots computed
at non-decomposable time points, but
also at coarser granularities:
◆ Video rental examples: day as finest
granularity, weeks and months may also be
considered
C. Shahabi 25
CSCI585
Logical Design …
3. Introducing new concepts that capture the

temporal aspects of data and may form the
basis for new database design guidelines
■ Most prominent candidate: time patterns
◆ Video rental example: since the set of tapes checked
out by a customer changes more frequently than the
customer’s address, they should be stored in separate
relations
■ Another candidate: lifespan
■ Attributes with different lifespan (to avoid null
values) or with different precision (hour vs. day)
should be stored separately
C. Shahabi 26
13
CSCI585
Conceptual Design
■ ER diagrams become obscure and cluttered
when an attempt is made to capture temporal
aspects (see example)
■ CheckedOut relationship should become ternary
by introducing an artificial entity set to capture
time of rental
■ However, still issues remain: varying rental price
over time, transaction time inclusion, …
■ Some industrial solution: ignore temporal
aspects in the ER diagram and supplement it
with textual phrases, e.g., “full temporal support”
◆ ! no automatic mapping from ER to model
■ Dozens of temporally enhanced ER models
proposed 27
C. Shahabi
CSCI585
Conceptual Design …
1. Give all existing ER constructs temporal

semantics, similar to “applies to all snapshots”
for normalization
➩ Does not result in any new syntactical constructs
➪ Rules out databases with non-temporal parts: while
the syntax of legacy diagrams remain valid their
semantics have changed!
2. Devise new notational shorthand for frequent
temporal aspects in ER diagram (e.g., time
varying attributes)
➩ Both non-temporal and mixed databases can be
modeled
➪ More difficult to understand
C. Shahabi 28
14
CSCI585
Conceptual Design …
■ All existing models assume mapping to

relational model
■ None tries to map to one of the several
time-extended relational models
■ Also mapping to emerging models (e.g.,
SQL3/ORDBMS) are missing.
C. Shahabi 29
CSCI585
DBMS Implementation
■ Integrated approach: internal modules of a DBMS

are modified or extended to support time-varying
data
◆ Efficiency
■ Layered approach: a software layer interposed
between the user applications and DBMS that
converts temporal query language statements to
conventional statements
◆ Realistic for short and medium term
■ Popular approach: integrated, utilizing
timestamping tuples with time intervals
C. Shahabi 30
15
CSCI585
Query Processing
■ Temporal queries are large and complex

■ Also, the predicates might be temporal, e.g.,
overlap among two time intervals
■ Unlike equality predicate in conventional joins,
temporal joins require multiple inequality
predicates to be examined: two intervals I and j
overlap iff st(i) <= end(j) and st(j) <= end(i)
■ Coalescing of data should be implemented
efficiently: interactions among coalescing,
duplicate removal and ordering
C. Shahabi 31
CSCI585
Query Processing …
■ Opportunities for temporal query optimization:

◆ Time advances continuously, hence for transaction
time, time value used most recently in updates is the
largest value used so far
! natural sorting and clustering: if current and
logically deleted tuples are stored separately, then
• Current clustered on st(tt)
• Deleted clustered on end(tt)
◆ Integrity constraint st(j)<end(j)
◆ Intervals associated with a key value are contiguous in
time (end of one interval is the beginning of the other)
C. Shahabi 32
16
CSCI585
Implementation of Algebraic Operators
■ Efficient implementation of temporal selection,
joins, aggregates, and duplicate elimination !
temporal index structures
■ Variety of binary temporal joins have been
proposed: time-join, time-equijoin, … as
extensions of nested loop or merge join that
exploits orders or local workspace as well as
partitioning based joins
■ Also, incremental techniques for implementing
operators on relations capturing transaction time
have been discussed
◆ Caching the results of previous computations to be
reused later (easy to do since the records of updates,
I.e., changes to previously cached results, are already
contained in a temporal DBMS)
C. Shahabi 33
CSCI585
Imp. Of Algebraic Ops…
■ Efficient implementation of time-varying

aggregates
■ Efficient implementation of coalescing:
1. Sorting the argument relation on the explicit
attribute values as well as the valid time
2. Perform the merging in the subsequent scan
C. Shahabi 34
17
CSCI585
Indexing Structures
■ Similar to spatial index structures can be

based on traditional indexes such as B+-
tree or multidimensional ones such as R-
tree
■ Index structures usually used for
selection operators
■ Active research investigation: use index
structures for temporal joins, coalescing
and aggregates
C. Shahabi 35
CSCI585
Summary
■ Popular approaches:
◆ Snapshot-based semantics for database design
◆ BCDM for modeling
◆ TSQL2 as a query language
■ Well understood issues (some with efficient
implementation):
◆ Semantics of the time domain: its structure,
dimensionality, and indeterminacy
◆ Representational issues and operations on timestamps
◆ Temporal joins, aggregates and coalescing
◆ Temporal index structures supporting vt, tt, or both
◆ Prototype implementations of temporal DBMS
C. Shahabi 36
18
CSCI585
Open Problems
■ Legacy awareness
■ Architecture awareness
■ Visualization of temporal data
■ Conceptual design
■ Performance (cost models for temporal
operators and maintaining statistics for
query optimizer)
C. Shahabi 37
CSCI585
Open Problems …
■ Related research that can benefit from

and/or challenge temporal DBMS
research:
◆ Active databases
◆ Spatiotemporal databeses
◆ Moving objects
◆ Multimedia, virtual reality, immersive apps.
◆ Temporal data mining
◆ Warehousing
C. Shahabi 38
19
Multimedia Storage Servers
Cyrus Shahabi
shahabi@usc.edu
Integrated Media Systems Center &
Computer Science Department
University of Southern California
Los Angeles CA, 90089-0781 1
CSCI585-
CSCI585-Spring2001
OUTLINE
Introduction
Continuous Media
Magnetic Disk Drives
Display of CM (single disk, multi-disks )
Optimization Techniques
Additional Issues
Case Study (Yima)
C. Shahabi 2
CSCI585-
CSCI585-Spring2001
What is a Multimedia Server?
Network
Storage Manager
Memory
• Multiple streams of audio and video should

be delivered to many users simultaneously
C. Shahabi 3
CSCI585-
CSCI585-Spring2001
Some Applications
Video-on-demand Medical databases
News-on-demand NASA databases
News-editing
Movie-editing
Interactive TV
Digital libraries
Distance Learning
C. Shahabi 4
CSCI585-
CSCI585-Spring2001
Challenge: Continuous Media

CM object consists of a sequence of media quanta
(e.g., audio samples or video frames), which
convey meaning only when presented in time.
S torage & Retrieval
Continuous display
High bandwidth requirement
Large size
Communications
End-user (display and interface)
C. Shahabi 5
CSCI585-
CSCI585-Spring2001
Continuous Display
Data should be
transferred from the
storage device to the
memory (or display) at
a pre-specified rate. Memory
Otherwise: frequent
disruptions & delays,
Disk
termed hiccups
NTSC quality: 45Mb/s
C. Shahabi 6
CSCI585-
CSCI585-Spring2001
High Bandwidth & Large Size

Access Time Transfer Rate Cost / Megabyte
Memory 1 ~ 5 ns > 1 GB/s ~ $1
Disk 5 ~ 20 ms < 40 MB/s < $0.01
Optical 100 ~ 300 ms < 5 MB/s < $0.005
Tape sec ~ min < 10 MB/s < $0.001
HDTV quality ~ 1 Gb/s 2-hr HDTV ~ 900 GB
C. Shahabi 7
CSCI585-
CSCI585-Spring2001
Compression
MPEG-1 30:1 reduction in both size and
bandwidth requirement (NTSC 45 Mb/s is
reduced to 1.5 Mb/s)
MPEG-2 3-10:1 reduction
(NTSC ~ 4, DVD ~ 8, hdtv ~ 20 Mb/s)
Problem: lose information
(cannot be tolerated by some applications,
NASA)
C. Shahabi 8
CSCI585-
CSCI585-Spring2001
Assumed Hardware Platform

Multiple magnetic
disk drives:
Not too expensive
(as compared to RAM)
Not too slow
(as compared to tape) Memory
Not too small
(as compared to CD-ROM)
And it’s already everywhere!
C. Shahabi 9
CSCI585-
CSCI585-Spring2001
Magnetic Disk Drives

An electro-mechanical random access storage device
Magnetic head(s) read and write data from/to the disk
C. Shahabi 10
CSCI585-
CSCI585-Spring2001
Disk Device Comparison
C. Shahabi 11
CSCI585-
CSCI585-Spring2001
Disk Seek Characteristic
C. Shahabi 12
CSCI585-
CSCI585-Spring2001 Disk Seek Time Model
TSeek = { cc1 ++( c( 2c××dd)) If d < z cylinders
3 4 If d >= z cylinders
1 60 sec
TAvgRotLatency = ×
2 rpm
C. Shahabi 13
CSCI585-
CSCI585-Spring2001
Disk Service Time Model
TService = TTransfer + TAvgRotLatency + TSeek
B B
BWEffective = TTransfer =
TService BWMax
TTransfer: data transfer time [s]
TAvgRotLatency: average rotational latency [s]
TService: service time [s]
B: block size [MB]
C. Shahabi BWEffective: effective bandwidth [MB/s] 14
CSCI585-
CSCI585-Spring2001
Data Retrieval Overhead
C. Shahabi 15
CSCI585-
CSCI585-Spring2001
Sample Calculations
Assumptions:
TSeek = 10 ms
BWMax = 20 MB/s
Spindle speed: 10,000 rpm
B
BWEffective =
B 30 sec
+ + TSeek
BWMax rpm
B 1 KB 10 KB 100 KB 1 MB 10 MB
BWEffective 0.076 0.74 5.55 15.87 19.49
MB/s MB/s MB/s MB/s MB/s
0.38% 3.7% 27.8% 79.4% 97.5%
C. Shahabi 16
CSCI585-
CSCI585-Spring2001
Summary
Average rotational latency depends on the spindle
speed of the disk platters (rpm)
Seek time is a non-linear function of the number
of cylinders traversed
Average rotational latency + seek time = overhead
(wasteful)
Average rotational latency and seek time reduce
the maximum bandwidth of a disk drive to the
effective bandwidth
C. Shahabi 17
CSCI585-
CSCI585-Spring2001
Continuous Display (1 disk)

Retrieve X1 X2 X3
from disk
Display
from Display X1 Display X2 Display X3
memory
Time
Traditional production/consumption problem
RC = Consumption Rate, MPEG-1 1.5 Mb/s
RD = Production Rate,Seagate Barracuda 68 Mb/s
For now: RC < RD
Partition video X into n blocks: X1, X2, ..., Xn
(to reduce the buffer requirement)
C. Shahabi 18
CSCI585-
CSCI585-Spring2001
Round-robin Display
Seek Time
Retrieve X1 Y3 X2 Y4 X3 Y5
from Disk
Display Display X1 Display X2 Display X3
from
Display Y3 Display Y4 Display Y5
Memory
Time
Time period: time to display a block (is fixed)

System Throughput (N): number of streams
Assuming random assignment of the blocks:
Maximum seek time between block retrievals
Waste of disk bandwidth ==> lower throughput
C. Shahabi Tp=?, N=?, Memory=?, max-latency=? 19
CSCI585-
CSCI585-Spring2001
Cycle-based Display
Retrieve X1 Y3 Y4
Z5 Z6 X2 Y5 X3 Z7
from Disk
Display
from Display X1, Y3, Z5 Display X2, Y4, Z6
Memory
Time
Using disk scheduling techniques

Less seek time ==> Less disk bandwidth waste ==> Higher
throughput
Larger buffer requirement

Tp=?, N=?, Memory=?, max-latency=?
C. Shahabi 20
CSCI585-
CSCI585-Spring2001
Group Sweeping Schema (GSS)

Group 1 Group 2
W1 X1 Y3 Z5 X2 W2 Z6 Y4 W3 X3 Z7 Y5
Subcycle 1 Subcycle 2
Display X1, W1 Display X2, W2
Can shuffle order of blocks retrievals within a group

Cannot shuffle the order of groups
GSS when g=1 is cycle-based
GSS when g=N is round-robin
Optimal value of g can be determined to minimize memory
buffer requirements
C. Shahabi Tp=?, N=?, Memory=?, max-latency=? 21
CSCI585-
CSCI585-Spring2001
Constrained Data Placement

X0 Z0
R0
Partition the disk into R
X1 Z1 regions
R1
During each time period
X2 Z2 Y0
R2 only blocks reside in the
X3 Z3 Y1
same region are retrieved
R3
Maximum seek time is
R4
X4 Z 4 Y2 reduced almost by a factor
of R
X5 Z5 Y3
R5 Introduce startup latency
time
T1 a. OREO
Tp=?, N=?,
X0 X1 X2 Y0 Memory=?,
First Time Period Second Time Period max-latency=?
b. Time Period Schedule
C. Shahabi 22
CSCI585-
CSCI585-Spring2001
Hybrid
For the blocks retrieved within a region, use GSS
schema
This is the most general approach
Tp=?, N=?, Memory=?, max-latency=?
By varying R and g all the possible display techniques
can be achieved
Round-robin (R=1, g=N)
Cycle-base (R=1, g=1)
Constrained placement (R>0, g=1), ...
A configuration planner calculates the optimal values of
R & g for certain application.
C. Shahabi 23
CSCI585-
CSCI585-Spring2001
Display of Mix of Media

Retrieve X1 Y3 Z5 Z6 Y4 X2 Y5 X3 Z7
from Disk
Display
from Display X1, Y3, Z5 Display X2, Y4, Z6
Memory
Time
Mix of media types: different RC’s: audio, video

Rc(Y) < Rc(X) < Rc(Z)
Different block sizes: Rc(X)/B(X)=Rc(Y)/B(Y)= ...
Display time of a block (time period) is still fixed
C. Shahabi 24
CSCI585-
CSCI585-Spring2001
Multiple-disks
Single disk: even in the
best case with 0 seek time,
68/1.5=45 MPEG-1
streams
Typical applications
(MOD): 1000 streams
Memory
Solution: aggregate
bandwidth and storage
space of multiple disk
drives
How to place a video?
C. Shahabi 25
CSCI585-
CSCI585-Spring2001
RAID Striping
All disks take part in X1
transmission of a block
Can be conceptualized as a
single disk
d1 d2 d3
Even distribution of display
load
X1.1 X1.2 X1.3
Efficient admission
X2.1 X2.2 X2.3
Is not scalable in throughput
C. Shahabi 26
CSCI585-
CSCI585-Spring2001
Round-robin retrieval
d1 d2 d3
Only a single disk takes part
in transmission of each block X1 X2 X3
Y2 Y3 W1
Retrieval schedule Y1
Z1 Z2
Round-robin retrieval of the Z3 W2 W3
blocks
Retrieval Schedule
Even distribution of display
d1 d2 d3 Display
load
Time
Efficient admission
Not scalable in latency X1,Y1,W1,Z1
X2,Y2,W2,Z2
X3,Y3,W3,Z3
C. Shahabi 27
CSCI585-
CSCI585-Spring2001
Hybrid Striping
Partition D disks into clusters of d disks
Each block is declustered across the d disks that constitute
a cluster (each cluster is a logical disk drive)
RAID striping within a cluster
Round-robin retrieval across the clusters
RAID striping (d=D), Round-robin retrieval (d=1)
C0 C1 C2
X 0.0 X 0.1 X 1.0 X 1.1 X 2.0 X 2.1

X 3.0 X3.1 X 4.0 X4.1 X 5.0 X5.1
... ... ... ... ... ...
d0 d1 d2 d3 d4 d5
C. Shahabi 28
CSCI585-
CSCI585-Spring2001 Worst latency (sec)
Scalability 80
By varying the value of d 60

d=1
different throughput and 40
latency time can be achieved
20
For an application, the
d=D
maximum throughput and the 0
0 5 10 15 20
tolerable latency time can be Factor of increase in resources (memory+disk)
Max. No. of users
given as inputs to a planner 800
The output will then be the 600

optimal value of d for that
application 400 d=1
Note that hybrid always

200
outperforms RAID striping and d=D
round-robin retrieval 0
0 5 10 15 20
Factor of increase in resources (memory+disk)
C. Shahabi 29
CSCI585-
CSCI585-Spring2001
Latency & Throughput (Hybrid Striping)

Conceptualize a set of slots supported by a cluster in a Tp as a
group
A request maps onto one group and groups visit clusters in a
round-robin manner
During a time period, each occupied slot of a group retrieves a
block that resides in the cluster that is being visited by that group
Time
1Tp 2Tp 3Tp 4Tp 5Tp 6Tp
G1 G0 G5 G4 G3 G2
C0 G1 G2 G3 G4 G5 G0
C1 G0 G1 G2 G3 G4 G5
C2 G5 G0 G1 G2 G3 G4 ...
X4 X5 X0 X1 X2 X3
... C3 G4 G5 G0 G1 G2 G3
X6
Y0 Y1 Y2 Y3 Y4 Y5 C4 G3 G4 G5 G0 G1 G2
Y6 ...
C5 G2 G3 G4 G5 G0 G1
C0 C1 C2 C3 C4 C5
C. Shahabi 30
CSCI585-
CSCI585-Spring2001
Admission Control
When a request arrives, search an empty slot from the group
currently accessing the cluster which has the first block
failure: look up a group and find no empty slot
success: find a group with empty slot and assign the request
3 2 1 5 4
G0 G5 G4 G3 G2 G1
X4 X5 X0 X1 X2 X3
X6 ...
Y0 Y1 Y2 Y3 Y4 Y5
Y6 ... Z0 Z1 ...
C0 C1 C2 C3 C4 C5
C. Shahabi 31
CSCI585-
CSCI585-Spring2001
Throughput & Startup Latency

The number of slots (N ) in a time period defines the
maximum number of simultaneous displays (throughput)
supported by a cluster
N and Tp could be configured using various techniques
such as GSS
The throughput of a system with C clusters is:
N ×C
If a request experiences i failures before success, the
average startup latency is:
. × TP (i = 0)
05
L=
 i × TP (i ≠ 0)
How to determine the value of i ?
C. Shahabi 32
CSCI585-
CSCI585-Spring2001
Throughput & Startup Latency

The probability of a failure is a function of system load
Develop a probabilistic approach to determine the expected
latency of a request
p(k) : probability that there are k active requests in the
system (using a queuing model)
p f (i , k ) : probability that a request has i failures before a
success when there are k active requests in the system
The expected latency is:
m−1 m −1 k N 
E [ L] = ∑ p( k ) p f ( 0, k ) 0.5Tp + ∑ ∑ p( k ) p f (i , k )iTp
k =0 k = 0 i =1
C. Shahabi 33
CSCI585-
CSCI585-Spring2001
Optimization Techniques
Two techniques to minimize startup latency

Request Migration
Object Replication
Three techniques to maximize throughput
Batching requests
Piggybacking
Buffer sharing
C. Shahabi 34
CSCI585-Spring2001
Request Migration
Migrating a request from a busy group to a group with
more idle slots reduces the possible latency of future
requests
G0 G5 G4 G3 G2 G1
X4 X5 X0 X1 X2 X3
X6 ...
Y0 Y1 Y2 Y3 Y4 Y5
Y6 ... Z0 Z1 ...
C0 C1 C2 C3 C4 C5
C. Shahabi 35
CSCI585-Spring2001
Object Replication
With multiple copies of an object, simultaneous checking
for an empty slot reduces the worst case startup latency
and the average latency
G0 G5 G4 G3 G2 G1
X4 X5 X0 X1 X2 X3
X6 ...
X1 X2 X3 X4 X5 X0
... X6
C0 C1 C2 C3 C4 C5
C. Shahabi 36
CSCI585-
CSCI585-Spring2001
Maximize Throughput
Objective: support more requests than the
maximum throughput of the system!
Observation: for most of the applications (e.g.,
MOD) many requests for the same stream (e.g.,
new released movies) arrive close to each other
Solution: Share streams among multiple requests
How?
C. Shahabi 37
CSCI585-
CSCI585-Spring2001
Stream Sharing
Batching requests: introduce a startup delay and
multiplex single stream to support all the requests
for the same stream
Piggybacking: if two (or more) streams are close
enough, speed-up one and slow down the other so
they meet and then share streams
Buffer sharing: keep blocks of the first request in
memory so that the second request can be served
from memory (rather than disk)
C. Shahabi 38
CSCI585-
CSCI585-Spring2001
Additional Issues (1)

VBR support
Approach: approximate bandwidth with segments of
constant bitrate and buffers
Pause and resume
Stop a stream at a specific point and resume from
the same point
Fast forward and fast rewind (speed up stream)
Send more data during each time period
Drop some frames or blocks at the server side
Have separate “trick” files and switch between them
C. Shahabi 39
CSCI585-
CSCI585-Spring2001

Synchronization between multiple streams
Audio/video synchronization is very sensitive
Synchronization must be addressed at different levels
o Server
o Network
o Client
Random Data Placement

Deadline driven scheduling approach
Probabilistic model to hiccup probability
C. Shahabi 40
CSCI585-
CSCI585-Spring2001
Multi-zone disk drives (variable transfer rate)
E.g., Track-pairing
E.g., FIXB (elevator allocation of blocks over zones)
Heterogeneity
Different media types
Different disk types
Fault tolerance: what kind of failures are acceptable?
Replication vs. parity-based
OnLine reorganization
C. Shahabi 41
Yima Multichannel Streaming System

CSCI585-
CSCI585-Spring2001
Yima Server Yima Clients

Cluster
HDTV
(MPEG-2)
Network
...
Storage 100 Mb/s or 1 Gb/s NTSC (MPEG-2)
Systems
Multi-node, scalable server architecture 10.2 Channel Audio
Media format independent: supports DVD MPEG-2 (8 Mb/s), HD MPEG-2 (20
Mb/s), MPEG-4 (800 Kb/s), 10.2 channel audio, etc.
Standard transmission protocols: RTP, RTSP
Selective retransmission of lost media packets for improved playback quality
Synchronization across multiple media streams
C. Shahabi 42

DBMS

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DBMS

Uploaded by

Copyright:

Available Formats

CSCI 585 UNIVERSITY OF SOUTHERN CALIFORNIA Cyrus Shahabi

SPRING 2001 COMPUTER SCIENCE DEPARTMENT Jan. 11,2001

• Relationship: A relationship is an association among several entities

Presented as: Enrolled

• A relationship may also have descriptive attributes:

• Degree of relationship: Number of participating entity sets in a

• Mapping cardinalities (example of definition of constraints in a data

• one-to-many (1:N): Mothers having children

• many-to-many (M:N): Students enrolled in courses

• Strong entity set: One with a primary key.

Street City State Zipcode

• Multivalued vs. single-valued: An attribute with a set of values for a

• Null attribute: When a value is not applicable for an attribute of a

(An interactive/attractive example - here)

• Relational data model represents the database as a collection of

• A row (also termed a tuple or a record) in a table represents a

• Entity integrity constraint: No primary key value can be null.

(Issue: Different attribute names refer to the same concept in real

• Referential integrity constraint: A tuple in one relation that

• If 1:N or N:1 then choose the N-side relation (say S) and

• If N:M then create a new relation as:

• Example ER for a real-world problem

Different types of specialization

• Total: Every entity in the superclass must be a member of some

(Only if specialization is both disjoint and total)

• Option 3: A singles table including all the attributes of the

(Only if specialization is disjoint; null value for t if partial; t can

• Option 4: Same as option 3 except there are m Boolean type

• Example ER for a real-world problem

Different types of specialization

• Total: Every entity in the superclass must be a member of some

(Only if specialization is both disjoint and total)

• Option 3: A singles table including all the attributes of the

(Only if specialization is disjoint; null value for t if partial; t can

• Option 4: Same as option 3 except there are m Boolean type

select [distinct] target-list

• The result of a select query is a relation.

• Target list determines the attributes of the resulting relation.

select distinct e.salary

select distinct e.salary

• Conceptualize the execution of a select query as one of:

• Qualification list can be a Boolean combination (and, or, not) of

• A selection clause is a comparison between an indexed tuple

select distinct e.salary

• A join clause is a comparison between two different relation

Find the name and department name of all the employees:

Find the name of all those employee who work in Toy

• SQL provides set operations: union, intersect, minus.

Example: Retrieve the social security of those employees

(select distinct e.SS#

Note that the following query is WRONG!

select distinct e.SS#

This is because the department name is single value (either Toy

(note: try it with either shoe or toy, union)

• SQL has set comparison operators: contains, some, all. It is

• A contains B means set A is a superset of set B.

Example: Social security number of employees who manage

• Find the name of employees whose salary is higher than the

• Find the name of employees whose salary is higher than the

• SQL supports five aggregate functions that can be applied to

• Count (*) can be used (without distinct) if the number of rows in

• Qualified aggregates: the set on which the aggregate is applied

Find the average salary of the employees in the toy department.