Professional Documents
Culture Documents
An overview
and its role in a growing world of
“unstructured” data
Me
●
Daniel Hanks
●
Systems / Database admin @ United Online
●
danhanks@gmail.com
●
http://brainshed.com – Geek stuff / blog
●
http://hanksplace.net – Family History
●
Slides available @
http://brainshed.com/relational_model.odp
History
●
I was building a webbased, databasebacked
app.
●
I wanted to learn more about good database
design.
●
That eventually led me to Fabian Pascal and C.
J. Date.
●
The rest is history.
The iceberg syndrome
●
We're only scratching the surface.
●
There's a lot more depth to all this than we'll
cover here.
●
I don't have graduate degrees in math or comp
sci., so won't go too deep.
●
I do hope to raise your awareness of what the
relational model actually is, and increase your
desire to learn more about it.
RDBMS
●
“Relational” Database Management System
●
What does “Relational” mean?
SQL ≠ Relational
●
Oracle ≠ fully relational
●
Postgresql ≠ fully relational
●
MySQL ≠ fully relational
– (If I could capitalize the operator here, I would ;P ).
●
Most current “R”DBMSs stink (as far as they
implement the relational model). Some stink
more than others.
What is the relational model?
Essentially:
Applying math and logic to data management
“It was late in 1968 that Codd, a mathematician by
training, first realized that the discipline of
mathematics could be used to inject some solid
principles and rigor into the field of database
management” C. J. Date, Database In Depth
Is it still relevant?
●
It's almost 40 years old. That means it's
antiquated and about as useful as COBOL,
right?
●
We've got Überobjectxmlheirarchical
folksonomic databases now, why would we
want to use a relational database?
Absolutely relevant
Calculus has been around for hundreds of years
and it's still pretty useful.
Math and Logic
The relational model is founded on math and
logic. Principles, as opposed to practices.
Not set in stone
“One widespread misconception about the
relational model is that it's a totally static thing.
It's not. It's like mathematics in that respect:
mathematics too is not a static thing, but
changes over time. In fact, the relational model
can itself be seen as a small branch of
mathematics; as such, it evolves over time as
new theorems are proved and new results
discovered.”
C. J. Date, Database In Depth
Historical players
●
E.F. Codd – wrote the original paper defining
the Relational Model.
●
C.J. Date – An associate of Codd.
Current players
●
C. J. Date – Still going strong.
●
Hugh Darwen – Partner in crime with Date.
●
Fabian Pascal – Crotchety and harsh, but worth
reading.
Physical vs. Logical
●
The Relational Model is all about the Logical.
●
Physical implementation details are orthogonal
to the model, including:
– Indexes
– Performance
– Storage
– Transactions
– Etc., Etc.
Mathematical foundations
●
Set Theory
●
Predicate Logic
Set Theory
●
A set is a collection of distinct elements.
●
No duplicates.
●
A set is defined by its elements.
●
{ 1, 2, 3 } = { 3, 2, 1} = {2, 3, 1}.
●
Elements can be anything (including other
sets).
●
{1, 2, {2, 3}}, and so forth.
Set Theory
●
Operators
– With set operators we can manipulate sets and
derive new sets.
– UNION, INTERSECTION, DIFFERENCE, and so
forth.
Predicate Logic
●
A predicate is a generalized statement of “truth”
about our “universe.”
●
Generalized because it's parameterized.
●
Example:
– <Name> has id <ID> and works in Department <D>
●
We'll discuss this more later.
Key Players in the model
●
Types (domains)
●
Tuples (rows)
●
Relations (tables)
●
Databases
Types
Each piece of data stored in a database is of a
particular type or domain.
Types
This is the basic level of data integrity—ensuring
each piece of data stored in a database is of
the set of allowed values for that type.
Types
●
Types can be simple (numbers, strings, etc).
●
Or complex
– Point (x,y)
– JPEG Image
– Lon/Lat coordinate
– Even relation types (more on relations in a moment)
●
The important bit is that we have adequate
operators to manipulate values of each type.
Relations
What's a “relation?”
Relations
●
Based on the mathematical concept of a
relation.
●
Remember “relational expressions” from math?
●
1 < 2
Relations
●
If we generalize that:
●
R < R
●
This relation is the set of pairs that satisfy this
expression {(0,1), (0,2), (1,2), ...}.
●
This is a 'binary' relation—2 values in each
element.
●
The relational model utilizes nary relations:
Relations
●
Going back to predicates.
●
<Name> has ID <id>, and works in Dept <D>.
●
A relation expressing this predicate would be
the set of all sets of values which satisfy the
predicate truthfully (in our enterprise or
universe):
●
John has id 12345, and works in Engineering.
●
Sally has id 54321, and works in Finance, etc.
Relations
●
Or, expressed mathematically:
●
{ {Name => John, id => 12345, Dept => Eng},
{Name => Sally, id => 54321, Dept => Fin} }
●
Expressed tabularly:
Name ID Dept
John 12345 Engineering
Sally 54321 Finance
Relations
●
We call the set of predicate placeholders
'attributes': {Name, ID, Dept} (fields in SQL).
●
Each attribute has a corresponding type (or
domain) that constrains the possible set of
values for that attribute (String Name, Integer
ID, String Dept).
●
Each set of values for the attributes of the
relation is called a tuple.
Tuples
●
Each set of values that satisfy the predicate of a
relation is a tuple in that relation.
●
{John, 12345, Eng},
{Sally, 54321, Finance}
●
The values in each tuple correspond to the
attributes in the relation, and conform to the
type or domain of each attribute.
The strength of the model
A relational database is much more than just
tables, columns, and rows...
The strength of the model
●
A relation is a set of tuples.
●
From a logical point of view (predicate logic) a
relation is the the logical AND of all the tuples
that are elements of the relation:
●
John has id 12345, works in Engineering AND
Sally has id 54321 works in Finance AND
Jim has id 13245 works in Marketing AND ...
The strength of the model
●
A database is a set of relations.
●
A database is the logical AND of all the tuples
in all of the relations in the database.
The strength of the model
●
If each tuple reresents an axiom—a statement
of truth or a fact about our universe...
●
We can combine these axioms with
mathematical operators, and using logical rules
of inference...
●
We can derive new facts that are
mathematically provable.
●
Thus ...
The strength of the model
●
Queries in our database are really proofs, which
use the basic axioms stored as tuples, as well
as relational operators and logical inference
rules.
●
If our database is designed on mathematically
sound principles, our query results are
mathematically sound theorems about various
aspects of the universe we are trying to model.
The strength of the model
●
That means you can put a lot of trust in your
query results.
●
So, a truly relational database is more than just
tables, columns, and rows.
●
Your database is not a spreadsheet :). It is a
formal proof system...if it's designed well, and
follows some basic rules:
Duplicates
●
No duplicate tuples (rows) in the relational
model (unlike SQL).
●
Remember, relations are sets. Sets don't have
duplicates.
●
Besides, what would a duplicate tuple mean in
light of our discussion on predicates?
●
Duplicates state the same fact twice, thus are
redundant and meaningless.
Duplicates
●
John has id 12345, works in Engineering AND
●
John has id 12345, works in Engineering AND
●
Makes no sense, adds no significant meaning
in the context of predicate logic.
●
Unfortunately, that doesn't stop most
“R”DBMSes from letting you do it :(.
Nulls
●
Avoid nulls
●
Codd proposed nulls 10 years after first
introducing the relational model
●
Date argues they aren't necessary, and even
harmful, for the following reasons...
Nulls
●
Predicate logic (remember predicates?) is
based on 2valued logic, true and false
●
By introducing nulls into the picture we move
away from the safety and mathematical
strength of 2valued logic into 3valued logic
(true, false, and unknown).
●
The truth tables for 3VL operators are larger,
and in some cases nonintuitive.
Nulls
●
Nulls can lead to query results that are
erroneous – and more particularly – not
immediately evident that they are erroneous
●
You have to always remember to check for the
possibility of null when you use them.
●
Nulls don't behave intuitively when used with
particular operators, and implementation can
vary from one vendor to another.
Nulls
●
You can design without them.
●
It may complicate the design a bit, but the
resulting mathematical soundness is worth it.
●
In a sense, a database that doesn't allow nulls
is another form of integrity constraint.
Keys
●
Repeat after me:
– “A key is NOT an index.”
– Indexes are physical implementation mechanisms
used to enforce keys.
– So what's a key?
Keys
●
A “key” is any set of attributes, the values for
which uniquely identify any tuple in a given
relation (Every relation has a key comprised of
all the attributes of the relation).
●
A “candidate key” is a key with the smallest
possible set of attributes.
●
A “primary key” is simply an arbitrary choice of
candidate keys.
Keys
●
A “foreign key” is a set of attributes in one
relation that correspond to a candidate key in
another relation.
●
The set of values in each tuple for the attributes
of a foreign key must exist as a set of values in
a tuple for the corresponding attributes of the
'foreign' relation.
Views
●
A view is simply a relation, defined as the result
of a given relational expression.
●
All rules that apply to relations, apply to views.
●
Anything you can do to a relation, you should
be able to do to a view.
– Including updates (!)
View updates
●
Some vendors offer limited support (e.g.,
Oracle)
●
The “view update problem” is a real challenge.
– Any takers?
Practical recommendations
Given that most vendors don't faithfully implement
the relational model, what is one to do?
Practical recommendations
●
Use wellthoughtout data types, supplemented
by check constraints and “domain” tables:
CREATE TABLE widget_type {
widget_type
VARCHAR(32)
PRIMARY KEY
};
CREATE TABLE widget {
id
INTEGER
PRIMARY KEY,
name
VARCHAR(32)
NOT NULL,
widget_type
VARCHAR(32)
NOT NULL
REFERENCES widget_type(widget_type)
};
Practical recommendations
●
Select appropriate keys. Make sure every table
has a key (a set of attributes that uniquely
identify every tuple). This will prevent any
possibility of duplicate rows.
●
Use referentialintegrity where needed (don't
leave it up to the client code to enforce it).
Practical recommendations
●
Avoid nulls.
●
If nulls are “necessary”, understand the logical
and mathematical implications of using them.
●
Don't model more than one entity class in a
single table.
●
Normalize.
●
Understand the dataintegrity implications of
denormalization.
Practical recommendations
●
Use plenty of wellthoughtout constraints.
●
Chances are, someone or something, down the
road, will try to insert data into your table that
was never intended for that table.
A world of unstructured data
●
Files, video, audio, text, ad infinitum (and
growing).
●
Much of it really does have structure. And most
of it has metadata, which is certainly structured.
●
But is it really unstructured?
The role of Open Source
●
When will we see an industrialstrength
implementation of the relational model?
– There are several nascent projects
●
When will we see an S3scale implementation?
●
What becomes possible when the mainstream
catches on to the power of the model?
Implementations
●
Rel (Java)
●
Duro (C, TCL)
●
MuldisDB (Perl 5 / Perl 6)
●
Alphora (commercial, nonopensource)
– Though they apostatized on the null issue
●
Dee (Python)
●
See http://thethirdmanifesto.com for URLs
The Role of Open Source
●
I don't think we've seen all the possibilities that
the relational model offers – since none of the
big players have provided a faithful
implementation of it.
●
That's where YOU come in :).
Losing my religion
●
Use the right tool for the job.
●
In theory, anything can be modelled with the
relational model, but sometimes it's not easy.
●
The model is powerful and is worth
understanding indepth. Don't discount it just
because the big players don't implement it well.
●
That said, it's not the doall, endall so consider
all the options.
Recommended Reading
●
C. J. Date – Database in Depth, O'Reilly, 2005
– Brief (~200 pp), relatively inexpensive,
approachable.
●
C. J. Date – An Introduction to Database
Systems, 8th Ed, AddisonWesley, 2003
– College text, large (~1000 pp), pricey, indepth,
excellent.
Recommended Reading
●
Fabian Pascal – Practical Issues in Database
Management, AddisonWesley, 2000
– Approachable, relatively brief (~250 pp).
●
C. J. Date – What Not How, AddisonWesley,
2000
– A good overview of the benefits of thinking
declaratively, as opposed to procedurally
Recommended Reading
●
C. J. Date and Hugh Darwen – Databases,
Types and the Relational Model (3rd
Edition), AddisonWesley, 2006
– “Graduatelevel” reading. A thorough definition of a
truly relational language.
●
http://thethirdmanifesto.com
– A website dedicated to topics in the above book.
Includes a very highsignal mailing list.
Recommended Reading
●
Codd, E. F. (1970). "A relational model of data
for large shared data banks". Communications
of the ACM, 13(6):377–387.
– A revised version of the original IBM report that
started it all.
●
http://en.wikipedia.org/wiki/Relational_model
– Wikipedia article on the Relational Model
Recommended Reading
●
http://www.dbdebunk.com
– Fabian Pascal's Website. Lots of good stuff there if
you can put up with his crustiness.
●
Everything else written by Date (there's a lot of
it).
The Relational Model
●
Questions?