You are on page 1of 25

An

Introduction to
Relational Databases
Prof. Nicolas van Zeebroeck
September 2014

TranslatedbyNicolasAmeyefromthebookLesfondementsdelinformatique:dubitaucloud
3meditionbyH.Bersini,MP&RSpinette,andNvanZeebroeck,Vuibert,2014.
EditionsVuibert&NicolasvanZeebroeck2014Donotcopyordistributewithoutpermission

Context
The operating system that runs any computer however small is always responsible for
managing the file system: file names, directories, access rights, actual location on the
disks. However, the operating system is always as agnostic as ignorant on the structure,
format and contents of files. It simply manages their storage and their localization on the
hard drive as well as their access rights. When one program requests access to a file, the
operating system transmits it to him under the shape of a continuous and
undifferentiated flow of bytes. It is up to the program to browse the contents of the file
in a sequential way, from the beginning to the end, until it localizes the information
which he was looking for. This access mode (that is called "sequential" and which
constitutes the only way to access files offered by most of the operating systems) suits
generally well with most files and programs. Nevertheless, such an access mode
assumes that files either need to be entirely loaded before they are actually read or
edited (which is the case for instance when you open a Word document or an Excel
spreadsheet), or that they are by nature readable from their beginning to their end (as
with an MP3 audio file). When files are small enough, the access mode appears
irrelevant because loading them completely is rarely a problem. The limitations of this
mode of operation start to appear when files are voluminous or made of elements
presenting repetitiveness with strong homogeneity. We then refer to these elements
under the name of "articles" (or records). Moreover, we can also define one or several
fields which will play the particular role of an identifier or "attribute". An attribute is a
field whose content is used to distinguish one or more items within the file. An identifier
whose content is not repeated in the file is said "unique", whereas an identifier with
identical contents in various articles of the file is said multiple.
In order to illustrate this notion of identifier, just let us suppose that you decided to
introduce into your computer, on the one hand the inventory of your wine cellar, and on

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

the other hand the list of transactions of your bank account. To group these two types of
information in the same and single file would probably not be a good idea, because the
data of your cellar are totally different from those of your bank account.





Unique
identifier

Multiple
identifier

Figure 1: sequential file


So let us just suppose and consider that we keep these two pieces of information in two
separate files. We will first take care of the file containing the transactions from your
bank account, supposed unique for the sake of simplicity. We will use this example to
illustrate the notions of "attribute", sequential access and indexed sequential and sorting
of articles. We shall specify these notions by entering into the wine cellar (by leaving it,
it would be more difficult).
Regarding the bank account, every transaction will be represented by an article
containing for the main part a transaction number, a value date, a profitable account
number, a more or less cryptic explanation, an amount, and a balance. The transaction
number is chosen as the identifier because it constitutes a different serial number for
every article. In what order are we going to organize the transactions in the file? The
arrangement which appears the most logical is to arrange them one after the other, in
the order of their serial number which we retained as an identifier. We therefore
established a sequential file, ordered according to the successive values of a unique
identifier (as show in Figure 1). This sequential arrangement of articles or transactions
corresponds to the order in which the information accumulates and accurately reflects
their chronological evolution, as well as that of the balance of the account.
More generally, the sequence of articles within a file can be ordered according to the
contents of one or several unique identifiers or multiple identifiers of the article. These
identifiers receive the name of "sort key" or "key". Articles can be sorted out according
to the increasing or decreasing order of the contents of the keys. When the identifier is
multiple, the key presents "duplicates". In our example with the bank transactions, we
decided that the file had to be sorted out by transaction number. However, we could
have chosen another key, for example, the amounts of the transactions listed in
descending order, showing first the highest amounts. Duplicates can appear, because
this key does not form a unique identifier: two different transactions can have the same
amount.
N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

Let us come down finally into the wine cellar. The choice of the identifier seems less
obvious. It could be the names of the castles. Yes, but the "Bordelais" counts probably
more Castle Bellevue than names. Let us imagine that you have two sorts of Castle
Bellevue: one in Mdoc, and another one in SaintEmilion. Furthermore, they are not all
of the same vintage. Let's leave it there, you can easily understand that the choice of a
unique identifier is less obvious. You can opt for a composite identifier resuming at least
the name of the castles, the naming and the vintage. The location in the cellar in the form
of "number of vault" would be a false good idea because bottles get drunk and locations
are reused. And in what order to arrange the articles in the file? The chronological order
in which bottles are placed in the cellar or consumed is simple to implement and to
operate, but however does not answer all the needs. It is necessary to be able to browse
the file in order to know from which price you bought the Castle Bellevue Mdoc 1998.
All those readings of the file seem to be superfluous, when the answer is only in one
article. Do not give up, we shall see immediately that other forms of logical organization
of files turn out much better adapted to this kind of situation.
Both examples highlight that the programs have to share a common understanding of
the structure and internal organization of a file. This logical organization will depend
mainly on the nature of the applications which will have to run the file, in presence
sometimes of contradictory requirements. Indeed, the reserved organization influences
on the one hand the way the contents of the file will be arranged, but also the
possibilities of access to a fraction of the file without having to treat it in its entirety. Let
us examine and let us compare now various methods of access: sequential (the one that
you already know), indexed or by calculated address.
An example of sequential access, but concerning this time a file where the contents are
organized in the form of articles of fixed length, is that of the banking transactions
described above. To establish the list of the banking transactions, the sequential access
suits well, but it is clear that a more direct access would be desirable in order to find a
transaction made for a given date or an information relative to a particular wine in our
cellar.

Indexed access
The access by use of an index (as the finger pointed towards an object) is based on the
principle that files consist of articles or records that it is interesting to consult
individually, and to find each of these records by the value of a particular or a given
identifier that we called "key" higher. Figure 2 illustrates the principle of the indexed
access. Every file is accompanied by an auxiliary table associating value of the index and
number of the block containing the article. The knowledge of the key is therefore enough
for localizing the block. The index can be composite, when several fields are associated
to form the key. We can finally obtain the purchase price of the Castle Bellevue Mdoc
1998 without having to browse the file. It is also possible to create multiindexed
sequential files, containing as many auxiliary tables as different indexes. One more

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

index, and we know immediately which wine is arranged in a particular location in our
cellar. Similarly, we can transform our sequential file of the banking transactions into a
multiindexed file with an index containing account numbers and another one dates.
Key or index

Transaction

1
2
...
...
24

Transaction number 1
Transaction number 25
Transaction number 50




25


50

Figure 2: indexed access


This will allow us to immediately know the transactions made for a given date with a
determined account. For voluminous files, the number of entries being too high, it is
necessary, as Figure 3 indicates it, to turn to a hierarchical organization of these
auxiliary tables, the entries of the first table pointing to intermediate tables, themselves
pointing towards blocks containing the desired article (in our case, for a single
hierarchical level, but this halving can be reproduced as often as required). This access
mode is similar to the one that the Unix system uses for the organization of its file
system.

Direct access by calculated address


The indexed access is unquestionably an excellent solution for applications requiring an
access as rapid as direct as possible to a particular article on the basis of the value of its
key. However, this access is made at the price of the consultation of cumbersome tables
wherever they are stored, tables which, besides, must be browsed sequentially, until
reaching the value of the key allowing to locate the article. This sequential consultation
can be extremely expensive in terms of calculation time. An alternative solution, which
could be even more effective, is to associate directly the value of the key and the address
of the article within the file and this without appealing to any table embodying this
association.

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

This address could be expressed under the shape of a reference consisting of a block
number and of a location inside that block. To every possible value of the key would
correspond then a single reference associated with a portion of space on the disk.
However, this solution is rarely possible, because the values of the keys generally
present too many discontinuities entailing unused zones on the disk. However, the
solution consisting in associating this address with the value of the key can be
maintained, but by producing this address by a mathematical operation made directly
on the key.

Transaction number 1
Transaction number 101
Transaction number 201

Transaction number 1
Transaction number 11
Transaction number 21

1
2
...

10

Transaction number 91

91

Transaction number 101


Transaction number 111
Transaction number 121

101

Figure 3: Hierarchical indexed access


The calculation realizing this association and allowing the direct access is sometimes
called "randomization", referring to the generation of a random number. This
terminology should not hide however that this particular shape of "randomization" will
always generate the same result from a value of a given key. This is indeed the
indispensable condition to find later the location of an article from the value of its key.
To be effective, the generation of addresses from the keys will have to lead to a
distribution as equal as possible through the file. It will be necessary to accept that
several different values can end up in the same address and manage if necessary this
problem.
The direct access by calculated address favours the speed of access as well as the
addition of articles. However, the preservation of a sequential order of articles can be
made only at the price of complex subtleties.

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

Database management systems


The methods of access (sequential, sequential indexed, direct), which we have just seen,
apply to an isolated file, made by a single type of information. But the real world is more
complicated than that. Nevertheless, the ambition of the computing is to describe this
real world, in reduced scale, in a model which describes not only all elements of
information, but also their relations. Somewhere, there is a relationship between certain
transactions of your bank account and bottles in your wine cellar. More generally, there
is a relationship between the banking transactions and the holders of the accounts who
make them.
A database is designed to bring together various files in a structured set where the
relations between similar fields of various articles can be expressed. A database consists
of all the data stored on disk and of the software which ensures the creation, the
consultation and the maintenance. Such software is called a database management
system or DBMS.
The complexity of this subject would justify a full book, among which libraries and
bookshops abound. Our ambition in this chapter is not to envisage all the refinements of
which databases are capable, but to familiarize you with the foundations of databases
and their query mode. It seems essential to us, because databases are at the heart of the
computing and of all enterprise systems. There are various manners to organize and to
manage databases, and particularly to realize the links between files. The model that got
established for several years is the one said "relational" model and it is thus logically this
model which we choose to present.





1 bank account to its N transactions

Figure 4: Hierarchical indexed access


In the relational model, the information is grouped according to its nature in tables
which can be linked by a mechanism of join keys. As showed in Figure 4, the table of
"account holders" will be linked to the table Transactions through the intervention of
the key formed by the account number. Every holder with his account number is unique,

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

and we can have several transactions identified by the same account number. We are
thus in the presence of a "1 to N" relationship. The relationship "account number" will
allow to find the information of the account holder for a given transaction, and
conversely all the transactions of a very precise holder.
The choice of a database represents for any organization, whatever it is, a choice of
strategic importance, because of this choice can depend a good adequacy of the data
model to the reality of the organization and to its evolution. DBMS represent an
important area of development of information technology. Various format of DBMS are
proposed, going from relatively limited products such Microsoft Access or Apple
FileMaker often used by private individuals) to highly sophisticated systems intended
for enterprise applications such Microsoft SQL Server, the open source system MySQL,
and at the top of the potentialities (and costs), Oracle. DBMS sometimes exhibit
substantial differences in terms of their implementation and exploitation. This also
means that switching from one DBMS to another can pose major difficulties in a crucial
domain for the smooth running of an organization, and the sustainability of the DBMS's
editor turns out to be an important consideration.
They have on the other hand a common purpose, it is to provide access to the data
through a language called SQL (Structured Query Language). SQL became a de facto
standard, but it presents, however dialectal variations according to the editors, what
sometimes reserves a few surprises.

Tables
Inside a database, the information is structured and stored within tables, which are to
the DBMS what files are for the operating system. The table constitutes the physical
storage unit of information and thus presents two sides: a table determines on one hand
the structure of its contents, and on the other hand it also contains the information itself.
The first step to carry on in the creation of a database consists in identifying the various
tables. To do it, we shall ensure that every table can be associated with a logical object in
the database, what we call an "entity".
An entity is an abstract group of information, meaning that an entity corresponds to an
identifiable object in the real world which the database is intended to represent: a bank
account, a customer, a supplier, a vintage wine, etc. To illustrate this notion and those
that will follow, we propose to build together a small database. The objective of this
database will be to manage the registration of students in various courses, and the
grades they obtained. In this small simplified example, we can readily identify at least
two entities: students and courses. We shall see later on how to manage their
registrations and grades.
Every entity is characterized by a number of information which inform about its identity
and its properties. We call these elements of information the "attributes" of the entity. In
the case of a bank account, it will be for example about the account number, about his

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

holder, about his type (running, of savings, etc.), or still about its balance. Return to our
example of students and courses. What information characterizes each of these two
entities? When we think of a course, the first elements which we can consider are for
example its title, the number of hours or credits which it represents (in the system of
European higher education, we express this notion in a unit called ECTS for European
Credit Transfer System), its teacher. As for the students, we will think of course of their
name, first name, date of birth, or their address.
Of course, there are so many other facets of a student which we could consider: his
nationality, his size, the name of his relatives, his income or those of his parents, his
bank account number, the colour of his hair, the name of his favourite band, the sports
which he practices, etc.
When designing a database, the IT specialist has to restrain himself from modelling the
world in greatest detail. Firstly, because it would be simply impossible. Secondly,
because every attribute that is created in a table represents additional volume and
additional information to be recorded and maintained, which would make the database
excessively voluminous, heavy to manage and difficult to exploit. And finally and
perhaps most importantly, most of this information would be of no use within the
framework of our previous examples courses, notes which our small system aims to
ensure. Furthermore, it could also be considered as illegal or immoral on behalf of a
university to want to store relatively personal or confidential data. The designer of a
database will thus have to think very carefully about the attributes which are absolutely
necessary for the services which the database is intended to allow, and to make sure the
structure of the entities includes these attributes, all these attributes, only these
attributes.
At this stage, our database is composed of two entities, thus of two tables, the courses
and the students. It's time to think of the way we are going to uniquely identify the
articles contained inside each of these tables, that is their identification key. For the
courses, we could consider using the title of the course as unique identifier, but it would
be risky, because in a university, there are very often several courses of accounting,
mathematics, statistics or physics in various faculties, without being given by the same
teacher or associated to the same number of credits. Some teachers also have more
"credit" than others. As in the example of our wine bottles, and as very often in
databases, the best solution consists in adding an attribute which will have no other
purpose than to assign a unique identifier to every article. In universities, we usually
refer to the unique identifier or the code of the courses as their "mnemonic" and that of
the students as their "roll number". As these two attributes will have the role of allowing
to identify articles, they constitute an "index key" (mechanism which will allow to locate
every article on the physical support of storage). And as these attributes will take a
unique value and specific to every article, they will insure their unambiguous
identification. We shall qualify these keys as "primary" (primary key or PK). Every entity
(table) always has to contain at least an attribute allowing to assure this unique

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

identification (sometimes it is the combination of several attributes that can play this
role) and we shall pay attention to distinctively indicate this or these essential attributes
in any representation of the table. When a primary key is made of a value that has been
created specifically for the sake of uniquely identifying records in a table, it is called a
surrogate key. Student enrolment numbers and course codes are examples of
surrogate keys. Lets now represent the diagram of our database at this stage of
development. This is what Figure 5 illustrates.





Figure 5: Schema of a database containing two entities


Figure 5 represents the structure of the entities which compose our database. This
representation consists in materializing every entity by a box whose header indicates
the name of the entity (the object that the entity represents) and whose main part lists
the various attributes. It is a rather simple and concise way to represent the structure of
an entity, i.e. its internal organization. But this representation sometimes infers a little of
confusion because we arrange the attributes one below the other, while in reality the
attributes represent columns of the table.
Let's return one moment to the notion of table. Every entity corresponds to a table. Yet,
a table is generally constituted of columns and lines. Columns correspond de facto to the
various attributes which characterize articles of the table (we also call them fields), and
every article corresponds to a line, which we shall prefer to qualify as "tuple" (or as
record). From these synonyms, the following terminological equivalences emerge:

Physical

File

Field

Article

Virtual

Usual

Conceptual

Table

Object represented

Entities

Column

Characteristics

Attribute

Line

Person

Tuple or record

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

These equivalences are important, because both the professionals and the specialized
literature juggle with these various terms and use them in an exchangeable way. You
will often hear about a file which contains columns and recordings, mixing cheerfully
terms borrowed from various semantic views. But it matters little, from the moment you
conceive and can well imagine the grouping in a database of the elements which share
the same perfectly regular internal structure. These elements constitute the lines of a
table whose structure is determined by the columns. If we try to be rigorous, we shall
strive to say that tables represent conceptual entities, columns represent their
attributes, and lines represent the tuples which are so many concrete realizations of the
entity.
Let us focus one moment on this notion of perfectly regular structure. Is this notion
enough to assert that the lines of a table just have to have exactly the same structure
only because of the fact that they are described by the same number of columns whose
titles are the same? Conceptually, yes. In practice for the computer, it is a little less
obvious. Think again of the way the information is physically stored in the secondary
memory. Inside a file, articles are written one after the other. Also, within every article,
fields are written one after the other. The main interest of databases lies in the capacity
to consult or to modify a given piece of information without having to browse or to
rewrite the entire file. Imagine you want to modify the title of a course which appears in
the middle of the Courses table. The academic persons in charge of the program found,
for example, that the course entitled "Computing" would win in clarity if it was renamed
"Introduction to the Foundations of Computing and Microcomputing". Much clearer,
right? Well, let us ask the DBMS to find the article corresponding to this course (we shall
use for it the mnemonic "INFOS202" in reference to the value of its primary key) and
then let us ask him to replace "Computing" in the field "Title" by the value "Introduction
to the Foundations of Computing and Microcomputing". What impact do you think that
this modification could have on the physical storage of the information contained in this
file? Since all the information is stored in a contiguous way there, shouldnt the
replacement of a text string made up of 12 characters by another text string containing
72 characters raise some issues? The answer is obvious. The insertion of 60 additional
characters is going to require the shifting of all the information which follows the field
"Title" of the tuple "INFOS202" and to impose a complete rewriting of the file from the
point of insertion. We have just provoked exactly what we thought that databases
allowed us to avoid.
Then what is the solution? Lets go all the way down to the end of our logic. We have said
that the various articles of a file in a database had to have exactly the same structure. By
satisfying us with imposing the number and the title of the columns, we did not answer
this requirement. For the structure of articles to be exactly the same, it is necessary that
the size and the shape of a field are strictly identical for all the articles. We cannot just
list the columns of each table, we must specify their nature and format. Thus let us take
back our two entities and their respective attributes and let us specify the nature and

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

10

the format of each of them, by using the various variable types available in every
programming language: boolean (binary), integer, double, string.
DBMS always allow to qualify these types for example by determining the length of
integers (that is the number of bits on which they will be coded) and that of the strings.
To these types are added some others very typical of databases, such as dates. These are
coded under the shape of integers representing the number of days since the start of an
agreed year, for example, since January 1st, 1900 (so, the date of May 28th, 2014 would
be represented by the value 41787, i.e. the number of days since 1/1/1900).
The solution of our problem now seems rather obvious: all we need is to specify the type
of every field in the database. Most of the fields in our example are of textual nature and
thus consisted of character strings for which we set arbitrarily length (Char(X) where X
represents the number of characters that compose the maximum value of the field). Only
the field "ECTS" in the table "Courses" is of type integer (Int) and the field "Date of birth"
of the table "Student" is of type date.

Table Course
Attribute
Mnemonic
Title
Teacher
Ects
Domain

Table Student
Attribute
Roll number (PK)
Name
First name
Date of birth
Address

Type
Char(8)
Char(256)
Char(50)
Int
Char(50)
Type
Char(6)
Char(50)
Char(50)
Date
Char(256)


Now that we have specified the type of every column of our tables, the problem of line
editing is solved. Even if we had to replace the "Computing" title by "Introduction to the
Foundations of Computing and Microcomputing", it would no longer cause shifting all
the consecutive information, because the space of 256 characters has already been
reserved inside the table for every field "Title", even if the real title has far fewer. This
solves the problem which occupies us, by guaranteeing a perfect symmetry of articles
and a stability of the structure of the file, but it can create another one, linked to the
storage volume.
Actually, on the basis of the fields' length fixed arbitrarily above, we can easily estimate
the space which will invariably be necessary for the storage of an article of each table. If
N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

11

every character is coded on 32 bits (as is the case in Unicode format), and so it an
integer and a date, every line of the table "Courses" will occupy the space corresponding
to 364 characters and 1 integer, that is 365 times 32 bits, thus 11680 bits or 1460 bytes.
Likewise, the records of the table "Student" will occupy the space corresponding to 362
characters and an integer, that is 363 times 32 bits each. When we consider that
databases in firms often contain millions or even hundreds of a million of lines in every
table, the volume which results from it can quickly become gigantic.
For that reason, the designer of the database would be well advised to try harder to
reduce or minimize the size of each field without risking to lose the capacity to store the
real information. You understand now why those of us who have long family names see
them often amputated when the information systems of the administrations with which
they interact reproduce them. It is by concern of economy that the administrators of
databases try hard to limit the size of all the fields of their databases, sometimes at the
price of a loss of a part of the information.
The case of strings is in this respect and by far the most problematic, because their
length is very variable by nature. Imposing a fixed length for every string can, from this
point of view, be very suboptimal because it will require catering for storage capacity,
which will exceed by far the needs for most records. In our example, we planned 50
characters to store the first name. It should be enough in most of the cases (and still),
but for the immense majority of the students it is very excessive and will entail a waste
of space in the database: the majority of the space reserved for the characters will be
replaced by spaces and our database will be essentially made up of blanks.
DBMS solved this problem by offering a particular textual variable type that we call
"variable character field" ("varchar" in a large number of DBMS). Instead of imposing a
fixed length for each string, we shall prefer to impose a maximal length without
requiring the reservation of the space of storage corresponding to this maximum.
Technically, instead of indicating a character format such as Char(50), we shall indicate
Varchar(50). In so doing, instead of filling 50 unused characters by empty or blank
spaces, the DBMS will store only the actual characters and will make them precede by a
byte to indicate the real length of the string (what will allow it to know where this field
stops and where the following one begins).
It seems simple, but doesn't it recreate the problem bound to the adjoining or
contiguous storage of information, i.e. the risk of having to rewrite the entire contents of
a file at every insertion? In theory, yes. In practice, no, because DBMS solve it by always
leaving a little free space inside every article. In case of widening a value, rather than
moving and shifting all the articles which follow, they generally just have to rewrite the
only corresponding article by reorganizing its free space. When the widening of a value
exceeds the free space in the concerned article, we shall then prefer to move the article
itself at the end of the file (or at the first space of sufficient size). In such cases, the

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

12

management of indexes complicates rather furiously, but let us leave these


considerations for DBMS's designers, they ask for more of it.
The precise definition of the type of every field has not for sole and only purpose to
facilitate the management of the information storage. In reality, we first define the
nature of the data in databases as in any computer program to ensure consistency,
integrity and future treatment. By imposing that the field "Date of birth" is of type Date,
we forbid in reality the introduction of any value which would not be in accordance with
this requirement, and we thus guarantee the integrity of the data, which provides a
certainty on which the programmers can lean when they write computer code intended
to treat or manipulate this information. No way to store the title of the course in the field
"ECTS", the DBMS would not allow it. And it's for the best, because it would be a very
unpleasant surprise (heavy of consequences) for the program which would bet on the
fact that the information contained in this field is numeric.
This concern of integrity and coherence of data is such that most of the DBMS allow the
definition of validation rules for every field. For a start, we can always determine
whether or not a field can be left blank. By default, (unless it is a primary key or more
generally if the field contains a unique index) a field can always remain empty, but it is
possible to force the introduction of a value for the record to be valid. We can also
impose limits on the allowed values for numeric fields (for example, by indicating that a
date of birth cannot be situated in the future, or that an age cannot be negative) or to
impose a certain format on the textual data (for example by indicating that a mnemonic
consists of 5 letters followed by 3 figures and cannot contain spaces). It is interesting to
become aware of it, but the way these rules are defined varies so much from one system
to the other that it would exceed the frame of this note.
Last detail, but of importance, in order to allow direct accesses, to every field one can
associate an index. The indexing of a column will allow the ordering of the records of the
table on the basis of its value, and to reach directly the lines where the value of this field
corresponds to what we are looking for.
By adding an index on a column, the DBMS requires to specify if the indexation must be
unique (in which case every value can appear in this column only for a one line) or if
duplicates are allowed (in which case, the same value can appear several times in the
same column). The field which plays the role of primary key must obviously be indexed
in a unique way. All DBMS can manage indexes on combinations of field.
Now that our two entities have been identified, described and structured, we can begin
to fill our tables with fictitious data. DBMS allow realizing insertions or entries in tables
in the manner of a spreadsheet like Excel. They are also capable of showing the contents
as follows (by remembering that lines represent tuples and that columns represent the
attributes):

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

13

Courses
Mnemonic

Title

Teacher

Ects

DROIS201
ECONS201
GESTS203

Commercial Law
Macroeconomics I
Analysis of financial statements

Franoise Juge
Rovert Ycroit
Faska Comte

3
6
3

Domain

Law
Economics
Business

INFOS202
LANGS201
MATHS201
STATS202

Introduction to computing
English I
Mathematics II
Probability and statistical
inference

Nicolas Code
Ian Smith
Marie Cosinus
Lise Peutet

5
4
5
10

Computing
Languages
Mathematics
Mathematics


Roll number Name

First name

Date of birth

Address

298489
300487
310452
325637
330777
380654

Pascale
Nicolas
Robert
Emilie
Hugues
MariePaule

21/05/1994
14/09/1994
1/08/1995
19/04/1995
13/11/1994
8/06/1995

Street Hirondelles 2
Street Brederode 14
Street Lilas 13
Place Etoile 4
Square Tilleuls 12
Avenue Pasteur 35

Students

Marcel
Berzok
Pins
David
Sineri
Piesnet

Relationshipships
Currently, our two tables are totally independent from each other and nothing allows to
establish a correspondence between them. It is because, until now, we left aside the
main objective of our database, i.e. to manage the students enrollment in the courses
and the results which they obtain in the corresponding academic tests.
Before coming to this, let us briefly describe three possible types of relationships in a
relational databases. These three types distinguish themselves in reality only by the
multiplicity (we sometimes say the cardinality) of possible correspondences. We speak
about relationships from 1 to 1 (one in one), 1 to N (one in several) or N to M (several in
several).

Relationships 1 1
Let's start by imagining that the university allows its students to access the parking lots
of its campus. To obtain this access, every student will have to register his car to the
campus security service, and will receive in exchange a sticker to put on the windshield
of his vehicle, the colour of which determines in which places he may or not park. We
could easily store the corresponding information in a table called "Vehicles" containing

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

14

the following fields: plate (of registration), which will constitute the primary key,
enrolment number (of the student), brand, model and access type.
As each student can only get access for only one car, and as a car can be associated only
with a single access, thus only with a single student, that settles in reality a unique match
between a student and a vehicle, that is a relationship 1 1. This mapping is possible
only because one of the two tables includes a field whose value could be found in a
similar field of the other table. Here, it is the student's enrolment number that plays this
role. In the table "Student", the roll insures the function of primary key, what implies
automatically that every value is unique. In the table Vehicles, the field does not insure
the function of primary key (it is the license plate which plays this role), but this field is
not less unique for that matter (when creating the table, we created a unique index on
the column "Roll" of the table Vehicles to guarantee that every student is associated only
with one vehicle). It is because indexes are unique on both sides that we can speak about
relationship 1 1, as represented in Figure 6.


1
1

Figure 6: An example of relationship 1 1


The field which insures the correspondence between two tables without constituting the
primary key is called the "foreign key" (FK). The qualifier "foreign" means that the
values of this column are borrowed from another table and are there to be able to
establish a formal relationship.
This element is essential because it indicates the nature of a relationship between two
tables in a database. A relationship between two tables is in reality always a matching of
two fields. At the time of the data entry, it means asking the database to make sure that
when a value is entered into a column which serves as foreign key in a table (here the
column "Roll" of the table Vehicles) this value exists in the corresponding field in the
other table (where it normally plays the role of primary key, which is the case for the
field "Roll" of the table Students). Both tables are connected because every vehicle
contains a reference to one and only one student. From then on, it is possible to find the
address and phone number of the student that the safety department has to contact to
inform him that the car which he had carefully parked in the parking lot had been
inconveniently stamped by another student quite distracted or tipsy: simply by
searching his vehicle in the table Vehicles through using the license plate as index. Once
N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

15

the corresponding tuple found, it is easy to get the roll of the student. This roll is then
used as an index enabling us to find the address and phone number of the unfortunate
student in the table Students and tell him the good news.
Out of this example, we can deduct one of the main interests of linking information:
avoid redundancy and duplication. Without the ability to link the tables Vehicles and
Students, we should have had to copy out all the coordinates of the student (name, first
name, address, etc.) in the table Vehicles, which would not only be disastrous in terms of
storage (a waste), but catastrophic in terms of maintenance and consistency of the data
(if the address and coordinates of a student changed, we would be forced to modify
them in both tables instead of one, and if we forgot to do it on one side, we would have
had conflicting information without the possibility of knowing which one is correct). It is
up to the designer of a database to watch that no information is duplicated there, except
for the foreign keys. It is his job to structure its tables and their relationships in a way to
avoid redundancy.
Let us return now to our 1 1 relationship between students and vehicles. Each of these
two tables represents a different conceptual entity (a student is not a car and vice
versa). But if we are certain that to every vehicle corresponds one and a single student,
and that to every student corresponds one and a single vehicle, we can merge the
information of these two tables in a single one without the risk of creating redundancy
or wasting empty spaces. We say in this case that we "merge the conceptual entities".
That's what we've done in Figure 7.
Such a fusion of conceptual entities is required in a large number of cases, so that in
practice 1 1 relationships are proportionally rare in reallife databases. Technically, it
is always possible to operate such a merger without the risk of creating redundancy,
because of the uniqueness of their relationship.








Figure 7: Schema of a database containing two entities


N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

16

Nevertheless, it is not said that such an operation is always beneficial, for two main
reasons. Two considerations must be weighted. The first one is performance, thus the
speed of access to the data. If the data contained in both tables had to be frequently
organized or combined, it will be better to merge them because those kinds of
operations are cumbersome. If on the other hand the information is generally consulted
or modified separately, the merger of those two tables will not bring any tangible profit
in terms of performance (what could be conceived in our case, because the combination
of information of both tables would be justified only if there was suddenly a need to
contact the owner of a car, a situation which we hope rare). On the contrary, the merger
of tables could damage the speed of access, because it would entail a widening of the size
of the records, so that it would be necessary for the DBMS to browse a greater distance
to move from one record to the next.
The second dimension to consider is storage optimization. If tables do not contain any
redundancy (what we hope), their merger will allow saving a field (the foreign key is not
necessary any more if both tables are merged in only one). So there is an economy of
theoretical storage (but limited to a column and an index). If the relationship is strictly
of type 1 1, (i.e. to every line of a table corresponds one and only one line of the other
one), then this economy will be net. But let us imagine (it does not require a great deal of
effort) that some students do not have a vehicle. Let us even imagine, times are hard,
that a large proportion of students are pedestrians or cyclists. In this case, the fields
"Plate", "Brand", "Model" and "Access type" will remain desperately empty in the
merged table for all these students, resulting in a waste of space which could quickly
erase the thin profit pulled from the suppression of the column serving as foreign key
when tables were still separated. In this case, we would be far better advised to maintain
the separation between both tables, by knowing that all the rolls which appear in the
table Vehicle will correspond to only one student in the table Students, but that some
students from the table Student will not be linked to any vehicle.
The relationship would then be slightly asymmetric, but would stay a relationship 1 1
as long as a student cannot possess more than a car and that a car cannot be associated
with more than one student.

Relationships 1 N
Now, back to the table Courses. One of the information there relates to the domain to
which each course belongs: economics, management, computing, mathematics and
statistics, languages, etc. Up to now, we limited ourselves to the storing of this piece of
information under textual shape in a dedicated column named "Domain". Let us imagine
now that the university manages these domains by allocating to each and every one of
them a particular academic person in charge and a type (exact, human or social
sciences). We could of course add these two elements into the table Courses, what would
give the result below:

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

17

Courses
Mnemonic

Title

Teacher

Ects

DROIS201
ECONS201
GESTS203

Commercial Law
Macroeconomic theories 1
Analysis of financial
statements
Introduction to computing
English 1
Mathematics 2
Probability and statistical
inference

Franoise Juge
Rovert Ycroit
Faska Comte
Nicolas Code
Ian Smith
Marie Cosinus
Lise Peutet

INFOS202
LANGS201
MATHS201
STATS202

3
6
3

Person in charge

Prof. Juste
Dr. Munt
Dr. Picsou

Type
Human
Social
Social

5
4
5
10

Prof. Apfel
Prof. Babel
Prof. Chiffre
Prof. Chiffre

Exact
Human
Exact
Exact


The problem which this solution puts is rather obvious, especially if you look for
example at the courses of mathematics and probability: the information is redundant.
Horresco referens! The redundancy is the ultimate abomination in a database, the last
thing you want for the reasons evoked above: wasting of place, difficulty of maintenance,
risk of inconsistency. In reality, the problem comes from an error of conception, of
design: obviously, the redundant information (the last three columns of the table above)
refers to a conceptual entity which is different from the one that represents the table
Courses (a course is not a domain and vice versa). Because they are two conceptually
different entities, we have to separate them into two tables, which means creating a
table Domain in addition to the existing table Courses.
The new table will include three pieces of information: the name of the domain, the
name of its academic person in charge and its type. We could use the name of the
domain as primary key by assuming that two domains would not share the same name.
But it would not be really savvy, because the textual keys always take more space in
memory than numeric keys.

Figure 8: Two tables linked by a relationship 1 N

We shall thus add a column "ID Domain" which will serve specifically as surrogate
primary key in the form of an integer whose values will be attributed by the DBMS on an
incremental basis (every new record will automatically receive as identifier the number

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

18

following the identifier of the last inserted record). Depending on the software, this
function can be called AutoNumber or AUTO_INCREMENT. In the table Courses, we will
then have to replace the names of the domains by the number which was attributed to
them as primary key in the table Domains. This field "Domain" in the table Courses
becomes then a foreign key. The resulting schema and data contained in these two tables
are represented in Figure 8.

Domain
ID Domain

Name

Type

Law
Economics
Business

Person in
charge
Prof. Juste
Dr. Munt
Dr. Picsou

1
2
3
4
5
6

Computing
Languages
Mathematics

Prof. Apfel
Prof. Babel
Prof. Chiffre

Exact
Human
Exact

Human
Social
Social

Courses
Mnemonic

Title

Teacher

Ects

DROIS201
ECONS201
GESTS203

Commercial Law
Macroeconomic theories 1
Analysis of financial
statements
Introduction to computing
English 1
Mathematics 2
Probability and statistical
inference

Franoise Juge
Rovert Ycroit
Faska Comte

3
6
3

Domain

1
2
3

Nicolas Code
Ian Smith
Marie Cosinus
Lise Peutet

5
4
5
10

4
5
6
6

INFOS202
LANGS201
MATHS201
STATS202


We can then easily understand that the relationship which connects the courses and
domains is this time of plural nature: a domain can be linked to many courses, but a
course can be linked to only a single domain. The relationship is thus of type 1 to N. This
multiplicity and its direction are determined by the keys. As the relationship is always
established between two fields, we have already seen that it is the uniqueness or not of
the key index which determines this dimension of the relationship. In our current case,
the relationship is materialized by mapping and linking the field "ID Domain" from the
table Domains and "Domain" from the table Courses. The first one constitutes the
primary key and is thus inevitably unique. The second is not indexed in a unique way (it
is conceptually clear because several courses can obviously belong to the same domain,
and practically because the courses of mathematics and of probability are both bound to
the same domain, which is why the number 6 shows up twice in the column "Domain" of

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

19

the table Courses). A field indexed in a unique way a fortiori a primary key can only
be on the side "1" of a relationship. A field which is not indexed in a unique way is
inevitably always on the side "N".
The relationships 1 N are excellent examples of the economy of storage that is provided
by a relational database. By grouping and by isolating information which "go together"
and form a whole within distinct conceptual entities, and by allowing to refer to them
elsewhere by simply including an identifier that will refer to this set, we can easily avoid
any redundancy in the data.

Relationships N to M
Finally, we will now treat the primary goal of our database: the management of course
registrations and grades. We can already suspect that it will be necessary to establish a
relationship between the students and the courses. But what would be the nature of this
relationship? Can we imagine that a student can attend several courses? Despite the
successive reforms of the higher education which encourage the students not to be
overwork and to spread their efforts over several years if necessary, we can reasonably
hope that the answer is positive. Can we claim that the same course is followed by
several students? Fortunately for the teachers, the answer is positive as well. A course is
thus connected to several students, and students can be linked to several courses. We
are facing a "many to many relationship", more simply qualified of "N to M".
But how to materialize such a relationship? We saw that the only way of putting two
tables in relationship is to insert into one of the tables a column which will serve as
foreign key by making a reference to the primary key of another table. Thus let us try to
realize it in our database and let us see what it could yield, assuming that the first two
students are both registered to the first two courses.

Roll number Name

Students
300487
300487
298489
298489

Berzok
Berzok
Marcel
Marcel


First
name
Nicolas
Nicolas
Pascale
Pascale

Date of birth Address

Mnemonic

14/09/1994
14/09/1994
21/05/1994
21/05/1994

DROIS201
ECONS201
DROIS201
ECONS201

Street Brederode 14
Street Brederode 14
Street Hirondelles 2
Street Hirondelles 2

Courses
Mnemonic

Title

Teacher

Ects

DROIS201
ECONS201
DROIS201

Commercial Law
Macroeconomic theories 1
Commercial Law

Franoise Juge
Rovert Ycroit
Franoise Juge

ECONS201

Macroeconomic theories 1

Rovert Ycroit

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

Roll

3
6
3

Domain

1
2
1

298489

300487
300487
298489

20

The result is a disaster. To include the mnemonic of the course in a new column of the
table Students is a disaster because it supposes to repeat all the lines for every new
course with which a student is associated. To include the roll of a student in a new
column of the table Courses also has the same result, because it means duplicating the
course for every new student who joins it.
The problem is that we should always include the foreign key on the side "N" and never
on the side "1", yet in a relationship of N to M, every table is at the same time on the side
1 and on the side N: ONE student is bound to SEVERAL courses, and ONE course is
bound to SEVERAL students. We cannot integrate the primary key of one of the tables as
the foreign key into the other one. In reality, we can simply not connect these tables
directly. We need a kind of artifice, which consists in creating a new table in which the
relationships between the two tables will be materialized.
So imagine we create an additional table which we shall call "Joint" and which will
contain three columns: an identifier specific to every line of the table and which will
serve as primary key (i.e. a surrogate key), the roll number of the student, and the
mnemonic of the course. We can now enter the registration of the first two students to
the first two courses by creating four new records in this new table.

Joint
ID joint
1
2
3

Student
300487
300487
298489

Course
DROIS201
ECONS201
DROIS201

298489

ECONS201


The creation of this joint table allows to avoid the modification of the tables which we
"wish to connect in N to M and thus to create the redundancy observed above. Let us
reflect a moment about the relationships which this table creates implicitly in our
database. We inserted into the table Joint the unique identifier of the student (his roll)
which plays the role of foreign key. We now have a de facto relationship between the
table Student and the table Joint. As the relationship is materialized by the roll of a
student and as that latter is unique to the student side but multiple on the joint side, we
can say that this is a relationship of 1 to N in the direction StudentJoint: one student can
appear only once in the table Student, but multiple times in the Joint table. In a similar
way, we de facto created a relationship of 1 to N between Courses and Joint: the same
mnemonic can appear only once in the table Courses but several times in the table Joint.
The result is represented in the Figure 9.
Thanks to this table Joint, a student can be bound to several courses, and a course to
several students, without creating the slightest redundancy in the data. The relationship
N M to which we aspired was actually decomposed into two relationships 1 N.
N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

21

Relationships N to M direct definitively have no place in a relational schema because


there is no way to realize them without creating redundancies. It is thus a theoretical
concept which comes true in practice by means of a table Joint and two relationships 1
N to connect three sets.

Figure 9: A joint table materialized by a relationship N M

Very often, the joint table represents actually a perfectly identifiable conceptual entity.
This is also the case in our example above because every line of the table Joint
represents concretely a registration (that of a student in a course). This is why we
renamed the joint table "Registration" in the diagram above. In all honesty, we could
have been able in this case to identify from the start that this entity had to be
represented somewhere in our database. It may happen however that we cannot
associate a significant conceptual entity (a tangible object in the world which the
database is supposed to represent) with a joint table.
In any case, whether or not the significant conceptual entity is associated with a joint
table, it is common that the relationship which it materializes requires to be completed
by some specific information, which implies that the joint table has to include a number
of attributes in addition to its primary and foreign keys. In the case of registrations, we
need a column to store the grades obtained by the students (a decimal number between
0 and 20 for example). We could also imagine that it is necessary to know if the given
course is compulsory or optional for a focal student.
Let us take another example. Let us think of the database dedicated to the management
of a hotel. We can straightaway imagine that it will include a table Customer, a table
Room and a joint table between those two which will represent the reservations (a
customer can make several reservations, and a room can be reserved several times (at
different moments in time obviously)). Therefore, we have a relationship of 1 to N
between Customer and Reservation, and another relationship of 1 to N between Room
and Reservation. In the end, this set of three related tables materializes a relationship of
N to M between Customer and Room: a customer can book several rooms, and a room
may be booked by several customers. In this case, the table Reservation will include not

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

22

only the identifier of the room and of the customer as foreign keys, but also all kinds of
information to qualify this reservation: date of arrival, number of nights, price confirmed
to the customer, number of people staying in the room, etc.
Joint tables can quickly contain a significant number of lines, by simple effect of
combinatorial rules. For example, with 200 students who follow every 20 of the
compulsory courses composing this program, we find ourselves with 4000 lines in the
table Registrations. But you can multiply this figure by one thousand or even by a
million, it wont really be a problem for a good DBMS, which excels in storage and
indexing.

Relational schemas and normalized forms


The relational schema describes the structure of a database by describing its tables (or
entities) and its relationships. It is the standardized presentation which allows a fast and
very complete understanding of a database, sufficient in any case to understand how the
data are organized and how to question them. The relational plan of our database can
now be represented in its entirety as in Figure 10.


1

Figure 10: Entire relational schema of the database

A relational schema will be said normalized when a number of conditions are met,
guaranteeing the absence of redundancy and a certain ease of management. The first
one of these rules requires that no field of any table contains information which can be
again decomposed into several pieces of information. For example, an address can be
decomposed between street, zip code and locality, 3 elements of information which must
be stored in three different fields (damn! Our diagram above is not yet fully normalized).
We say that the information stored in tables needs to be atomic.
But it is not enough, they still cannot be intrinsically redundant, meaning that we should
not be able to reconstitute an information (for example the age of a student) from
another information of the database (for example its date of birth). Then, all the values
stored in tables must be invariable in time, except for punctual modifications. It implies
that between the age and the date of birth, we shall always prefer to store the date of

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

23

birth, which is fixed rather than the age which should be updated regularly and at
regular intervals.
Finally, every piece of information contained in a field has to depend functionally on (to
be determined only by) the primary key of its own table. For example, the name of a
student depends functionally of his roll, because to one roll can only correspond a single
name (remember that the roll is a primary key and thus inevitably unique) but the same
name can be matched with several rolls (what will be the case if brothers and sisters or
homonyms are enrolled at the same university). When a field depends functionally on
another one (or of a combination of others), the rules of normality impose that this field
is stored in the table where the source field (the one on which it depends) is defined. By
following this rule, we will enter the name of the student into the table Students where
his source field (roll) is a primary key, but not in the table Registrations where his roll is
only a foreign key. Therefore, this rule consists simply in specifying that, except for the
foreign keys which are used to materialize relationships, the information relative to an
entity has no place in the fields of another entity.
Together, these few basic rules are sufficient to ensure a minimum degree of normality
(efficiency and consistency) of the relational schema. Other conditions exist, which
define more advanced normal forms, but they exceed the objective of this note and we
shall thus restrict ourselves to the rules above which constitute the basis of all the
normal forms. The above set defines what is known as the third normal form.
In essence, it is sufficient to note that a relational schema can be considered as normal
from the moment it contains no more redundancy and that all the information is atomic
and invariable in time.

The future of database management systems


Relational databases have been remarkably successful in the business world and
constitute the basis of most enterprise systems. They present, however, certain
weaknesses, even if new types of solutions try hard to compensate those weaknesses.
The first limitation of the relational databases is related to the difficulty that we have to
synchronize them with an object oriented model whose relationships are richer than the
simple 11, 1N and NN in the relational world. Database systems called "object
oriented" were thus born to spread the types of relationships to the typical notions of
inheritance and association of the object world. These systems have had a limited
success, and software solutions solve this problem today by taking care of synchronizing
in real time the objects running in a program with their relational equivalent in the
database, as the Web language Django for example.
The second limit to relational DBMS is more problematic and concerns at the same time
the volumes of data to be managed (firms manage not only millions, but often billions of
rows today, as the activity of the members on Facebook or the search queries of Google
users), and the heterogeneity in structure of the information to be stored. These two

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

24

elements create a need for DBMS that allow the storage of records from the same table
on different drives or systems (thus allowing the storage of big volumes) and that are
structured around associative tables (where only the physical address of every record is
stored). The records are then called "documents" and can present a different structure,
even if they belong to the same table. The resulting systems are called "NoSQL
databases", which originally meant "without SQL", but should really be understood as
"nonrelational". Nevertheless, SQL is not the primary mode of query of these new types
of DBMS, which include Googles BigTable, Amazons Dynamo, Facebooks HBase, or the
open source solution mongoDB.

N.VANZEEBROECKANOTEONRELATIONALDATABASES(SEPT.2014)

25

You might also like