You are on page 1of 111

WEEK 12

Dr. A. Brennan

NOTE:

Efficient Physical DB design produces technical


specifications to be used during the DB
implementation phase

For efficient physical DB design, certain


info. needs to be gathered:

Normalised relations with estimates of table volume (number


of rows in each table)
Attribute (field) definitions and possible max. length
Descriptions of data usage (when and where data are
entered, deleted, retrieved, updated etc.)
Response time expectations
Data security needs
Backup/recovery needs
Integrity expectations
What DBMS technology will be used to implement the
database
What DB architecture to use

Once this info. is gathered, the designer


has to decide on a range of issues:

Suitable storage format (i.e. data types) for each


attribute (in order to minimise storage space and
maximise data integrity)
Grouping attributes from the logical model into physical
records (denormalisation)
File organisation (arranging similarly structured records
in secondary memory for the purpose of storage, fast
and efficient retrieval and update, protection of data
and its recovery after errors are found)
Query optimisation

Physical Design

What is it?

translate the logical description of data into technical


specifications for storing and retrieving of data

Why?

Good performance , database integrity, security and


recoverability.

Input and Output for Physical Design

Normalised relations with


estimates of table volume (number
of rows in each table)
Attribute definitions (and possible
maximum length)
Descriptions of data usage (when
and where data are entered,
retrieved, deleted, updated)
Response time expectations
Data security needs
Backup/recovery needs
Integrity expectations
Description of DBMS technology

Suitable storage format (data


type) for each attribute in the
logical data model in order to
minimise storage space and
maximise data integrity
Grouping attributes from the
logical model into physical records
File organisation
Selection of indexes and database
architectures for storing and
connecting files to efficiently
retrieve related data
Query optimisation

Data Types
CHAR fixed-length character
VARCHAR2 variable-length character (memo)
LONG large number
NUMBER positive/negative number
DATE actual date
BLOB binary large object (good for graphics,
sound clips, etc.)

Goals

Data type

Goals = minimise storage space, represent all possible


values, improve data integrity, support all data
manipulations

Data integrity controls


Default

value, range control (constraints/validation


rules), null value control (i.e. PK cannot be null)
Referential integrity (FK cannot be null)

Integrity Controls

Default value - assumed value if no explicit value is entered for an


instance of the field (this reduces data entry time and helps prevent
entry errors for the most common value);
Range control imposes allowable value limitations (constraints or
validation rules). This may be a numeric lower to upper bound, or a set
of specific values. This approach should be used with caution, since the
limits may change over time
Null value control allows or prohibits empty fields (e.g. each primary
key must have an integrity control that prohibits a null value);
Referential integrity a form of a range control (and null value
allowances) for foreign-key to primary-key match-ups. It guarantees
that only some existing cross-referencing value is used

Physical Records
A Physical Record is a group of fields that are stored in adjacent secondary memory
locations and are retrieved and written together as a unit by particular DBMS

Scope:
Efficient use of secondary storage (influenced by both the size of
the record and the structure of the secondary storage)
Data processing speed.
Computer operating systems read data from secondary memory in units called pages.
A page is the amount of data read or written by an operating system in one operation.

Blocking Factor is the number of physical records per page.

Normalization

Normalization is a logical database design that is


structurally consistent and has minimal redundancy.
Normalization forces us to understand completely
each attribute that has to be represented in the
database. This may be the most important factor
that contributes to the overall success of the system.

What is Denormalization?
Denormalization a process of transforming normalised

relations into unnormalised physical record


specifications

Denormalization can also be referred to a process


in which we combine two relations into one new
relation, and the new relation is still normalized but
contains more nulls than the original relations

Denormalization
In addition, the following factors have to be
considered:
Application specific;
Denormalization may speed up retrievals but it
slows down updates
Size of tables
Coding

Answer
15

Efficient data processing (second goal of physical record


design after efficient use of storage space) in most cases,
dominates the design process.
The speed of data processing depends on how close
together the related data are.

Benefits and Possible Problems


16

Benefits:
Can improve performance (speed)
Due to data duplication

Problems:
Wasted storage space
Data integrity/consistency threats

Denormalisation How?
Option one: Combine attributes from several logical
relations together into one physical record in order to avoid
doing joins (one to one, many to many, one to many)
Option two: Partition a logical relation into several
physical records (multiple tables);
Option three: Data replication; or a combination of the
two options above.

Denormalisation Option 1
1. Two entities with a

one-to-one relationship

Mapping

Logical Model: Normalised Relations

Select * from EMPLOYEE, PARKING


WHERE
EMPLOYEE.Employee-ID = PARKING.Employee-ID

Denormalisation Option 1
1. Two entities with a

one-to-one relationship

Try this!
Employee
PPS

EMPLOYEE
Address

Manager
ID

Name

Manages

MANAGER

Expertise

EMPLOYEE(EmployeePPS, Name, Address,


ManagerID)
MANAGER(ManagerID, Expertise)

Select * from EMPLOYEE, MANAGER


WHERE
EMPLOYEE.ManagerID = MANAGER.ManagerID
ManagerID

Expertise

EmployeePPS

Name

Address

Denormalisation Option 1
Many-to-many relationship (associative entity)

2.
with non-key attributes

Denormalisation Option 1
Physical Model: Denormalised Relation

Denormalisation Option 1
3 One to many relationship

Logical Model: Normalised Relations Resulting from One-to-Many (1:M) Relationship

Physical Model: Denormalised Relation

Denormalisation Option 2
Option 2 : Partitioning of logical relation into multiple tables
Horizontal partitioning - places different rows of a table into several physical files, based
on common column values.

Vertical partitioning distributing the columns of a table into several separate files,
repeating the primary key in each one of them

CUSTOMER
CustID

CUSTOMERA

CUSOMTERB

FirstName

CustID

CustID

MiddleName

FirstName

CreditLimit

LastName

MiddleName

SalesTaxRate

Address1

LastName

Address2

Address1

City

Address2

County

City

Country

County

Phone

Country

CreditLimit

Phone

SalesTaxRate

Fax

Fax

Email

Email
31

Vertical partitioning

32

CUSTOMER

CUSTOMERA-M

CUSTOMERN-Z

CustID

CustID

CustID

FirstName

FirstName

FirstName

MiddleName

MiddleName

MiddleName

LastName

LastName

LastName

Address1

Address1

Address1

Address2

Address2

Address2

City

City

City

County

County

County

Country

Country

Country

Phone

Phone

Phone

CreditLimit

CreditLimit

CreditLimit

SalesTaxRate

SalesTaxRate

SalesTaxRate

Fax

Fax

Fax

Email

Email

Email
Horizontal partitioning

Advantages and disadvantages of


partitioning
33

Efficiency
Local optimisation
Recovery

Slow retrieval
Complexity
Extra space and time for updates

Denormalisation Option 3
34

Option 3 : Data replication; or a combination of the other two


options

Data Replication - the same data is purposely stored in multiple locations


of the database.

Data replication improves performance by allowing multiple users to


access the same data at the same time with minimum contention.

Denormalisation Disadvantages

The potential for loss of integrity is considerable.


Additional time that is required to maintain consistency
automatically every time a record is inserted, updated,
or deleted
Increase in storage space resulting from the duplication

Whose responsibility?
36

DBMS
Database Designer

File Organisation
37

1. Sequential File Organisation

2. Indexed File Organisation


3. Hashed File Organisation

38

Sequential File Organisation

The records are stored in sequence according to a


primary key value. To locate a particular record, a
program must scan the file from its beginning until
the desired record is located

https://www.youtube.com/watch?v=zDzu6vka0rQ
40

Indexed File Organisation

The records are stored either sequentially or non


sequentially, and an INDEX is created allowing the
application software to locate the individual records

Indexed files have the capability of creating


multiple indexes e.g. library where there
are indexes on author, title, subject etc.

Therefore Indexes are most useful


for

Larger tables
Attributes which are referenced in ORDER BY or
GROUP BY clauses

https://www.youtube.com/watch?v=h2d9b_nEzoA
43

Hashed File Organisation

The address of each file is determined using a


hashing algorithm
A hashing algorithm is a routine that converts a PK
value into a record address
A hash index table uses hashing to map a key into a
location in an index, where there is a pointer to the
data record matching the hash key

DB Architecture

Note
46

De-normalisation should only take place after a


satisfactory level of normalisation has taken place

Goal of Physical DB Design

The goal of physical DB design is to create technical


specifications from the logical descriptions of data
that will provide adequate data storage and
performance and will ensure database integrity,
security and recoverability

DATA AND DATABASE


ADMINISTRATION

Data within the organisation

Data are a resource to be translated into


information
Data is constantly being produced and analysed to
create even
more data

Database use in the organisation

Top management
strategic

Middle management
tactical

decision making, planning and policy

decisions and planning

Operational management
support

company operations
MIS
DSS
Database

TPS

Management data

Two recognised roles

Data/database administration

Data administration is responsible for:


planning and analysis function responsible for setting data
policy and standards
promoting companys data as a competitive resource
providing liaison support to systems analysts during
application development

Database administration
operationally oriented
responsible for day-to-day monitoring and management of
active database
liaison and support during application development

Data administrator

Data coordination

Data standards

keep track of update, responsibilities and interchange


e.g naming standards

Liaison with systems analysts and programmers,


including design
Training managers, users, developers
Arbitration of disputes and usage authorization
Documentation and internal publicity
Promotion of datas competitive advantage

Database administrator

Responsible for the day-to-day administration of


the database
Monitors performance to maximize efficiency
Provides central point for troubleshooting
Monitors security and usage (audit log)
Responsible for operational aspects of data
dictionary
Carries out data and software maintenance
Involved in database design

Database Administrator in DB Design

Define conceptual schema

Define internal schema

decide physical database design

Liaise with users

what data to be held; what entities; what attributes

ensure the data they need is available

Define security needs


Define backup and recovery
Monitor performance

respond to changing requirements

A Summary of DBA Activities

db activity

db service

planning

end-user support

organising

policies, procedures and standards

testing

of

data security, privacy and integrity

monitoring

data backup and recovery

delivering

data distribution and use

Tools for Database Administration

Information is kept about all corporate resources,


including data
This data about data is termed metadata
The database which holds this metadata is the data
dictionary
Two types data dictionary
stand-alone

or passive
integrated or active

Metadata in Access

Data Dictionary

Passive data dictionary


self-contained

database
all data about entities are entered into the dictionary
requests for metadata information are run as reports
and queries as necessary

Active data dictionary

Data dictionary: relationships

Table construction

Security

which tables or files are on which disks

Program data requirements

which programs might be affected by changes to which tables

Physical residence

which people have access to which databases or tables

Impact of change

which attributes appear in which tables

which programs use which tables or files

Responsibility
who

is responsible for updating which databases or tables

Introducing a Database: Considerations

Three important aspects


technological:

managerial:
cultural:

DBMS software and hardware


Administrative functions

Corporate resistance to change

Social impact databases

Data collection is extensive


both

voluntary and involuntary

Data is a commodity

DATABASE SECURITY

Security - types threat


external

Loss or corruption to data due to sabotage

Loss or corruption to data due to error


Disclosure of sensitive data
Fraudulent manipulation of data
internal

Threats to data security

Controlling unauthorised access

Physical access to building


Access to hardware

Monitor any unusual activity

Controlling unauthorised access

Developing user profiles


care

over decisions on what data and resources can be


accessed (and type access) for each end user
user training and education

Firewalls
Encryption
Plugging known security holes
using

patches available for known problems

Developing user profiles

Every user is given an identifier for authentication


Users are given privileges to access data
dependent on what is essential for their work
insert
update

delete

Most DBMS provide an approach called


Discretionary Access Control (DAC)
SQL standard supports DAC through the GRANT
and REVOKE commands

DAC and MAC

DAC has certain weaknesses in that an unauthorized


user can trick an authorized user into disclosing sensitive
data
An additional approach is required called Mandatory
Access Control (MAC)
MAC based on system-wide policies that cannot be
changed by individual users
each database object is assigned a security class
each user is assigned a clearance for a security class
rules are imposed on reading and writing of database
objects by users

SQL standard does not include support for MAC

secret

Firewalls

Firewall controls network


traffic

Encryption
Encryption: decoding or scrambling data to make it unintelligible
to those without the key

encryption

Controlling loss DP facilities

Redundancy
Virus protection
Disaster protection
Minimise error
Alert network managers to problems
Minor disruptions require on-going monitoring

Protect against error

Educate all employees


Reminders to save
Should you overwrite existing files?
incorporate

safety nets on deletion

Include integrity checks on data


range checks

validation
check digits
hash totals

cross checking

batch totals

Software Invasion
Cruise virus

Worm

attacks for profit

makes copies of itself

Exploits the networks weakest link - transmits copies to other machines


you
difficult to access to disable
attacks through the public domain

Trapdoor

waits to reach its target


reports successful penetration

simulates regular entry

delivers payload

or bypasses normal
security procedures

Trojan horse
looks like something else

difficult to detect that it


has been run

once launched, too late!

Stealth viruses
encrypt and hides tracks

Logic bomb
event driven

Protecting against virus attacks

Prepare a company policy on viruses


Educate on the destructive power of viruses
Control the source of software purchasing
Ensure new or upgraded software is installed by system
administrator on quarantined machine
Control use of bulletin boards
Install anti-virus software where necessary
Make regular back-ups
data and programs separately
store back-up copies off-site once software opened

Be aware of software holes in systems software

How security can be compromised

Poor security management


Poor connections to the outside world
Shoddy system control
Human folly
Lack of security ethic

And the answer is:

Education!!

DISTRIBUTED DATABASE MANAGEMENT SYSTEMS

Distributed databases

Distributed database

a logically interrelated collection of shared data (and


description of this data) physically distributed over a
computer network

Distributed DBMSs (DDBMS)


the software systems that permits the management of the
distributed database and makes the distribution apparent
to users
must perform all the functions of a centralized DBMS
must handle all necessary functions imposed by the
distribution of data and processing

Distributed processing/database

Distributed Processing
Shares data processing chores over
sites using communications network
Database resides at one site only

Distributed Database
Each site has a data fragment
which might be replicated at
other sites
Requires distributed processing

DDBMS

Advantages
Reflects organisational
structure
Faster data access and
processing
Improved communications in
org.
Reduced operating costs
Improved share-ability and
local autonomy
Less danger of single-point
failure
Modular growth easier

Disadvantages
Complexity management
and control
Security
Integrity control more
difficult
Lack of standard comms.
protocols for dbs
Increased training costs
Database design more
complex

Characteristics DDBMS

A collection of logically related shared data


The data is spilt into a number of fragments
Fragments may be replicated
Fragments/replicas are allocated to sites
Sites linked by a communications network
Data at each site is under control of a DBMS
DBMS at each site can handle local applications
autonomously
Each DBMS participates in at least one global
application

DDBMS features

Application interface to interact with end user or


application programs and with other DBMs
Validation to analyse data results
Transformation: to determine which data requests
are distributed and which local
Query optimization to find best access strategy
Mapping to determine location fragments
I/O interface
Formatting to prepare data for presentation

Distributed database design

Data fragmentation (divide)

need to decide how to split into fragments

OR

Data replication (copy)

a copy of a fragment (or all) may be held at several sites

THEN

Data allocation:

need to decide where to locate those fragments and


replicas : each fragment is stored at the site with optimal
distribution

Data fragmentation

Users work with views, so appropriate to work with


subsets data
Cheaper to store data closest to where it is used
May give reduced performance for global
applications
Integrity control may be difficult if data and
functional dependencies are at different sites
Data fragmentation must be done carefully

Data fragmentation

Breaks single object into two or more segments or


fragments
Each fragment can be stored at any site over a
computer network
Information about data fragmentation is stored in
the distributed data catalog (DDC), from which it is
accessed by the transaction processor

Strategies for fragmentation

For successful fragmentation, must ensure:


completeness:

each data item must appear in at least

one fragment
reconstruction: should be able to define a relational
operation that will reconstruct relation from fragments
disjointness: a data item appearing in one fragment
should not appear in another

Strategies for data fragmentation

Horizontal fragmentation

division of a relation into subsets (fragments) based on


tuples (rows)

Vertical fragmentation
division

of a relation into attribute (column) subsets

Mixed fragmentation

combination

Data replication

Storage of data copies at multiple sites served by a computer


network
Fragment copies can be stored at several sites to serve specific
information requirements

can enhance data availability and response time


can help to reduce communication and total query costs

Data replication

Fully replicated database:


stores multiple copies of each database fragment at
multiple sites
can be impractical due to amount of overhead

Partially replicated database:


stores multiple copies of some database fragments at
multiple sites
most DDBMSs are able to handle the partially replicated
database well

Unreplicated database:
stores each database fragment at a single site
no duplicate database fragments

Data allocation

Data allocation is closely related to the way the


database if fragmented : leads to decisions on which
data is stored where
Centralized

Partitioned/ fragmented

entire database is stored at one site


database divided into several fragments and stored at
several sites

Replicated

copies of one or more database fragments (selective


replication) are stored at several sites

Strategies for data allocation

BIG DATA, SMALL DATA

Big Data

Big data is the term for data sets so large and complex that it
becomes difficult to process them using on-hand database
management tools or traditional data processing applications
We are collecting more data than ever

We have streamlined our processes through normal channels

electronics enables us to do so (RFID)


storage is cheap
computing has enabled us to improve what we do and
businesses are looking for new ways to have a competitive edge

By looking at patterns in this data we can find out useful things

From a McKinsey report

$600 to buy a disk which can store all of eth


worlds music

Internet of Things
Ubiquitous Broadband
Reduction in connectivity
costs

RFID enables unique


addressability

Increasingly, we are including sensors in everyday objects


These often have communicative capacity and link to source
through the internet

Use of Big Data

We can gain additional information derivable from analysis of


a single large set of related data (rather than a large number
of small sets)
Correlations can be found which "spot business trends,
determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway
traffic conditions"

The business case (Mc Kinsey)


1.

2.

Big data can unlock significant value by making information


transparent and usable at much higher frequency.
Organizations can collect more accurate and detailed
performance information on everything from product
inventories to sick days, and therefore expose variability and
boost performance. Leading companies are using data
collection and analysis to conduct controlled experiments to
make better management decisions; others are using data for
basic low-frequency forecasting to high-frequency nowcasting
to adjust their business levers just in time.

The business case (Mc Kinsey)


3.

4.
5.

Big data allows ever-narrower segmentation of customers and


therefore much more precisely tailored products or services.
Sophisticated analytics can substantially improve decision-making.
Big data can be used to improve the development of the next
generation of products and services. For instance, manufacturers
are using data obtained from sensors embedded in products to
create innovative after-sales service offerings such as proactive
maintenance (preventive measures that take place before a failure
occurs or is even noticed).

http://www.mckinsey.com/insights/business_technology/big_data_the_next_
frontier_for_innovation

Big Data in Health

Big data is enabling a new understanding of the molecular biology


of cancer. The focus has changed over the last 20 years from the
location of the tumor in the body (e.g., breast, colon or blood), to the
effect of the individuals genetics, especially the genetics of that
individuals cancer cells, on her response to treatment and sensitivity
to side effects. For example, researchers have to date identified
four distinct cell genotypes of breast cancer; identifying the cancer
genotype allows the oncologist to prescribe the most effective
available drug first.

http://strata.oreilly.com/2013/08/cancer-and-clinical-trials-the-roleof-big-data-in-personalizing-the-health-experience.html

Big Data in banking

IBMs Watson can do analysis with unstructured data such as those


found in e-mails, news reports, books and websites. Citigroup has
hired Watson to help it decide what new products and services to
offer its customers and to try to cut down on fraud and look for
signs of customers becoming less creditworthy. In most financial
institutions the immediate use of big data is in containing fraud and
complying with rules on money-laundering and sanctions.
Big credit card companies are getting better at recognising patterns
Solutions are getting cheaper even for smaller banks
Banks also use the data to sell products (eg insurance) by looking at
the type of transactions customers make
http://www.economist.com/node/21554743

Some geospatial uses

The Climate Corporation, an insurance company, combines modern


Big Data techniques, climatology and agronomics to analyse the
weathers complex and multi-layered behaviour to help the worlds
farmers adapt to climate change.
McLarens Formula One racing team uses Big Data to identify issues
with its racing cars using predictive analytics and takes corrective
actions pro-actively. They spend 5% of their budget on telemetry. An
F1 car is fitted with about 130 sensors. In addition to the engine
sensors, video and GPS is used to work out the best line to take
through each bend. The sensor data is helping in traffic smoothing,
energy-optimising analysis and drivers direction determination.
e.g.new Pirelli tyres this year meant teams had to watch for tyre
wear, grip, temperature under different weather conditions and
tracks, relating all that to driver acceleration, braking and steering.

Some geospatial uses

Vestas Wind Systems is implementing a big data solution that is


significantly reducing data processing time and helping faster and
more accurate predict weather patterns prediction at potential sites
to increase turbine energy production. They currently store 2.8
petabytes in a wind library covering over 178 parameters, such as
temperature, barometric pressure, humidity, precipitation, wind
direction and wind velocity from the ground level up to 300 feet.
Nokia need a technology solution to support the collection, storage
and analysis of virtually unlimited data types and volumes. They
leverage data processing and complex analyses in order to build
maps with predictive traffic and layered elevation models, to source
information about points of interest around the world, to understand
the quality of phones and more. www.geospatialworld.net

More geospatial uses

US Xpress, transportation solutions, collects about a thousand


data elements ranging from fuel usage to tyre condition to truck
engine operations to GPS information, and uses this for optimal
fleet management and to drive productivity, saving millions of
dollars in operating costs. When an order is dispatched, it is
tracked using an in-cab system installed on a DriverTech tablet
with speech recognition capability. US Xpress constantly connects
to the devices to monitor progress of the lorry. The video camera
on the device could be used to check if the driver is nodding off.
All the data collected is analysed in real time using geospatial
data, integrated with driver data and truck telematics. They can
minimise delays and ensure trucks are not left waiting when they
arrive at a depot for maintenance. www.geospatialworld.net

Big data in the university

Huddersfield University

Purdue University, Indiana

when student logs into a course website, they see a traffic


light signal (and advice how to move to green)

University of Derby

linked library data to identify learning styles. Now including


lecture attendance records

VLE use, sports, car parking

Loughborough University

analyses staff-student interaction


www.theguardian.com/education/2013/aug/05

Role of Cloud Computing

Enables easier gathering, storage and processing of Big


Data
Cloud computing provides accessibility any time, any
place
Large scale data gathering is possible from multiple
locations
Sharing of data easier
Large scale storage
Processing power also available with virtual machines
provision to analyse data

Can be utilised on an ad-hoc basis

Analysing Big Data

Data mining

Analytics
Machine learning
Visualisation

blend of applied statistics and artificial intelligence


neural networks, cluster analysis, genetic algorithms, decision trees,
support vector machines

interactive rather than static graphs help to understand patterns

Shift of skills to digital analysis and visualisation techniques

Data mining

Who interprets?

A new set of tools make it easier to do a variety of data


analysis tasks. Some require no programming, while other tools
make it easier to combine code, visuals, and text in the same
workflow. They enable users who arent statisticians or data
geeks, to do data analysis. While most of the focus is on
enabling the application of analytics to data sets, some tools
also help users with the often tricky task of interpreting results.
In the process users are able to discern patterns and evaluate
the value of data sources by themselves, and only call upon
expert data analysts when faced with non-routine problems
http://strata.oreilly.com/2013/08/data-analysis-tools-target-nonexperts.html

Issues

Problems with algorithms


can

magnify misbehaviour (e.g. selection bias)

Privacy and security


anonymity:

profiling individuals

Over-reliance on technology
Need for skilled workers with deep analytics skills

www.internetofthings.eu

House Keeping
110

Groups and group names


Project distribution
Weighting (65% exam : 35% CA)
35% CA = 28% project, 7% SQL CAs(approx.)

FINI

You might also like