You are on page 1of 67

In SQL, what are the differences between primary, foreign, and

unique keys?
The one thing that primary, unique, and foreign keys all have in common is the fact that each
type of key can consist of more than just one column from a given table. In other words,
foreign, primary, and unique keys are not restricted to having just one column from a given table
each type of key can cover multiple columns. So, that is one feature that all the different types
of keys share they can each be comprised of more than just one column, which is something
that many people in software are not aware of.

Of course, the database programmer is the one who will actually define which columns are covered
by a foreign, primary, or unique key. That is one similarity all those keys share, but there are also
some major differences that exist between primary, unique, and foreign keys. We will go over
those differences in this article. But first, we want to give a thorough explanation of why foreign
keys are necessary in some situations.

What is the point of having a foreign key?


Foreign keys are used to reference unique columns in another table. So, for example, a foreign
key can be defined on one table A, and it can reference some unique column(s) in another table B.
Why would you want a foreign key? Well, whenever it makes sense to have a relationship between
columns in two different tables.

An example of when a foreign key is necessary


Suppose that we have an Employee table and an Employee Salary table. Also assume that every
employee has a unique ID. The Employee table could be said to have the master list of all
Employee IDs in the company. But, if we want to store employees salaries in another table, then
do we want to recreate the entire master list of employee IDs in the Employee Salary table as
well? No we dont want to do that because its inefficient. It would make a lot more sense to just
define a relationship between an Employee ID column in the Employee Salary table and the
master Employee ID column in the Employee table one where the Employee Salary table can
just reference the employee ID in the Employee table. This way, whenever someones employee
ID is updated in the Employee table, it will also automatically get updated in the Employee Salary
table. Sounds good right? So now, nobody has to manually update the employee IDs in the
Employee Salary table every time the ID is update in the master list inside the Employee table.
And, if an employee is removed from the Employee table, he/she will also automatically be
removed (by the RDBMS) from the Employee Salary table of course all of this behavior has to be
defined by the database programmer, but hopefully you get the point.

Foreign keys and referential integrity


Foreign keys have a lot to do with the concept of referential integrity. What we discussed in the
previous paragraph are some of the principles behind referential integrity. You can and should read
a more in depth article on that concept here:Referential integrity explained.

Can a table have multiple unique, foreign, and/or primary keys?


A table can have multiple unique and foreign keys. However, a table can have only one primary
key.

Can a unique key have NULL values? Can a primary key have
NULL values?
Unique key columns are allowed to hold NULL values. The values in a primary key column,
however, can never be NULL.

Can a foreign key reference a non-primary key?


Yes, a foreign key can actually reference a key that is not the primary key of a table. But, a
foreign key must reference a unique key.

Can a foreign key contain null values?


Yes, a foreign key can hold NULL values. Because foreign keys can reference unique, non-primary
keys which can hold NULL values this means that foreign keys can themselves hold NULL
values as well.

In SQL, whats the difference between the having clause and


the where clause?

The difference between the having and where clause is best illustrated by an example.
Suppose we have a table called emp_bonus as shown below. Note that the table has multiple
entries for employees A and B.

emp_bonus

Employee Bonus

A 1000

B 2000

A 500

C 700

B 1250

If we want to calculate the total bonus that each employee received, then we would write a
SQL statement like this:

select employee, sum(bonus) from emp_bonus group by employee;

The Group By Clause


In the SQL statement above, you can see that we use the "group by" clause with the employee
column. What the group by clause does is allow us to find the sum of the bonuses
for each employee. Using the group by in combination with the sum(bonus) statement will
give us the sum of all the bonuses for employees A, B, and C.

Running the SQL above would return this:

Employee Sum(Bonus)

A 1500

B 3250

C 700

Now, suppose we wanted to find the employees who received more than $1,000 in bonuses
for the year of 2007. You might think that we could write a query like this:

BAD SQL:
select employee, sum(bonus) from emp_bonus
group by employee where sum(bonus) > 1000;

The WHERE clause does not work with aggregates like SUM
The SQL above will not work, because the where clause doesnt work with aggregates like
sum, avg, max, etc.. Instead, what we will need to use is the having clause. The having clause
was added to sql just so we could compare aggregates to other values just how the where
clause can be used with non-aggregates. Now, the correct sql will look like this:

GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
Difference between having and where clause
So we can see that the difference between the having and where clause in sql is that the where
clause can not be used with aggregates, but the having clause can. One way to think of it is
that the having clause is an additional filter to the where clause.

In the table below, how would you retrieve the unique values for the employee_location
without using the DISTINCT keyword?

employee

employee_name employee_location

Joe New York

Sunil India

Alex Russia

Albert Canada

Jack New York

Alex Russia

We can actually accomplish this with the GROUP BY keyword. Heres what the SQL would
look like:

SELECT employee_location from employee


GROUP BY employee_location

Running this query will return the following results:

employee_location

New York

India

Russia
Canada

So, you can see that the duplicate values for "Russia" and "Canada" are not returned in the
results.

This is a valid alternative to using the DISTINCT keyword. If you need a refresher on the
GROUP BY clause, then check out this question: Group By and Having. This question would
probably be asked just to see how good you are with coming up with alternative options for
SQL queries. Although, it probably doesnt prove much about your SQL skills.

In SQL, whats the difference between an inner and outer join?

Joins are used to combine the data from two tables, with the result being a new, temporary table.
The temporary table is created based on column(s) that the two tables share, which represent
meaningful column(s) of comparison. The goal is to extract meaningful data from the resulting
temporary table. Joins are performed based on something called a predicate, which specifies the
condition to use in order to perform a join. A join can be either an inner join or an outer join,
depending on how one wants the resulting table to look.

It is best to illustrate the differences between inner and outer joins by use of an example. Here we
have 2 tables that we will use for our example:

Employee Location

EmpID EmpName EmpID EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

25 Johnson 39 Bangalore, India

Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.

Outer Joins

Lets start the explanation with outer joins. Outer joins can be be further divided into left outer
joins, right outer joins, and full outer joins. Here is what the SQL for a left outer join would look
like, using the tables above:
select * from employee left outer join location
on employee.empID = location.empID;

In this SQL we are joining on the condition that the employee IDs match in the rows tables. So,
we will be essentially combining 2 tables into 1, based on the condition that the employee IDs
match. Note that we can get rid of the "outer" in left outer join, which will give us the SQL below.
This is equivalent to what we have above.

select * from employee left join location


on employee.empID = location.empID;

A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. The SQL above will give us the result set shown below.

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

25 Johnson NULL NULL

The Join Predicate a geeky term you should know

Earlier we had mentioned something called a join predicate. In the SQL above, the join
predicate is "on employee.empID = location.empID". This is the heart of any type of join,
because it determines what common column between the 2 tables will be used to "join" the 2
tables. As you can see from the result set, all of the rows from the left table are returned when we
do a left outer join. The last row of the Employee table (which contains the "Johson" entry) is
displayed in the results even though there is no matching row in the Location table. As you can
see, the non-matching columns in the last row are filled with a "NULL". So, we have "NULL" as the
entry wherever there is no match.

A right outer join is pretty much the same thing as a left outer join, except that the rows that are
retained are from the right table. This is what the SQL looks like:

select * from employee right outer join location


on employee.empID = location.empID;

// taking out the "outer", this also works:

select * from employee right join location


on employee.empID = location.empID;

Using the tables presented above, we can show what the result set of a right outer join would look
like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

NULL NULL 39 Bangalore, India

We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.

Inner Joins

Now that weve gone over outer joins, we can contrast those with the inner join. The difference
between an inner join and an outer join is that an inner join will return only the rows that actually
match based on the join predicate. Once again, this is best illustrated via an example. Heres what
the SQL for an inner join will look like:

select * from employee inner join location on


employee.empID = location.empID

This can also be written as:

select * from employee, location


where employee.empID = location.empID

Now, here is what the result of running that SQL would look like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles


3 Ram 3 Pune, India

17 Babu 17 Chennai, India

Inner vs Outer Joins

We can see that an inner join will only return rows in which there is a match based on the join
predicate. In this case, what that means is anytime the Employee and Location table share an
Employee ID, a row will be generated in the results to show the match. Looking at the original
tables, one can see that those Employee IDs that are shared by those tables are displayed in the
results. But, with a left or right outer join, the result set will retain all of the rows from either the
left or right table.

What is the definition of a secondary key?


You may have heard the term secondary key in Oracle, MySQL, SQL Server, or whatever other
dbms you are dealing with. What exactly is a secondary key? Lets start with a definition, and then
a simple example that will help you understand further.

A given table may have more than just one choice for a primary key. Basically, there may be
another column (or combination of columns for a multi-column primary key) that qualify as
primary keys. Any combination of column(s) that may qualify to be a primary key are known as
candidate keys. This is because they are considered candidates for the primary key. And the
options that are not selected to be the primary key are known as secondary keys.

Example of a Secondary Key in SQL


Lets go through an example of a secondary key. Consider a table called Managers that stores all
of the managers in a company. Each manager has a unique Manager ID Number, a physical
address, and an email address. Lets say that the Manager ID is chosen to be the primary key of
the Managers table. Both the physical address and email address could have been selected as the
primary key, because they are both unique fields for every manager row in the Managers table.
But, because the email address and physical address were not selected as the primary key, they
are considered to be secondary keys.

What is a simple key in a dbms?


In a database table, a simple key is just a single attribute (which is just a column) that can
uniquely identify a row. So, any single column in a table that can uniquely identify a row is a
simple key. The reason its called a simple key is because of the fact that it is simple in the sense
that its just composed of one column (as opposed to multiple columns) and thats it.

Example of a simple key


Lets go through an example of a simple key. Consider a table called Employees. If every
employee has a unique ID and a column called EmployeeID, then the EmployeeID column would
be considered a simple key because its a single column that can uniquely identify every row in the
table (where each row is a separate employee). Simple isnt it?

Provide a definition and example of a superkey in SQL.


In SQL, the definition of a superkey is a set of columns in a table for which there are no two rows
that will share the same combination of values. So, the superkey is unique for each and every row
in the table. A superkey can also be just a single column.

Example of a superkey
Suppose we have a table that holds all the managers in a company, and that table is called
Managers. The table has columns called ManagerID, Name, Title, and DepartmentID. Every
manager has his/her own ManagerID, so that value is always unique in each and every row.

This means that if we combine the ManagerID column value for any given row with any other
column value, then we will have a unique set of values. So, for the combinations of (ManagerID,
Name), (ManagerID, TItle), (ManagerID, DepartmentID), (ManagerID, Name, DepartmentID), etc
there will be no two rows in the table that share the exact same combination of values, because
the ManagerID will always be unique and different for each row. This means that pairing the
Manager ID with any other column(s) will ensure that the combination will also be unique across
all rows in the table.

And that is exactly what defines a superkey its any combination of column(s) for which that
combination of values will be unique across all rows in a table. So, all of those combinations of
columns in the Manager table that we gave earlier would be considered to be superkeys. Even the
ManagerID column is considered to be a superkey, although a special type of superkey as you can
read more about below.

What is a minimal superkey?


A minimal superkey is the minimum number of columns that can be used to uniquely identify a
single row. In other words, the minimum number of columns, which when combined, will give a
unique value for every row in the table. Remember that we mentioned earlier that a superkey can
be just a single column. So, in our example above, the minimal superkey would be the ManagerID
since it is unique for each and every row in the Manager table.

Can a table have multiple minimal superkeys?


Yes, a table can have multiple minimal superkeys. Let use our example of a Manager table again.
Suppose we add another column for the Social Security Number (which, for our non-American
readers, is a unique 9 digit number assigned to every citizen of the USA) to the Manager table
lets just call it SSN. Since that column will clearly have a unique value for every row in the table,
it will also be a minimal superkey because its only one column and it also is unique for every
row.

Can a minimal superkey have more than one column?


Absolutely. If there is no single column that is unique for every row in a given table, but there is a
combination of columns that produce a unique value for every row in a table, then that
combination of columns would be the minimal superkey. This is of course provided that the
combination is the smallest number of columns necessary to produce a unique value for each row.

Why is it called a superkey?


Its called a superkey because it comes from RDBMS theory, as in superset and subset. So, a
superkey is essentially all the superset combinations of keys, which will of course uniquely identify
a row in a table.

Superkey versus candidate key


We discussed minimal superkeys and defined exactly what they are. Candidate keys are actually
minimal superkeys so both candidate keys and minimal superkeys mean exactly the same thing.

What are some tips on tuning SQL Indexes for better performance?
If youve already read our article on SQL indexes, then you know that indexes are used to make
queries run faster by reducing the time taken to look up data. But, the tradeoff for that improved
performance is that indexes also take up space, and there must be some maintenance done on
indexes to ensure that they continue to run smoothly. Lets go over some tips, guidelines, and
suggestions on how to properly use and maintain indexes to improve performance.

Dont use too many indexes


As you know, indexes can take up a lot of space. So, having too many indexes can actually be
damaging to your performance because of the space impact. For example, if you try to do an
UPDATE or an INSERT on a table that has too many indexes, then there could be a big hit on
performance due to the fact that all of the indexes will have to be updated as well. A general rule
of thumb is to not create more than 3 or 4 indexes on a table.

Try not to include columns that are repeatedly updated in an index


If you create an index on a column that is updated very often, then that means that every time
the column is updated, the index will have to be updated as well. This is done by the DBMS, of
course, so that the index stays current and consistent with the columns that belong to that index.
So, the number of writes is increased two-fold one time to update the column itself and another
to update the index as well. So, you might want to consider avoiding the inclusion of columns that
are frequently updated in your index.

Creating indexes on foreign key column(s) can improve


performance
Because joins are often done between primary and foreign key pairs, having an index on a foreign
key column can really improve the join performance. Not only that, but the index allows some
optimizers to use other methods of joining tables as well.

Create indexes for columns that are repeatedly used in predicates


of your SQL queries
Take a look at your queries and see which columns are used frequently in the WHERE predicate. If
those columns are not part of an index already, then you should add them to an index. This is of
course because an index on columns that are repeatedly used in predicates will help speed up your
queries.

Get rid of overlapping indexes


Overlapping indexes are indexes that share the same leading column (the first column) and that
is why they are called overlapping indexes. Remember that indexes can have multiple columns.
And, almost all RDBMSs can use an index for a query, even if that querys WHERE predicate uses
only the first column of that index, and doesnt use any of the other columns in that index. In
other words, most RDBMSs can use indexes for queries even when there is just a partial match for
the query columns to the index columns. And for this reason, overlapping indexes are usually not
necessary but be sure to research this for your own particular RDBMS.

Consider deleting an index when loading huge amounts of data into


a table
If you are loading a huge amount of data into a table, then you might want to think about deleting
some of the indexes on the table. Then, after the data is loaded into the table, you can recreate
the indexes. The reason you would want to do this is because the index will not have to be
updated during the delete, which could save you a lot of time!

Ensure that the indexes you create have high selectivity


For reasons described in more detail here index selectivity you should create indexes with a
high selectivity. A general rule of thumb is that indexes should have a selectivity thats higher than
.33. Remember that indexes that are completely unique have a selectivity of 1.0, which is the
highest possible selectivity value.

What is cardinality in SQL?


The term cardinality actually has two different meanings depending on the context of its usage
one meaning is in the context of data modeling and the other meaning is in the context of SQL
statements. Lets go through the first meaning when the word cardinality is used in the context
of data modeling, it simply refers to the relationship that one table can have with another table.
These relationships include: many-to-many, many-to-one/one-to-many, or one-to-one
whichever one of these characteristics a table has in relationship with another table is said to be
the cardinality of that table. An example will help clarify further.
An example of cardinality in data modeling
Suppose we have three tables that are used by a company to store employee information: an
Employee table, an Employee_Salary table, and a Department table. The Department table will
have a one to many relationship with the Employee table, because every employee can belong to
only one department, but a department can consist of many employees. In other words, the
cardinality of the Department table in relationship to the employee table is one to many. The
cardinality of the Employee table in relationship to the Employee_Salary table will be one to one,
since an employee can only have one salary, and vice versa (yes, two employees can have the
same salary, but there will still be exactly one salary entry for each employee regardless of
whether or not someone else has the same salary).

Example of Cardinality in SQL


The other definition of cardinality is probably the more commonly used version of the term. In
SQL, the cardinality of a column in a given table refers to the number of unique values that appear
in the table for that column. So, remember that the cardinality is a number. For example, lets
say we have a table with a Sex column which has only two possible values of Male and
Female. Then, that Sex column would have a cardinality of 2, because there are only two
unique values that could possibly appear in that column Male and Female.

Cardinality of a primary key


Or, as another example, lets say that we have a primary key column on a table with 10,000 rows.
What do you think the cardinality of that column would be? Well, it is 10,000. Because it is a
primary key column, we know that all of the values in the column must be unique. And since there
are 10,000 rows, we know that there are 10,000 entries in the column, which translates to a
cardinality of 10,000 for that column. So, we can come up with the rule that the cardinality of a
primary key column will always be equal to the number of records in the same table.

What does a cardinality of zero mean?


Well, if a column has a cardinality of zero, it means that the column has no unique values. This
could potentially happen if the column has all NULLs which means that the column was never
really used anyways.

What is the difference between cardinality and selectivity?


In SQL, cardinality refers to the number of unique values in particular column. So, cardinality is a
numeric value that applies to a specific column inside a table. An example will help clarify.

Example of cardinality
Suppose we have a table called People which has a Sex column that has only two possible values
of Male and Female. Then, that Sex column would have a cardinality of 2, because there are
only two unique values that could possibly appear in that column Male and Female.
Difference between cardinality and selectivity
In SQL, the term selectivity is used when discussing database indexes. The selectivity of a
database index is a numeric value that is calculated using a specific formula. That formula actually
uses the cardinality value to calculate the selectivity. This means that the selectivity is calculated
using the cardinality so the terms selectivity and cardinality are very much related to each other.
Here is the formula used to calculated selectivity:

Formula to calculate selectivity:

Selectivity of index = cardinality/(number of rows) * 100%

This means that if the People table we were talking about earlier has 10,000 rows, then the
selectivity would be 2/10,000, which equals .02%. This is considered to be a very low selectivity.

Why are the selectivity and cardinality used in databases?


The selectivity basically is a measure of how much variety there is in the values of a given table
column in relation to the total number of rows in a given table. The cardinality is just part of the
formula that is used to calculate the selectivity. Query optimizers use the selectivity to figure out if
it is actually worth using an index to find certain rows in a table. A general principle that is
followed is that it is best to use an index when the number of rows that need to be selected is
small in relation to the total number of rows. That is what the selectivity helps measure. You can
also read more about selectivity and cardinality in these links: cardinality and selectivity.

What is selectivity in SQL? How is selectivity calculated


and how does it relate to a database index?
The terms selectivity and cardinality are closely related in fact, the formula used to
calculate selectivity uses the cardinality value. The term selectivity is used when talking
about database indexes. This is the formula to use to calculate the selectivity of an index
dont worry we do explain what it all means below:

How to calculate the selectivity of an index:


Selectivity of index = cardinality/(number of records) * 100%

Note that the number of records is equivalent to the number of rows in the table.

What does selectivity mean?

So, you see the formula and you are thinking thats great, but what does this actually
mean? Well, lets say we have a table with a Sex column which has only two possible
values of Male and Female. Then, that Sex column would have a cardinality of 2,
because there are only two unique values that could possibly appear in that column Male
and Female. If there are 10,000 rows in the table, then this means that the selectivity of an
index on that particular column will be 2/10,000 * 100%, which is .02%.

The key with the selectivity value is that it basically measures how selective the values
within a given column are in other words how many different values are available in the
given sample set. A selectivity of .02% is considered to be really low, and means that given
the number of rows, there is a very small amount of variation in the actual values for that
column. In our example Sex column,

Why does the database actually care about the selectivity and how does it use it? Well, lets
consider what a low selectivity means. A low selectivity basically means there is not a lot of
variation in the values in a column that there is not a lot of possibilities for the values of a
column. Suppose, using the example table that we discussed earlier, that we want to find the
names of all the females in the table.

How does selectivity affect usage of a database index


Database query optimizers have to make a decision about whether it would actually make
sense to either use the index to find certain rows in a table or to not use the index. This is
because there are times when using the index is actually less efficient than just directly
scanning the table itself. This is something that you should remember: even if a column has
an index created for it, that does not mean the index will always be used, because scanning
the table directly without going through the index first could be a better, more efficient,
option.

When is better to not use a database index?


So, when exactly is it better to not use a database index? Well, when there is a low selectivity
value! Why does a low selectivity mean that using the index is not a good idea? Well, think
about it lets say we want to run a query that will find the names of all the females in the
table we are of course assuming that there is another column for Name in addition to the
Sex column. If we are searching for all the female rows in a table with 10,000 rows then
there is a good chance that 50% of the rows are females, because there really are just two
possible values male and female. Assuming that 50% of the rows are indeed females, then
this means that we would have to access the index 5,000 times to find all the female rows.
Accessing the index takes time, and consumes resources. If we are accessing the index 5,000
times, it is actually faster to just directly access the table and do a full table scan. So, you can
see that the selectivity value was used by the query optimizer to determine whether it was
more efficient to use an index or just read the table directly.

What selectivity value determines if an index will be used


or not?
Its really hard to say since that exact value varies from one database to another.
Of course, a high selectivity value means that the index should definitely be used. For
example, if we are dealing with a column that has a selectivity of 100%, then all the values in
that column are unique. This means that if a query is searching for just one of those values
then it makes much more sense to use the index, because it will be far more efficient than
risking a full table scan which is the worst case scenario if the table is searched directly
without consulting the index first.

What is a database lock in the context of SQL? Provide an example


and explanation.
A database lock is used to lock some data in a database so that only one database user/session
may update that particular data. So, database locks exist to prevent two or more database users
from updating the same exact piece of data at the same exact time. When data is locked, then
that means that another database session can NOT update that data until the lock is released
(which unlocks the data and allows other database users to update that data. Locks are usually
released by either a ROLLBACK or COMMIT SQL statement.

What happens when another session tries to update the locked


data?
Suppose database session A tries to update some data that is already locked by database session
B. What happens to session A? Well, session A will actually be placed in whats called a lock
wait state, and session A will be stopped from making further progress with any SQL transaction
that its performing. Another way of saying this is that session A will be stalled until session B
releases the lock on that data.

If a session ends up waiting too long for some locked data, then some databases, like DB2 from
IBM, will actually time out after a certain amount of time and return an error instead of waiting
and then updating the data as requested. But some databases, like Oracle, may handle the
situation differently Oracle can actually leave a session in a lock wait state for an indefinite
amount of time. So, there are a lot of differences between different database vendors in terms of
how they choose to deal with locks and other sessions waiting for locks to be released.

Database locking techniques


Database locks can actually be placed at different levels also known as lock granularity within
the database.

Here is a list of the usual lock levels and types supported, and more information on what each
technique means:

Database level locking


With database level locks, the entire database is locked which means that only one database
session can apply any updates to the database. This type of lock is not often used, because it
obviously prevents all users except one from updating anything in the database. But, this lock can
be useful when some major support update is necessary like upgrading the database to a new
version of the software. Oracle actually has an exclusive mode, which is used to allow just one
user session use the database this is basically a database lock.

File level locking


With a file lock level, an entire database file is locked. What exactly is a file in a database? Well, a
file can have a wide variety of data inside a file there could be an entire table, a part of a table,
or even parts of different tables. Because of the variety of data stored inside a file, this type of
lock level is less favored.

Table level locking


A table level lock is pretty straight forward it means that an entire table is locked as a whole.
This lock level comes in handy when making a change that affects an entire table, like updating all
the rows in a table, or modifiying the table to add or remove columns. In Oracle, this is known as
a DDL lock, because its used with DDL (Data Definition Language) statements like CREATE,
ALTER, and DROP basically statements that modify the entire table somehow or the other.

Page or block level locking


Block, or page, level locking occurs when a block or page that is part of a database file is locked.
To read more about pages and blocks if you are not already familiar with them, then go
here: Pages versus blocks.

Because the data that can be stored in blocks/pages can be wide and varied, page/block locking is
less favored in databases today.

Column level locking


A column level lock just means that some columns within a given row in a given table are locked.
This form of locking is not commonly used because it requires a lot of resources to enable and
release locks at this level. Also, there is very little support for column level locking in most
database vendors.

Row level locking


A row level lock applies to a row in a table. This is also the most commonly locking level, and
practically all major database vendors support row level locks.

Are locks automatically used by databases?


When data is either deleted or updated locks are always used even if a database user doesnt
write his/her SQL to explicitly say that a lock must be used. Many of the RDBMSs out there today
also have support to use the FOR UPDATE OF clause combined with a normal SELECT statement.
The FOR UPDATE OF clause basically says that the database user intends to update some data
although the database user is not required to make changes to that particular data either. And,
because the intent of updating data is declared, it means that a lock will be placed on that data as
well.
Example of database locking
As a simple example of when locking would be used by database, suppose we have the following
SQL:

UPDATE some_table SET some_field = "some_value"


WHERE some_column = "XYZ";

The SQL statement above will lock the row or rows which have a value of XYZ for the column
named some_column. The locking of the row(s) happens behind the scenes as part of the
RDBMS software, and it prevents other database user sessions from updating the same row(s) at
the same exact time as well.

Can data be read when a lock is in place?


It depends on the lock, since some locks are read-exclusive, which means that other sessions in
the database can not even read the locked data so if .

What is the point of database locking?


If its not already clear to you, the reason we have database locks is to prevent the potential loss
of data that could happen if updates are applied concurrently or at the same exact time. If two
different database users are allowed to update the same data at the same exact time, then the
results could be potentially confusing and disastrous. But if that same data were locked, then that
issue would not arise, since only one user could update the locked data at a time.

What is lock contention?


One problem that occurs with having locks is that locks can cause whats known ascontention,
which means that because there are locks on the data, sessions that exist at the same time
(concurrent sessions) are essentially competing for the right to apply updates on the same data,
because that data may be locked by any given session. In the best case, lock contention means
that some user processes run slower because a session is waiting for a lock. In the worst case,
having sessions compete for locks can make sessions stall for an indefinite period of time.

When sessions do stall for an indefinite period of time, that is known as deadlock, which you can
read more about here: Database deadlock.

Lock Escalation
You should also read about the concept of Lock Escalation, which is a built in feature of many
RDBMSs today.
Some other differences between foreign, primary, and unique keys
While unique and primary keys both enforce uniqueness on the column(s) of one table, foreign
keys define a relationship between two tables. A foreign key identifies a column or group of
columns in one (referencing) table that refers to a column or group of columns in another
(referenced) table in our example above, the Employee table is the referenced table and the
Employee Salary table is the referencing table.

As we stated earlier, both unique and primary keys can be referenced by foreign keys.

Whats the difference between data mining and data


warehousing?

Data mining is the process of finding patterns in a given data set. These patterns can often provide
meaningful and insightful data to whoever is interested in that data. Data mining is used today in
a wide variety of contexts in fraud detection, as an aid in marketing campaigns, and even
supermarkets use it to study their consumers.

Data warehousing can be said to be the process of centralizing or aggregating data from multiple
sources into one common repository.

Example of data mining


If youve ever used a credit card, then you may know that credit card companies will alert you
when they think that your credit card is being fraudulently used by someone other than you. This
is a perfect example of data mining credit card companies have a history of your purchases from
the past and know geographically where those purchases have been made. If all of a sudden some
purchases are made in a city far from where you live, the credit card companies are put on alert to
a possible fraud since their data mining shows that you dont normally make purchases in that
city. Then, the credit card company can disable your card for that transaction or just put a flag on
your card for suspicious activity.

Another interesting example of data mining is how one grocery store in the USA used the data it
collected on its shoppers to find patterns in their shopping habits. They found that when men
bought diapers on Thursdays and Saturdays, they also had a strong tendency to buy beer. The
grocery store could have used this valuable information to increase their profits. One thing they
could have done odd as it sounds is move the beer display closer to the diapers. Or, they could
have simply made sure not to give any discounts on beer on Thursdays and Saturdays. This is data
mining in action extracting meaningful data from a huge data set.

Subscribe to our newsletter for more free interview questions.

Example of data warehousing Facebook

A great example of data warehousing that everyone can relate to is what Facebook does. Facebook
basically gathers all of your data your friends, your likes, who you stalk, etc and then stores
that data into one central repository. Even though Facebook most likely stores your friends, your
likes, etc, in separate databases, they do want to take the most relevant and important
information and put it into one central aggregated database. Why would they want to do this? For
many reasons they want to make sure that you see the most relevant ads that youre most likely
to click on, they want to make sure that the friends that they suggest are the most relevant to
you, etc keep in mind that this is the data mining phase, in which meaningful data and patterns
are extracted from the aggregated data. But, underlying all these motives is the main motive: to
make more money after all, Facebook is a business.

We can say that data warehousing is basically a process in which data from multiple
sources/databases is combined into one comprehensive and easily accessible database. Then this
data is readily available to any business professionals, managers, etc. who need to use the data to
create forecasts and who basically use the data for data mining.

Datawarehousing vs Datamining

Remember that data warehousing is a process that must occur before any data mining can take
place. In other words, data warehousing is the process of compiling and organizing data into one
common database, and data mining is the process of extracting meaningful data from that
database. The data mining process relies on the data compiled in the datawarehousing phase in
order to detect meaningful patterns.

In the Facebook example that we gave, the data mining will typically be done by business users
who are not engineers, but who will most likely receive assistance from engineers when they are
trying to manipulate their data. The data warehousing phase is a strictly engineering phase, where
no business users are involved. And this gives us another way of defining the 2 terms: data mining
is typically done by business users with the assistance of engineers, and data warehousing is
typically a process done exclusively by engineers.

What is a self join? Explain it with an example and tutorial.


Lets illustrate the need for a self join with an example. Suppose we have the following table that
is called employee. The employee table has 2 columns one for the employee name (called
employee_name), and one for the employee location (called employee_location):

employee

employee_name employee_location

Joe New York

Sunil India

Alex Russia

Albert Canada

Jack New York

Now, suppose we want to find out which employees are from the same location as the employee
named Joe. In this example, that location would be New York. Lets assume for the sake of our
example that we can not just directly search the table for people who live in New York with a
simple query like this (maybe because we dont want to hardcode the city name) in the SQL
query:
SELECT employee_name
FROM employee
WHERE employee_location = "New York"

So, instead of a query like that what we could do is write a nested SQL query (basically a query
within another query which more commonly called a subquery) like this:

SELECT employee_name
FROM employee
WHERE employee_location in
( SELECT employee_location
FROM employee
WHERE employee_name = "Joe")

Using a subquery for such a simple question is inefficient. Is there a more efficient and elegant
solution to this problem?

It turns out that there is a more efficient solution we can use something called a self join. A self
join is basically when a table is joined to itself. The way you should visualize a self join for a given
table is by imagining that a join is performed between two identical copies of that table. And
that is exactly why it is called a self join because of the fact that its just the same table being
joined to another copy of itself rather than being joined with a different table.

How does a self join work


Before we come up with a solution for this problem using a self join, we should go over some
concepts so that you can fully understand how a self join works. This will also make the SQL in our
self join tutorial a lot easier to understand, which you will see further below.

A self join must have aliases


In a self join we are joining the same table to itself by essentially creating two copies of that table.
But, how do we distinguish between the two different copies of the table because there is only
one table name after all? Well, when we do a self join, the table names absolutely must use aliases
otherwise the column names would be ambiguous. In other words, we would not know which
tables columns are being referenced without using aliases for the two copies of the table. If you
dont already know what an alias is, its simply another name given to a table, and that name is
then used in the SQL query to reference the table. So, we will just use the aliases e1 and e2 for
the employee table when we do a self join.

Self join predicate


As with any join there must be a condition upon which a self join is performed we can not just
arbitrarily say do a self join, without specifying some condition. That condition will be our join
predicate. If you need a refresher on join predicates (or just joins in general) then check this link
out: Inner vs. Outer joins.

Now, lets come up with a solution to the original problem using a self join instead of a subquery.
This will help illustrate how exactly a self join works. The key question that we must ask ourselves
is what should our join predicate be in this example? Well, we want to find all the employees who
have the same location as Joe.

Whats referential integrity?

Referential integrity is a relational database concept in which multiple tables share a relationship
based on the data stored in the tables, and that relationship must remain consistent.

The concept of referential integrity, and one way in which its enforced, is best illustrated by an
example. Suppose company X has 2 tables, an Employee table, and an Employee Salary table. In
the Employee table we have 2 columns the employee ID and the employee name. In the
Employee Salary table, we have 2 columns the employee ID and the salary for the given ID.

Whats referential integrity?

Referential integrity is a relational database concept in which multiple tables share a relationship
based on the data stored in the tables, and that relationship must remain consistent.

The concept of referential integrity, and one way in which its enforced, is best illustrated by an
example. Suppose company X has 2 tables, an Employee table, and an Employee Salary table. In
the Employee table we have 2 columns the employee ID and the employee name. In the
Employee Salary table, we have 2 columns the employee ID and the salary for the given ID.

Now, suppose we wanted to remove an employee because he no longer works at company X.


Then, we would remove his entry in the Employee table. Because he also exists in the Employee
Salary table, we would also have to manually remove him from there also. Manually removing the
employee from the Employee Salary table can become quite a pain. And if there are other tables in
which Company X uses that employee then he would have to be deleted from those tables as well
an even bigger pain.

By enforcing referential integrity, we can solve that problem, so that we wouldnt have to manually
delete him from the Employee Salary table (or any others). Heres how: first we would define the
employee ID column in the Employee table to be our primary key. Then, we would define the
employee ID column in the Employee Salary table to be a foreign key that points to a primary key
that is the employee ID column in the Employee table. Once we define our foreign to primary key
relationship, we would need to add whats called a constraint to the Employee Salary table. The
constraint that we would add in particular is called a cascading delete this would mean that any
time an employee is removed from the Employee table, any entries that employee has in the
Employee Salary table would alsoautomatically be removed from the Employee Salary table.

Note in the example given above that referential integrity is something that must beenforced, and
that we enforced only one rule of referential integrity (the cascading delete). There are actually 3
rules that referential integrity enforces:
1.We may not add a record to the Employee Salary table
unless the foreign key for that record points to an existing
employee in the Employee table.

2.If a record in the Employee table is deleted, all corresponding


records in the Employee Salary table must be deleted using a
cascading delete. This was the example we had given earlier.

3.If the primary key for a record in the Employee table changes,
all corresponding records in the Employee Salary table must be
modified using what's called a cascading update.

Its worth noting that most RDBMSs relational databases like Oracle, DB2, Teradata, etc. can
automatically enforce referential integrity if the right settings are in place. But, a large part of the
burden of maintaining referential integrity is placed upon whoever designs the database schema
basically whoever defined the tables and their corresponding structure/relationships in the
database that you are using. Referential integrity is an important concept and you simply must
know it for any programmer interview.

Provide an example and definition of a natural key in SQL.


You have probably come across the term natural key within the context of SQL and data
warehouses. What exactly is a natural key? A natural key is a key composed of columns that
actually have a logical relationship to other columns within a table. What does that mean in plain
English? Well, lets go through an example of a natural key.

Natural Key Example


Consider a table called People. If we use the columns First_Name, Last_Name, and Address
together to form a key then that would be a natural key because those columns are something
that are natural to people, and there is definitely a logical relationship between those columns and
any other columns that may exist in the table.

Why is it called a natural key?


The reason its called a natural key is because the columns that belong to the key are just
naturally a part of the table and have a relationship with other columns in the table. So, a natural
key already exists within a table and columns do not need to be added just to create an
artificial key.

Natural keys versus business keys


Natural keys are often also called business keys so both terms mean exactly the same thing.

Natural keys versus domain keys


Domain keys also mean the same thing as natural keys.

Natural keys versus surrogate keys


Natural keys are often compared to surrogate keys. What exactly is a surrogate key? Well, first
consider the fact that the word surrogate literally means substitute. The reason a surrogate key is
like a substitute is because its unnatural, in the sense that the column used for the surrogate key
has no logical relationship to other columns in the table.

In other words, the surrogate key really has no business meaning i.e., the data stored in a
surrogate key has no intrinsic meaning to it.

Why are surrogate keys used?


A surrogate key could be considered to be the artificial key that we mentioned earlier. In most
databases, surrogate keys are only used to act as a primary key. Surrogate keys are usually
just simple sequential numbers where each number uniquely identifies a row. For example,
Sybase and SQL Server both have whats called an identity column specifically meant to hold a
unique sequential number for each row. MySQL allows you to define a column with the
AUTO_INCREMENT attribute, which just means that the value in the column will automatically
increment the value in a given column to be 1 greater than the value in the previous row. This just
means that every time you add a new row, the value in the column that is auto incremented is 1
greater than the value in the most recent row added to the table. You can also set the increment
value to be whatever you want it to be.

In SQL, whats the difference between the having clause and


the where clause?

The difference between the having and where clause is best illustrated by an example.
Suppose we have a table called emp_bonus as shown below. Note that the table has multiple
entries for employees A and B.

emp_bonus

Employee Bonus

A 1000

B 2000

A 500

C 700

B 1250
If we want to calculate the total bonus that each employee received, then we would write a
SQL statement like this:

select employee, sum(bonus) from emp_bonus group by employee;

The Group By Clause


In the SQL statement above, you can see that we use the "group by" clause with the employee
column. What the group by clause does is allow us to find the sum of the bonuses
for each employee. Using the group by in combination with the sum(bonus) statement will
give us the sum of all the bonuses for employees A, B, and C.

Subscribe to our newsletter for more free interview questions.

Running the SQL above would return this:

Employee Sum(Bonus)

A 1500

B 3250

C 700

Now, suppose we wanted to find the employees who received more than $1,000 in bonuses
for the year of 2007. You might think that we could write a query like this:

BAD SQL:
select employee, sum(bonus) from emp_bonus
group by employee where sum(bonus) > 1000;

The WHERE clause does not work with aggregates like SUM
The SQL above will not work, because the where clause doesnt work with aggregates like
sum, avg, max, etc.. Instead, what we will need to use is the having clause. The having clause
was added to sql just so we could compare aggregates to other values just how the where
clause can be used with non-aggregates. Now, the correct sql will look like this:

GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;

Difference between having and where clause


So we can see that the difference between the having and where clause in sql is that the where
clause can not be used with aggregates, but the having clause can. One way to think of it is
that the having clause is an additional filter to the where clause.

How do database indexes work? And, how do indexes help?


Provide a tutorial on database indexes.
Lets start out our tutorial and explanation of why you would need a database index by going
through a very simple example. Suppose that we have a database table called Employee with
three columns Employee_Name, Employee_Age, and Employee_Address. Assume that the
Employee table has thousands of rows.

Now, lets say that we want to run a query to find all the details of any employees who are named
Jesus? So, we decide to run a simple query like this:

SELECT * FROM Employee


WHERE Employee_Name = 'Jesus'

What would happen without an index on the table?


Once we run that query, what exactly goes on behind the scenes to find employees who are
named Jesus? Well, the database software would literally have to look at every single row in the
Employee table to see if the Employee_Name for that row is Jesus. And, because we want every
row with the name Jesus inside it, we can not just stop looking once we find just one row with the
name Jesus, because there could be other rows with the name Jesus. So, every row up until the
last row must be searched which means thousands of rows in this scenario will have to be
examined by the database to find the rows with the name Jesus. This is what is called a full table
scan.

How a database index can help performance


You might be thinking that doing a full table scan sounds inefficient for something so simple
shouldnt software be smarter? Its almost like looking through the entire table with the human
eye very slow and not at all sleek. But, as you probably guessed by the title of this article, this is
where indexes can help a great deal. The whole point of having an index is to speed up
search queries by essentially cutting down the number of records/rows in a table that
need to be examined.
What is an index?
So, what is an index? Well, an index is a data structure (most commonly a B- tree) that stores the
values for a specific column in a table. An index is created on a column of atable. So, the key
points to remember are that an index consists of column values from one table, and that those
values are stored in a data structure. The index is a data structure remember that.

Subscribe to our newsletter for more free interview questions.

What kind of data structure is an index?


B- trees are the most commonly used data structures for indexes. The reason B- trees are the
most popular data structure for indexes is due to the fact that they are time efficient because
look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason
B- trees are more commonly used is because the data that is stored inside the B- tree can
be sorted. The RDBMS typically determines which data structure is actually used for an index.
But, in some scenarios with certain RDBMSs, you can actually specify which data structure you
want your database to use when you create the index itself.

How does a hash table index work?


Hash tables are another data structure that you may see being used as indexes these indexes
are commonly referred to as hash indexes. The reason hash indexes are used is because hash
tables are extremely efficient when it comes to just looking up values. So, queries that compare
for equality to a string can retrieve values very fast if they use a hash index. For instance, the
query we discussed earlier (SELECT * FROM Employee WHERE Employee_Name = Jesus) could
benefit from a hash index created on the Employee_Name column. The way a hash index would
work is that the column value will be the key into the hash table and the actual value mapped to
that key would just be a pointer to the row data in the table. Since a hash table is basically an
associative array, a typical entry would look something like Jesus => 028939, where 028939
is a reference to the table row where Jesus is stored in memory. Looking up a value like Jesus in
a hash table index and getting back a reference to the row in memory is obviously a lot faster than
scanning the table to find all the rows with a value of Jesus in the Employee_Name column.

The disadvantages of a hash index


Hash tables are not sorted data structures, and there are many types of queries which hash
indexes can not even help with. For instance, suppose you want to find out all of the employees
who are less than 40 years old. How could you do that with a hash table index? Well, its not
possible because a hash table is only good for looking up key value pairs which means queries
that check for equality (like WHERE name = Jesus). What is implied in the key value mapping in
a hash table is the concept that the keys of a hash table are not sorted or stored in any
particular order. This is why hash indexes are usually not the default type of data structure used
by database indexes because they arent as flexible as B- trees when used as the index data
structure. Also see: Binary trees versus Hash Tables.

What are some other types of indexes?


Indexes that use a R- tree data structure are commonly used to help with spatial problems. For
instance, a query like Find all of the Starbucks within 2 kilometers of me would be the type of
query that could show enhanced performance if the database table uses a R- tree index.

Another type of index is a bitmap index, which work well on columns that contain Boolean values
(like true and false), but many instances of those values basically columns with low selectivity.

How does an index improve performance?


Because an index is basically a data structure that is used to store column values, looking up
those values becomes much faster. And, if an index is using the most commonly used data
structure type a B- tree then the data structure is also sorted. Having the column values be
sorted can be a major performance enhancement read on to find out why.

Lets say that we create a B- tree index on the Employee_Name column This means that when we
search for employees named Jesus using the SQL we showed earlier, then the entire Employee
table does not have to be searched to find employees named Jesus. Instead, the database will
use the index to find employees named Jesus, because the index will presumably be sorted
alphabetically by the Employees name. And, because it is sorted, it means searching for a name is
a lot faster because all names starting with a J will be right next to each other in the index! Its
also important to note that the index also stores pointers to the table row so that other column
values can be retrieved read on for more details on that.

What exactly is inside a database index?


So, now you know that a database index is created on a column in a table, and that the index
stores the values in that specific column. But, it is important to understand that a database index
does not store the values in the other columns of the same table. For example, if we create an
index on the Employee_Name column, this means that the Employee_Age and Employee_Address
column values are not also stored in the index. If we did just store all the other columns in the
index, then it would be just like creating another copy of the entire table which would take up
way too much space and would be very inefficient.

An index also stores a pointer to the table row


So, the question is if the value that we are looking for is found in an index (like Jesus) , how
does it find the other values that are in the same row (like the address of Jesus and his age)?
Well, its quite simple database indexes also storepointers to the corresponding rows in the
table. A pointer is just a reference to a place in memory where the row data is stored on disk. So,
in addition to the column value that is stored in the index, a pointer to the row in the table where
that value lives is also stored in the index. This means that one of the values (or nodes) in the
index for an Employee_Name could be something like (Jesus, 082829), where 082829 is the
address on disk (the pointer) where the row data for Jesus is stored. Without that pointer all you
would have is a single value, which would be meaningless because you would not be able to
retrieve the other values in the same row like the address and the age of an employee.

How does a database know when to use an index?


When a query like SELECT * FROM Employee WHERE Employee_Name = Jesus is run, the
database will check to see if there is an index on the column(s) being queried. Assuming the
Employee_Name column does have an index created on it, the database will have to decide
whether it actually makes sense to use the index to find the values being searched because
there are some scenarios where it is actually less efficient to use the database index, and more
efficient just to scan the entire table. Read this article to understand more about those
scenarios: Selectivity in SQL.

Can you force the database to use an index on a query?


Generally, you will not tell the database when to actually use an index that decision will be made
by the database itself. Although it is worth noting that in most databases (like Oracle and MySQL),
you can actually specify that you want the index to be used.

How to create an index in SQL:


Heres what the actual SQL would look like to create an index on the Employee_Name column from
our example earlier:

CREATE INDEX name_index


ON Employee (Employee_Name)

How to create a multi-column index in SQL:


We could also create an index on two of the columns in the Employee table , as shown in this SQL:

CREATE INDEX name_index


ON Employee (Employee_Name, Employee_Age)

What is a good analogy for a database index?


A very good analogy is to think of a database index as an index in a book. If you have a book
about dogs and you are looking for the section on Golden Retrievers, then why would you flip
through the entire book which is the equivalent of a full table scan in database terminology
when you can just go to the index at the back of the book, which will tell you the exact pages
where you can find information on Golden Retrievers. Similarly, as a book index contains a page
number, a database index contains a pointer to the row containing the value that you are
searching for in your SQL.

What is the cost of having a database index?


So, what are some of the disadvantages of having a database index? Well, for one thing it takes
up space and the larger your table, the larger your index. Another performance hit with indexes
is the fact that whenever you add, delete, or update rows in the corresponding table, the same
operations will have to be done to your index. Remember that an index needs to contain the same
up to the minute data as whatever is in the table column(s) that the index covers.

As a general rule, an index should only be created on a table if the data in the indexed column will
be queried frequently.

In SQL, how and when would you do a group by with multiple


columns? Also provide an example.
In SQL, the group by statement is used along with aggregate functions like SUM, AVG, MAX, etc.
Using the group by statement with multiple columns is useful in many different situations and it
is best illustrated by an example. Suppose we have a table shown below called Purchases. The
Purchases table will keep track of all purchases made at a fictitious store.

Purchases

purchase_date item items_purchased


2011-03-25 00:00:00.000 Wireless Mouse 2
2011-03-25 00:00:00.000 Wireless Mouse 5
2011-03-25 00:00:00.000 MacBook Pro 1
2011-04-01 00:00:00.000 Paper Clips 20
2011-04-01 00:00:00.000 Stapler 3
2011-04-01 00:00:00.000 Paper Clips 15
2011-05-15 00:00:00.000 DVD player 3
2011-05-15 00:00:00.000 DVD player 8
2011-05-15 00:00:00.000 Stapler 5
2011-05-16 00:00:00.000 MacBook Pro 2

Now, lets suppose that the owner of the store wants to find out, on a given date, how many of
each product was sold in the store. Then we would write this SQL in order to find that out:

select purchase_date, item, sum(items_purchased) as


"Total Items" from Purchases group by item, purchase_date;

Subscribe to our newsletter on the left to receive more free interview


questions!

Running the SQL above would return this:

purchase_date item Total Items


2011-03-25 00:00:00.000 Wireless Mouse 7
2011-03-25 00:00:00.000 MacBook Pro 1
2011-04-01 00:00:00.000 Paper Clips 35
2011-04-01 00:00:00.000 Stapler 3
2011-05-15 00:00:00.000 DVD player 11
2011-05-15 00:00:00.000 Stapler 5
2011-05-16 00:00:00.000 MacBook Pro 2

Note that in the SQL we wrote, the group by statement uses multiple columns: group by item,
purchase_date;. This allows us to group the individual items for a given date so basically we are
dividing the results by the date the items are purchased, and then for a given date we are able to
find how many items were purchased for that date. This is why the group by statement with
multiple columns is so useful!

What is the difference between a left outer join and a right


outer join?

It is best to illustrate the differences between left outer joins and right outer joins by use of an
example. Here we have 2 tables that we will use for our example:

Employee Location

EmpID EmpName EmpID EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

25 Johnson 39 Bangalore, India

For the purpose of our example, it is important to note that the very last employee in the
Employee table (Johnson, who has an ID of 25) is not in the Location table. Also, no one from the
Employee table is from Bangalore (the employee with ID 39 is not in the Employee table). These
facts will be significant in the discussion that follows.

A left outer join


Using the tables above, here is what the SQL for a left outer join would look like:

select * from employee left outer join location


on employee.empID = location.empID;

In the SQL above, we are joining on the condition that the employee IDs match in the tables
Employee and Location. So, we will be essentially combining 2 tables into 1, based on the
condition that the employee IDs match. Note that we can get rid of the "outer" in left outer join,
which will give us the SQL below. This is equivalent to what we have above.

select * from employee left join location


on employee.empID = location.empID;

What do left and right mean?

A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. What are the left and right tables? Thats easy the left table is
simply the table that comes first in the join statement in this case it is the Employee table, its
called the left table because it appears to the left of the keyword join. So, the right table in
this case would be Location. The SQL above will give us the result set shown below.

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc


13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India
25 Johnson NULL NULL

As you can see from the result set, all of the rows from the left table (Employee) are returned
when we do a left outer join. The last row of the Employee table (which contains the "Johson"
entry) is displayed in the results even though there is no matching row in the Location table. As
you can see, the non-matching columns in the last row are filled with a "NULL". So, we have
"NULL" as the entry wherever there is no match.

Subscribe to our newsletter on the left to receive more free interview questions!

What is a right outer join?

A right outer join is pretty much the same thing as a left outer join, except that all the rows from
the right table are displayed in the result set, regardless of whether or not they have matching
values in the left table. This is what the SQL looks like for a right outer join:

select * from employee right outer join location


on employee.empID = location.empID;

// taking out the "outer", this also works:

select * from employee right join location


on employee.empID = location.empID;

Using the tables presented above, we can show what the result set of a right outer join would look
like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc


13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India
NULL NULL 39 Bangalore, India

We can see that the last row returned in the result set contains the row that was in the Location
table, but which had no matching empID in the Employee table (the "Bangalore, India" entry).
Because there is no row in the Employee table that has an employee ID of "39", we have NULLs in
that row for the Employee columns.

So, what is the difference between the right and left outer joins?

The difference is simple in a left outer join, all of the rows from the left table will be displayed,
regardless of whether there are any matching columns in the right table. In a right outer
join, all of the rows from the right table will be displayed, regardless of whether there are any
matching columns in the left table. Hopefully the example that we gave above help clarified this
as well.

Should I use a right outer join or a left outer join?

Actually, it doesnt matter. The right outer join does not add any functionality that the left outer
join didnt already have, and vice versa. All you would have to do to get the same results from a
right outer join and a left outer join is switch the order in which the tables appear in the SQL
statement. If thats confusing, just take a closer look at the examples given above.

In SQL, what is the difference between a left join and a left


outer join?
There is actually no difference between a left join and a left outer join they both refer to the
exact same operation in SQL. An example will help clear this up.

Here we have 2 tables that we will use for our example:


Employee Location

EmpID EmpName EmpID EmpLoc


13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India
25 Johnson 39 Bangalore, India

Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.

Left Outer Join


Here is what the SQL for a left outer join would look like, using the tables above:

select * from employee left outer join location


on employee.empID = location.empID;

Subscribe to our newsletter on the left to receive more free interview


questions!

In the SQL above, we actually remove the "outer" in left outer join, which will give us the SQL
below. Running the SQL with the outer keyword, would give us the exact same results as
running the SQL without the outer. Here is the SQL without the outer keyword:

select * from employee left join location


on employee.empID = location.empID;

A left outer join (also known as a left join) retains all of the rows of the left table, regardless of
whether there is a row that matches on the right table. The SQL above will give us the result set
shown below.

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

25 Johnson NULL NULL


What is the difference between a right outer join and a right join?
Once again, a right outer join is exactly the same as a right join. This is what the SQL looks like:

select * from employee right outer join location


on employee.empID = location.empID;

// taking out the "outer", this would give us


// the same results:

select * from employee right join location


on employee.empID = location.empID;

Using the tables presented above, we can show what the result set of a right outer join would look
like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

NULL NULL 39 Bangalore, India

We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.

In SQL, whats the difference between the having clause and


the group by statement?

In SQL, the having clause and the group by statement work together when using aggregate
functions like SUM, AVG, MAX, etc. This is best illustrated by an example. Suppose we have a table
called emp_bonus as shown below. Note that the table hasmultiple entries for employees A and B
which means that both employees A and B have received multiple bonuses.

emp_bonus

Employee Bonus

A 1000
B 2000

A 500

C 700

B 1250

If we want to calculate the total bonus amount that each employee has received, then we would
write a SQL statement like this:

select employee, sum(bonus) from emp_bonus group by employee;

The Group By Clause


In the SQL statement above, you can see that we use the "group by" clause with the employee
column. The group by clause allows us to find the sum of the bonuses foreach employee
because each employee is treated as his or her very own group. Using the group by in
combination with the sum(bonus) statement will give us the sum of all the bonuses for employees
A, B, and C.

Subscribe to our newsletter for more free interview questions.

Running the SQL above would return this:

Employee Sum(Bonus)

A 1500

B 3250

C 700

Now, suppose we wanted to find the employees who received more than $1,000 in bonuses for the
year of 2012 this is assuming of course that the emp_bonus table contains bonuses only for the
year of 2012. This is when we need to use the HAVINGclause to add the additional check to see if
the sum of bonuses is greater than $1,000, and this is what the SQL look like:

GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
And the result of running the SQL above would be this:

Employee Sum(Bonus)

A 1500

B 3250

Difference between having clause and group by statement


So, from the example above, we can see that the group by clause is used to group column(s) so
that aggregates (like SUM, MAX, etc) can be used to find the necessary information. The having
clause is used with the group by clause when comparisons need to be made with those aggregate
functions like to see if the SUM is greater than 1,000, as in our example above. So, the having
clause and group by statements are not really alternatives to each other but they are used
alongside one another!

Lets say that you are given a SQL table called Compare (the schema is shown below) with only
one column called Numbers.

Compare
{
Numbers INT(4)
}

Write a SQL query that will return the maximum value from the Numbers column, without using
a SQL aggregate like MAX or MIN.

This problem is difficult because you are forced to think outside the box, and use whatever SQL
you know to solve a problem without using the most obvious solution (doing a select MAX from
the table).

Probably the best way to start breaking this problem down is by creating a sample table with some
actual data that matches the schema given. Here is a sample table to start out with:

Compare

Numbers

30
70

-8

90

The value that we want to extract from the table above is 90, since it is the maximum value in the
table. How can we extract this value from the table in a creative way (it will have to be creative
since we cant use the max or min aggregates)? Well, what are the properties of the highest
number (90 in our example)? We could say that there are no numbers larger than 90 that
doesnt sound very promising in terms of solving this problem.

We could also say that 90 is the only number that does not have a number that is greater than it.
If we can somehow return every value that does not have a value greater than it then we would
only be returning 90. This would solve the problem. So, we should try to design a SQL statement
that would return every number that does not have another number greater than it. Sounds fun
right?

Lets start out simple by figuring out which numbers do have any numbers greater than
themselves. This is an easier query. We can start by joining the Compare table with itself this is
called a self join, which you can read more about here in case you are not familiar with self
joins: Example of self join in SQL .

Using a self join, we can create all the possible pairs for which each value in one column is greater
than the corresponding value in the other column. This is exactly what the following query does:

SELECT Smaller.Numbers, Larger.Numbers


FROM Compare as Larger JOIN Compare AS Smaller
ON Smaller.Numbers < Larger.Numbers

Now, let's use the sample table we created, and we end up with this table after running the query
above:

Smaller Larger
-8 90
30 90
70 90
-8 70
30 70
70 90
Now we have every value in the "Smaller" column except the largest value of 90. This means that
all we have to do is find the value that is not in the Smaller column (but is in the Compare table),
and that will give us the maximum value. We can easily do this using the NOT IN operator in SQL.

Subscribe to our newsletter for more free interview questions.

But before we do that we have to change the query above so that it only selects the "Smaller"
column - because that is the only column we are interested in. So, we can simply change our
query above to this in order to get the "Smaller" column:

SELECT Smaller.Numbers
FROM Compare as Larger JOIN Compare AS Smaller
ON Smaller.Numbers < Larger.Numbers

Now, all we have to do is apply the NOT IN operator to find the max value.

SELECT Numbers
FROM Compare
WHERE Numbers NOT IN (
SELECT Smaller.Numbers
FROM Compare AS Larger
JOIN Compare AS Smaller ON Smaller.Numbers < Larger.Numbers
)

This will give us what we want - the maximum value. But there is one small problem with the SQL
above - if the maximum value is repeated in the Compare table then it will return that value twice.
We can prevent that by simply using the DISTINCT keyword. So, here's what the query looks like
now:

SELECT DISTINCT Numbers


FROM Compare
WHERE Numbers NOT IN (
SELECT Smaller.Numbers
FROM Compare AS Larger
JOIN Compare AS Smaller ON Smaller.Numbers < Larger.Numbers
)

And there we have our final answer. Of course, some of you may be saying that there is a much
simpler solution to this problem. And you would be correct. Here is a simpler answer to the
problem using the SQL Top clause along with the SQL Order By clause - this is what it would look
like in SQL Server:
select TOP 1 -- select the very top entry in result set
Numbers
from
Compare
order by
Numbers DESC

And since MySQL does not have a TOP clause this is what it would look like in MySQL using just
ORDER BY and LIMIT :

select
Numbers
from
Compare
order by
Numbers DESC - order in descending order
LIMIT 1 --retrieve only one value

So, even though there are a couple of much simpler answers it is nice to know the more
complicated answer using a self join so that you can impress your interviewer with your
knowledge.

What is a database deadlock? Provide an example and explanation


of a deadlock in a database.
In a database, a deadlock is a situation that occurs when two or more different database
sessions have some data locked, and each database session requests a lock on the data that
another, different, session has already locked. Because the sessions are waiting for each other,
nothing can get done, and the sessions just waste time instead. This scenario where nothing
happens because of sessions waiting indefinitely for each other is known as deadlock.

If you are confused, some examples of deadlock should definitely help clarify what goes on during
deadlock. And, you should probably read our explanation ofdatabase locks before proceeding since
that will help your understanding as well.

Database deadlock example


Suppose we have two database sessions called A and B. Lets say that session A requests and has
a lock on some data and lets call the data Y. And then session B has a lock on some data that
we will call Z. But now, lets say that session A needs a lock on data Z in order to run another SQL
statement, but that lock is currently held by session B. And, lets say that session B needs a lock
on data Y, but that lock is currently held by session A. This means that session B is waiting on
session As lock and session B is waiting for session As lock. And this is what deadlock is all about!
Lets go through a more detailed (and less abstract) example of deadlock so that you can get a
more specific idea of how deadlock can arise.

Database deadlock example in banking


Lets use an example of two database users working at a bank lets call those database users X
and Y. Lets say that user X works in the customer service department and has to update the
database for two of the banks customers, because one customer (call him customer A) incorrectly
received $5,000 in his account when it should have gone to another customer (call him customer
B) so user X has to debit customer Xs account by $5,000 and also credit customer Ys account
$5,000.

Note that the crediting of customer B and debiting of customer A will be run as a single transaction
this is important for the discussion that follows.

Now, lets also say that the other database user Y works in the IT department and has to go
through the customers table and update the zip code of all customers who currently have a zip
code of 94520, because that zip code has now been changed to 94521. So, the SQL for this would
simply have a WHERE clause that would limit the update to customers with a zip code of 94520.

Also, both customers A and B currently have zip codes of 94520, which means that their
information will be updated by database user Y.

Here is a breakdown of the events in our fictitious example that lead to deadlock:

1. Database user X in the customer service department selects customer As data and
updates As bank balance to debit/decrease it by $5,000. However, whats important here
is that there is no COMMIT issued yet because database user X still has to update
customer Bs balance to increase/credit by $5,000 and those 2 separate SQL statements
will run as a single SQL transaction. Most importantly, this means that database user X
still holds a lock on the row for customer A because his transaction is not fully
committed yet (he still has to update customer A). The lock on the row for customer
A will stay until the transaction is committed.
2. Database user Y then has to run his SQL to update the zip codes for customers with zip
codes of 94520. The SQL then updates customer Bs zip code. But, because the SQL
statement from user Y must be run as a single transaction, the transaction has not
committed yet because all of the customers havent had their zip codes changed yet. So,
this means that database user Y holds a lock on the row for customer B. .
3. Now, Database user X still has to run the SQL statement that will update customer Bs
balance to increase it by $5,000. But, now the problem is that database user Y has a lock
on the row for customer B. This means that the request to update customer Bs balance
must wait for user Y to release the lock on customer B. So, database user X is waiting for
user Y to release a lock on customer B.
4. Now, the SQL statement being run by user Y tries to update the zip code for customer
A. But, this update can not happen because user X holds a lock on customer As row. So,
user Y is waiting for a lock to be released by user X.
5. Now you can see that we have user X waiting for user Y to release a lock and user Y
waiting for user X to release a lock. This is the situation of deadlock, since neither user can
make any progress, and nothing happens because they are both waiting for each other.
So, in theory, these two database sessions will be stalled forever. But, read on to see how
some DBMSs deal with this unique situation.

Database deadlock prevention


So now you have seen an example of deadlock. The question is how do DBMSs deal with it? Well,
very few modern DBMSs can actually prevent or avoid deadlocks, because theres a lot of
overhead required in order to do so. This is because the DBMSs that do try to prevent deadlocks
have to try to predict what a database user will do next, and the theory behind deadlock
prevention is that each lock request is inspected to see if it has the potential to cause contention.
If that is the case, then the lock is not allowed to be placed.

Database deadlock detection


Instead of deadlock prevention, the more popular approach to dealing with database deadlocks is
deadlock detection. What is deadlock detection? Well, deadlock detection is based on the principle
that one of the requests that caused the deadlock should be aborted.

How does deadlock detection work?


There are two common approaches to deadlock detection: 1. Whenever a session is waiting for a
lock to be released it is in whats known as a lock wait state. One way deadlock detection is
implemented is to simply set the lock wait time period to a certain preset limit (like 5 seconds).
So, if a session waits more than 5 seconds for a lock to free up, then that session will will be
terminated. 2. The RDBMS can regularly inspect all the locks currently in place to see if there are
any two sessions that have locked each other out and are in a state of deadlock.

In either of the deadlock detection methods, one of the requests will have to be terminated to stop
the deadlock. This also means that any transaction changes which came before the request will
have to be rolled back so that the other request can make progress and finish.

What is the difference between a Clustered and Non Clustered


Index?
A clustered index determines the order in which the rows of a table are stored on disk. If a table
has a clustered index, then the rows of that table will be stored on disk in the same exact order as
the clustered index. An example will help clarify what we mean by that.

An example of a clustered index


Suppose we have a table named Employee which has a column named EmployeeID. Lets say we
create a clustered index on the EmployeeID column. What happens when we create this clustered
index? Well, all of the rows inside the Employee table will be physically sorted (on the actual
disk) by the values inside the EmployeeID column. What does this accomplish? Well, it means
that whenever a lookup/search for a sequence of EmployeeIDs is done using that clustered index,
then the lookup will be much faster because of the fact that the sequence of employee IDs are
physically stored right next to each other on disk that is the advantage with the clustered index.
This is because the rows in the table are sorted in the exact same order as the clustered
index, and the actual table data is stored in the leaf nodes of the clustered index.

Remember that an index is usually a tree data structure and leaf nodes are the nodes that are at
the very bottom of that tree. In other words, a clustered index basically contains the actual
table level data in the index itself. This is very different from most other types of indexes as
you can read about below.

When would using a clustered index make sense?


Lets go through an example of when and why using a clustered index would actually make sense.
Suppose we have a table named Owners and a table named Cars. This is what the simple schema
would look like with the column names in each table:

Owners
Owner_Name
Owner_Age

Cars
Car_Type
Owner_Name

Lets assume that a given owner can have multiple cars so a single Owner_Name can appear
multiple times in the Cars table. Now, lets say that we create a clustered index on the
Owner_Name column in the Cars table. What does this accomplish for us? Well, because a
clustered index is stored physically on the disk in the same order as the index, it would mean that
a given Owner_Name would have all his/her car entries stored right next to each other on disk. In
other words, if there is an owner named Joe Smith or Raj Gupta, then each owner would have
all of his/her entries in the Cars table stored right next to each other on the disk.

When is using a clustered index an advantage?


What is the advantage of this? Well, suppose that there is a frequently run query which tries to
find all of the cars belonging to a specific owner. With the clustered index, since all of the car
entries belonging to a single owner would be right next to each other on disk,the query will run
much faster than if the rows were being stored in some random order on the disk. And
that is the key point to remember!

Why is it called a clustered index?


In our example, all of the car entries belonging to a single owner would be right next to each other
on disk. This is the clustering, or grouping of similar values, which is referred to in the term
clustered index.

Note that having an index on the Owner_Name would not necessarily be unique, because there are
many people who share the same name. So, you might have to add another column to the
clustered index to make sure that its unique.
What is a disadvantage to using a clustered index?
A disadvantage to using a clustered index is the fact that if a given row has a value updated in one
of its (clustered) indexed columns what typically happens is that the database will have to move
the entire row so that the table will continue to be sorted in the same order as the clustered index
column. Consider our example above to clarify this. Suppose that someone named Rafael Nadal
buys a car lets say its a Porsche from Roger Federer. Remember that our clustered index is
created on the Owner_Name column. This means that when we do a update to change the name
on that row in the Cars table, the Owner_Name will be changed from Roger Federer to Rafael
Nadal.

But, since a clustered index also tells the database in which order to physically store the rows on
disk, when the Owner_Name is changed it will have to move an updated row so that it is still in
the correct sorted order. So, now the row that used to belong to Roger Federer will have to be
moved on disk so that its grouped (or clustered) with all the car entries that belong to Rafael
Nadal. Clearly, this is a performance hit. This means that a simple UPDATE has turned into a
DELETE and then an INSERT just to maintain the order of the clustered index. For this exact
reason, clustered indexes are usually created on primary keys or foreign keys, because of the fact
that those values are less likely to change once they are already a part of a table.

A comparison of a non-clustered index with a clustered index with


an example
As an example of a non-clustered index, lets say that we have a non-clustered index on the
EmployeeID column. A non-clustered index will store both the value of the
EmployeeIDAND a pointer to the row in the Employee table where that value is actually stored.
But a clustered index, on the other hand, will actually store the row data for a particular
EmployeeID so if you are running a query that looks for an EmployeeID of 15, the data from
other columns in the table like EmployeeName, EmployeeAddress, etc. will all actually be stored in
the leaf node of the clustered index itself.

This means that with a non-clustered index extra work is required to follow that pointer to the
row in the table to retrieve any other desired values, as opposed to a clustered index which can
just access the row directly since it is being stored in the same order as the clustered index itself.
So, reading from a clustered index is generally faster than reading from a non-clustered index.

A table can have multiple non-clustered indexes


A table can have multiple non-clustered indexes because they dont affect the order in which the
rows are stored on disk like clustered indexes.

Why can a table have only one clustered index?


Because a clustered index determines the order in which the rows will be stored on disk, having
more than one clustered index on one table is impossible. Imagine if we have two clustered
indexes on a single table which index would determine the order in which the rows will be
stored? Since the rows of a table can only be sorted to follow just one index, having more than
one clustered index is not allowed.
Summary of the differences between clustered and non-clustered
indexes
Heres a summary of the differences:

A clustered index determines the order in which the rows of the table will be stored on disk
and it actually stores row level data in the leaf nodes of the index itself. A non-clustered
index has no effect on which the order of the rows will be stored.
Using a clustered index is an advantage when groups of data that can be clustered are
frequently accessed by some queries. This speeds up retrieval because the data lives close
to each other on disk. Also, if data is accessed in the same order as the clustered index,
the retrieval will be much faster because the physical data stored on disk is sorted in the
same order as the index.
A clustered index can be a disadvantage because any time a change is made to a value of
an indexed column, the subsequent possibility of re-sorting rows to maintain order is a
definite performance hit.
A table can have multiple non-clustered indexes. But, a table can have only one clustered
index.
Non clustered indexes store both a value and a pointer to the actual row that holds that
value. Clustered indexes dont need to store a pointer to the actual row because of the fact
that the rows in the table are stored on disk in the same exact order as the clustered index
and the clustered index actually stores the row-level data in its leaf nodes.

How do you tune database performance?


Every good Database Administrator (DBA) knows that tuning a database is a constant job. This is
because there is always something in the database that can be changed to make it run more
efficiently. But, what are some of the more common things that you should keep an eye out for if
you are tuning an RDBMS? Well, lets go through those things, but without getting too specific for
any particular RDBMS (like MySQL, Oracle, etc.).

Tune database performance by designing tables efficiently


How you design your tables is very important in terms of database performance. In general, non-
RDBMS specific terms, here are some guidelines you should follow when designing tables:

Be careful using triggers. Using triggers on specific database tables can be a nice way to
come up with a solution for a specific problem. For example, if you want to enforce a
difficult constraint, triggers can be a good solution. But, on the other hand, when a trigger
is fired (or triggered) by a given SQL statement, then all the work that trigger does
becomes a part of the SQL statement. This extra work can slow down the SQL statement,
and sometimes that can cause a big hit on performance if the trigger fires a lot.
Use the same data type for both primary and foreign key columns. This means that when
defining a primary and foreign key, its not a good idea to define the primary key with a
VARCHAR data type and a foreign key with a CHAR data type its better to define both
keys as either VARCHAR or CHAR. This is because table joins when the data types are
different, then the DBMS will have to convert one of the data types to the same type as
the other one. This process of conversion from one data type to another is extra work that
the RDBMS, and of course slows down any joins that are done between tables with
columns that have different data types.
When choosing a data type for a numeric column, you should always pick thesmallest data
type that can hold your data. Choosing the smallest data type that still fits your data can
save a lot of space. For example, if you choose to use the BIGINT data type when all of
your data for a given column could actually fit in the TINYINT or SMALLINT data types
would be a huge waste of space. Think of it this way if your table has millions of rows
then it would result in millions of entries for that BIGINT column which would have too
much extra space inside them. So, the unused and wasted space could add up very fast.
Another thing to consider is to use CHAR instead of VARCHAR when you know that the
entries will have the same length. VARCHAR columns actually have an extra 1 to 3 bytes
per entry so that they can also hold the length of the actual data. In addition to that extra
1 to 3 bytes, whenever a VARCHAR column has some change in the column data, there is
some more processing overhead because the length of the column data has to be
calculated yet again.

Tune database performance by reducing disk writes and reads


Reading and writing from the disk (as in the actual hard drive on which the database is stored) is
extremely expensive in terms of database performance. Operations that read and write to and
from the disk are also known as I/O operations. Anyways, you have to do the best you can with
the available memory (the RAM not to be confused with disk space), so that you reduce the
number of disk reads and writes to the absolute minimum necessary. Here are some approaches
to reducing the number of I/O operations, and also to minimize the time spent waiting for those
operations to finish:

One possibility to help tune database performance by reducing the number of I/O
operations is to use more than just one physical disk drive. By having multiple disk drives,
you can spread out the database files on the different disk drives. By spreading out the
database files over multiple drives, you can have many I/O operations happening in
parallel, which can be a big performance advantage.
Another important approach to correctly tuning your database for better performance is to
make sure that you allocate buffers which have the right size. What exactly is a buffer?
Well, its a part of the memory (the RAM) that is used to temporarily store data that has
recently been retrieved from the database and also temporarily hold data that will
eventually be stored in the permanent storage (like a disk). If you have a buffer of the
right size, then data that has already been read from the database will remain in the buffer
for a fair amount of time, and that means that if a new query is run asking for the same
data, then it can be retrieved directly from the buffer rather than making another
expensive (performance-wise) query into the database.

Buffers can also be used to store data that is written to the buffer by the RDBMS for
temporary storage. Then, this data can later be copied to permanent storage at some
point in the future. This is also known as asynchronous I/O because the data is not written
immediately from the RDBMS to the permanent storage all major RDBMSs have some
form of it, since its quite important for performance.

Creating indexes help greatly in tuning database performance


Database indexes are a major performance enhancer in databases. You can read more about some
best practices to follow for database indexes here: SQL Index best practices.

What are some tips on tuning SQL Indexes for better performance?
If youve already read our article on SQL indexes, then you know that indexes are used to make
queries run faster by reducing the time taken to look up data. But, the tradeoff for that improved
performance is that indexes also take up space, and there must be some maintenance done on
indexes to ensure that they continue to run smoothly. Lets go over some tips, guidelines, and
suggestions on how to properly use and maintain indexes to improve performance.

Dont use too many indexes


As you know, indexes can take up a lot of space. So, having too many indexes can actually be
damaging to your performance because of the space impact. For example, if you try to do an
UPDATE or an INSERT on a table that has too many indexes, then there could be a big hit on
performance due to the fact that all of the indexes will have to be updated as well. A general rule
of thumb is to not create more than 3 or 4 indexes on a table.

Try not to include columns that are repeatedly updated in an index


If you create an index on a column that is updated very often, then that means that every time
the column is updated, the index will have to be updated as well. This is done by the DBMS, of
course, so that the index stays current and consistent with the columns that belong to that index.
So, the number of writes is increased two-fold one time to update the column itself and another
to update the index as well. So, you might want to consider avoiding the inclusion of columns that
are frequently updated in your index.

Creating indexes on foreign key column(s) can improve


performance
Because joins are often done between primary and foreign key pairs, having an index on a foreign
key column can really improve the join performance. Not only that, but the index allows some
optimizers to use other methods of joining tables as well.

Create indexes for columns that are repeatedly used in predicates


of your SQL queries
Take a look at your queries and see which columns are used frequently in the WHERE predicate. If
those columns are not part of an index already, then you should add them to an index. This is of
course because an index on columns that are repeatedly used in predicates will help speed up your
queries.

Get rid of overlapping indexes


Overlapping indexes are indexes that share the same leading column (the first column) and that
is why they are called overlapping indexes. Remember that indexes can have multiple columns.
And, almost all RDBMSs can use an index for a query, even if that querys WHERE predicate uses
only the first column of that index, and doesnt use any of the other columns in that index. In
other words, most RDBMSs can use indexes for queries even when there is just a partial match for
the query columns to the index columns. And for this reason, overlapping indexes are usually not
necessary but be sure to research this for your own particular RDBMS.

Consider deleting an index when loading huge amounts of data into


a table
If you are loading a huge amount of data into a table, then you might want to think about deleting
some of the indexes on the table. Then, after the data is loaded into the table, you can recreate
the indexes. The reason you would want to do this is because the index will not have to be
updated during the delete, which could save you a lot of time!

Ensure that the indexes you create have high selectivity


For reasons described in more detail here index selectivity you should create indexes with a
high selectivity. A general rule of thumb is that indexes should have a selectivity thats higher than
.33. Remember that indexes that are completely unique have a selectivity of 1.0, which is the
highest possible selectivity value.

What are some tips on tuning SQL Indexes for better performance?
If youve already read our article on SQL indexes, then you know that indexes are used to make
queries run faster by reducing the time taken to look up data. But, the tradeoff for that improved
performance is that indexes also take up space, and there must be some maintenance done on
indexes to ensure that they continue to run smoothly. Lets go over some tips, guidelines, and
suggestions on how to properly use and maintain indexes to improve performance.

Dont use too many indexes


As you know, indexes can take up a lot of space. So, having too many indexes can actually be
damaging to your performance because of the space impact. For example, if you try to do an
UPDATE or an INSERT on a table that has too many indexes, then there could be a big hit on
performance due to the fact that all of the indexes will have to be updated as well. A general rule
of thumb is to not create more than 3 or 4 indexes on a table.

Try not to include columns that are repeatedly updated in an index


If you create an index on a column that is updated very often, then that means that every time
the column is updated, the index will have to be updated as well. This is done by the DBMS, of
course, so that the index stays current and consistent with the columns that belong to that index.
So, the number of writes is increased two-fold one time to update the column itself and another
to update the index as well. So, you might want to consider avoiding the inclusion of columns that
are frequently updated in your index.
Creating indexes on foreign key column(s) can improve
performance
Because joins are often done between primary and foreign key pairs, having an index on a foreign
key column can really improve the join performance. Not only that, but the index allows some
optimizers to use other methods of joining tables as well.

Create indexes for columns that are repeatedly used in predicates


of your SQL queries
Take a look at your queries and see which columns are used frequently in the WHERE predicate. If
those columns are not part of an index already, then you should add them to an index. This is of
course because an index on columns that are repeatedly used in predicates will help speed up your
queries.

Get rid of overlapping indexes


Overlapping indexes are indexes that share the same leading column (the first column) and that
is why they are called overlapping indexes. Remember that indexes can have multiple columns.
And, almost all RDBMSs can use an index for a query, even if that querys WHERE predicate uses
only the first column of that index, and doesnt use any of the other columns in that index. In
other words, most RDBMSs can use indexes for queries even when there is just a partial match for
the query columns to the index columns. And for this reason, overlapping indexes are usually not
necessary but be sure to research this for your own particular RDBMS.

Consider deleting an index when loading huge amounts of data into


a table
If you are loading a huge amount of data into a table, then you might want to think about deleting
some of the indexes on the table. Then, after the data is loaded into the table, you can recreate
the indexes. The reason you would want to do this is because the index will not have to be
updated during the delete, which could save you a lot of time!

Ensure that the indexes you create have high selectivity


For reasons described in more detail here index selectivity you should create indexes with a
high selectivity. A general rule of thumb is that indexes should have a selectivity thats higher than
.33. Remember that indexes that are completely unique have a selectivity of 1.0, which is the
highest possible selectivity value.

What is the difference between correlated subqueries and


uncorrelated subqueries?
Lets start out with an example of what an uncorrelated subquery looks like, and then we can
compare that with a correlated subquery. Here is an example of an uncorrelated subquery:
Example of an Uncorrelated Subquery
Here is an example of some SQL that represents an uncorrelated subquery:

select Salesperson.Name from Salesperson


where Salesperson.ID NOT IN(
select Orders.salesperson_id from Orders, Customer
where Orders.cust_id = Customer.ID
and Customer.Name = 'Samsonic')

If the SQL above looks scary to you, dont worry its still easy to understand for our purposes
here. The subquery portion of the SQL above begins after the NOT IN statement. The reason that
the query above is an uncorrelated subquery is that the subquery can be run independentlyof the
outer query. Basically, the subquery has no relationship with the outer query.

Now, a correlated subquery has the opposite property the subquery cannot be run
independently of the outer query. You can take a look at this example of a correlated subquery
below and easily see the difference yourself:

Example of a correlated subquery

SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > Emp1.Salary)

What you will notice in the correlated subquery above is that the inner subquery uses
Emp1.Salary, but the alias Emp1 is created in the outer query. This is why it is called a correlated
subquery, because the subquery references a value in its WHERE clause (in this case, it uses a
column belonging to Emp1) that is used in the outer query.

How does a correlated query work exactly?


Its important to understand the order of operations in a correlated subquery. First, a row is
processed in the outer query. Then, for that particular row the subquery is executed so for each
row processed by the outer query, the subquery will also be processed. In our example of a
correlated subquery above, every time a row is processed for Emp1, the subquery will also choose
that rows value for Emp1.Salary and run. And then the outer query will move on to the next row,
and the subquery will execute for that rows value of Emp1.Salary. This will continue until the
WHERE (1) = ( ) condition is satisfied.

You can also read this particular SQL problem to get more detailed information on how the
correlated query above works: SQL Find nth highest salary
Suppose that you are given the following simple database table
called Employee that has 2 columns named Employee ID and
Salary:

Employee

Employee ID Salary

3 200

4 800

7 450

Write a SQL query to get the second highest salary from the table
above. Also write a query to find the nth highest salary in SQL,
where n can be any number.
The easiest way to start with a problem like this is to ask yourself a simpler question first. So, lets
ask ourselves how can we find the highest salary in a table? Well, you probably know that is
actually really easy we can just use the MAX aggregate function:

select MAX(Salary) from Employee;

Remember that SQL is based on set theory


You should remember that SQL uses sets as the foundation for most of its queries. So, the
question is how can we use set theory to find the 2nd highest salary in the table above? Think
about it on your own for a bit even if you do not remember much about sets, the answer is very
easy to understand and something that you might be able to come up with on your own.

Figuring out the answer to find the 2nd highest salary


What if we try to exclude the highest salary value from the result set returned by the SQL that we
run? If we remove the highest salary from a group of salary values, then we will have a new group
of values whose highest salary is actually the 2nd highest in the original Employee table.

So, if we can somehow select the highest value from a result set thatexcludes the highest value,
then we would actually be selecting the 2nd highest salary value. Think about that carefully and
see if you can come up with the actual SQL yourself before you read the answer that we provide
below. Here is a small hint to help you get started: you will have to use the NOT IN SQL
operator.
Solution to finding the 2nd highest salary in SQL
Now, here is what the SQL will look like:

SELECT MAX(Salary) FROM Employee


WHERE Salary NOT IN (SELECT MAX(Salary) FROM Employee )

Running the SQL above would return us 450, which is of course the 2nd highest salary in the
Employee table.

Subscribe to our newsletter for more free interview questions.

An explanation of the solution


The SQL above first finds the highest salary value in the Employee table using (select
MAX(Salary) from Employee). Then, adding the WHERE Salary NOT IN in front basically creates
a new set of Salary values that does not include the highest Salary value. For instance, if the
highest salary in the Employee table is 200,000 then that value will be excluded from the results
using the NOT IN operator, and all values except for 200,000 will be retained in the results.

This now means that the highest value in this new result set will actually be the 2nd highest value
in the Employee table. So, we then select the max Salary from the new result set, and that gives
us 2nd highest Salary in the Employee table. And that is how the query above works.

An alternative solution using the not equals SQL operator


We can actually use the not equals operator the <> instead of the NOT IN operator as an
alternative solution to this problem. This is what the SQL would look like:

select MAX(Salary) from Employee


WHERE Salary <> (select MAX(Salary) from Employee )

How would you write a SQL query to find the Nth highest salary?
What we did above was write a query to find the 2nd highest Salary value in the Employee table.
But, another commonly asked interview question is how can we use SQL to find theNth highest
salary, where N can be any number whether its the 3rd highest, 4th highest, 5th highest, 10th
highest, etc? This is also an interesting question try to come up with an answer yourself before
reading the one below to see what you come up with.

The answer and explanation to finding the nth highest salary in SQL
Here we will present one possible answer to finding the nth highest salary first, and the
explanation of that answer after since its actually easier to understand that way. Note that the
first answer we present is actually not optimal from a performance standpoint since it uses a
subquery, but we think that it will be interesting for you to learn about because you might just
learn something new about SQL. If you want to see the more optimal solutions first, you can skip
down to the sections that says Find the nth highest salary without a subquery instead.

The SQL below will give you the correct answer but you will have to plug in an actual value for N
of course. This SQL to find the Nth highest salary should work in SQL Server, MySQL, DB2, Oracle,
Teradata, and almost any other RDBMS:

SELECT * /*This is the outer query part */


FROM Employee Emp1
WHERE (N-1) = ( /* Subquery starts here */
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > Emp1.Salary)

How does the query above work?


The query above can be quite confusing if you have not seen anything like it before pay special
attention to the fact that Emp1 appears in both the subquery (also known as an inner query)
and the outer query. The outer query is just the part of the query that is not the subquery/inner
query both parts of the query are clearly labeled in the comments.

The subquery is a correlated subquery


The subquery in the SQL above is actually a specific type of subquery known as
acorrelated subquery. The reason it is called a correlated subquery is because the the subquery
uses a value from the outer query in its WHERE clause. In this case that value is the Emp1 table
alias as we pointed out earlier. A normal subquery can be runindependently of the outer query,
but a correlated subquery can NOT be run independently of the outer query. If you want to read
more about the differences between correlated and uncorrelated subqueries you can go
here: Correlated vs Uncorrelated Subqueries.

The most important thing to understand in the query above is that the subquery is evaluated each
and every time a row is processed by the outer query. In other words, the inner query can not be
processed independently of the outer query since the inner query uses the Emp1 value as well.

Finding nth highest salary example and explanation


Lets step through an actual example to see how the query above will actually execute step by
step. Suppose we are looking for the 2nd highest Salary value in our table above, so our N is 2.
This means that the query will look like this:

SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > Emp1.Salary)

You can probably see that Emp1 and Emp2 are just aliases for the same Employee table its like
we just created 2 separate clones of the Employee table and gave them different names.

Understanding and visualizing how the query above works


Lets assume that we are using this data:

Employee

Employee ID Salary

3 200

4 800

7 450

For the sake of our explanation, lets assume that N is 2 so the query is trying to find the 2nd
highest salary in the Employee table. The first thing that the query above does is process the very
first row of the Employee table, which has an alias of Emp1.

The salary in the first row of the Employee table is 200. Because the subquery is correlated to the
outer query through the alias Emp1, it means that when the first row is processed, the query will
essentially look like this note that all we did is replace Emp1.Salary with the value of 200:

SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > 200)

So, what exactly is happening when that first row is processed? Well, if you pay special attention
to the subquery you will notice that its basically searching for the count of salary entries in the
Employee table that are greater than 200. Basically, the subquery is trying to find how many
salary entries are greater than 200. Then, that count of salary entries is checked to see if it equals
1 in the outer query, and if so then everything from that particular row in Emp1 will be returned.

Note that Emp1 and Emp2 are both aliases for the same table Employee. Emp2 is only being
used in the subquery to compare all the salary values to the current salary value chosen in Emp1.
This allows us to find the number of salary entries (the count) that are greater than 200. And if
this number is equal to N-1 (which is 1 in our case) then we know that we have a winner and
that we have found our answer.
But, its clear that the subquery will return a 2 when Emp1.Salary is 200, because there are clearly
2 salaries greater than 200 in the Employee table. And since 2 is not equal to 1, the salary of 200
will clearly not be returned.

So, what happens next? Well, the SQL processor will move on to the next row which is 800, and
the resulting query looks like this:

SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > 800)

Since there are no salaries greater than 800, the query will move on to the last row and will of
course find the answer as 450. This is because 800 is greater than 450, and the count will be 1.
More precisely, the entire row with the desired salary would be returned, and this is what it would
look like:

EmployeeID Salary

7 450

Its also worth pointing out that the reason DISTINCT is used in the query above is because there
may be duplicate salary values in the table. In that scenario, we only want to count repeated
salaries just once, which is exactly why we use the DISTINCT operator.

A high level summary of how the query works


Lets go through a high level summary of how someone would have come up with the SQL in the
first place since we showed you the answer first without really going through the thought
process one would use to arrive at that answer.

Think of it this way we are looking for a pattern that will lead us to the answer. One way to look
at it is that the 2nd highest salary would have just one salary that is greater than it. The 4th
highest salary would have 3 salaries that are greater than it. In more general terms, in order to
find the Nth highest salary, we just find the salary that has exactly N-1 salaries greater
than itself. And that is exactly what the query above accomplishes it simply finds the salary
that has N-1 salaries greater than itself and returns that value as the answer.

Find the nth highest salary using the TOP keyword in SQL Server
We can also use the TOP keyword (for databases that support the TOP keyword, like SQL Server)
to find the nth highest salary. Here is some fairly simply SQL that would help us do that:

SELECT TOP 1 Salary


FROM (
SELECT DISTINCT TOP N Salary
FROM Employee
ORDER BY Salary DESC
) AS Emp
ORDER BY Salary

To understand the query above, first look at the subquery, which simply finds the N highest
salaries in the Employee table and arranges them in descending order. Then, the outer query will
actually rearrange those values in ascending order, which is what the very last line ORDER BY
Salary does, because of the fact that the ORDER BY Default is to sort values in ASCENDING order.
Finally, that means the Nth highest salary will be at the top of the list of salaries, which means we
just want the first row, which is exactly what SELECT TOP 1 Salary will do for us!

Find the nth highest salary without using the TOP keyword
There are many other solutions to finding the nth highest salary that do not need to use the TOP
keyword, one of which we already went over. Keep reading for more solutions.

Find the nth highest salary in SQL without a subquery


The solution we gave above actually does not do well from a performance standpoint. This is
because the use of the subquery can really slow down the query. With that in mind, lets go
through some different solutions to this problem for different database vendors. Because each
database vendor (whether its MySQL, Oracle, or SQL Server) has a different SQL syntax and
functions, we will go through solutions for specific vendors. But keep in mind that the solution
presented above using a subquery should work across different database vendors.

Find the nth highest salary in MySQL


In MySQL, we can just use the LIMIT clause along with an offset to find the nth highest salary. If
that doesnt make sense take a look at the MySQL-specific SQL to see how we can do this:

SELECT Salary FROM Employee


ORDER BY Salary DESC LIMIT n-1,1

Note that the DESC used in the query above simply arranges the salaries in descending order so
from highest salary to lowest. Then, the key part of the query to pay attention to is the LIMIT N-
1, 1. The LIMIT clause takes two arguments in that query the first argument specifies the offset
of the first row to return, and the second specifies the maximum number of rows to return. So, its
saying that the offset of the first row to return should be N-1, and the max number of rows to
return is 1. What exactly is the offset? Well, the offset is just a numerical value that represents the
number of rows from the very first row, and since the rows are arranged in descending order we
know that the row at an offset of N-1 will contain the (N-1)th highest salary.

Find the nth highest salary in SQL Server


In SQL Server, there is no such thing as a LIMIT clause. But, we can still use the offset to find the
nth highest salary without using a subquery just like the solution we gave above in MySQL
syntax. But, the SQL Server syntax will be a bit different. Here is what it would look like:

SELECT Salary FROM Employee


ORDER BY Salary DESC OFFSET N-1 ROW(S)
FETCH FIRST ROW ONLY

Note that I havent personally tested the SQL above, and I believe that it will only work in SQL
Server 2012 and up. Let me know in the comments if you notice anything else about the query.

Find the nth highest salary in Oracle using rownum


Oracle syntax doesnt support using an offset like MySQL and SQL Server, but we can actually use
the row_number analytic function in Oracle to solve this problem. Here is what the Oracle-specific
SQL would look like to find the nth highest salary:

select * from (
select Emp.*,
row_number() over (order by Salary DESC) rownumb
from Employee Emp
)
where rownumb = n; /*n is nth highest salary*/

The first thing you should notice in the query above is that inside the subquery the salaries are
arranged in descending order. Then, the row_number analytic function is applied against the list of
descending salaries. Applying the row_number function against the list of descending salaries
means that each row will be assigned a row number starting from 1. And since the rows are
arranged in descending order the row with the highest salary will have a 1 for the row number.
Note that the row number is given the alias rownumb in the SQL above.

This means that in order to find the 3rd or 4th highest salary we simply look for the 3rd or 4th
row. The query above will then compare the rownumb to n, and if they are equal will return
everything in that row. And that will be our answer!

Find the nth highest salary in Oracle using RANK


Oracle also provides a RANK function that just assigns a ranking numeric value (with 1 being the
highest) for some sorted values. So, we can use this SQL in Oracle to find the nth highest salary
using the RANK function:

select * FROM (
select EmployeeID, Salary
,rank() over (order by Salary DESC) ranking
from Employee
)
WHERE ranking = N;

The rank function will assign a ranking to each row starting from 1. This query is actually quite
similar to the one where we used the row_number() analytic function, and works in the same way
as well.

Weve now gone through many different solutions in different database vendors like Oracle,
MySQL, and SQL Server. Hopefully now you understand how to solve a problem like this, and you
have improved your SQL skills in the process! Be sure to leave a comment if you have any
questions or observations.

In SQL, how and when would you do a group by with multiple


columns? Also provide an example.
In SQL, the group by statement is used along with aggregate functions like SUM, AVG, MAX, etc.
Using the group by statement with multiple columns is useful in many different situations and it
is best illustrated by an example. Suppose we have a table shown below called Purchases. The
Purchases table will keep track of all purchases made at a fictitious store.

Purchases

purchase_date item items_purchased


2011-03-25 00:00:00.000 Wireless Mouse 2
2011-03-25 00:00:00.000 Wireless Mouse 5
2011-03-25 00:00:00.000 MacBook Pro 1
2011-04-01 00:00:00.000 Paper Clips 20
2011-04-01 00:00:00.000 Stapler 3
2011-04-01 00:00:00.000 Paper Clips 15
2011-05-15 00:00:00.000 DVD player 3
2011-05-15 00:00:00.000 DVD player 8
2011-05-15 00:00:00.000 Stapler 5
2011-05-16 00:00:00.000 MacBook Pro 2
Now, lets suppose that the owner of the store wants to find out, on a given date, how many of
each product was sold in the store. Then we would write this SQL in order to find that out:

select purchase_date, item, sum(items_purchased) as


"Total Items" from Purchases group by item, purchase_date;

Subscribe to our newsletter on the left to receive more free interview


questions!

Running the SQL above would return this:

purchase_date item Total Items


2011-03-25 00:00:00.000 Wireless Mouse 7
2011-03-25 00:00:00.000 MacBook Pro 1
2011-04-01 00:00:00.000 Paper Clips 35
2011-04-01 00:00:00.000 Stapler 3
2011-05-15 00:00:00.000 DVD player 11
2011-05-15 00:00:00.000 Stapler 5
2011-05-16 00:00:00.000 MacBook Pro 2

Note that in the SQL we wrote, the group by statement uses multiple columns: group by item,
purchase_date;. This allows us to group the individual items for a given date so basically we are
dividing the results by the date the items are purchased, and then for a given date we are able to
find how many items were purchased for that date. This is why the group by statement with
multiple columns is so useful!

What is the difference between system and object privileges?


First off, lets define what a privilege is in a database. In a database, every user account can be
granted a number of privileges, which are also known as permissions. These privileges allow a
particular user account to do certain things, like DELETE and UPDATE certain tables, CREATE a
database, SELECT from a certain table, and many other things.

System and object privileges in Oracle, SQL Server, and Sybase


In Oracle, Microsofts SQL Server, and in Sybase Adaptive Server privileges are further divided into
two different categories: 1. system privileges and 2. object privileges. Whats the difference
between system and object privileges? Well, lets go through an explanation of each one, and then
well discuss the differences between the two.

System privileges
System privileges are privileges given to users to allow them to perform certain functions that deal
withmanaging the database and the server . Most of the different types of permissions supported
by the database vendors fall under the system privilege category. Lets go through some examples
of system privileges in Oracle and SQL Server.

Examples of Oracle system privileges


CREATE USER. The CREATE USER permission, when granted to a database user, allows
that database user to create new users in the database.
CREATE TABLE. The CREATE TABLE permission, when granted to a database user, allows
that database user to create tables in their own schema. This type of privilege is also
available for other object types like stored procedures and indexes.
CREATE SESSION. The CREATE SESSION permission, when granted to a database user,
allows that database user to connect to the database.

Examples of Microsoft SQL Server System Privileges


BACKUP DATABASE. The BACKUP DATABASE permission, when granted to a database
user, allows that database user to create backups of the databases on the server.
CREATE DATABASE. The CREATE DATABASE permission, when granted to a database user,
allows that database user to create new databases on the server.
SHUTDOWN. The SHUTDOWN permission, when granted to a database user, allows that
database user to issue a command to shutdown the server.

Object privileges
Object privileges are privileges given to users so that they can perform certain actions upon
certain database objects where database objects are things like tables, stored procedures,
indexes, etc. Some examples of object privileges include granting a particular database user the
right to DELETE and/or SELECT from a particular table. This is done using the GRANT clause, which
you can read more about here: SQL GRANT.

System versus Object privileges


So, now hopefully its clear that the difference between system and object privileges is that system
privileges are used for server and database privileges. But object privileges are used to grant
privileges on database objects like tables, stored procedures, indexes, etc.

What is the SQL Grant statement, and how is it used to assign


users privileges/permissions?
In SQL, the GRANT statement is used to give (or grant) one or more privileges/permissions to a
particular database user. Heres what the syntax for the GRANT statement looks like:

Syntax for the GRANT statement


GRANT privilege [, privilege ... ] /* can add more privileges here too */
[ON database object ] /*object can be tables, views, etc.. */
TO grantee [, grantee ...] /*user, PUBLIC, or role*/
[WITH GRANT OPTION | WITH ADMIN OPTION];

Lets go over some things that you should know if you plan on using the GRANT statement.

SQL GRANT privileges


When granting privileges, the list of privileges can only be either all system privileges or all object
privileges. In other words, you can not grant both system and object privileges in the same SQL
GRANT statement you will need to use two different SQL GRANT statements in order to do this.
You can read about the differences between system and object privileges here: System versus
object privileges.

SQL GRANT ON
The ON clause is only used to grant object privileges not system privileges. This clause specifies
which object privileges (as in which table privileges, view privileges, etc..) are being granted.

The SQL GRANTEE list


It should be clear from the syntax of the GRANT statement shown above that you can specify
more than one database user or role that should receive the privilege(s) being granted.

SQL GRANT PUBLIC


In the majority of SQL implementations, you can actually grant privileges to PUBLIC. What does
PUBLIC mean in the SQL GRANT statement? Well, the PUBLIC keyword is used to mean all users of
a database so all users would receive whatever privileges are being granted by that SQL
statement. This is typically not good practice because of the obvious security risks, but may be OK
depending on the situation.

The WITH ADMIN OPTION or WITH GRANT OPTION


The WITH ADMIN OPTION or WITH GRANT OPTION clauses lets the grantee (the person receiving
the privileges) to further grant the privileges to others. So, these options essentially grant the
grantee the ability to grant the same privileges to others (lovely sentence right?). Depending on
which DBMS you are using, the exact syntax of these clauses will vary.

SQL GRANT Example


Heres an example of what an actual SQL GRANT statement would look like:
GRANT SELECT ON SOME_TABLE TO SOME_USER;

In the example above, the GRANT statement is used to give the privilege of being able to select
from SOME_TABLE to the SOME_USER user.

What is the SQL REVOKE statement? How does it work?


In SQL, the REVOKE statement is used to take back/remove (or revoke) one or more
privileges/permissions from a particular database user. Heres what the syntax for the REVOKE
statement looks like:

Syntax for the REVOKE statement

REVOKE privilege [, privilege ... ] /* can add more privileges here too */
[ON database object ] /*object can be tables, views, etc.. */
FROM grantee [, grantee ...] ; /*user, PUBLIC, or role*/

Lets go over some things that you should know if you plan on using the REVOKE statement.

SQL REVOKE privileges


When revoking privileges, the list of privileges can only be either all system privileges or all object
privileges. In other words, you can not revoke both system and object privileges in the same SQL
REVOKE statement you will need to use two different SQL REVOKE statements in order to do
this. You can read about the differences between system and object privileges here: System and
object privileges.

SQL REVOKE ON
The ON clause is only used to revoke object privileges not system privileges. This clause
specifies which object privileges (as in which table privileges, view privileges, etc..) are being
revoked.

The SQL REVOKE GRANTEE list


It should be clear from the syntax of the REVOKE statement shown above that you can specify
more than one database user or role that should receive the privilege(s) being revoked. In this
context, GRANTEEs refer to the users from whom the privileges are being revoked.

SQL REVOKE Example


Heres an example of what an actual SQL REVOKE statement would look like:
REVOKE SELECT ON SOME_TABLE TO SOME_USER;

In the example above, the REVOKE statement is used to revoke the privilege of being able to
select from SOME_TABLE from the SOME_USER user.

What does the CREATE USER Statement do in SQL? What is the


syntax and other details?
Its pretty obvious what the CREATE USER statement does it allows you to create a user in the
database. Most of the popular databases out there already provide some sort of graphical interface
that allows you to create users without actually typing in any SQL like phpMyAdmin which is a
PHP interface to the MySQL database. In any case, the SQL standard defines the CREATE USER
statement.

SQL CREATE USER Syntax


Here is what the syntax of the CREATE USER statement looks like:

CREATE USER username


[IDENTIFIED BY password]
[other options];

CREATE USER Identified By


The Identified By clause lets you say how the database should authenticate the user. The exact
syntax of the Identified by clause varies from one database to another.

What is a role in a database?


A database role is a collection of any number of permissions/privileges that can be assigned to one
or more users. A database role also is also given a name for that collection of privileges.

The majority of todays RDBMSs come with predefined roles that can be assigned to any user. But,
a database user can also create his/her own role if he or she has the CREATE ROLE privilege.

Advantages of Database Roles


Why are database roles needed? Well, lets go over some of the advantages of using database
roles and why they would be necessary:
Roles continue to live in database even after users are
deleted/dropped
Many times a DBA (Database Administrator) has to drop user accounts for various reasons say,
for example, an employee quits the company so his/her user account is removed from the system.
Now suppose that those same user accounts need to be recreated later on just assume that
same employee re-joins the company later on and needs his same account. That employees user
account probably had a lot of specific permissions assigned to it. So, when his/her account was
deleted then all of those permissions were deleted as well, which creates a hassle for the DBA who
has to reassign all of those permissions one by one. But, if a role was being used then all of those
permissions could have just been bundled into one role and then the process of re-instating that
employee into the system would mean that the DBA simply reassigns the role to the employee.
And, of course that role could also be used for other users as well. So, this is a big advantage of
using a database role.

Roles save DBAs time


Another advantage is the fact that a DBA can grant a lot of privileges with one simple command by
assigning a user to a role.

Database roles are present before users accounts are created


And finally, an advantage of database roles is that they can be used to assign a group of
permissions that can be re-used for new users who belong to a specific group of people who need
those permissions. For example, you may want to have a group of permissions in a role reserved
just for some advanced users who know what they are doing and assign that role to a user only
when a new advanced user needs that role. Or, you can have a group of privileges for users who
are all working on the same project and need the same type of access.

Disadvantages of Database Roles


The main disadvantage of using a database role is that a role may be granted to user, but that role
may have more privileges than that user may actually need. This could cause a potential security
issue if that user abuses his extra privileges and potentially ruins some part of the database.

An example of this is that in older versions of Oracle (before release 10.2), there is a role called
CONNECT, which included privileges like CREATE TABLE, CREATE VIEW, CREATE SESSIONS, ALTER
SESSION, and several other privileges. But, having all of these privileges is probably too much for
a normal business user. That is probably why in newer versions of Oracle (since version 10.2), the
CONNECT role has been changed so that it only has the CREATE SESSION privilege.

How to create a database role


Most RDBMSs use the CREATE ROLE syntax to define a role. And then, the GRANT statement is
used to give permissions to that database role. But, the exact details vary from one RDBMS to
another so its best to consult the documentation.
Example of a database role
Here is an example of what creating a database role could look like:

CREATE ROLE advancedUsers;

GRANT UPDATE ON SOMETABLE


TO advancedUsers;

How do you tune SQL queries to improve performance?


Tuning your SQL queries can have a significantly positive impact on performance. And
understanding how your specific RDBMS works can help tremendously as well. But here we will go
over some tips on how to tune SQL queries in general, non-RDBMS specific terms.

What is a query execution plan?


Understanding query execution plans is one of the first steps to properly tuning SQL queries. So,
what is a query execution plan also known as an explain plan? Well, a query execution plan lists
all the details of how that particular RDBMS plans on processing a particular query. Inside this plan
are details on how the index will be used, how joins will be performed (and their associated logic),
and also an estimate of the resource cost. Understanding the explain plan utility for your particular
RDBMS is critical if you want to successfully tune SQL queries.

Here are some non-RDBMS specific things to keep in mind when tuning your SQL queries to
improve performance:

Reduce the rows that are returned by a query


This is fairly obvious if a query returns less rows, then clearly the query will be more efficient.

Get rid of unnecessary columns and tables


Even though this isnt a change made to your actual query, the less unused space in your
database, the more efficient your queries will be. This is another pretty obvious one.

GROUP BY may be better to use than DISTINCT. In some DBMSs GROUP BY is a more efficient
way of retrieving unique rows than DISTINCT. This is because GROUP BY performs the sort that
finds duplicates earlier in the processing of a query than DISTINCT. The DISTINCT clause will
perform the sort at the very last step, and will do this against the final result set. Why does it
matter when the sort is performed? Well, if duplicate rows are eliminated earlier on in the
processing of a query, then it means that the rest of the processing of the query will be more
efficient because there will presumably be less rows to perform the rest of processing on, since the
duplicates have already been eliminated. For your particular RDBMS, you should look at the
explain plans for running a query with GROUP BY or DISTINCT to see how they compare.
Hints might help you tune your SQL queries. What is a hint? A hint is special syntax that you can
put inside your SQL. What does a hint do? Well, it tells the query optimizer to perform a certain
action, like if you want to tell the optimizer to use a certain method to join tables, or if you want to
tell the optimizer to use a certain index.

Understand your optimizer to help you tune SQL queries


Knowing how the query optimizer for your particular RDBMS works can be a big help. This is
because every optimizer does things differently. Lets go through some things that you should
keep in mind when dealing with query optimizers:

Not enough database statistics


Suppose there are not enough statistics about the database. Since cost based
optimizers rely on those statistics to perform their analysis, some optimizers may have to
use a rule based optimizer instead in that case. And other databases may decide not to
use an index at all and just do a full table scan instead.

Are order of predicates taken into account?


You should know whether or not your optimizer takes the order of the predicates in a
WHERE clause into account, and whether that order has any effect on the order in which
the predicates are actually evaluated. What does that mean in plain English? Well, a
predicate is the comparison portion of the WHERE clause. So, for example, if we have
some SQL that says WHERE website_name = ProgrammerInterview.com, then in that
particular SQL query there is just one predicate comparing the website_name column to
the text ProgrammerInterview.com. But, if we have some SQL that says WHERE
website_name = ProgrammerInterview.com AND website_subject=technical", then we
have two different predicates one that checks for the website name and another that
checks for the website category.

Now that weve cleared up what we mean by predicates, lets get back to the original topic.
So, we said that you should know if your optimizer takes the order of the predicates in a
WHERE clause into account, and if that order affects the order in which the predicates are
evaluated. But, why should the order of the predicates matter? Well, if you optimizer does
take the order into account, then you would want the predicate that eliminates the
higher number of rows to be evaluated first by the optimizer. So, for example, lets
say that we have a table called Websites which has columns for the website_active and the
website_subject. The website_active column is just a yes or a no entry, and lets
assume that most of the rows (something like 90%) in the table have a yes value for
website_active. But, lets also say that there are three possible subjects like technical,
self help, cooking, etc. And, the subjects are evenly distributed amongst the rows so
1/3rd of the rows are technical, 1/3rd are self help, etc.

Now, lets say we want to run a query with a where clause like this WHERE
website_subject = cooking AND website_active = NO. Which predicate of the WHERE
clause should be executed first the website_active = NO or the website_subject =
cooking? Well, think about that on your own for a second. Wouldnt it make more sense
to run the predicate which eliminates more rows first? That way, the second predicate has
less rows to process. With that in mind, lets ask ourself which predicate will eliminate
more rows? Wouldnt it be the check to see WHERE website_active = NO? Because, that
check will eliminate 90% of the rows in the table. But the check for website_subject =
cooking would only eliminate 67% of the rows, so clearly 90% is better which means
that the predicate that checks website_active = NO should be run first.

Are order of table names being taken into account?


Just as the order of the predicates being used in the WHERE clause can have a big effect
on the efficiency of the query, so can the order of the table names in the JOIN or FROM
clause. This is especially true with rule based optimizers. Of course, the best thing for the
RDBMS to do is to choose the most selective table first. This way the most number of rows
will be removed from the result set, which means that less rows will have to be processed
and the query can run more efficiently. You should check to see what your particular
optimizer does to see if you need to tune your SQL queries accordingly.

Are queries being rewritten?


You should also check to see if your optimizer rewrites queries into more efficient formats.
One example of this is when optimizers will rewrite subqueries into their equivalent joins,
and that will make the processing that must follow much simpler. For some DBMSs, there
are certain options that have to be enabled so that the optimizer can actually rewrite
queries.

Tune SQL queries by ensuring that your indexes perform well


One very important thing that can help your SQL queries run better is making sure that your
database indexes also perform well. Read here for more details on that subject:Improving Index
performance in SQL.

Summary of tips on how to to tune your SQL queries


In general, you should know the options available to you that may help you in tuning your SQL
queries. Of course, not everything we presented above will help make your SQL perform better
because everyones particular situation is different. But, knowing your options is critical as the
saying goes To the man with a hammer, every problem looks like a nail. Make sure that you
have more than just a hammer in your toolbox!

How do transactions work in Microsofts SQL Server?


Microsofts SQL Server provides support for transactions in three different modes of operation.
Those three modes are known as explicit, implicit, and autocommit. If you are connected directly
to the database using some sort of client side tool then you can use any of those three modes as
you desire. But, if you are connecting to SQL Server through either a JDBC or an ODBC driver,
then you should read the documentation to see what kind of transaction support is provided.

What is explicit mode in SQL Server?


When explicit mode is used, every transaction starts with a BEGIN TRANSACTION statement and
ends with either a ROLLBACK TRANSACTION statement (for when a transaction does not
successfully complete) or a COMMIT TRANSACTION statement (for when a transaction completes
successfully). Explicit mode is most commonly used in triggers, stored procedures and application
programs.

What is implicit mode in SQL Server?


Implicit mode for transactions in SQL Server is set to ON or OFF by using the command SET
IMPLICIT_TRANSACTIONS to ON or OFF. If the implicit mode is set to ON, then a new transaction
is implicitly started whenever any particular SQL statement is executed. The particular SQL
statement can be one from a specific list of SQL statements, that includes INSERT, SELECT,
DELETE AND UPDATE and other SQL statements as well. It is called implicit mode because of the
fact that once IMPLICIT_TRANSACTIONS is set to ON then the transactions are created implicitly
for certain SQL statements, without having to say you want to create a transaction each time.

If a transaction is implicitly started, then the transaction will continue until its either fully
committed or until the transaction is rolled back. One scenario in which the transaction could
potentially be rolled back is if the user disconnects before having submitted a statement that
would end the transaction.

What is autocommit mode in SQL Server?

In autocommit mode, each SQL statement is treated as a separate transaction. This is

accomplished by automatically committing each SQL statement as it finishes and is

why its called autocommit mode. Autocommit is used by default in every connection to

SQL Server unless the implicit mode is set or an explicit transaction is started.

You might also like