You are on page 1of 8

Derby, 06/29/2005

Direct I/O Explained

Abstract
The purpose of this article is to explain the nature of direct I/O from the perspective of Oracle
RDBMS on Unix. This article is, therefore, relevant only for those using Unix, Linux, AIX or
any other related operating system. There are many myths and legends surrounding direct I/O
and the most common among them is that we can expect fabulous performance boost from
merely making Oracle RDBMS do direct I/O instead of using normal disk I/O. The legend is,
as is the case with most legends, just a legend. So, what is a “normal” disk I/O on Unix, as
opposed to the “direct” disk I/O?
To understand the difference, we have to see how Unix does normal disk I/O. Unix is an
operating system remarkably similar to Oracle RDBMS. To speed things up, Unix uses its own
version of SGA called buffer cache. The picture of what is happening in typical Unix process is
shown below:

On Unix system, user does not transfer data from the disk device directly into the user buffer
when using system service like “read”. Unix substitutes buffer cache between user buffer and
the disk device. Similarly as in Oracle RDBMS, when reading a file, user essentially requests
that block be brought into the operating system buffer cache. This enables Unix to efficiently
share blocks and use caching. Unix buffer cache is not directly accessible by the user code, it
can only be accessed by system service. The nature of the separating line between user address
space and kernel address space depends on the nature of the underlying Unix OS. The
Derby, 06/29/2005

distinction between monolithic kernels and microkernel based Unix varieties is beyond the
scope of this article and can be found in any introductory college text, like Andrew
Tannenbaum's “Modern Operating Systems”.
Also, when doing normal Unix I/O, user process waits for the operating system to complete
the system call. In particular, user process stops running, relinquishes the CPU it utilized until a
call to read or write system service was encountered and waits on one on pre-defined queues.

Problems
There are several problems with the approach describe above. The first problem originates
from the fact that both read and write are done through system buffers. In particular, user
never knows whether I/O ended up on disk or in the system buffer. The latter possibility is bad
if we want to be certain that Unix really wrote to disk what we orderered to be written, as is the
case with COMMIT processing. Database log writer must make sure that commit record is
really written to disk. Log writer makes sure of that by using system services like fsync,
explained below:

Log writer will make sure that log records and commit records are written to disk, regardless of
the chosen I/O subvariety. That is not a problem. Database buffer writer (DBWR) will not do
so. If machine crashes before your OS buffers have been written to the disk, your instance
needs recovery.
Another problem with I/O through Unix buffer cache is performance. Unix, similarly to
Oracle, employs several strategies to speed up I/O and maximize throughput. Unfortunately,
Derby, 06/29/2005

these strategies are not SQL based and and not aware of things like transactions, read
consistency and alike. Those strategies are based on file paradigm and include file prefetch and
maintaining LRU list, so that frequently used (“hot”) blocks do not get removed from buffer
cache. File prefetch is mechanism which brings several consecutive file buffers more then
requested into the buffer cache in hope that the next item that user will request will lie in one of
those buffers brought in before time. Of course, those buffers are pre-fetched from any file
encountered and Oracle database files are no different. The problem with the above is that
Oracle caching strategies are vastly different from the Unix ones and that using both vastes
valuable memory, without real gain, as Oracle maintians its own cache, the SGA and its caching
strategies are much more efficient then Unix ones, for database purposes.
There is, however, one notable exception from the above rule, when Unix strategies are,
actually, very helpful and desired. That is the case of data warehouse database that is mostly
read and almost never updated. Furthermore, the percentage of full table scans is large, as the
database is used for producing large reports, so pre-fetching will actually speed things up
signifficantly. I've read reports about speeding things up up to 20% by using normal, “cooked”
database files instead of direct I/O or raw devices.
Most of modern Unix systems have dynamic buffer caches which will extend in any free
memory available on the system. If you observe memory usage on a busy Linux system where
direct I/O is not used, you will notice that the amount of free memory is falling rapidly,
although the system is not overburdened. That means that for any new process, like the one
resulting from a dedicated server connection, operating system will need to clean up some
memory, usually by swapping it out. In other words, your system is losing time on handling its
own buffer cache instead of doing what you bought it for. The effects of memory shortages can
absolutely cripple a busy Unix system using file system based Oracle RDBMS. It is not
infrequent to see CPU spent in kernel mode rise to 20% for minutes at a time. When Unix
system is in the kernel mode, it is maintaining itself instead of doing what the owner bought it
for.

Unix systems have two tricks up their sleeve to fix those problems: direct I/O and
asynchronous I/O. Asynchronous I/O is easy to explain: I/O is entrusted to specialized kernel
threads and the issuing process is not waiting for the I/O to finish. The issuing process is
notified by a signal, usually the USR1, that the I/O request is complete. If the system doesn't
support asynchronous I/O or there are some problems with OS asynchronous I/O
implementation, Oracle has tools to simulate asynchronous I/O and alleviate, if not solve, the
problem. Oracle can launch multiple database writers, DBWR slaves and, in version 10,
multiple archivers. The trick is simple and adequate: instead of using system provided threads,
Oracle forks its own processes which will perform the I/O requests.
Direct I/O is much more important and complex beast which cannot be easily simulated. The
impact of direct I/O is also potentially much greater then the impact of asynchronous I/O.

Direct I/O
Direct I/O is frequently described by using the following phrase: “direct I/O means to use a
file just like raw device”. The meaning of that phrase is the following: Unix buffer cache is used
Derby, 06/29/2005

to cache file system blocks only. If we put database files on disk devices which don't have file
system, the data is trasferred directly from disk to user buffer and back, thus bypassing the
complex Unix buffer cache. In other words, to utilize direct I/O means to transfer data from
SGA to disk and back, without having to store the data in the Unix buffer cache. There are two
conditions which impact our ability to use direct I/O:
• File system must support direct I/O
• Buffer must be properly aligned in memory, usually to the page boundary or file system
block boundary.

The truth of the matter is that most of modern file systems support direct I/O. This is certainly
true with Veritas VxFS, HP HPFS, IBM JFS and JFS2, SUN UFS and Linux Ext3. If Linux is
your operating system of choice, IBM JFS also supports direct I/O on Linux. Net Appliance
with NFS3 and NFS4 also supports direct I/O for NFS based files which is actually quite
remarkable. I used to work for a company that uses Oracle 9.2.0.5 on Red Hat Linux with Net
Appliance and the performance was very good, no difference from the local SCSI drives. To
tell the truth, things sometimes do not work out of the box, patch 2448994 needs to be
installed on toop of Oracle 9.2.0.6 for Linux in order for direct I/O over NFS to work.

Given that the buffer to which we are transferring the data is SGA, buffer alignment is Oracle's
problem. How is direct I/O actually done? It is done by opening database files with
O_DIRECT flag, like in the following little program, which copies file “xxx” to file “yyy” using
direct I/O.

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <asm/fcntl.h>
#include <errno.h>
#include <string.h>
#define BUFFSIZE 65536
#define ALIGN 4096

main() {
char *buff;
int stat1=0,stat2=0;
int fd1=0,fd2=0;
if (stat1=posix_memalign(&buff,ALIGN,BUFFSIZE)) {
fprintf(stderr,"ALIGN ERR:%s\n",strerror(stat1));
exit(0);
}

fd1=open("xxx", O_RDONLY|O_DIRECT|O_LARGEFILE,S_IRWXU);
if (fd1<0) {
fprintf(stderr,"OPEN ERR:%s\n",strerror(errno));
exit(0);
}
fd2=open("yyy",O_CREAT|O_WRONLY|O_DIRECT|O_LARGEFILE,S_IRWXU);
while(stat2=read(fd1,buff,BUFFSIZE)) {
if (stat2<0) {
fprintf(stderr,"READ ERR:%s\n",strerror(errno));
Derby, 06/29/2005

exit(0);
}
stat1=write(fd2,buff,stat2);
if (stat1<0) {
fprintf(stderr,"WRITE ERR:%s\n",strerror(errno));
exit(0);
}
}
close(fd1);
close(fd2);
}

The buffer is aligned to 4096 byte boundary by using posix_memalign() function. This
boundary on Linux is equal to both OS virtual memory page size and the JFS block size. On
other systems, you may need to experiment with alignment. So, let's execute the program and
see what happens.

$ dd if=/dev/zero of=xxx bs=4096 count=128


128+0 records in
128+0 records out
$ ./dio
$ ls -l xxx yyy
-rw-r--r-- 1 mgogala users 524288 May 30 00:58 xxx
-rwx------ 1 mgogala users 524288 May 30 00:58 yyy

Not only is the buffer aligned to the 4096 byte boundary, the file xxx that is being copied to yyy
has the size which is multiple of our alignment size. Let's try out what happens if the file size is
wrong:

$ rm -f xxx;dd if=/dev/zero of=xxx bs=1023 count=1


1+0 records in
1+0 records out
$ ls -l xxx
-rw-r--r-- 1 mgogala users 1023 May 30 01:03 xxx
$ ./dio
WRITE ERR:Invalid argument

We're getting an “Invalid argument” error, reminiscent of the OS error which we get if Oracle
has a problem with direct I/O. The reason for this error is the fact that file has only had 1023
bytes instead of the expected multiple of 4096.

How can we check if Oracle is actually using direct I/O? There are different utilities for
intercepting system calls that a process makes. The most famous tool of that type can be found
on Solaris and HP-UX 11i and is called “truss” for “TRace Used System Services”. IBM AIX
has a tool called “trace” and Linux has a tool called “strace”. Thanks to the invaluable Rosetta
Stone, named after the famous stone containing the same text in Egyiptian hieroglyphes and
Greek language, which made it possible to translate Egyptian hieroglyphes, we can now
translate commands from one Uix variety into another. This precious tool, also known as “A
Derby, 06/29/2005

Sysadmin's Unixersal Translator” can be found at: http://bhami.com/rosetta.html

I am writing this article on Linux, using OpenOffice 1.1.4 and, consequently, the tool of choice
will be strace that will be attached to DBWR. So, the procedure will be the following:
• Database will be started in mount mode.
• The strace tool will be attached to the database writer, pointing the output to file
/tmp/dbwr.txt
• Database will be open.
• We'll investigate calls to the “open” system service in /tmp/dbwr.txt

The database is 10.1.0.4 on Fedora Core 3 Linux.

$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.1.0.4.0 - Production on Mon May 30 01:16:41 2005

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to an idle instance.

SQL> startup mount


ORACLE instance started.

Total System Global Area 201326592 bytes


Fixed Size 778452 bytes
Variable Size 78651180 bytes
Database Buffers 121634816 bytes
Redo Buffers 262144 bytes
Database mounted.
SQL>
$ ps -ef|grep dbw|grep -v grep
oracle 6839 1 0 01:16 ? 00:00:00 ora_dbw0_10g
$ strace -o /tmp/dbwr.txt -p 6839
Process 6839 attached - interrupt to quit

Now we established that the database writer process is process with PID=6839 and we have
attached “strace” to that process. Now, we can open the database and stop tracing by pressing
Ctrl-C in the window from which tracing is done.

SQL> alter database open;

Database altered.

SQL>

The result looks like this:

$ grep O_DIRECT /tmp/dbwr.txt


open("/oradata/10g/oracle/system01.dbf", O_RDONLY|O_DIRECT|O_LARGEFILE) = 18
open("/oradata/10g/oracle/system01.dbf", O_RDWR|O_SYNC|O_DIRECT|O_LARGEFILE)
= 18
Derby, 06/29/2005

fcntl64(18, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|


O_DIRECT|O_LARGEFILE)
open("/oradata/10g/oracle/undotbs01.dbf", O_RDONLY|O_DIRECT|O_LARGEFILE) = 19
open("/oradata/10g/oracle/undotbs01.dbf", O_RDWR|O_SYNC|O_DIRECT|O_LARGEFILE)
= 19
fcntl64(19, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
open("/oradata/10g/oracle/sysaux01.dbf", O_RDONLY|O_DIRECT|O_LARGEFILE) = 20
open("/oradata/10g/oracle/sysaux01.dbf", O_RDWR|O_SYNC|O_DIRECT|O_LARGEFILE)
= 20
fcntl64(20, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
open("/oradata/10g/oracle/users01.dbf", O_RDONLY|O_DIRECT|O_LARGEFILE) = 21
open("/oradata/10g/oracle/users01.dbf", O_RDWR|O_SYNC|O_DIRECT|O_LARGEFILE) =
21fcntl64(21, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
open("/oradata/10g/ORACLE/datafile/o1_mf_lobs_0szv1bfw_.dbf", O_RDONLY|
O_DIRECT|O_LARGEFILE) = 22
open("/oradata/10g/ORACLE/datafile/o1_mf_lobs_0szv1bfw_.dbf", O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE) = 22
fcntl64(22, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
open("/oradata/10g/ORACLE/datafile/o1_mf_indx_0szv4dbg_.dbf", O_RDONLY|
O_DIRECT|O_LARGEFILE) = 23
open("/oradata/10g/ORACLE/datafile/o1_mf_indx_0szv4dbg_.dbf", O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE) = 23
fcntl64(23, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
open("/oradata/10g/ORACLE/datafile/o1_mf_statspac_0t032zh4_.dbf", O_RDONLY|
O_DIRECT|O_LARGEFILE) = 24
open("/oradata/10g/ORACLE/datafile/o1_mf_statspac_0t032zh4_.dbf", O_RDWR|
O_SYNC|O_DIRECT|O_LARGEFILE) = 24
fcntl64(24, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
open("/oradata/10g/ORACLE/datafile/o1_mf_oratext_153wgh4b_.dbf", O_RDONLY|
O_DIRECT|O_LARGEFILE) = 25
open("/oradata/10g/ORACLE/datafile/o1_mf_oratext_153wgh4b_.dbf", O_RDWR|
O_SYNC|O_DIRECT|O_LARGEFILE) = 25
fcntl64(25, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
open("/oradata/10g/oracle/temp01.dbf", O_RDONLY|O_DIRECT|O_LARGEFILE) = 26
open("/oradata/10g/oracle/temp01.dbf", O_RDWR|O_SYNC|O_DIRECT|O_LARGEFILE) =
26
fcntl64(26, F_GETFL) = 0xd002 (flags O_RDWR|O_SYNC|
O_DIRECT|O_LARGEFILE)
$

By looking into the output, we can see that Oracle has opened each file with O_DIRECT flag,
which means that the I/O requests issued for that file will bypass Unix buffer cache and will go
directly into SGA, as desired.

How do we controll whether Oracle will open its files with or without O_DIRECT flag, ie
whether direct I/O will be utilized or not? That is regulated by the Oracle parameter named
FILESYSTEMIO_OPTTIONS, available as of Oracle 9.2. Here is what the Oracle manual
Derby, 06/29/2005

page says about the parameter:

FILESYSTEMIO_OPTIONS
Property Description
Parameter String
type
Syntax FILESYSTEMIO_OPTIONS = { none | setall | directIO |
asynch }
Default There is no default value.
value
Modifiable ALTER SESSION, ALTER SYSTEM
Basic No

Direct I/O used to be available only as a commercial option from Veritas (it was known as
“quick I/O”) and is a subject of numerous myths and legends. The most ubiquitous legend
tells us about the immense performance gains. I haven't experienced anything like that and am
still waiting for the Oracle “make applications run faster” parameter. Whether direct I/O is
right for you can be decided only by careful benchmarking. If you have an OLTP system with
many concurrent users, direct I/O is, most likely, good for you. If you have data warehouse
which is visited twice a week and produces one extensive daily report, you probably do not
want direct I/O and are better off without it.
What are the indications for looking into the direct I/O? When should you start thinking about
it, in the first place? The most significant indication is significant paging activity and large
percent of CPU time spent in the kernel mode. That can diagnosed by using utilities like “sar”
or “top”. Never make changes which are not needed. If your system is not spending large
amount of CPU time in kernel, it is unlikely that the introduction of direct I/O will have any
positive effect on your system. You might instead be diagnosed with CTD (Compulsive Tuning
Disorder) by dr. Gaja Vaidyanatha, a famous Oracle spinlock doctor and a member of the Oak
Table who has exorcised many Oracle daemons out of various Unix and non-Unix systems.

This article has, hopefully, explained and illustrated the use of direct I/O. As with any
advanced option, the best advice is to use with caution and read the fine print before setting
your production system to use direct I/O.

You might also like