You are on page 1of 4

NFSandDirtyPages

Updated April 21 2016 at 12:02 PM - English

()

PROBLEM
Computers with lots of RAM and lots of processing power can quickly create many Dirty Pages (data to
be written eventually to a lesystem) in RAM. When the time comes to ush these Dirty Pages to the
respective lesystem, called Writeback, there can be a lot of congestion with NFS. The throughput of
data travelling over the network is signicantly slower than writing to RAM. Picture the impact on road
trafc if a 10 lane road suddenly reduced down to 2 lanes.
One may expect this to only impact the NFS mount, however, the number of permitted Dirty Pages is a
system-wide value. Once this threshold is reached, every process on the system is responsible for
freeing up pages if they attempt to allocate memory. If there are only a few dirty pages, this is ne, but if
there are 40GiB of dirty pages, all the process can get blocked for a long time.

WORKAROUNDS
There are a number of ways to work around this issue. They range from solutions that only impact the
process and the le being written to, all the way to impacting all processes and all lesystems.

File-levelimpact
DirectI/O
When opening the le for writing, use the O_DIRECT ag to completely bypass the Page Cache. This
can also be achieved by using dd to copy a le to the NFS mount with the oflag=direct option.

ThrottleI/O
The next option is to throttle the rate of reading the data to match the NFS WRITE rate. e.g. Use
rsync and option --bwlimit .

FlushNFSDirtyPagesfrequently
If you have access to recompiling the source code, periodically call fsync() . If you are unable to
recompile the source code, run the following periodically:
ls -l /nfsmount/dir_containing_files

Writesmallerles
If possible, try breaking up single large les into smaller les. Dirty Pages associated with each le will
be ushed when it is closed. This results in Dirty Pages being ushed more frequently.

NFSmountimpact
UseonlysynchronousI/O
Normally, I/O is done asynchronously on the NFS Client, meaning the application writes to the Page
Cache and the NFS Client sends the data to the NFS Server later.
I/O can be forced to be done synchronously, meaning the application does not consider a write
complete until the NFS Client has sent the data to the NFS Server, and the NFS Server has
acknowledged receiving the data.
Using the sync NFS Client mount option forces all writes to be synchronous. However, it will also
severely degrade the NFS Client WRITE performance.

rsize / wsize (NFSclientmountoptions)


The rsize / wsize is the maximum number of bytes per network READ / WRITE request. Increasing
these values has the potential to increase the throughput depending on the type of workload and the
performance of the network.
The default rsize / wsize is negotiated with the NFS Server by the NFS Client. If your workload is a
streaming READ / WRITE workload, increasing rsize / wsize to 1048576 (1MiB) could improve
C U S T O M E R (https://access.redhat.com/)
throughput performance.P O R TA L

System-wideimpact
Limitthenumberofsystem-wideDirtyPages
From RHEL 5.6 (kernel 2.6.18-238) onwards (including RHEL 6.0) the tunables
vm.dirty_background_bytes and vm.dirty_bytes are available. These tunables provide ner
grain adjustments particularly if the system has a lot of RAM. Prior to RHEL 5.6, the tunables
vm.dirty_background_ratio and vm.dirty_ratio can be used to achieve the same objective.
Set vm.dirty_expire_centisecs ( /proc/sys/vm/dirty_expire_centisecs ) to 500
from the 3000 default
Limit vm.dirty_background_bytes ( /proc/sys/vm/dirty_background_bytes ) to
500MiB
Limit vm.dirty_bytes ( /proc/sys/vm/dirty_bytes ) to not more than 1 GiB
Ensure that /proc/sys/vm/dirty_background_bytes is always a smaller, non-zero, value than
/proc/sys/vm/dirty_bytes .

Changing these values can impact throughput negatively while improving latency. To shift the balance
between throughput and latency, adjust these values slightly and measure the impact, in particular
dirty_bytes.
The behaviour of Dirty Pages and Writeback can be observed by running the following command:
$ watch -d -n 1 cat /proc/meminfo

Documentation/sysctl/vm.txt:
dirty_expire_centisecs This tunable is used to define when dirty data is old enough
to be eligible for writeout by the kernel flusher threads. It is expressed in
100'ths of a second. Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.

dirty_bytes Contains the amount of dirty memory at which a process generating disk
writes will itself start writeback. Note: dirty_bytes is the counterpart of
dirty_ratio. Only one of them may be specified at a time. When one sysctl is written
it is immediately taken into account to evaluate the dirty memory limits and the
other appears as 0 when read. Note: the minimum value allowed for dirty_bytes is two
pages (in bytes); any value lower than this limit will be ignored and the old
configuration will be retained.

dirty_ratio Contains, as a percentage of total available memory that contains free


pages and reclaimable pages, the number of pages at which a process which is
C U S Twill
O M E Ritself
(https://access.redhat.com/)
generating disk writes
start writing out dirty data. The total available
P
O
R
TA
L
memory is not equal to total system memory.

dirty_background_bytes Contains the amount of dirty memory at which the background


kernel flusher threads will start writeback. Note: dirty_background_bytes is the
counterpart of dirty_background_ratio. Only one of them may be specified at a time.
When one sysctl is written it is immediately taken into account to evaluate the
dirty memory limits and the other appears as 0 when read.

dirty_background_ratio Contains, as a percentage of total available memory that


contains free pages and reclaimable pages, the number of pages at which the
background kernel flusher threads will start writing out dirty data. The total
available memory is not equal to total system memory.

Environment-wideimpact
ImprovetheNetworkPerformance(iperfbenchmarking)
The performance of the network has a signicant bearing on NFS. Check that the network is performing
well by running iperf. It can be used to measure network throughput between the NFS client and
another system, if possible, the NFS server. e.g.

Receiver:
$ iperf -s -f M

Transmitter:
$ iperf -c RECEIVER-IP -f M -t 60

Do a few iterations and try to make each test run for at least 60 seconds. You should be able to get an
idea of baseline network throughput. NFS will not perform any faster than the baseline.
Also refer to How to begin Network debugging (https://access.redhat.com/articles/1311173).

IncreasetheperformanceoftheNFSServer
Benchmark and determine if there are any performance bottlenecks on the NFS Server. e.g. Determine
if the NFS Server's export's underlying lesystem performance can be improved.
Product(s)
Category

RedHatEnterpriseLinux(/taxonomy/products/red-hat-enterprise-linux)
Congure(/category/congure)

ArticleType

Component

kernel(/components/kernel)

General(/article-type/general)

C U S T O M E R (https://access.redhat.com/)
P O R TA L

Comments

Privacy Policy (http://www.redhat.com/en/about/privacy-policy)


Customer Portal Terms of Use (https://access.redhat.com/help/terms/)
All Policies and Guidelines (http://www.redhat.com/en/about/all-policies-guidelines)

Copyright 2016 Red Hat, Inc.

You might also like