Tuning Linux VM on Kernel 2.6
Oracle Tips by Burleson Consulting
By Mladen Gogala, author of "Easy
What does the phrase “tuning kernel” mean? It actually
conveys a dangerous message that it is possible to somehow make database to run
faster, by tweaking certain arcane and undocumented Linux parameters. The
knowledge of the charms giving the ability to perform such a dangerous black
magic is highly priced possession of the privileged few, the real stuff that
separates boys from men, or so the legend goes. The truth is that there is no
such thing as “tuning database” or “tuning Linux”. The usual meaning of the
phrase is to minimize the application response time, for our given application.
This document concentrates on the application systems that utilize Oracle RDBMS.
On Linux, there are very few things one can tune to make things run faster. The
purpose of this document is to provide the insight into relatively few tools and
parameters that can be tuned and adjusted to provide temporary relief to the OS.
Linux, in contrast to fully fledged Unix systems, is oriented to the mythical
“home user” and surprisingly poor when it comes to the OS diagnostics and
When tuning an application system, we must always start
from the application itself and see where it spends its time. Unfortunately
Oracle RDBMS doesn't have any statistics or wait event names “waiting for an OS
mess”. The first indication that there is a problem with the OS is a long
duration of other, normally encountered wait events like scattered or sequential
reads, free buffer search or IPC message waits. It is important to know and have
reference of the usual values of RDBMS statistics, in order to be able to notice
sudden spike. The best way to collect such statistics is to take regular
STATSPACK snapshots or use Activity & Workload History (AWH, available as of
Oracle 10g is a separately licensed option, available as a part of OEM
Diagnostics and Tuning pack). The information collected using these tools can
help us correlate the periods of strain on the system with the periods of
activity withing the Oracle RDBMS.
When such a spike is noticed, it is time to look into
OS. Causes of increased overhead are many and can be caused by all of the OS
subsystems: I/O, device management, virtual memory management, network or,
simply, a bad application. The rest of this document concentrates on the case of
virtual memory subsystem causing problems. Virtual memory was chosen because of
popular demand: several people that I know of asked me how to diagnose VM
problems on Linux and how to respond. The problem is that documentation is scant
and inaccurate, while rumors and recipes for “run my stuff faster button”
abound. Linux itself also isn't very helpful in such situations. The rest of
this document deals with the following questions:
How to diagnose VM shortage and what diagnostic tools do we have
at our disposal?
What should we do? What is the goal of tuning VM?
What means do we have at our disposal? This is, of course, the
part about what everybody wants to know the most – Linux parameters.
How to diagnose VM problems?
Before we start diagnosing the problem, let's clarify the
purpose and the basic features of the Linux virtual memory subsystem. The
purpose of any virtual memory subsystem, including the one on Linux, is to
emulate much larger memory then physically available by saving the least needed
parts of memory to disk. The award winning question is: what are the least
needed parts of memory? The answer to that question is simple: the least needed
parts of memory are those that will not be used for the longest time.
Unfortunately, Linux doesn't have a crystal ball which would allow it to look
into the future so guesses must be made about the parts of memory should be
written to the disk and replaced by new contents. Different operating systems
make different guesses. VMS has an implementation of the least recently used
(LRU) mechanism for page replacement. Linux does not have LRU mechanism, Linux
uses the standard “two handed clock” mechanism, found on all other Unix systems.
This mechanism works like this: special process, called page update daemon or
pdflush, goes through all eligible pages, that is the pages that are not locked
in memory, and turns the “in use” bit in the page header off. The process then
goes to sleep. It wakes up after period of time, determined by a configuration
parameter, and writes down to disk all modified pages which do not have the “in
use” bit turned back to on, as a result of access by some process. Those two
functions of the page replacement process are called “hands”. This is a high
level overview of Linux (and Unix) page replacement process.
Linux virtual memory system maintains certain quantity of
“free memory”. What is “free memory”? Free memory is comprised of pages having
valid backup on disk which are not being used by any process. The “in use” bit
for those pages is set to 0. Those pages can be re-allocated to another process
at any time. Process will then allocate a page from the “free memory” pool of
pages, discard its contents, which it can safely do, as the page has a valid
backup, and read the new contents from the disk. This process is called a “page
fault” and happens as a result of CPU hardware. Large number of page faults
results in system spending large parts of time in kernel mode and is one of the
best indicators of problems with the VM subsystem.
At this point, the further descriptions of the virtual
memory system would become too complicated for an article like this and I'll
have to direct the reader to one of the numerous classic texts like Linux
Kernel Development by Robert Love, UNIX Internals: The New Frontiers, by Uresh
Vahalia or Design of the UNIX Operating System by Maurice J. Bach.
The usual tools for diagnosing all kinds of performance
problems, including the ones with virtual memory, are top, sar, vmstat, iostat
and atop. The first four tools are very common and their descriptions can be
found in many places. Those tools can give us information about the memory and
CPU usage per process and various system statistics on the global level, but not
on the per process level. The last tool, atop, is an excellent and phenomenally
useful tool which is not that well known but is very much appreciated among the
connoisseurs. The reason for the relative obscurity and insufficient popularity
lies, quite paradoxically, in the problem of the Linux operating system. Linux
doesn't do per process accounting of disk requests or network requests, while
atop is useful for allowing the administrator to monitor precisely those
aspects: per process page faults, disk and network requests. So, how can it be?
Linux doesn't do per process accounting of I/O requests and atop monitors
precisely that? The answer is in the kernel patch. To monitor I/O and network
requests issued by various processes, the administrator has to install two
$ ls /usr/src
redhat linux kernels
$ ls /usr/src/atoppatch-kernel-2.6.16
After that, the new kernel
has to be linked and installed, which is not a very simple procedure, especially
not with packaged kernels like RH EL or SuSE Linux distributions which no longer
ship the current kernel source. When this is done, atop can be used pretty much
in same way the top is used. Atop, along with the necessary kernel patches and
documentation can be downloaded from:
On the beginning of the next
page is a screen shot of atop monitor, showing the processes responsible for
the most of I/O activity. It shows processes responsible for the disk activity,
together with percentages and the disk activity broken into the number of
self-explanatory calls. This is just one of the monitoring screens atop can
provide to system administrator. It can also break down paging and networking on
the per process basis. This is, by far, my favorite tool for investigating
Oracle performance problems on Linux. In case that atop proves to be too much of
an issue, there are also sar and vmstat which, unfortunately, do not allow
drilling to the process level and finding out the process responsible for all
the commotion. The “vmstat” tool is an archaic tool used primarily by the people
from the BSD world. On Linux, it has been mostly replaced by sar, originally a
System V tool for monitoring systems. The output looks like this:
$ sar -B 3
2.6.17-1.2142_FC4 (medo.noip.com) 08/13/2006
PM pgpgin/s pgpgout/s fault/s majflt/s
PM 0.00 18.67 187.33 0.00
PM 0.00 27.91 94.35 0.00
PM 0.00 16.00 88.67 0.00
PM 0.00 43.19 89.04 0.00
Average: 0.00 26.46 114.81 0.00
As you can see, this only shows
the overall picture, not enabling the administrator to drill down to the process
level. This report contains two columns: “fault/s” and “majflt/s”. What is the
difference between a “fault”, sometimes known as a “soft fault”, and a “major
fault” (aka “hard fault”)? Soft fault happens when the process needs a page
that is already in memory, but was freed by the page replacement process. Major
or “hard” fault happens when the page needs to be brought into memory from
disk. Major faults are, of course, much more expensive and take much longer to
complete then the soft ones. Large number of major page faults can slow the
system down to the crawl. On an average system, major page faults are
responsible for the vast majority of the CPU time spent in the kernel mode.
Overall picture of the system
memory can be obtained by using -r (“report”) option and looks like this:
$ sar -r 3
2.6.17-1.2142_FC4 (medo.noip.com) 08/13/2006
kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused
PM 58640 456864 88.62 8292 230496
1347644 225212 14.32 31352
PM 58640 456864 88.62 8300 230496
1347644 225212 14.32 31352
PM 58656 456848 88.62 8308 230496
1347644 225212 14.32 31352
Average: 58645 456859 88.62 8300
230496 1347644 225212 14.32 31352
The columns in the display are
more or less self-explanatory. There are also two other tools, used for
investigating processes in detail, by following their execution path through
exposing the system and library calls they make. Those tools are called strace
and ltrace. They trace system and library calls, respectively. Those tools are
not tools for monitoring virtual memory per se, but do deserve being honorary
mentioned as they are otherwise extremely useful.
Our problem at hand, however,
was virtual memory system misbehaving. How can we tell that virtual memory is
misbehaving? There are several strong indicators:
Our system is spending much of the time in kernel mode, executing
system calls. That, in itself, is not a sufficient indication, as there may be
many reasons for that. Another major reasons for system spending large amount of
time in kernel mode include faulty equipment which sends interrupt requests
uncontrollably, PIO (Programmed I/O, character by character is an extremely
expensive operation as each separate character requires a separate interrupt) or
a kernel bug. This is not a 100% certain sign of a virtual memory problems but
is a sure sign of some problem.
There are many page faults (“majflts” from the sar output above),
pages are being constantly brought in and out. This will cause significant
system strain and is a definite sign of a virtual memory problem. This is where
atop comes in handy: other tools will not help you in identifying the culprit,
they will only tell you that something is paging like crazy on the system.
The page stealing process “pdflush” is very active and constantly
shows up among the processes responsible for the most disk I/O activity.
Swap consumption continues to increase.
All of these OS aspects can be monitored and diagnosed
using atop. Other monitoring tools do not provide as clean and readable per
process information which could help the administrator ascertain that the
interaction of Oracle and VM is the problem. Of course, standard tools, usable
without kernel patches, can also help diagnose problems with virtual memory
subsystem. The best tool for monitoring the overall health of the VM subsystem
is sar, which is a part of the sysstat package for Red Hat Fedora and EL
operating systems. Programs in this package, sar and vmstat will provide you
with the global state of affairs as far as virtual memory is concerned, but will
not make it possible to drill down to the process level.
Now, the problem has been diagnosed as overly active
virtual memory subsystem. What should we do about it and how can we do it?
The best thing to do would be to tune the culprit program
and make it behave nicely with respect to virtual memory. That, alas, is not an
option with commercial programs like Oracle RDBMS which do not provide the
source code. If the program itself cannot be tuned, we have to “tune the
system”, which means to adjust the operating mode of the system in such a way
that our special program, in this case Oracle RDBMS, can operate as smoothly as
The goal of the tuning effort is to bring the system in
the stable state, where it will not spend time maintaining virtual memory but
will devote that time to work on user's problems. Please, note that this goal is
not equivalent to eliminating paging. Some people try with usual “locking the
SGA”, and that, in most of the cases, makes things worse. Why? Look at the
monitor above: there is web browser (firefox), word processor (swriter.bin) and
printing subsystem (eggcups), all active along with Oracle instance. If I allow
Oracle to take the SGA out of the control of virtual memory subsystem and
prevent paging and swapping on 50% of the system memory which comprises the SGA,
all these processes will have to make do with just 50% of memory, competing for
the resources and paging like crazy. Will my Oracle instance work faster as a
result? Definitely not. Page fault is implemented as a system trap and its
execution prevents other processes from running for a short period of time.
Oracle processes will have the whole SGA in memory, even with the parts that are
rarely used, but will be unable to run because the kernel will be constantly
running in the kernel mode, bringing pages needed by the word processor, printer
and web browser in and out of memory. The whole system will slow down to the
crawl. The solution, in this case, would be to refrain from doing things like
writing articles about virtual memory on database servers.
The other thing to do would be decreasing the SGA and
leaving more memory for everything else. If the system is busy paging, trimming
down the buffer cache will actually result in increased performance, despite the
lower BCHR (buffer Cache Hit Ratio, an old system performance metrics, rarely
Opposite of that is purchasing more memory, also known as
“throwing money at the problem”. That may or may not solve the problem,
depending on what uses the memory. If the memory is used for file system
buffers, which are allocated dynamically, and paged in and out, as needed, then
purchasing more memory will not be extremely helpful. Oracle RDBMS keeps data in
the OS files. If I/O operations referencing those files use system buffers, the
memory consumption can potentially be enormous, as database files may be
enormous. That is known as the double buffering. Oracle blocks are being cached
by both Oracle and operating system, in two separate memory areas, with two
different cache management systems. This situation is not without its benefits,
either, but , generally speaking, tends to stimulate exorbitant memory
consumption. Benefits for such situation come from pre-fetch features of most of
modern file systems, which will bring in many more blocks then just the
requested one, and make full table scans and fast full index scans perform
better. Unfortunately, server processes will have to compete for memory with the
file system, instead of just allocating free memory, thus increasing swap
consumption and slowing things down. Operations that will be badly affected are
operations which use lot of memory: hash joins, large sorts or PL/SQL objects
with the internal arrays and tables.
The solution here is to make Oracle do I/O bypassing the
system buffer cache. In other words, Oracle should use direct I/O or raw
What else, except SGA, can make Oracle start using large
quantities of memory? Oracle processes allocate PGA for doing sorts, hash
joins, storing session variables, PL/SQL tables and cursor statuses. Note that
“P” in PGA stands for “Private”, which means that every process allocates its
own PGA (except when using MTS, which is outside of the scope of this article)
and that means that locking SGA in memory is unlikely to help. That is the
classic situation every DBA faces from time to time. SGA is fixed in memory, it
size never increases and never decreases. If the DBA doesn't go overboard with
the initial allocation and doesn't use BCHR as the most important tuning
metrics, SGA is unlikely to ever cause a serious headache. Shared memory is
always the last one to be paged out, so it will almost never happen. The problem
is, therefore, aggressive allocation of PGA areas. How to speed it up? The best
way to do that is to ensure that there is always enough memory for an Oracle
process to allocate an average PGA and to control the overall PGA allocation
through the Oracle instance parameters. Automatic PGA allocation is described in
many papers and is quite easy to set up and control. What needs to be done is to
ensure that there is always enough of free memory for an average allocation to
be completed quickly and without performance problems. Now, that is a well
defined goal that Linux kernel parameters can help us achieve.
Of course, before we proceed with parameter changes, one
should always try tuning SQL and reduce the need for memory. That should always
be the first line of defense. Creating an index and turning hash join into a
nested loops join is definitely less painful option than changing VM parameters
for the whole system or purchasing more memory. Sometimes, though, tuning SQL
cannot help us. Big quarterly report just has to go through all of the data for
the quarter and summarize them, accessing several dimension tables along the
way. In this case, we can only help Oracle manage PGA.
Automated PGA management will keep overall memory
consumption under control, but for each process, PGA allocation can be a major
hurdle. Here is what needs to happen, so that the process can allocate enough
Process needs to allocate enough space on the swap partition to which the
newly allocated memory will be written, should the need arise.
Process needs to allocate sufficient number of clean memory pages. It
will first try to find sufficient number of pages which are not in use and have
a valid backup on disk (aka “free pages”) but if if fails, it will try to back
up some pages to disk, fill them with zeros (aka “clean them”) and allocate to
the requesting process. In doing so, the process will activate page stealing
process “pdflush” and, of course, wait for the process to free enough pages so
that the allocation can proceed.
Page tables will have to be extended and segment sizes updated. Again,
that might require some help from “pdflush” process.
Process statistics about memory usage has to be updated.
Backing pages up and cleaning them can result in several
I/O requests. The requesting process cannot proceed before those pages are
allocated, so there will be significant wait here. Time needed for disk access
is measured in milliseconds, while the time needed for memory access is measured
in nanoseconds. Memory access is approximately 100,000 times faster then the
disk access. Impact of hard page faults will severely impact all users of the
What we want to do is to instruct the Linux virtual
memory system to always maintain enough memory so that the PGA allocation,
should the need arise, can be completed quickly. Let's have Linux kernel do
preventive maintenance, so that Oracle never has to wait for pages to be
written down. Now, that is the well defined goal that can be achieved by
In order to help VM system to maintain enough free
memory, there are the following kernel parameters:
Min_free_kbytes – defines how much free memory should be
Dirty_background_ratio – defines the percentage of dirty pages
that will activate pdflush.
Dirty_expire_centisecs – defines how long should the data be in
memory before becoming eligible for being written out by pdflush.
Dirty_writeback_centisecs – defines the period between pdflush
Swapiness – composite parameter which defines “inclination of the
system for swapping”.
These parameters regulate several things:
When do pages get flushed to disk
Which pages will be flushed to disk.
How many pages will be written to disk.
The goal is to maintain sufficient quantity of free
memory while not causing page replacement process to become overly active and
the whole system to become I/O bound. .
So far, we have been talking about page replacement, page
stealing and page freeing, but we haven't mentioned the word “swapping”. What
is swapping? Swapping does not operate on pages, it operates on processes.
Swapper is the process which, when memory is at the very low point, writes the
address space of a whole process to the swap partition and frees the memory.
Usually, on other Unix systems, that is a desperate measure which kicks in only
when the amount of free memory falls beneath the number of pages defined by the
parameter minfree. Unfortunately, Linux doesn't have “minfree” parameter which
would even remotely resemble the usual culprits on Solaris or HP-UX. Linux 2.6
has “swappiness”, which is defined as “tendency of system to swap”. It is a
“dashboard indicator”, in the best Microsoft style, which can take values
between 0 and 100, with the default of 60.
[root@medo ~]# cat /proc/sys/vm/swappiness
Of course, this parameter is not defined in kernel
documentation that comes with the system and there are no good recommendations.
Personally, I turned it off by setting it to 0, without any ill effects, but
without any noticeable improvements, either. I want my system to start swapping
only in times of crisis, and Linux doesn't provide any type of control that
makes sense. Linux doesn't provide many means for system administrator to tune
VM, asking you instead to rely on the “intelligence of the system”. Despite all
the search there is no proof of intelligence within Linux kernel yet. This
undocumented parameter is rumored to be the famous “run faster button”. When I
set it to 100, my system started to swap frequently
[root@medo ~]# echo "100">/proc/sys/vm/swappiness
[root@medo ~]# cat /proc/sys/vm/swappiness
Below is the picture of my system, with Oracle instance
started and swappiness set to 100:
As is visible from the picture, swapper is active, but the
amount of free memory is not particularly large. This parameter has just
activated the swapper, without any beneficial effect on performance. I haven't
noticed any bad effects from setting it to 0, for as long as page replacement
mechanism is able to cope with the amount of available memory.
To maintain the necessary amount of free memory, it is much
better to set min_free_kbytes which will cause the page replacement process to
maintain at least that many free kilobytes of memory. It is also worth noticing
that Linux misses parameters for fine regulation of the virtual memory, defining
when will the page replacement process be activated, deactivated, sped up or
What I do not want to do is to give numbers and advise
maintaining at least 10% or 15% or 20% of your RAM as free memory, as the
number is very dependent on the behavior and type of the system. Linux
parameters are not very well documented and I've seen systems that behaved well
without a single one of the parameters above being changed. The parameters
described earlier are the tools that database and system administrators have at
their disposal in case of urgent problems. These parameters can be used to patch
situation quickly until the main problem is resolved. Sudden and rapid growth in
the memory consumption is always a sign of change and changing the virtual
memory parameters usually doesn't resolve the problem but it can alleviate the
There is no one size fit all best strategy, capable of
determining the values of those parameters in advance. It is important to
underline the fact that changing these parameters should be done as a reaction
to a problem, not as an preventive measure which will prevent demons from
possessing our little Linux box. As you can see, Linux is rather limited in its
tuning capabilities and tools and rather poorly instrumented. When an operating
system becomes fully instrumented and stable, it usually also becomes obsolete.
Judging by that, Linux is light years away from becoming obsolete.
By Mladen Gogala, author of "Easy
If you like Oracle tuning, see the book "Oracle
Tuning: The Definitive Reference", with 950 pages of tuning tips and
You can buy it direct from the publisher for 30%-off and get
instant access to the code depot of Oracle tuning scripts.