|
|
Welcome to My Nightmare - The Common Performance Errors in Oracle Databases
Oracle Database Tips by Donald Burleson |
By Michael R. Ault
Improper Disk Setup
Under the heading of improper disk setup there
are many sub topics. Some of these disk setup topics include:
·
Interface issues
·
Mount options
·
Filesystem choices
·
RAID setup
·
Disk size and speed choices
Let's look at each of these in the Oracle
environment.
Interface Issues
Generally interface issues resolve to bandwidth
issues. A case in point, a major bakery had upgraded their system,
putting in more, faster CPUs, higher speed disks and newer hardware
overall. They calculated on the average they only used 75% of the
bandwidth on the old system so they reduced the number of HBAs from
12 dual-channel to 8 dual-channel.
After the upgrade performance looked great,
until the end of month processing crunch, suddenly performance
dropped to half of what it was before. Investigation showed that
while on the average they only needed 75% of the bandwidth of the 12
HBAs during end of month, end of quarter and end of year processing
they actually required more. Luckily for them the HBAs in the old
system where compatible and, they had the needed expansion slots to
add the needed HBAs to the new system. With the 4 additional HBAs in
place they quadrupled their performance.
The other major choice in interfaces is in
interface type, SCSI, Fibre, Fabric. Unfortunately there is no
simple answer, you need to examine your system and if IO timing is
bad, find out whether it is related to contention or bandwidth
issues.
Note that when you monitor IO timing you need to
look at it from Oracle's perspective, that is, from the time Oracle
requests the IO to the time the IO is received by the Oracle system,
not strictly at the operating system level. If you see a large
difference between what Oracle is saying IO timings are and what the
OS is saying you need to track down where the time is being consumed
between the disk packs and the Oracle database.
Mount Options
Essentially Oracle doesn't like any mount option
involving logging on its datafile mount points. Anything you can do
to make the disk look RAW to Oracle is a good thing. This means
nologging, noatime, async, aio and many other mount options need to
be considered depending on your operating system. Dramatic
improvements in performance have been reported from just changing
the mount options on the drives otr filesystems that support your
Oracle datafiles. Generally it is not advised to switch to nologging
type options on filesystems where non-Oracle or binary type files
(executables) are stored.
In UNIX you can control whether a file system
uses buffered or unbuffered IO. With Oracle the use of a buffered
filesystem is redundant and dangerous. An example of the dangers of
a buffered filesystem with Oracle is when power is lost. The buffer
in a buffered filesystem depends on the cache battery to provide
enough power to allow the buffer to be written to disk before the
disk spins down. However, many shops fail to monitor the cache
battery lifetime limitations or fail to change the batteries at all.
This can result in loss of data in a buffered filesystem on loss of
power.
You can turn off buffered writes in several ways
(buffered reads aren't an issue, but you should always use
write-through caching). One is to mount the filesystems used with
Oracle files as non-buffered using such options as:
·
AIX: "dio", "rbrw", "nointegrity"
·
SUN: "delaylog", "mincache=direct",
"convosync=direct" ,"nodatainlog"
·
LINUX: "async", "noatime"
·
HP: Use VxFS with: "delaylog", "nodatainlog",
"mincache=direct", "convosync=direct"
Using Direct IO at the Oracle Level
For information about Oracle direct I/O, refer
to this URL by Steve Adams:
*
http://www.ixora.com.au/notes/filesystemio_options.htm
Checking Your Server
Methods for configuring the OS will vary
depending on the operating system and file system in use. Here are
some examples of quick checks that anyone can perform to ensure that
you are using direct I/O:
·
Solaris - Look for a "forcedirectio" option.
Oracle DBAs find this option often makes a huge difference in I/O
speed for Sun servers. Here is the
Sun documentation:
·
AIX - Look for a "dio" option. Here is a great
link for AIX direct I/O:
·
http://www-106.ibm.com/developerworks/eserver/articles/DirectIO.html
·
Veritas VxFS - (including HP-UX, Solaris and
AIX), look for "convosync=direct". It is also possible to enable
direct I/O on a per-file basis using Veritas QIO; refer to the
"qiostat" command and corresponding man page for hints. For HPUX,
see
Oracle on HP-UX - Best Practices.
·
Linux - Linux systems support direct I/O on a
per-filehandle basis (which is much more flexible), and I believe
Oracle enables this feature automatically. Someone should verify at
what release Oracle started to support this feature (it is called
O_DIRECT). See
Kernel Asynchronous I/O (AIO) Support
for Linux and this great OTN article:
Talking Linux: OCFS Update.
I'm Using LINUX and ATA Arrays, no Stress, but
IO is slow!
Don't panic! Most LINUX kernels will take the
default ATA interface setpoints that were the "standard" when the
kernel was built (or even older ones). This can be corrected.
In LINUX there is the hdparm command
which allows you to reset how ATA drives are accessed by the
operating system. Using hdparm is simple and with it I have
seen 300% improvement in access speeds of various ATA drives. Let's
go through a quick tuning sequence.
First, we will use the hdparm command with no
arguments but the full path to the disk device listing:
[root@aultlinux2 root]# hdparm /dev/hdb
/dev/hdb:
multcount = 16 (on)
IO_support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 0 (off)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 77557/16/63, sectors =
78177792, start = 0
The hdparm with no arguments but the disk device
gives the current settings for the disk drive. You should compare
this to the specifications for your drive. You may find that direct
emmory access (DMA) is not being used, readahead is too small, you
are only using 16 bit when you should be using 32 bit, etc.
Next, let's do a basic benchmark of the current
performance of the drive, you do this using the hdparm –Tt option
(for all options do a "man hdparm" at the command line.
[root@aultlinux2 root]# hdparm -Tt /dev/hdb
/dev/hdb:
Timing buffer-cache reads: 128 MB in 1.63
seconds = 78.53 MB/sec
Timing buffered disk reads: 64 MB in 14.20
seconds = 4.51 MB/sec
Now lets adjust the settings, the –c option, when set to 1 enables
32 bit IO, the –u option is used to get or set the interrupt-unmask
flag for the drive. A setting of 1 permits the driver to unmask
other interrupts during processing of a disk interrupt, which
greatly improves Linux's responsiveness and eliminates "serial port
overrun" errors. Use this feature with caution on older kernels:
some drive/controller combinations do not tolerate the increased I/O
latencies possible when this feature is enabled, resulting in
massive filesystem corruption. However most versions of Linux
(RedHat 2.1 and greater) using modern controllers don't have this
issue. The –p option is used to autoset the PIO mode and –d is used
to set or unset the DMA mode.
[root@aultlinux2 root]# hdparm -c1 -u0 -p
-d0 /dev/hdb
/dev/hdb:
attempting to set PIO mode to 0
setting 32-bit IO_support flag to 1
setting unmaskirq to 0 (off)
setting using_dma to 0 (off)
IO_support = 1 (32-bit)
unmaskirq = 0 (off)
using_dma = 0 (off)
So we turned on 32 bit mode and set DMA to mode
0. Let's see the resulting performance change using our previous –Tt
option.
[root@aultlinux2 root]# hdparm -Tt /dev/hdb
/dev/hdb:
Timing buffer-cache reads: 128 MB in 1.63
seconds = 78.53 MB/sec
Timing buffered disk reads: 64 MB in 9.80
seconds = 6.53 MB/sec
So we didn't change the buffer-cache read timings, however, we
improved the buffered disk reads by 45%. Let's tweak some more and
see if we can do better. The –m option sets the multi-sector IO
count on the drive. The –c option sets the 32 bit option, the –X
sets the access mode to mdma2 the –d1 option turns on direct memory
access, the –a8 option improves the readahead performance for large
reads and –u1 turns on the unmasking operation described above.
[root@aultlinux2 root]# hdparm -m16 -c3 -X
mdma2 -d1 -a8 -u1 /dev/hdb
/dev/hdb:
setting fs readahead to 8
setting 32-bit IO_support flag to 3
setting multcount to 16
setting unmaskirq to 1 (on)
setting using_dma to 1 (on)
setting xfermode to 34 (multiword DMA
mode2)
multcount = 16 (on)
IO_support = 3 (32-bit w/sync)
unmaskirq = 1 (on)
using_dma = 1 (on)
readahead = 8 (on)
So now let's see what we have done to
performance using the –Tt option.
[root@aultlinux2 root]# hdparm -Tt /dev/hdb
/dev/hdb:
Timing buffer-cache reads: 128 MB in 1.56
seconds = 82.05 MB/sec
Timing buffered disk reads: 64 MB in 4.29
seconds = 14.92 MB/sec
Not bad! We improved buffered cache reads by 5% and buffered disk
reads by 231%!
These options can then be loaded into a startup file
to make them part of the system startup.
I'm Really Feeling SCSI About Disk Performance,
what then?
Sorry for the bad pun (well, actually I'm not)
what can be done with SCSI interfaces? To tell you the truth, not a
lot, however, there are some items which you may find useful. Most
interfaces will buffer commands and issue them in batches, for
example, most SCSI interfaces use a 32 command buffer that stacks
commands until it has 32 of them and then fires them off. This can
be reset in LINUX using options in the modules.conf file for the
SCSI interface module.
In other UNIX flavors there are many settings
which can be changed, but an exact understanding of the interface
and its limitations as well as current system loads must be had
before changing any of the SCSI settings. If you feel you need to
have them checked, ask your SA.
Disk Stress In a Nut Shell
In summary, to determine if a disk or array is
undergoing IO related stress, perform an IO balance and an IO timing
analysis. If the IO timing analysis shows excessive read or write
times investigate the causes. Generally speaking, poor IO timings
will result when:
·
A single disk exceeds 110 - 150 IO per second
·
An entire multi-read capable RAID10 array exceeds
#MIRRORS*#DPM*110 IO's per second
·
An entire non-multi-read capable RAID10 array exceeds
#DPM*110 IO's per second
·
If a RAID5 array exceeds (#DISKS-1)*66 IO's per second
then it will probably experience poor IO timings.
·
Make sure Oracle is using direct IO at both the OS and
Oracle levels
·
Make sure your disk interface is tuned to perform
optimally
*DPM=Disks per mirror
Seeing stress from the Oracle Side
Disk stress will show up on the Oracle side as
excessive read or write times. Filesystem stress is shown by
calculating the IO timings as shown in Figure 16.
For complete scripts, see my complete Oracle
script collection at
www.dba-oracle.com/oracle_scripts.htm.
rem Purpose: Calculate IO timing values for
datafiles
col name format a65
col READTIM/PHYRDS heading 'Avg|Read Time'
format 9,999.999
col WRITETIM/PHYWRTS heading 'Avg|Write
Time' format 9,999.999
set lines 132 pages 45
start title132 'IO Timing Analysis'
spool rep_out\&db\io_time
select f.FILE#
,d.name,PHYRDS,PHYWRTS,READTIM/PHYRDS,WRITETIM/PHYWRTS
from v$filestat f, v$datafile d
SEE CODE DEPOT FOR FULL
SCRIPT
and phyrds>0 and phywrts>0
union
select a.FILE#
,b.name,PHYRDS,PHYWRTS,READTIM/PHYRDS,WRITETIM/PHYWRTS
from v$tempstat a, v$tempfile b
where a.file#=b.file#
and phyrds>0 and phywrts>0
order by 5 desc
/
spool off
ttitle off
clear col
Figure 16: IO Timing Report
An example of the output from Figure 16 is shown
in Figure 17.
Date:
11/20/05 Page: 1
Time: 11:12 AM IO Timing
Analysis PERFSTAT
whoraw
database
FILE# NAME PHYRDS PHYWRTS
READTIM/PHYRDS WRITETIM/PHYWRTS
----- -------------- ---------- -------
-------------- ----------------
13 /dev/raw/raw19 77751 102092
76.8958599 153.461829
33 /dev/raw/raw35 32948 52764
65.7045041 89.5749375
7 /dev/raw/raw90 245854 556242
57.0748615 76.1539869
54 /dev/raw/raw84 208916 207539
54.5494409 115.610912
40 /dev/raw/raw38 4743 27065
38.4469745 47.1722889
15 /dev/raw/raw41 3850 7216
35.6272727 66.1534091
12 /dev/raw/raw4 323691 481471
32.5510193 100.201424
16 /dev/raw/raw50 10917 46483
31.9372538 74.5476626
18 /dev/raw/raw24 3684 4909
30.8045603 71.7942554
23 /dev/raw/raw58 63517 78160
29.8442779 84.4477866
5 /dev/raw/raw91 102783 94639
29.1871516 87.8867909
Figure 17: Example IO Timing Report
As you can see from Figure 17 we are looking at
an example report from a RAW configuration using single disks.
Notice how both read and write times exceed even the rather large
good practice limits of 10-20 milliseconds for a disk read. However
in my experience for reads you should not exceed 5 milliseconds and
usually with modern buffered reads, 1-2 milliseconds. Oracle is more
tolerant for write delays since it uses a delayed write mechanism,
so 10-20 milliseconds on writes will normally not cause significant
Oracle waits, however, the smaller you can get read and write times,
the better!
Filesystems
In UNIX or LINUX you have multiple filesystem
options, RAW, JFS, ext2, ext3, reiserFS, OCFS. You need to use the
best performing filesystem usually RAW, ext3 or resierFS. Oracle
OCFS is also viable.
RAW Filesystems
RAW is probably the simplest filesystem to
understand. In Unix systems, a disk can be configured to contain a
single contiguous chunk of space, or it can be configured to contain
multiple separate chunks of space. Each chunk of space is called a
partition or section. Typically a partition would be
formatted to contain a filesystem, allowing a hierchical structure
of directories and files to be created. A partition that does not
contain a filesystem is called a raw partition.. A database
extent stored on a raw partition is called a raw extent.
Advantages
The primary advantage to using raw extents is
the possibility of increased performance compared to extents stored
in Unix files. The performance benefits occur because:
·
File system overhead and address translation is
eliminated. Database block addresses map directly to raw partition
addresses.
·
Read operations transfer data directly from the disk
controller to shared memory and write operations transfer directly
from shared memory to the disk controller. The Unix buffer pool is
not used, avoiding the necessity to copy data from a Unix page
buffer to shared memory and vice versa.
Disadvantages
Although they may provide increased performance,
raw extents have several major disadvantages. You should not take
the decision to use them lightly. Among the disadvantages are:
·
Raw disk partitions are cumbersome to configure and
manage. You have to keep track of what is on each raw partition
yourself. You can easily make mistakes and destroy the contents of a
raw partition.
·
The number of raw partitions available on a disk may
be quite limited. For example, on Solaris 2.5, a disk can have at
most 8 partitions.
·
Moving raw partitions from one disk to another may be
difficult or impossible.
·
Raw partition sizes are fixed. Allocating more space
to a partition requires backing up all the partitions on a disk,
designing a new partition layout, and restoring all saved data.
·
You cannot use the same operating system backup
utilities to make backups of files and raw partitions.
·
You may get worse performance with raw devices than
you do with files. The performance difference is highly dependent on
the operating system's implementation of the file system. Many
modern Unix systems, like Ditical Unix, AIX 4.2, HP-UX 10, Solaris
2.6, and others have highly advanced filesystem implementations that
include many performance optimizations. Older Unix systems, such as
Unix System V Release 4 are somewhat less advanced and raw
partitions may provide better performance on these systems.
·
You may not be able to use logical volume managers
with raw partitions. Not all Unix systems support logical volumes
composed of raw partitions.
The EXT2 Filesystem
The, Ext2fs is based on the Extfs code with many
reorganizations and many improvements. It has been designed with
evolution in mind and contains space for future improvements. The
Second Extended File System has been designed and implemented to fix
some problems present in the first Extended File System.
·
The Ext2fs supports standard Unix file types: regular
files, directories, device special files and symbolic links.
·
Ext2fs is able to manage filesystems created on really
big partitions. While the original kernel code restricted the
maximal filesystem size to 2 GB, recent work in the VFS layer have
raised this limit to 4 TB. Thus, it is now possible to use big disks
without the need of creating many partitions.
·
Ext2fs provides long file names. It uses variable
length directory entries. The maximal file name size is 255
characters. This limit could be extended to 1012 if needed.
·
Ext2fs reserves some blocks for the super user (root).
Normally, 5% of the blocks are reserved. This allows the
administrator to recover easily from situations where user processes
fill up filesystems.
"Advanced" Ext2fs features
In addition to the standard Unix features,
Ext2fs supports some extensions which are not usually present in
Unix filesystems.
·
File attributes allow the users to modify the kernel
behavior when acting on a set of files. One can set attributes on a
file or on a directory. In the later case, new files created in the
directory inherit these attributes.
·
BSD or System V Release 4 semantics can be selected at
mount time.
·
BSD-like synchronous updates can be used in Ext2fs.
·
Ext2fs allows the administrator to choose the logical
block size when creating the filesystem. Block sizes can typically
be 1024, 2048 and 4096 bytes. Using big block sizes can speed up I/O
since fewer I/O requests, and thus fewer disk head seeks, need to be
done to access a file.
·
Ext2fs implements fast symbolic links.
·
Ext2fs keeps track of the filesystem state.
·
Always skipping filesystem checks may sometimes be
dangerous, so Ext2fs provides two ways to force checks at regular
intervals. A mount counter is maintained in the superblock. Each
time the filesystem is mounted in read/write mode, this counter is
incremented. When it reaches a maximal value (also recorded in the
superblock), the filesystem checker forces the check even if the
filesystem is ``Clean''.
·
Mount options can also be used to change the kernel
error behavior.
·
An attribute allows the users to request secure
deletion on files. When such a file is deleted, random data is
written in the disk blocks previously allocated to the file. This
prevents malicious people from gaining access to the previous
content of the file by using a disk editor.
·
Last, new types of files inspired from the 4.4 BSD
filesystem have recently been added to Ext2fs. Immutable files can
only be read: nobody can write or delete them. This can be used to
protect sensitive configuration files. Append-only files can be
opened in write mode but data is always appended at the end of the
file. Like immutable files, they cannot be deleted or renamed. This
is especially useful for log files which can only grow.
EXT3 Filesystem
According to Dr. Stephan Tweety:
The ext3 filesystem is a journaling extension to
the standard ext2 filesystem on Linux. Journaling results in
massively reduced time spent recovering a filesystem after a crash,
and is therefore in high demand in environments where high
availability is important, not only to improve recovery times on
single machines but also to allow a crashed machine's filesystem to
be recovered on another machine when we have a cluster of nodes with
a shared disk
… the real objective in EXT3 was this simple
thing: availability. When something goes down in EXT3, we don't want
to have to go through a fsck. We want to be able to reboot the
machine instantly and have everything nice and consistent,,, You can
take an existing EXT2 filesystem, throw a journal file onto it, and
mount it as EXT3.
ReiserFS
According to
Daniel Robbins
(drobbins@gentoo.org),
President/CEO:
The ReiserFS 3.6.x (the version included as part
of Linux 2.4) was designed and developed by Hans Reiser and his team
of developers at
Namesys.
The ReiserFS uses a specially optimized b* balanced tree (one per
filesystem) to organize all of its filesystem data. This offers a
nice performance boost, as well as easing artificial restrictions on
filesystem layouts. It's now possible to have a directory that
contains 100,000 other directories, for example. Another benefit of
using a b*tree is that ReiserFS, like most other next-generation
filesystems, dynamically allocates inodes as needed rather than
creating a fixed set of inodes at filesystem creation time. This
helps the filesystem to be more flexible to the various storage
requirements that may be thrown at it, while at the same time
allowing for some additional space-efficiency.
ReiserFS also has a host of features aimed
specifically at improving small file performance. Unlike ext2,
ReiserFS doesn't allocate storage space in fixed one k or four k
blocks. Instead, it can allocate the exact size it needs. And
ReiserFS also includes some special optimizations centered around
tails, a name for files and end portions of files that are smaller
than a filesystem block. In order to increase performance, ReiserFS
is able to store files inside the b*tree leaf nodes themselves,
rather than storing the data somewhere else on the disk and pointing
to it.
This does two things. First, it dramatically
increases small file performance. Since the file data and the
stat_data (inode) information are stored right next to each other,
they can normally be read with a single disk IO operation. Second,
ReiserFS is able to pack the tails together, saving a lot of space.
In fact, a ReiserFS filesystem with tail packing enabled (the
default) can store six percent more data than the equivalent ext2
filesystem, which is amazing in itself.
However, tail packing does cause a slight
performance hit since it forces ReiserFS to repack data as files are
modified. For this reason, ReiserFS tail packing can be turned off,
allowing the administrator to choose between good speed and space
efficiency, or opt for even more speed at the cost of some storage
capacity.
Other Filesystems
Of course we also have many other filesystems
such as those from Veritas, Polyserver and other specialized
systems.
Oracle and Filesystems
Generally Oracle prefers filesystems that don't
do logging or journaling for filesystems that contain datafiles so
if you use EXT2, EXT3 or reiserFS you need to mount them with the
journaling or logging off. If RAW filesystems are used you avoid
some overhead but usually can only obtain a 2-5 percent performance
increase over modern optimized filesystems.
RAID—Redundant Arrays of Inexpensive Disks
The main strengths of RAID technology are its
dependability and IO bandwidth. For example, in a RAID5 array, the
data is stored as are checksums and other information about the
contents of each disk in the array. If one disk is lost, the others
can use this stored information to re-create the lost data. However,
this rebuild of data on-the-fly causes a massive hit on performance.
In RAID 1, RAID 10 and RAID 01 failed disks are immediately replaced
by their mirror with no performance hit. This makes RAID very
attractive. RAID 5 has the same advantages as shadowing and striping
at a lower cost. It has been suggested that if the manufacturers
would use slightly more expensive disks (RASMED—redundant array of
slightly more expensive disks) performance gains could be realized.
A RAID system appears as one very large, reliable disk to the CPU.
There are several levels of RAID to date:
·
RAID 0. Known as disk striping.
·
RAID 1. Known as disk shadowing or mirroring.
·
RAID 1/0. Combination of RAID0 and RAID1. May
also be called RAID 10 depending on whether they are striped and
mirrored or mirrored then striped. It is generally felt that RAID 10
performs better than RAID 01.
·
RAID 2. Data is distributed in extremely small
increments across all disks and adds one or more disks that contain
a Hamming code for redundancy. RAID 2 is not considered commercially
viable due to the added disk requirements (10 to 20 percent must be
added to allow for the Hamming disks).
·
RAID 3. This also distributes data in small
increments but adds only one parity disk. This results in good
performance for large transfers, but small transfers show poor
performance.
·
RAID 4. In order to overcome the small transfer
performance penalties in RAID3, RAID4 uses large data chunks
distributed over several disks and a single parity disk. This
results in a bottleneck at the parity disk. Due to this performance
problem RAID 4 is not considered commercially viable. RAID 3 and 4
are usually are used for video streaming technology or large LOB
storage.
·
RAID 5. This solves the bottleneck by
distributing the parity data across the disk array. The major
problem is it requires several write operations to update parity
data. The performance hit is only moderate, and the other benefits
may outweigh this minor problem. However the penalty for writes can
be over 20% and must be weighed against the benefits.
·
RAID 6. This adds a second redundancy disk that
contains error-correction codes. Read performance is good due to
load balancing, but write performance suffers due to RAID 6
requiring more writes than RAID 5 for data update.
For the money, I would suggest RAID0/1 or
RAID1/0, that is, striped and mirrored. It provides nearly all of
the dependability of RAID5 and gives much better write performance.
You will usually take at least a 20 percent write performance hit
using RAID5. For read-only applications RAID5 is a good choice, but
in high-transaction/high-performance environments the write
penalties may be too high. Figure 18 shows RAID 1-0 or 0-1 depending
on whether you stripe and then mirror or mirror first and then
stripe. In most situations you get better performance from RAID 1-0
(mirroring then striping.)
Figure 18: Mirroring and Striping
Table 1 shows how Oracle suggests RAID should be
used with Oracle database files.
RAID |
Type of Raid |
Control File |
Database File |
Redo Log File |
Archive Log File |
0 |
Striping |
Avoid |
OK |
Avoid |
Avoid |
1 |
Shadowing |
Best |
OK |
Best |
Best |
1+0 |
Striping and Shadowing |
OK |
Best |
Avoid |
Avoid |
3 |
Striping with static parity |
OK |
OK |
Avoid |
Avoid |
5 |
Striping with rotating parity |
OK |
Best if RAID0-1 not available |
Avoid |
Avoid |
Table 1: RAID Recommendations (From MOSC
NOTE: 45635.1)
Disk Speed and Size Selection
This all points to the fact that in order to get
the maximum performance from your disk system you must understand
the IO characteristics (the profile) of your database system, be it
Oracle, SQL Server, Informix, UDB or MySQL. You must tune your disk
architecture to support the expected IO profile and must tune the
database system to take advantage of the disk architecture. For
example, an Oracle database has different IO characteristics
depending on whether it is reading or writing data and what type of
read or write it is doing. Other databases have fixed read/write
sizes.
You must determine the IO profile for your
database and then use the IO profile of the database to determine
the maximum and minimum IO size. The IO profile will tell you what
percentage of IO is large IO and what percentage is small IO, it
will also give you the expected IO rate in IO/second.
Once you have the IO per second you can
determine the IO capacity (number of drives) needed to support your
database.
The first rule of tuning your disk system is:
Size first
for IO capacity, then for volume.
Some back of the envelope calculations for the
number of spindles needed to support IO rate are:
RAID10 with active read/write to all mirrors:
MAX(CEILING(IOR/(NSIOR*M),M),2*M)
Where:
·
IOR is expected maximum IO rate in IO/sec
·
NSIOR is the average non-sequential IO rate of the
disks in IO/sec (range of 90-100 for RAID10)
·
M is the number of mirrors
(The maximum of the IO rate divided by the
average non-sequential IO rate per disk times the number of mirrors
to the nearest power of M or 2*M)
RAID5 assuming 1 parity disk:
MAX((IOR/CNSIOR)+1,3)
Where:
·
IOR is expected maximum IO rate in IO/sec
·
CNSIOR is the corrected average non-sequential IO rate
of the disks in IO/sec (range of 60-90 for RAID5)
(The maximum of the IO rate divided by the
average non-sequential IO rate per disk corrected for RAID5
penalties plus 1 disk for the parity disk)
The correction for the non-sequential IO rate
for RAID is due to the up to 400% penalty on writes (writes take 4
times linger than reads on the same drive). In some cases on RAID5 I
have seen this go as high as 6400% (writes take 64 times as long as
reads for the same file) when combined with other problems such as
fragmentation.
A case in point, early RAID architectures
utilized the "stripe shallow and wide" mind set where files where
broken into small pieces and spread over a large number of disks.
For example, stripe unites per disk of as small as 8K were common.
Many systems read in IO sizes of 64K or larger. This means that to
satisfy a single IO request 8 disks of the RAID set were required,
if there were fewer than 8 disks in the set. Disks would have to
undergo 2 or more IOs to satisfy the request. This sounds fine if
you are talking about a single user wanting to read a large file
from a large group of disks very quickly, however, what happens when
you have 10 or 100 or 1000 concurrent users all wanting to do the
same thing?
Tune for Concurrency
This problem with concurrent access and RAID
arrays is one of the most prevailing in the industry. The ubiquitous
IO wait is usually the predominant wait event in any database system
simply due to the fact that IO to memory is in the nanosecond range
while IO to disk is in the millisecond range, when you add in
blocked access due to multi-disk IO requests you get a snowball
effect that can cripple your IO subsystem.
Array manufacturers have begun to recognize this
concurrent access problem and have increased the base stripe unit
per disk to 64K, matching the IO unit for many systems. Of course
now systems such as SUN and Windows utilize maximum IO sizes of 1
megabyte or larger, so again the array manufacturers are playing
catch up to the server manufacturers.
So what is our second rule of tuning disks?
Based on the above information the rule is:
Always
ensure that the primary IO size for your database system is matched
to the IO size of the disk array system.
Of course the inverse also holds true:
Always
match the stripe unit per disk to the expected majority IO request
from your (database) application.
In the 1990's Paul Chen of the University Of
Berkeley computer center published a series of papers on tuning disk
array stripe units size based on expected concurrency. In these
papers by Mr. Chen and his associates they determined that the IO
speed (as measured by average seek time) and IO rate (as measured in
megabytes per second) for a disk determined the stripe size for
performance in an array even when the number of concurrent accesses
is not known. There were three formulae derived from these papers:
For non-RAID5 arrays when concurrency is known:
SU =
(S*APT*DTR*(CON-1)*1.024)+.5K
Where:
·
SU - Striping unit per disk
·
S - Concurrency slope coefficient (~.25)
·
APT - Average positioning time (milliseconds)
·
DTR - Data transfer rate (Megabyte/sec)
·
CON - number of concurrent users.
·
1.024= 1s/1000ms*1024K/1M (conversion factors for
units)
So for a drive that has an average seek time of
5.6 ms and a transfer rate of 20 Mbyte/second the calculated stripe
unit for a 20 concurrent user base would be:
(.25*5.6*20*(19)*1.024)+.5 = 545K (or ~512K)
For a system where you didn't know the
concurrency the calculation becomes:
SU
=(2/3*APT*DTR)
So for the same drive:
2/3*5.6*20*1.024 = 76.46K so rounding up ~128K or rounding down 64K
And from Chen's final paper, a formula for RAID5
arrays is:
0.5*5.6*20*1.024 = 57.34 (rounding up 64K)
The values for average access time and transfer
rate used in these examples is actually fairly low when compared to
more advanced drives so the stripe sizes shown above are probably
low by at least a factor of 2 or more. I say this because while
average seek times drop, the transfer rate increases for example on
a Ultra3 SCSI 15K drive the spec for average seek may drop to 4.7
ms, however the transfer rate leaps to 70 Mbyte per second. So the
over all value of the combined factor goes from 112 to 329, a 293%
increase.
The 100% Myth
Many system administrators are guilty of
perpetuating the 100% myth. This myth states that you don't need
more assets (be it disk, CPU, or Memory) until the existing asset is
100% utilized. This leads to performance issues in the area of
disks. Due to disk physics the best performance for a disk is at the
outer edges, once you get towards the inner sectors performance
decreases because of the distance the head must travel to read the
data and other factors. In the good old days administrators spent
much time positioning frequently used files on the outer edges of
disks.
While physically positioning files on disks is
difficult if not impossible in modern RAID systems, you should
endeavor not to fill the disks to 100% of capacity. Some experts say
don't use more then 30% if you want maximum performance, others 50%.
I say it depends on how the system is used, the operating system and
the RAID array system. For example the Clariion from EMC promises to
tune the placement of files such that frequently used files are in
the best locations.
So, what can we summarize about disk size and
speed?
Get the
fastest drives you can and plan capacity based on concurrency
requirements as well as IO requirements. The more, faster disks the
better.
Improper Initialization File Parameter Settings
For Oracle7, version 7.3, there are 154
initialization parameters, for Oracle8, version 8.0.5, there are
184. In Oracle8i there are 194. In Oracle9i version 9.0.1
there are 251 and in 9iR2, 257. In 10g the number of parameters
actually dropped to 254 but the number of undocumented parameters
increased. In 9iR2 the number of undocumented parameters was 583 up
to 918 in 10gR1.
Fortunately there are very few that you need to
adjust to tune Oracle. Table 2 lists the major tuning parameters,
but is not supposed to be a complete list by any means.
Parameter |
Definition |
create_bitmap_area_size |
This sets the memory
area for bitmap creation |
bitmap_merge_area_size |
This is the memory area used for bitmap merge |
create_stored_outlines |
This allows Oracle to
create stored outlines |
cursor_sharing |
This sets for automated literal replacement |
db_file_multiblock_read_count |
This sets the read size for full table and index scans |
filesystemio_options |
This is used to set direct or AIO options for filesystem
reads |
optimizer_index_caching |
Used to tune index
access |
optimizer_index_cost_adj |
Used to tune index access |
query_rewrite_enabled |
Sets for queries to
be rewritten to use materialized views or FBIs |
query_rewrite_integrity |
Sets the criteria for when MVs are used. |
session_cached_cursors |
Sets the number of cached cursors at the session level
|
sga_max_size |
Sets the maximum SGA
memory size |
sga_target |
Sets the baseline SGA memory size |
star_transformation_enabled |
Allows Oracle to use star transformation |
transactions_per_rollback_segment |
Sets the number of transactions that will use a single
rollback (undo) segment |
pga_aggregate_target |
Sets the total PGA memory usage limit |
workarea_size_policy |
Determines how workareas (sort and hash) are determined |
buffer_pool_keep |
Sets the size of the
keep buffer pool for tables and indexes |
buffer_pool_recycle |
Sets the size of the recycle buffer pool for tables and
indexes |
cursor_space_for_time |
Sacrifices memory for
cursor storage space |
db_16k_cache_size |
Sets the size of the
16K cache size |
db_2k_cache_size
|
Sets the size of the
2K cache size |
db_32k_cache_size
|
Sets the size of the
32K cache size |
db_4k_cache_size
|
Sets the size of the
4K cache size |
db_8k_cache_size
|
Sets the size of the
8K cache size |
db_block_size |
Sets the default
block size for the database |
db_cache_size |
Sets the default cache size |
Table 2: Tuning Parameters
How to determine proper setpoints for all of
these is beyond the scope of this paper. However the Oracle tuning
guides provide many good tips as does the Burleson Consulting web
site:
www.remote-dba.net.
Improper PGA setup
I don't believe there is anyone out there that
believes disk based sorts and hashes are good things. A disk based
operation will take anywhere from 17 to hundreds of times as long as
a memory based operation depending on buffering, IO bandwidth,
memory and disk speeds.
Oracle provides AWRRPT or statspack reports to
track and show the number of sorts. Unfortunately hashes are not so
easily tracked. Oracle tracks disk and memory sorts, number of sort
rows and other sort related statistics. Hashes on the other hand
only can be tracked usually by the execution plans for cumulative
values, and by various views for live values.
In versions prior to 9i the individual areas
were set using the sort_area_size and hash_area_size parameters,
after 9i the parameter PGA_AGGREGATE_TARGET was provided to allow
automated setting of the sort and hash areas. For currently active
sorts or hashes the script in Figure 19 can be used to watch the
growth of temporary areas.
For complete scripts, see my complete Oracle
script collection at
www.dba-oracle.com/oracle_scripts.htm.
column now format a14
column operation format a15
column dt new_value td noprint
set feedback off
select to_char(sysdate,'ddmonyyyyhh24miss')
dt from dual;
set lines 132 pages 55
@title132 'Sorts and Hashes'
spool rep_out\&&db\sorts_hashes&&td
select
sid,work_area_size,expected_size,actual_mem_used,max_mem_used,tempseg_size,
to_char(sysdate,'ddmonyyyyhh24miss') now,
operation_type operation
SEE CODE DEPOT FOR FULL
SCRIPT
/
spool off
clear columns
set lines 80 feedback on
ttitle off
Figure 19: Live Sorts and Hashes Report
Figure 20 shows an example output from this
report.
Date: 01/04/06
Page: 1
Time: 01:27 PM Sorts and
Hashes SYS
whoville
database
Work Area Expected Actual Mem Max Mem
Tempseg
SID Size Size Used
Used Size Now Operation
---- --------- -------- ---------- -------
------- --------------- ---------------
1176 6402048 6862848 0
0 04jan2006132711 GROUP BY (HASH)
582 114688 114688 114688
114688 04jan2006132711 GROUP BY (SORT)
568 5484544 5909504 333824
333824 04jan2006132711 GROUP BY (HASH)
1306 3469312 3581952 1223680
1223680 04jan2006132711 GROUP BY (HASH)
Figure 20: Example Sorts and hashes Report
As you can see the whoville database had no
hashes, at the time the report was run, going to disk. We can also
look at the cumulative statistics in the v$sysstat view for
cumulative sort data.
Date:
12/09/05 Page: 1
Time: 03:36 PM Sorts
Report PERFSTAT
sd3p database
Type Sort Number Sorts
-------------------- --------------
sorts (memory) 17,213,802
sorts (disk) 230
sorts (rows) 3,268,041,228
Figure 21: Cumulative Sorts
Another key indicator that hashes are occurring
are if there is excessive IO to the temporary tablespace yet there
are few or no disk sorts.
The PGA_AGGREGATE_TARGET is the target
total amount of space for all PGA memory areas. However, only 5% or
a maximum of 200 megabytes can be assigned to any single process.
The limit for PGA_AGGREGATE_TARGET is 4 gigabytes
(supposedly) however you can increase the setting above this point.
The 200 megabyte limit is set by the _pga_max_size
undocumented parameter, this parameter can be reset but only under
the guidance of Oracle support. But what size should
PGA_AGGREGATE_TARGET be set? The AWRRPT report in 10g provides a
sort histogram which can help in this decision. Figure 22 shows an
example of this histogram.
PGA Aggr Target
Histogram DB/Inst: OLS/ols Snaps: 73-74
-> Optimal Executions are purely in-memory
operations
Low High
Optimal Optimal Total Execs Optimal
Execs 1-Pass Execs M-Pass Execs
------- -------
-------------- -------------- ------------ ------------
2K 4K
1,283,085 1,283,085 0 0
64K 128K
2,847 2,847 0 0
128K 256K
1,611 1,611 0 0
256K 512K
1,668 1,668 0 0
512K 1024K
91,166 91,166 0 0
1M
2M 690 690 0 0
2M 4M 174
164 10 0
4M 8M 18
12 6 0
-------------------------------------------------------------
Figure 22: Sort Histogram
In this case we are seeing 1-pass executions
indicating disk sorts are occurring with the maximum size being in
the 4m to 8m range. For an 8m sort area the PGA_AGGREGATE_TARGET
should be set at 320 megabytes (sorts get
0.5*(.05*PGA_AGGREGATE_TARGET)). For this system the setting was
at 160 so 4 megabytes was the maximum sort size, as you can see we
were seeing 1-pass sorts in the 2-4m range as well even at 160m.
By monitoring the realtime or live hashes and
sorts and looking at the sort histograms from the AWRRPT reports you
can get a very good idea of the needed PGA_AGGREGATE_TARGET
setting. If you need larger than 200 megabyte sort areas you may
need to get approval from Oracle support through the i-tar process
to set the _pga_max_size parameter to greater than 200
megabytes.
Summary
This paper has presented the major tuning issues
I have seen at many sites during tuning engagement. I have presented
ways of determining if the issues exist and how to determine
settings to help mitigate the issues in an active database.
|