Home
E-mail Us
Oracle Articles
New Oracle Articles

Oracle Training
Oracle Tips
Oracle Forum
Class Catalog

Remote DBA
Oracle Tuning
Emergency 911
RAC Support
Apps Support
Analysis
Design
Implementation
Oracle Support

SQL Tuning
Security
Oracle UNIX
Oracle Linux
Monitoring
Remote support
Remote plans
Remote services
Application Server
Applications
Oracle Forms
Oracle Portal
App Upgrades
SQL Server
Oracle Concepts
Software Support
Remote Support
Development
Implementation

Consulting Staff
Consulting Prices
Help Wanted!

Oracle Posters
Oracle Books
Oracle Scripts
Ion
Excel-DB

Don Burleson Blog

Welcome to My Nightmare - The Common Performance Errors in Oracle Databases

Oracle Database Tips by Donald Burleson

By Michael R. Ault

Improper Disk Setup

Under the heading of improper disk setup there are many sub topics. Some of these disk setup topics include:

· Interface issues

· Mount options

· Filesystem choices

· RAID setup

· Disk size and speed choices

Let's look at each of these in the Oracle environment.

Interface Issues

Generally interface issues resolve to bandwidth issues. A case in point, a major bakery had upgraded their system, putting in more, faster CPUs, higher speed disks and newer hardware overall. They calculated on the average they only used 75% of the bandwidth on the old system so they reduced the number of HBAs from 12 dual-channel to 8 dual-channel.

After the upgrade performance looked great, until the end of month processing crunch, suddenly performance dropped to half of what it was before. Investigation showed that while on the average they only needed 75% of the bandwidth of the 12 HBAs during end of month, end of quarter and end of year processing they actually required more. Luckily for them the HBAs in the old system where compatible and, they had the needed expansion slots to add the needed HBAs to the new system. With the 4 additional HBAs in place they quadrupled their performance.

The other major choice in interfaces is in interface type, SCSI, Fibre, Fabric. Unfortunately there is no simple answer, you need to examine your system and if IO timing is bad, find out whether it is related to contention or bandwidth issues.

Note that when you monitor IO timing you need to look at it from Oracle's perspective, that is, from the time Oracle requests the IO to the time the IO is received by the Oracle system, not strictly at the operating system level. If you see a large difference between what Oracle is saying IO timings are and what the OS is saying you need to track down where the time is being consumed between the disk packs and the Oracle database.

Mount Options

Essentially Oracle doesn't like any mount option involving logging on its datafile mount points. Anything you can do to make the disk look RAW to Oracle is a good thing. This means nologging, noatime, async, aio and many other mount options need to be considered depending on your operating system. Dramatic improvements in performance have been reported from just changing the mount options on the drives otr filesystems that support your Oracle datafiles. Generally it is not advised to switch to nologging type options on filesystems where non-Oracle or binary type files (executables) are stored.

In UNIX you can control whether a file system uses buffered or unbuffered IO. With Oracle the use of a buffered filesystem is redundant and dangerous. An example of the dangers of a buffered filesystem with Oracle is when power is lost. The buffer in a buffered filesystem depends on the cache battery to provide enough power to allow the buffer to be written to disk before the disk spins down. However, many shops fail to monitor the cache battery lifetime limitations or fail to change the batteries at all. This can result in loss of data in a buffered filesystem on loss of power.

You can turn off buffered writes in several ways (buffered reads aren't an issue, but you should always use write-through caching). One is to mount the filesystems used with Oracle files as non-buffered using such options as:

· AIX: "dio", "rbrw", "nointegrity"

· SUN: "delaylog", "mincache=direct", "convosync=direct" ,"nodatainlog"

· LINUX: "async", "noatime"

· HP: Use VxFS with: "delaylog", "nodatainlog", "mincache=direct", "convosync=direct"

Using Direct IO at the Oracle Level

For information about Oracle direct I/O, refer to this URL by Steve Adams:

* http://www.ixora.com.au/notes/filesystemio_options.htm

Checking Your Server

Methods for configuring the OS will vary depending on the operating system and file system in use. Here are some examples of quick checks that anyone can perform to ensure that you are using direct I/O:

· Solaris - Look for a "forcedirectio" option. Oracle DBAs find this option often makes a huge difference in I/O speed for Sun servers. Here is the Sun documentation:

· AIX - Look for a "dio" option. Here is a great link for AIX direct I/O:

· http://www-106.ibm.com/developerworks/eserver/articles/DirectIO.html

· Veritas VxFS - (including HP-UX, Solaris and AIX), look for "convosync=direct". It is also possible to enable direct I/O on a per-file basis using Veritas QIO; refer to the "qiostat" command and corresponding man page for hints. For HPUX, see Oracle on HP-UX - Best Practices.

· Linux - Linux systems support direct I/O on a per-filehandle basis (which is much more flexible), and I believe Oracle enables this feature automatically. Someone should verify at what release Oracle started to support this feature (it is called O_DIRECT). See Kernel Asynchronous I/O (AIO) Support for Linux and this great OTN article: Talking Linux: OCFS Update.

I'm Using LINUX and ATA Arrays, no Stress, but IO is slow!

Don't panic! Most LINUX kernels will take the default ATA interface setpoints that were the "standard" when the kernel was built (or even older ones). This can be corrected.

In LINUX there is the hdparm command which allows you to reset how ATA drives are accessed by the operating system. Using hdparm is simple and with it I have seen 300% improvement in access speeds of various ATA drives. Let's go through a quick tuning sequence.

First, we will use the hdparm command with no arguments but the full path to the disk device listing:

[root@aultlinux2 root]# hdparm /dev/hdb

/dev/hdb:

multcount = 16 (on)

IO_support = 0 (default 16-bit)

unmaskirq = 0 (off)

using_dma = 0 (off)

keepsettings = 0 (off)

readonly = 0 (off)

readahead = 8 (on)

geometry = 77557/16/63, sectors = 78177792, start = 0

The hdparm with no arguments but the disk device gives the current settings for the disk drive. You should compare this to the specifications for your drive. You may find that direct emmory access (DMA) is not being used, readahead is too small, you are only using 16 bit when you should be using 32 bit, etc.

Next, let's do a basic benchmark of the current performance of the drive, you do this using the hdparm –Tt option (for all options do a "man hdparm" at the command line.

[root@aultlinux2 root]# hdparm -Tt /dev/hdb

/dev/hdb:

Timing buffer-cache reads: 128 MB in 1.63 seconds = 78.53 MB/sec

Timing buffered disk reads: 64 MB in 14.20 seconds = 4.51 MB/sec

Now lets adjust the settings, the –c option, when set to 1 enables 32 bit IO, the –u option is used to get or set the interrupt-unmask flag for the drive. A setting of 1 permits the driver to unmask other interrupts during processing of a disk interrupt, which greatly improves Linux's responsiveness and eliminates "serial port overrun" errors. Use this feature with caution on older kernels: some drive/controller combinations do not tolerate the increased I/O latencies possible when this feature is enabled, resulting in massive filesystem corruption. However most versions of Linux (RedHat 2.1 and greater) using modern controllers don't have this issue. The –p option is used to autoset the PIO mode and –d is used to set or unset the DMA mode.

[root@aultlinux2 root]# hdparm -c1 -u0 -p -d0 /dev/hdb

/dev/hdb:

attempting to set PIO mode to 0

setting 32-bit IO_support flag to 1

setting unmaskirq to 0 (off)

setting using_dma to 0 (off)

IO_support = 1 (32-bit)

unmaskirq = 0 (off)

using_dma = 0 (off)

So we turned on 32 bit mode and set DMA to mode 0. Let's see the resulting performance change using our previous –Tt option.

[root@aultlinux2 root]# hdparm -Tt /dev/hdb

/dev/hdb:

Timing buffer-cache reads: 128 MB in 1.63 seconds = 78.53 MB/sec

Timing buffered disk reads: 64 MB in 9.80 seconds = 6.53 MB/sec

So we didn't change the buffer-cache read timings, however, we improved the buffered disk reads by 45%. Let's tweak some more and see if we can do better. The –m option sets the multi-sector IO count on the drive. The –c option sets the 32 bit option, the –X sets the access mode to mdma2 the –d1 option turns on direct memory access, the –a8 option improves the readahead performance for large reads and –u1 turns on the unmasking operation described above.

[root@aultlinux2 root]# hdparm -m16 -c3 -X mdma2 -d1 -a8 -u1 /dev/hdb

/dev/hdb:

setting fs readahead to 8

setting 32-bit IO_support flag to 3

setting multcount to 16

setting unmaskirq to 1 (on)

setting using_dma to 1 (on)

setting xfermode to 34 (multiword DMA mode2)

multcount = 16 (on)

IO_support = 3 (32-bit w/sync)

unmaskirq = 1 (on)

using_dma = 1 (on)

readahead = 8 (on)

So now let's see what we have done to performance using the –Tt option.

[root@aultlinux2 root]# hdparm -Tt /dev/hdb

/dev/hdb:

Timing buffer-cache reads: 128 MB in 1.56 seconds = 82.05 MB/sec

Timing buffered disk reads: 64 MB in 4.29 seconds = 14.92 MB/sec

Not bad! We improved buffered cache reads by 5% and buffered disk reads by 231%!

These options can then be loaded into a startup file to make them part of the system startup.

I'm Really Feeling SCSI About Disk Performance, what then?

Sorry for the bad pun (well, actually I'm not) what can be done with SCSI interfaces? To tell you the truth, not a lot, however, there are some items which you may find useful. Most interfaces will buffer commands and issue them in batches, for example, most SCSI interfaces use a 32 command buffer that stacks commands until it has 32 of them and then fires them off. This can be reset in LINUX using options in the modules.conf file for the SCSI interface module.

In other UNIX flavors there are many settings which can be changed, but an exact understanding of the interface and its limitations as well as current system loads must be had before changing any of the SCSI settings. If you feel you need to have them checked, ask your SA.

Disk Stress In a Nut Shell

In summary, to determine if a disk or array is undergoing IO related stress, perform an IO balance and an IO timing analysis. If the IO timing analysis shows excessive read or write times investigate the causes. Generally speaking, poor IO timings will result when:

· A single disk exceeds 110 - 150 IO per second

· An entire multi-read capable RAID10 array exceeds #MIRRORS*#DPM*110 IO's per second

· An entire non-multi-read capable RAID10 array exceeds #DPM*110 IO's per second

· If a RAID5 array exceeds (#DISKS-1)*66 IO's per second then it will probably experience poor IO timings.

· Make sure Oracle is using direct IO at both the OS and Oracle levels

· Make sure your disk interface is tuned to perform optimally

*DPM=Disks per mirror

Seeing stress from the Oracle Side

Disk stress will show up on the Oracle side as excessive read or write times. Filesystem stress is shown by calculating the IO timings as shown in Figure 16.

For complete scripts, see my complete Oracle script collection at www.dba-oracle.com/oracle_scripts.htm.

rem Purpose: Calculate IO timing values for datafiles

col name format a65

col READTIM/PHYRDS heading 'Avg|Read Time' format 9,999.999

col WRITETIM/PHYWRTS heading 'Avg|Write Time' format 9,999.999

set lines 132 pages 45

start title132 'IO Timing Analysis'

spool rep_out\&db\io_time

select f.FILE# ,d.name,PHYRDS,PHYWRTS,READTIM/PHYRDS,WRITETIM/PHYWRTS

from v$filestat f, v$datafile d

SEE CODE DEPOT FOR FULL SCRIPT

and phyrds>0 and phywrts>0

union

select a.FILE# ,b.name,PHYRDS,PHYWRTS,READTIM/PHYRDS,WRITETIM/PHYWRTS

from v$tempstat a, v$tempfile b

where a.file#=b.file#

and phyrds>0 and phywrts>0

order by 5 desc

spool off

ttitle off

clear col

Figure 16: IO Timing Report

An example of the output from Figure 16 is shown in Figure 17.

Date: 11/20/05 Page: 1

Time: 11:12 AM IO Timing Analysis PERFSTAT

whoraw database

FILE# NAME PHYRDS PHYWRTS READTIM/PHYRDS WRITETIM/PHYWRTS

----- -------------- ---------- ------- -------------- ----------------

13 /dev/raw/raw19 77751 102092 76.8958599 153.461829

33 /dev/raw/raw35 32948 52764 65.7045041 89.5749375

7 /dev/raw/raw90 245854 556242 57.0748615 76.1539869

54 /dev/raw/raw84 208916 207539 54.5494409 115.610912

40 /dev/raw/raw38 4743 27065 38.4469745 47.1722889

15 /dev/raw/raw41 3850 7216 35.6272727 66.1534091

12 /dev/raw/raw4 323691 481471 32.5510193 100.201424

16 /dev/raw/raw50 10917 46483 31.9372538 74.5476626

18 /dev/raw/raw24 3684 4909 30.8045603 71.7942554

23 /dev/raw/raw58 63517 78160 29.8442779 84.4477866

5 /dev/raw/raw91 102783 94639 29.1871516 87.8867909

Figure 17: Example IO Timing Report

As you can see from Figure 17 we are looking at an example report from a RAW configuration using single disks. Notice how both read and write times exceed even the rather large good practice limits of 10-20 milliseconds for a disk read. However in my experience for reads you should not exceed 5 milliseconds and usually with modern buffered reads, 1-2 milliseconds. Oracle is more tolerant for write delays since it uses a delayed write mechanism, so 10-20 milliseconds on writes will normally not cause significant Oracle waits, however, the smaller you can get read and write times, the better!

Filesystems

In UNIX or LINUX you have multiple filesystem options, RAW, JFS, ext2, ext3, reiserFS, OCFS. You need to use the best performing filesystem usually RAW, ext3 or resierFS. Oracle OCFS is also viable.

RAW Filesystems

RAW is probably the simplest filesystem to understand. In Unix systems, a disk can be configured to contain a single contiguous chunk of space, or it can be configured to contain multiple separate chunks of space. Each chunk of space is called a partition or section. Typically a partition would be formatted to contain a filesystem, allowing a hierchical structure of directories and files to be created. A partition that does not contain a filesystem is called a raw partition.. A database extent stored on a raw partition is called a raw extent.

Advantages

The primary advantage to using raw extents is the possibility of increased performance compared to extents stored in Unix files. The performance benefits occur because:

· File system overhead and address translation is eliminated. Database block addresses map directly to raw partition addresses.

· Read operations transfer data directly from the disk controller to shared memory and write operations transfer directly from shared memory to the disk controller. The Unix buffer pool is not used, avoiding the necessity to copy data from a Unix page buffer to shared memory and vice versa.

Disadvantages

Although they may provide increased performance, raw extents have several major disadvantages. You should not take the decision to use them lightly. Among the disadvantages are:

· Raw disk partitions are cumbersome to configure and manage. You have to keep track of what is on each raw partition yourself. You can easily make mistakes and destroy the contents of a raw partition.

· The number of raw partitions available on a disk may be quite limited. For example, on Solaris 2.5, a disk can have at most 8 partitions.

· Moving raw partitions from one disk to another may be difficult or impossible.

· Raw partition sizes are fixed. Allocating more space to a partition requires backing up all the partitions on a disk, designing a new partition layout, and restoring all saved data.

· You cannot use the same operating system backup utilities to make backups of files and raw partitions.

· You may get worse performance with raw devices than you do with files. The performance difference is highly dependent on the operating system's implementation of the file system. Many modern Unix systems, like Ditical Unix, AIX 4.2, HP-UX 10, Solaris 2.6, and others have highly advanced filesystem implementations that include many performance optimizations. Older Unix systems, such as Unix System V Release 4 are somewhat less advanced and raw partitions may provide better performance on these systems.

· You may not be able to use logical volume managers with raw partitions. Not all Unix systems support logical volumes composed of raw partitions.

The EXT2 Filesystem

The, Ext2fs is based on the Extfs code with many reorganizations and many improvements. It has been designed with evolution in mind and contains space for future improvements. The Second Extended File System has been designed and implemented to fix some problems present in the first Extended File System.

· The Ext2fs supports standard Unix file types: regular files, directories, device special files and symbolic links.

· Ext2fs is able to manage filesystems created on really big partitions. While the original kernel code restricted the maximal filesystem size to 2 GB, recent work in the VFS layer have raised this limit to 4 TB. Thus, it is now possible to use big disks without the need of creating many partitions.

· Ext2fs provides long file names. It uses variable length directory entries. The maximal file name size is 255 characters. This limit could be extended to 1012 if needed.

· Ext2fs reserves some blocks for the super user (root). Normally, 5% of the blocks are reserved. This allows the administrator to recover easily from situations where user processes fill up filesystems.

"Advanced" Ext2fs features

In addition to the standard Unix features, Ext2fs supports some extensions which are not usually present in Unix filesystems.

· File attributes allow the users to modify the kernel behavior when acting on a set of files. One can set attributes on a file or on a directory. In the later case, new files created in the directory inherit these attributes.

· BSD or System V Release 4 semantics can be selected at mount time.

· BSD-like synchronous updates can be used in Ext2fs.

· Ext2fs allows the administrator to choose the logical block size when creating the filesystem. Block sizes can typically be 1024, 2048 and 4096 bytes. Using big block sizes can speed up I/O since fewer I/O requests, and thus fewer disk head seeks, need to be done to access a file.

· Ext2fs implements fast symbolic links.

· Ext2fs keeps track of the filesystem state.

· Always skipping filesystem checks may sometimes be dangerous, so Ext2fs provides two ways to force checks at regular intervals. A mount counter is maintained in the superblock. Each time the filesystem is mounted in read/write mode, this counter is incremented. When it reaches a maximal value (also recorded in the superblock), the filesystem checker forces the check even if the filesystem is ``Clean''.

· Mount options can also be used to change the kernel error behavior.

· An attribute allows the users to request secure deletion on files. When such a file is deleted, random data is written in the disk blocks previously allocated to the file. This prevents malicious people from gaining access to the previous content of the file by using a disk editor.

· Last, new types of files inspired from the 4.4 BSD filesystem have recently been added to Ext2fs. Immutable files can only be read: nobody can write or delete them. This can be used to protect sensitive configuration files. Append-only files can be opened in write mode but data is always appended at the end of the file. Like immutable files, they cannot be deleted or renamed. This is especially useful for log files which can only grow.

EXT3 Filesystem

According to Dr. Stephan Tweety:

The ext3 filesystem is a journaling extension to the standard ext2 filesystem on Linux. Journaling results in massively reduced time spent recovering a filesystem after a crash, and is therefore in high demand in environments where high availability is important, not only to improve recovery times on single machines but also to allow a crashed machine's filesystem to be recovered on another machine when we have a cluster of nodes with a shared disk

… the real objective in EXT3 was this simple thing: availability. When something goes down in EXT3, we don't want to have to go through a fsck. We want to be able to reboot the machine instantly and have everything nice and consistent,,, You can take an existing EXT2 filesystem, throw a journal file onto it, and mount it as EXT3.

ReiserFS

According to Daniel Robbins (drobbins@gentoo.org), President/CEO:

The ReiserFS 3.6.x (the version included as part of Linux 2.4) was designed and developed by Hans Reiser and his team of developers at Namesys. The ReiserFS uses a specially optimized b* balanced tree (one per filesystem) to organize all of its filesystem data. This offers a nice performance boost, as well as easing artificial restrictions on filesystem layouts. It's now possible to have a directory that contains 100,000 other directories, for example. Another benefit of using a b*tree is that ReiserFS, like most other next-generation filesystems, dynamically allocates inodes as needed rather than creating a fixed set of inodes at filesystem creation time. This helps the filesystem to be more flexible to the various storage requirements that may be thrown at it, while at the same time allowing for some additional space-efficiency.

ReiserFS also has a host of features aimed specifically at improving small file performance. Unlike ext2, ReiserFS doesn't allocate storage space in fixed one k or four k blocks. Instead, it can allocate the exact size it needs. And ReiserFS also includes some special optimizations centered around tails, a name for files and end portions of files that are smaller than a filesystem block. In order to increase performance, ReiserFS is able to store files inside the b*tree leaf nodes themselves, rather than storing the data somewhere else on the disk and pointing to it.

This does two things. First, it dramatically increases small file performance. Since the file data and the stat_data (inode) information are stored right next to each other, they can normally be read with a single disk IO operation. Second, ReiserFS is able to pack the tails together, saving a lot of space. In fact, a ReiserFS filesystem with tail packing enabled (the default) can store six percent more data than the equivalent ext2 filesystem, which is amazing in itself.

However, tail packing does cause a slight performance hit since it forces ReiserFS to repack data as files are modified. For this reason, ReiserFS tail packing can be turned off, allowing the administrator to choose between good speed and space efficiency, or opt for even more speed at the cost of some storage capacity.

Other Filesystems

Of course we also have many other filesystems such as those from Veritas, Polyserver and other specialized systems.

Oracle and Filesystems

Generally Oracle prefers filesystems that don't do logging or journaling for filesystems that contain datafiles so if you use EXT2, EXT3 or reiserFS you need to mount them with the journaling or logging off. If RAW filesystems are used you avoid some overhead but usually can only obtain a 2-5 percent performance increase over modern optimized filesystems.

RAID—Redundant Arrays of Inexpensive Disks

The main strengths of RAID technology are its dependability and IO bandwidth. For example, in a RAID5 array, the data is stored as are checksums and other information about the contents of each disk in the array. If one disk is lost, the others can use this stored information to re-create the lost data. However, this rebuild of data on-the-fly causes a massive hit on performance. In RAID 1, RAID 10 and RAID 01 failed disks are immediately replaced by their mirror with no performance hit. This makes RAID very attractive. RAID 5 has the same advantages as shadowing and striping at a lower cost. It has been suggested that if the manufacturers would use slightly more expensive disks (RASMED—redundant array of slightly more expensive disks) performance gains could be realized. A RAID system appears as one very large, reliable disk to the CPU. There are several levels of RAID to date:

· RAID 0. Known as disk striping.

· RAID 1. Known as disk shadowing or mirroring.

· RAID 1/0. Combination of RAID0 and RAID1. May also be called RAID 10 depending on whether they are striped and mirrored or mirrored then striped. It is generally felt that RAID 10 performs better than RAID 01.

· RAID 2. Data is distributed in extremely small increments across all disks and adds one or more disks that contain a Hamming code for redundancy. RAID 2 is not considered commercially viable due to the added disk requirements (10 to 20 percent must be added to allow for the Hamming disks).

· RAID 3. This also distributes data in small increments but adds only one parity disk. This results in good performance for large transfers, but small transfers show poor performance.

· RAID 4. In order to overcome the small transfer performance penalties in RAID3, RAID4 uses large data chunks distributed over several disks and a single parity disk. This results in a bottleneck at the parity disk. Due to this performance problem RAID 4 is not considered commercially viable. RAID 3 and 4 are usually are used for video streaming technology or large LOB storage.

· RAID 5. This solves the bottleneck by distributing the parity data across the disk array. The major problem is it requires several write operations to update parity data. The performance hit is only moderate, and the other benefits may outweigh this minor problem. However the penalty for writes can be over 20% and must be weighed against the benefits.

· RAID 6. This adds a second redundancy disk that contains error-correction codes. Read performance is good due to load balancing, but write performance suffers due to RAID 6 requiring more writes than RAID 5 for data update.

For the money, I would suggest RAID0/1 or RAID1/0, that is, striped and mirrored. It provides nearly all of the dependability of RAID5 and gives much better write performance. You will usually take at least a 20 percent write performance hit using RAID5. For read-only applications RAID5 is a good choice, but in high-transaction/high-performance environments the write penalties may be too high. Figure 18 shows RAID 1-0 or 0-1 depending on whether you stripe and then mirror or mirror first and then stripe. In most situations you get better performance from RAID 1-0 (mirroring then striping.)

Figure 18: Mirroring and Striping

Table 1 shows how Oracle suggests RAID should be used with Oracle database files.

RAID	Type of Raid	Control File	Database File	Redo Log File	Archive Log File
0	Striping	Avoid	OK	Avoid	Avoid
1	Shadowing	Best	OK	Best	Best
1+0	Striping and Shadowing	OK	Best	Avoid	Avoid
3	Striping with static parity	OK	OK	Avoid	Avoid
5	Striping with rotating parity	OK	Best if RAID0-1 not available	Avoid	Avoid

Table 1: RAID Recommendations (From MOSC NOTE: 45635.1)

Disk Speed and Size Selection

This all points to the fact that in order to get the maximum performance from your disk system you must understand the IO characteristics (the profile) of your database system, be it Oracle, SQL Server, Informix, UDB or MySQL. You must tune your disk architecture to support the expected IO profile and must tune the database system to take advantage of the disk architecture. For example, an Oracle database has different IO characteristics depending on whether it is reading or writing data and what type of read or write it is doing. Other databases have fixed read/write sizes.

You must determine the IO profile for your database and then use the IO profile of the database to determine the maximum and minimum IO size. The IO profile will tell you what percentage of IO is large IO and what percentage is small IO, it will also give you the expected IO rate in IO/second.

Once you have the IO per second you can determine the IO capacity (number of drives) needed to support your database.

The first rule of tuning your disk system is:

Size first for IO capacity, then for volume.

Some back of the envelope calculations for the number of spindles needed to support IO rate are:

RAID10 with active read/write to all mirrors:

MAX(CEILING(IOR/(NSIOR*M),M),2*M)

Where:

· IOR is expected maximum IO rate in IO/sec

· NSIOR is the average non-sequential IO rate of the disks in IO/sec (range of 90-100 for RAID10)

· M is the number of mirrors

(The maximum of the IO rate divided by the average non-sequential IO rate per disk times the number of mirrors to the nearest power of M or 2*M)

RAID5 assuming 1 parity disk:

MAX((IOR/CNSIOR)+1,3)

Where:

· IOR is expected maximum IO rate in IO/sec

· CNSIOR is the corrected average non-sequential IO rate of the disks in IO/sec (range of 60-90 for RAID5)

(The maximum of the IO rate divided by the average non-sequential IO rate per disk corrected for RAID5 penalties plus 1 disk for the parity disk)

The correction for the non-sequential IO rate for RAID is due to the up to 400% penalty on writes (writes take 4 times linger than reads on the same drive). In some cases on RAID5 I have seen this go as high as 6400% (writes take 64 times as long as reads for the same file) when combined with other problems such as fragmentation.

A case in point, early RAID architectures utilized the "stripe shallow and wide" mind set where files where broken into small pieces and spread over a large number of disks. For example, stripe unites per disk of as small as 8K were common. Many systems read in IO sizes of 64K or larger. This means that to satisfy a single IO request 8 disks of the RAID set were required, if there were fewer than 8 disks in the set. Disks would have to undergo 2 or more IOs to satisfy the request. This sounds fine if you are talking about a single user wanting to read a large file from a large group of disks very quickly, however, what happens when you have 10 or 100 or 1000 concurrent users all wanting to do the same thing?

Tune for Concurrency

This problem with concurrent access and RAID arrays is one of the most prevailing in the industry. The ubiquitous IO wait is usually the predominant wait event in any database system simply due to the fact that IO to memory is in the nanosecond range while IO to disk is in the millisecond range, when you add in blocked access due to multi-disk IO requests you get a snowball effect that can cripple your IO subsystem.

Array manufacturers have begun to recognize this concurrent access problem and have increased the base stripe unit per disk to 64K, matching the IO unit for many systems. Of course now systems such as SUN and Windows utilize maximum IO sizes of 1 megabyte or larger, so again the array manufacturers are playing catch up to the server manufacturers.

So what is our second rule of tuning disks? Based on the above information the rule is:

Always ensure that the primary IO size for your database system is matched to the IO size of the disk array system.

Of course the inverse also holds true:

Always match the stripe unit per disk to the expected majority IO request from your (database) application.

In the 1990's Paul Chen of the University Of Berkeley computer center published a series of papers on tuning disk array stripe units size based on expected concurrency. In these papers by Mr. Chen and his associates they determined that the IO speed (as measured by average seek time) and IO rate (as measured in megabytes per second) for a disk determined the stripe size for performance in an array even when the number of concurrent accesses is not known. There were three formulae derived from these papers:

For non-RAID5 arrays when concurrency is known:

SU = (S*APT*DTR*(CON-1)*1.024)+.5K

Where:

· SU - Striping unit per disk

· S - Concurrency slope coefficient (~.25)

· APT - Average positioning time (milliseconds)

· DTR - Data transfer rate (Megabyte/sec)

· CON - number of concurrent users.

· 1.024= 1s/1000ms*1024K/1M (conversion factors for units)

So for a drive that has an average seek time of 5.6 ms and a transfer rate of 20 Mbyte/second the calculated stripe unit for a 20 concurrent user base would be:

(.25*5.6*20*(19)*1.024)+.5 = 545K (or ~512K)

For a system where you didn't know the concurrency the calculation becomes:

SU =(2/3*APT*DTR)

So for the same drive:

2/3*5.6*20*1.024 = 76.46K so rounding up ~128K or rounding down 64K

And from Chen's final paper, a formula for RAID5 arrays is:

0.5*5.6*20*1.024 = 57.34 (rounding up 64K)

The values for average access time and transfer rate used in these examples is actually fairly low when compared to more advanced drives so the stripe sizes shown above are probably low by at least a factor of 2 or more. I say this because while average seek times drop, the transfer rate increases for example on a Ultra3 SCSI 15K drive the spec for average seek may drop to 4.7 ms, however the transfer rate leaps to 70 Mbyte per second. So the over all value of the combined factor goes from 112 to 329, a 293% increase.

The 100% Myth

Many system administrators are guilty of perpetuating the 100% myth. This myth states that you don't need more assets (be it disk, CPU, or Memory) until the existing asset is 100% utilized. This leads to performance issues in the area of disks. Due to disk physics the best performance for a disk is at the outer edges, once you get towards the inner sectors performance decreases because of the distance the head must travel to read the data and other factors. In the good old days administrators spent much time positioning frequently used files on the outer edges of disks.

While physically positioning files on disks is difficult if not impossible in modern RAID systems, you should endeavor not to fill the disks to 100% of capacity. Some experts say don't use more then 30% if you want maximum performance, others 50%. I say it depends on how the system is used, the operating system and the RAID array system. For example the Clariion from EMC promises to tune the placement of files such that frequently used files are in the best locations.

So, what can we summarize about disk size and speed?

Get the fastest drives you can and plan capacity based on concurrency requirements as well as IO requirements. The more, faster disks the better.

Improper Initialization File Parameter Settings

For Oracle7, version 7.3, there are 154 initialization parameters, for Oracle8, version 8.0.5, there are 184. In Oracle8i there are 194. In Oracle9i version 9.0.1 there are 251 and in 9iR2, 257. In 10g the number of parameters actually dropped to 254 but the number of undocumented parameters increased. In 9iR2 the number of undocumented parameters was 583 up to 918 in 10gR1.

Fortunately there are very few that you need to adjust to tune Oracle. Table 2 lists the major tuning parameters, but is not supposed to be a complete list by any means.

Parameter	Definition
create_bitmap_area_size	This sets the memory area for bitmap creation
bitmap_merge_area_size	This is the memory area used for bitmap merge
create_stored_outlines	This allows Oracle to create stored outlines
cursor_sharing	This sets for automated literal replacement
db_file_multiblock_read_count	This sets the read size for full table and index scans
filesystemio_options	This is used to set direct or AIO options for filesystem reads
optimizer_index_caching	Used to tune index access
optimizer_index_cost_adj	Used to tune index access
query_rewrite_enabled	Sets for queries to be rewritten to use materialized views or FBIs
query_rewrite_integrity	Sets the criteria for when MVs are used.
session_cached_cursors	Sets the number of cached cursors at the session level
sga_max_size	Sets the maximum SGA memory size
sga_target	Sets the baseline SGA memory size
star_transformation_enabled	Allows Oracle to use star transformation
transactions_per_rollback_segment	Sets the number of transactions that will use a single rollback (undo) segment
pga_aggregate_target	Sets the total PGA memory usage limit
workarea_size_policy	Determines how workareas (sort and hash) are determined
buffer_pool_keep	Sets the size of the keep buffer pool for tables and indexes
buffer_pool_recycle	Sets the size of the recycle buffer pool for tables and indexes
cursor_space_for_time	Sacrifices memory for cursor storage space
db_16k_cache_size	Sets the size of the 16K cache size
db_2k_cache_size	Sets the size of the 2K cache size
db_32k_cache_size	Sets the size of the 32K cache size
db_4k_cache_size	Sets the size of the 4K cache size
db_8k_cache_size	Sets the size of the 8K cache size
db_block_size	Sets the default block size for the database
db_cache_size	Sets the default cache size

Table 2: Tuning Parameters

How to determine proper setpoints for all of these is beyond the scope of this paper. However the Oracle tuning guides provide many good tips as does the Burleson Consulting web site: www.remote-dba.net.

Improper PGA setup

I don't believe there is anyone out there that believes disk based sorts and hashes are good things. A disk based operation will take anywhere from 17 to hundreds of times as long as a memory based operation depending on buffering, IO bandwidth, memory and disk speeds.

Oracle provides AWRRPT or statspack reports to track and show the number of sorts. Unfortunately hashes are not so easily tracked. Oracle tracks disk and memory sorts, number of sort rows and other sort related statistics. Hashes on the other hand only can be tracked usually by the execution plans for cumulative values, and by various views for live values.

In versions prior to 9i the individual areas were set using the sort_area_size and hash_area_size parameters, after 9i the parameter PGA_AGGREGATE_TARGET was provided to allow automated setting of the sort and hash areas. For currently active sorts or hashes the script in Figure 19 can be used to watch the growth of temporary areas.

For complete scripts, see my complete Oracle script collection at www.dba-oracle.com/oracle_scripts.htm.

column now format a14

column operation format a15

column dt new_value td noprint

set feedback off

select to_char(sysdate,'ddmonyyyyhh24miss') dt from dual;

set lines 132 pages 55

@title132 'Sorts and Hashes'

spool rep_out\&&db\sorts_hashes&&td

select sid,work_area_size,expected_size,actual_mem_used,max_mem_used,tempseg_size,

to_char(sysdate,'ddmonyyyyhh24miss') now, operation_type operation

SEE CODE DEPOT FOR FULL SCRIPT

spool off

clear columns

set lines 80 feedback on

ttitle off

Figure 19: Live Sorts and Hashes Report

Figure 20 shows an example output from this report.

Date: 01/04/06 Page: 1

Time: 01:27 PM Sorts and Hashes SYS

whoville database

Work Area Expected Actual Mem Max Mem Tempseg

SID Size Size Used Used Size Now Operation

---- --------- -------- ---------- ------- ------- --------------- ---------------

1176 6402048 6862848 0 0 04jan2006132711 GROUP BY (HASH)

582 114688 114688 114688 114688 04jan2006132711 GROUP BY (SORT)

568 5484544 5909504 333824 333824 04jan2006132711 GROUP BY (HASH)

1306 3469312 3581952 1223680 1223680 04jan2006132711 GROUP BY (HASH)

Figure 20: Example Sorts and hashes Report

As you can see the whoville database had no hashes, at the time the report was run, going to disk. We can also look at the cumulative statistics in the v$sysstat view for cumulative sort data.

Date: 12/09/05 Page: 1

Time: 03:36 PM Sorts Report PERFSTAT

sd3p database

Type Sort Number Sorts

-------------------- --------------

sorts (memory) 17,213,802

sorts (disk) 230

sorts (rows) 3,268,041,228

Figure 21: Cumulative Sorts

Another key indicator that hashes are occurring are if there is excessive IO to the temporary tablespace yet there are few or no disk sorts.

The PGA_AGGREGATE_TARGET is the target total amount of space for all PGA memory areas. However, only 5% or a maximum of 200 megabytes can be assigned to any single process. The limit for PGA_AGGREGATE_TARGET is 4 gigabytes (supposedly) however you can increase the setting above this point. The 200 megabyte limit is set by the _pga_max_size undocumented parameter, this parameter can be reset but only under the guidance of Oracle support. But what size should PGA_AGGREGATE_TARGET be set? The AWRRPT report in 10g provides a sort histogram which can help in this decision. Figure 22 shows an example of this histogram.

PGA Aggr Target Histogram DB/Inst: OLS/ols Snaps: 73-74

-> Optimal Executions are purely in-memory operations

Low High

Optimal Optimal Total Execs Optimal Execs 1-Pass Execs M-Pass Execs

------- ------- -------------- -------------- ------------ ------------

2K 4K 1,283,085 1,283,085 0 0

64K 128K 2,847 2,847 0 0

128K 256K 1,611 1,611 0 0

256K 512K 1,668 1,668 0 0

512K 1024K 91,166 91,166 0 0

1M 2M 690 690 0 0

2M 4M 174 164 10 0

4M 8M 18 12 6 0

-------------------------------------------------------------

Figure 22: Sort Histogram

In this case we are seeing 1-pass executions indicating disk sorts are occurring with the maximum size being in the 4m to 8m range. For an 8m sort area the PGA_AGGREGATE_TARGET should be set at 320 megabytes (sorts get 0.5*(.05*PGA_AGGREGATE_TARGET)). For this system the setting was at 160 so 4 megabytes was the maximum sort size, as you can see we were seeing 1-pass sorts in the 2-4m range as well even at 160m.

By monitoring the realtime or live hashes and sorts and looking at the sort histograms from the AWRRPT reports you can get a very good idea of the needed PGA_AGGREGATE_TARGET setting. If you need larger than 200 megabyte sort areas you may need to get approval from Oracle support through the i-tar process to set the _pga_max_size parameter to greater than 200 megabytes.

Summary

This paper has presented the major tuning issues I have seen at many sites during tuning engagement. I have presented ways of determining if the issues exist and how to determine settings to help mitigate the issues in an active database.

��

Welcome to My Nightmare - The Common Performance Errors in Oracle Databases Oracle Database Tips by Donald Burleson

Welcome to My Nightmare - The Common Performance Errors in Oracle Databases

Oracle Database Tips by Donald Burleson