The Best Oracle
Resource on the Web
Tuning Disk Architectures for Databases
by Michael R. Ault
Introduction
When I began work in the computer industry, RAID (Redundant
Array of Inexpensive Disks) technology was still in its
infancy and Winchester hard drive technology wasn't available
except in huge frame mounted drives of around 90 megabytes in
size that weighed in at a svelte 200 to 300 pounds. Disk
technology has progressed far in terms of capacity since those
days, and the many controllers and optimizers bobble the
imagination. Gone are the days when tuning a disk meant a
simple defragmentation and placing frequently accessed files
near the outer rim of the disk.
Disk Capacity
While disk volume capacity has been growing logarithmically,
disk I/O capacity has been growing as a direct factor with
disk rotational speed. So while volume capacity has jumped
from 100 megabytes or less and 3600 rpm to 200 gigabytes at
10k or 15K rpm, the corresponding I/O rates have only
increased from 30 IOPS to a around 150 IOPS (for sequential
I/O only). This indicates that capacity has increased over two
million times, while I/O rates increased by only a factor of
five. What this means to you as a system tuner is that unless
you build out (increase the number of drives) beyond what is
needed strictly for volume of data, your ability to service
I/O requests will actually decrease. This need to increase I/O
rate to attempt to keep up with the ever increasing capacity
numbers is what drove the move to RAID technologies.
Disk Types
Adding to the confusion over disk tuning is the many disk
types that are available: ATA, SCSI, USB, IEE1394 (FireWire) —
all of these are examples of disk technologies in use today.
Each of the disk technologies has different tuning
requirements, and when you add RAID to the mix, it further
muddies the tuning waters. RAID can be based in software,
hardware, or a mix of both. The highest performance is usually
derived from hardware-based RAID solutions.
Each of the disk types is characterized by its interface type.
The interface types each have their own maximum bandwidth
capacities and you must take these into account when
specifying your disk requirements. The following table shows
an example comparison for these interfaces. Note that there
are also limits on the number of devices that can be attached
to each type of interface ranging, for example, from two on
most ATA buses to 256 on IEEE.1394.
Each of these, as indicated, may have multiple
implementations with new and improved features. Here
is a a speed comparison of the various types of disk
interface.
Interface |
Speed |
Serial |
115 kb/s |
Parallel(standard) |
115 kb/s |
Parallel(ECP/EPP) |
3.0 Mb/s |
SCSI |
5-320 Mb/sec |
ATA |
3.3 -
133Mb/sec |
USB1.1 |
1.5 Mb/s |
USB2.x |
60 Mb/s |
IEEE1394(b) |
50-400 Mb/s |
Original IDE |
3.3-33 Mb/sec |
SATA |
150 Mb/s |
Fibre Channel |
2 Gb/sec |
Memory Caches
In
an effort to improve the I/O characteristics of their disks,
manufacturers added memory caches to allow read and write
caching. Unfortunately, these caches are usually too small to
be of much help with database-level performance as they are
usually a very small fraction of a percent of the total disk
capacity. Of what real use is a disk cache of a couple of
megabytes on a 200 gigabyte disk holding a database table that
is 60 gigabytes in size?
The cache soon becomes saturated, and
any I/O benefit is eliminated. Disk array manufacturers such
as EMC and Hitachi have also provided larger amounts of cache,
in the tens of gigabyte range, which in some situations does
help improve I/O rates, depending on how efficiently the
application uses the data stored in the database.
For example,
if the application is poorly written and does multiple full
scans of multi-gigabyte tables, then again the cache becomes
saturated, and you fall back to disk I/O speeds. For that
matter, in some environments such as data warehouses and
decision support systems, the scanning of large tables,
sometimes hundreds of gigabytes in size, is commonplace. This
can quickly tax even the largest cache systems.
I/O Profiles
All of this points to the fact that to get the maximum
performance from your disk system, you must understand the I/O
characteristics (the profile) of your database system, be it
Oracle, SQL Server, Informix, UDB, or MySQL. You must tune
your disk architecture to support the expected I/O profile and
you must tune the database system to take advantage of the
disk architecture. For example, an Oracle database has
different I/O characteristics depending on whether it is
reading or writing data and what type of read or write it is
doing. Other databases have fixed read/write sizes.
You must determine the I/O profile for your database and then
use the I/O profile of the database to determine the maximum
and minimum I/O size. The I/O profile will tell you what
percentage of I/O is large I/O and what percentage is small
I/O. It will also give you the expected I/O rate in I/O per
second.
Once you have the I/O per second, you can determine the I/O
capacity (number of drives) needed to support your database.
The first rule of tuning your disk system is:
"Size first for I/O capacity, then for volume."
Some back-of-the-envelope calculations for the number of
spindles needed to support I/O rate (IOR) are:
The correction for the non-sequential I/O rate for RAID is due
to the up to 400 percent penalty on writes (writes take four
times longer than reads on the same drive). In some cases on
RAID5, I have seen this go as high as 6400 percent (writes
take 64 times as long as reads for the same file) when
combined with other problems such as fragmentation.
A
case in point: Early RAID architectures utilized the "stripe
shallow and wide" mind set in which files were broken into
small pieces and spread over a large number of disks. For
example, stripe unites per disk of as small as 8K were common.
Many systems read in I/O sizes of 64K or larger. This means
that to satisfy a single I/O request, eight disks of the RAID
set were required.
If there were fewer than eight disks in the
set, disks would have to undergo two or more I/Os to satisfy
the request. This sounds fine if you are talking about a
single user wanting to read a large file from a large group of
disks very quickly; however, what happens when you have 10, or
100, or 1000 concurrent users all wanting to do the same
thing?
Tune for Concurrency
This problem with concurrent access and RAID arrays is one of
the most prevailing in the industry. The ubiquitous I/O wait
is usually the predominant wait event in any database system
simply due to the fact that I/O to memory is in the nanosecond
range, while I/O to disk is in the millisecond range. When you
add in blocked access due to multi-disk I/O requests, you get
a snowball effect that can cripple your I/O subsystem.
Array manufacturers have begun to recognize this concurrent
access problem and have increased the base stripe unit per
disk to 64K, matching the I/O unit for many systems. Of course
now systems such as Sun and Windows utilize maximum I/O sizes
of one megabyte or larger, so again the array manufacturers
are playing catch up to the server manufacturers.
So
what is our second rule for tuning disks? Based on the
previous information, the rule is:
"Always ensure that the primary I/O size for your database
system is matched to the I/O size of the disk array system."
Of
course the inverse also holds true:
"Always match the stripe unit per disk to the expected majority
I/O request from your (database) application."
In
the 1990s, Paul Chen of the University Of California at
Berkeley computer center published a series of papers on
tuning disk array stripe units size based on expected
concurrency. In these papers, Mr. Chen and his associates
determined that the I/O speed (as measured by average seek
time) and I/O rate (as measured in megabytes per second) for a
disk determined the stripe size for performance in an array
even when the number of concurrent accesses is not known.
There were three formulae derived from these papers:
So
for a drive that has an average seek time of 5.6 ms and a
transfer rate of 20 megabytes/second, the calculated stripe
unit for a 20 concurrent user base would be:
-
(.25*5.6*20*(19)*1.024)+.5 = 545K (or ~512K)
For a system for which you didn't know the concurrency, the
calculation becomes:
-
2/3*5.6*20*1.024 = 76.46K so rounding up ~128K or rounding
down 64K
And from Chen's final paper, a formula for RAID5 arrays is:
-
0.5*5.6*20*1.024 = 57.34 (rounding up 64K)
The values for average access time and transfer rate used in
these examples is actually fairly low when compared to more
advanced drives, so the stripe sizes shown previously are
probably low by at least a factor of two or more. This is
because when average seek times drop, the transfer rate
increases. For example, on a Ultra3 SCSI 15K drive, the spec
for average seek may drop to 4.7 megabytes per second;
however, the transfer rate leaps to 70 megabytes per second.
So, the overall value of the combined factor goes from 112 to
329 — a 293 percent increase.
The I/O Path
The other major factor to consider is the ability to transfer
the data from the disks to the CPU. Transferring the data from
the disk to the CPU is carried out by the controllers and
other interface components. Each controller in a system
usually has two or more paths to the disk arrays. The more
paths to and from the disk array, the more data that can be
transferred. This leads to the third rule for tuning disk
arrays:
"Always plan transfer paths for peak load, not average
load."
In
one situation, I had a client who switched to a more advanced
disk array with faster disks, a new server with faster CPUs,
and yet the system performed at half the speed of the older
system. Investigation showed they had decided to reduce the
available disk paths to two-thirds of the previous number
(from 12 to 8) based on "average" I/O requirements. They soon
discovered that while this was fine for day-to-day processing,
it spelled real trouble for end-of-month, end-of-quarter, and
end-of-year processing as well as payday processing.
Make sure I/O is spread evenly in a load-balanced methodology,
not a cascade-based methodology across the paths for the
controllers. In a load-balanced methodology, the load is
spread evenly across all of the paths allocated to a given
server. In a cascade methodology subsequent paths are not
utilized until the primary reaches a certain percentage of
load.
The 100 Percent Myth
Many system administrators are guilty of perpetuating the 100
percent myth. This myth states that you don't need more assets
(be it disk, CPU, or memory) until the existing asset is 100
percent utilized. This leads to performance issues in the area
of disks.
Because of the physics of disks, the best performance for a
disk is at the outer edges; once you get towards the inner
sectors, performance decreases because of the distance the
head must travel to read the data and other factors. In the
good old days, administrators spent much time positioning
frequently used files on the outer edges of disks.
While physically positioning files on disks is difficult if
not impossible in modern RAID systems, you should try not to
fill the disks to 100 percent of capacity. Some experts say
don't use more then 30 percent if you want maximum
performance; others, 50 percent. I say it
depends on how the system is used, the operating system and
the RAID array system.
Should I Worry About Fragmentation?
The answer here is, it depends. Windows-based environments
(NT, W2K) are prone to fragmentation due to the opportunistic
write algorithm that writes back blocks to the first available
sector instead of their original sector. This is even true for
the pre-allocated files used by most databases. Another
fragmentation-prone environment is AIX. And of course anyone
still using OpenVMS will also know all they ever want to about
the costs of fragmentation.
In
Windows, AIX, and OpenVMS, you will need to monitor for
fragmentation and defragement as required. Be sure to verify
with your disk array vendor that defragmentation is possible
or allowed on their systems. If it becomes severe, you may
need to back up the array, rebuild it, and then repopulate it
to get around specific vendor limitations.
Summary
The entire underlying foundation of disk tuning, and for that
matter database tuning, is to open up the flow path for data.
Make sure there are no restrictions or kinks in the path from
the disks to the memory and CPUs. Then, if you follow up with
proper database tuning with the goal of reducing logical and
physical I/Os to a minimum, you will have an optimally
performing system.
|
|
Get the Complete
Oracle SQL Tuning Information
The landmark book
"Advanced Oracle
SQL Tuning The Definitive Reference" is
filled with valuable information on Oracle SQL Tuning.
This book includes scripts and tools to hypercharge Oracle 11g
performance and you can
buy it
for 30% off directly from the publisher.
|
References
Peter M. Chen and David A. Patterson. "Maximizing Performance
in a Striped Disk Array." In Proceedings of the 1990
International Symposium on Computer Architecture, pages
322–331, May 1990.
Peter M. Chen and Edward K. Lee, "Striping in a RAID Level 5
Disk Array." Proceedings of the 1995 ACM SIGMETRICS
Conference on Measurement and Modeling of Computer Systems.
Drew Rob. "Hard Disks And How They Are Organized."
Enterprise Operations Management, Auerbach Publications,
CRC Press, 2015.
|
|