|
|
Oracle
RAID configuration
Oracle Database Tips by Donald Burleson |
By Michael R. Ault
Welcome to My Nightmare - The Common Performance Errors in Oracle
Databases
RAID—Redundant Arrays of Inexpensive Disks
The main strengths of RAID technology are its
dependability and IO bandwidth. For example, in a RAID5 array, the
data is stored as are checksums and other information about the
contents of each disk in the array. If one disk is lost, the others
can use this stored information to re-create the lost data. However,
this rebuild of data on-the-fly causes a massive hit on performance.
In RAID 1, RAID 10 and RAID 01 failed disks are immediately replaced
by their mirror with no performance hit. This makes RAID very
attractive. RAID 5 has the same advantages as shadowing and striping
at a lower cost. It has been suggested that if the manufacturers
would use slightly more expensive disks (RASMED—redundant array of
slightly more expensive disks) performance gains could be realized.
A RAID system appears as one very large, reliable disk to the CPU.
There are several levels of RAID to date:
·
RAID 0. Known as disk striping.
·
RAID 1. Known as disk shadowing or mirroring.
·
RAID 1/0. Combination of RAID0 and RAID1. May
also be called RAID 10 depending on whether they are striped and
mirrored or mirrored then striped. It is generally felt that RAID 10
performs better than RAID 01.
·
RAID 2. Data is distributed in extremely small
increments across all disks and adds one or more disks that contain
a Hamming code for redundancy. RAID 2 is not considered commercially
viable due to the added disk requirements (10 to 20 percent must be
added to allow for the Hamming disks).
·
RAID 3. This also distributes data in small
increments but adds only one parity disk. This results in good
performance for large transfers, but small transfers show poor
performance.
·
RAID 4. In order to overcome the small transfer
performance penalties in RAID3, RAID4 uses large data chunks
distributed over several disks and a single parity disk. This
results in a bottleneck at the parity disk. Due to this performance
problem RAID 4 is not considered commercially viable. RAID 3 and 4
are usually are used for video streaming technology or large LOB
storage.
·
RAID 5. This solves the bottleneck by
distributing the parity data across the disk array. The major
problem is it requires several write operations to update parity
data. The performance hit is only moderate, and the other benefits
may outweigh this minor problem. However the penalty for writes can
be over 20% and must be weighed against the benefits.
·
RAID 6. This adds a second redundancy disk that
contains error-correction codes. Read performance is good due to
load balancing, but write performance suffers due to RAID 6
requiring more writes than RAID 5 for data update.
For the money, I would suggest RAID0/1 or
RAID1/0, that is, striped and mirrored. It provides nearly all of
the dependability of RAID5 and gives much better write performance.
You will usually take at least a 20 percent write performance hit
using RAID5. For read-only applications RAID5 is a good choice, but
in high-transaction/high-performance environments the write
penalties may be too high. Figure 18 shows RAID 1-0 or 0-1 depending
on whether you stripe and then mirror or mirror first and then
stripe. In most situations you get better performance from RAID 1-0
(mirroring then striping.)

Figure 18: Mirroring and Striping
Table 1 shows how Oracle suggests RAID should be
used with Oracle database files.
RAID |
Type of Raid |
Control File |
Database File |
Redo Log File |
Archive Log File |
0 |
Striping |
Avoid |
OK |
Avoid |
Avoid |
1 |
Shadowing |
Best |
OK |
Best |
Best |
1+0 |
Striping and Shadowing |
OK |
Best |
Avoid |
Avoid |
3 |
Striping with static parity |
OK |
OK |
Avoid |
Avoid |
5 |
Striping with rotating parity |
OK |
Best if RAID0-1 not available |
Avoid |
Avoid |
Table 1: RAID Recommendations (From MOSC
NOTE: 45635.1)
Disk Speed and Size Selection
This all points to the fact that in order to get
the maximum performance from your disk system you must understand
the IO characteristics (the profile) of your database system, be it
Oracle, SQL Server, Informix, UDB or MySQL. You must tune your disk
architecture to support the expected IO profile and must tune the
database system to take advantage of the disk architecture. For
example, an Oracle database has different IO characteristics
depending on whether it is reading or writing data and what type of
read or write it is doing. Other databases have fixed read/write
sizes.
You must determine the IO profile for your
database and then use the IO profile of the database to determine
the maximum and minimum IO size. The IO profile will tell you what
percentage of IO is large IO and what percentage is small IO, it
will also give you the expected IO rate in IO/second.
Once you have the IO per second you can
determine the IO capacity (number of drives) needed to support your
database.
The first rule of tuning your disk system is:
Size first
for IO capacity, then for volume.
Some back of the envelope calculations for the
number of spindles needed to support IO rate are:
RAID10 with active read/write to all mirrors:
MAX(CEILING(IOR/(NSIOR*M),M),2*M)
Where:
·
IOR is expected maximum IO rate in IO/sec
·
NSIOR is the average non-sequential IO rate of the
disks in IO/sec (range of 90-100 for RAID10)
·
M is the number of mirrors
(The maximum of the IO rate divided by the
average non-sequential IO rate per disk times the number of mirrors
to the nearest power of M or 2*M)
RAID5 assuming 1 parity disk:
MAX((IOR/CNSIOR)+1,3)
Where:
·
IOR is expected maximum IO rate in IO/sec
·
CNSIOR is the corrected average non-sequential IO rate
of the disks in IO/sec (range of 60-90 for RAID5)
(The maximum of the IO rate divided by the
average non-sequential IO rate per disk corrected for RAID5
penalties plus 1 disk for the parity disk)
The correction for the non-sequential IO rate
for RAID is due to the up to 400% penalty on writes (writes take 4
times linger than reads on the same drive). In some cases on RAID5 I
have seen this go as high as 6400% (writes take 64 times as long as
reads for the same file) when combined with other problems such as
fragmentation.
A case in point, early RAID architectures
utilized the "stripe shallow and wide" mind set where files where
broken into small pieces and spread over a large number of disks.
For example, stripe unites per disk of as small as 8K were common.
Many systems read in IO sizes of 64K or larger. This means that to
satisfy a single IO request 8 disks of the RAID set were required,
if there were fewer than 8 disks in the set. Disks would have to
undergo 2 or more IOs to satisfy the request. This sounds fine if
you are talking about a single user wanting to read a large file
from a large group of disks very quickly, however, what happens when
you have 10 or 100 or 1000 concurrent users all wanting to do the
same thing?
Tune for Concurrency
This problem with concurrent access and RAID
arrays is one of the most prevailing in the industry. The ubiquitous
IO wait is usually the predominant wait event in any database system
simply due to the fact that IO to memory is in the nanosecond range
while IO to disk is in the millisecond range, when you add in
blocked access due to multi-disk IO requests you get a snowball
effect that can cripple your IO subsystem.
Array manufacturers have begun to recognize this
concurrent access problem and have increased the base stripe unit
per disk to 64K, matching the IO unit for many systems. Of course
now systems such as SUN and Windows utilize maximum IO sizes of 1
megabyte or larger, so again the array manufacturers are playing
catch up to the server manufacturers.
So what is our second rule of tuning disks?
Based on the above information the rule is:
Always
ensure that the primary IO size for your database system is matched
to the IO size of the disk array system.
Of course the inverse also holds true:
Always
match the stripe unit per disk to the expected majority IO request
from your (database) application.
In the 1990's Paul Chen of the University Of
Berkeley computer center published a series of papers on tuning disk
array stripe units size based on expected concurrency. In these
papers by Mr. Chen and his associates they determined that the IO
speed (as measured by average seek time) and IO rate (as measured in
megabytes per second) for a disk determined the stripe size for
performance in an array even when the number of concurrent accesses
is not known. There were three formulae derived from these papers:
For non-RAID5 arrays when concurrency is known:
SU =
(S*APT*DTR*(CON-1)*1.024)+.5K
Where:
·
SU - Striping unit per disk
·
S - Concurrency slope coefficient (~.25)
·
APT - Average positioning time (milliseconds)
·
DTR - Data transfer rate (Megabyte/sec)
·
CON - number of concurrent users.
·
1.024= 1s/1000ms*1024K/1M (conversion factors for
units)
So for a drive that has an average seek time of
5.6 ms and a transfer rate of 20 Mbyte/second the calculated stripe
unit for a 20 concurrent user base would be:
(.25*5.6*20*(19)*1.024)+.5 = 545K (or ~512K)
For a system where you didn't know the
concurrency the calculation becomes:
SU
=(2/3*APT*DTR)
So for the same drive:
2/3*5.6*20*1.024 = 76.46K so rounding up ~128K or rounding down 64K
And from Chen's final paper, a formula for RAID5
arrays is:
0.5*5.6*20*1.024 = 57.34 (rounding up 64K)
The values for average access time and transfer
rate used in these examples is actually fairly low when compared to
more advanced drives so the stripe sizes shown above are probably
low by at least a factor of 2 or more. I say this because while
average seek times drop, the transfer rate increases for example on
a Ultra3 SCSI 15K drive the spec for average seek may drop to 4.7
ms, however the transfer rate leaps to 70 Mbyte per second. So the
over all value of the combined factor goes from 112 to 329, a 293%
increase.
The 100% Myth
Many system administrators are guilty of
perpetuating the 100% myth. This myth states that you don't need
more assets (be it disk, CPU, or Memory) until the existing asset is
100% utilized. This leads to performance issues in the area of
disks. Due to disk physics the best performance for a disk is at the
outer edges, once you get towards the inner sectors performance
decreases because of the distance the head must travel to read the
data and other factors. In the good old days administrators spent
much time positioning frequently used files on the outer edges of
disks.
While physically positioning files on disks is
difficult if not impossible in modern RAID systems, you should
endeavor not to fill the disks to 100% of capacity. Some experts say
don't use more then 30% if you want maximum performance, others 50%.
I say it depends on how the system is used, the operating system and
the RAID array system. For example the Clariion from EMC promises to
tune the placement of files such that frequently used files are in
the best locations.
So, what can we summarize about disk size and
speed?
Get the
fastest drives you can and plan capacity based on concurrency
requirements as well as IO requirements. The more, faster disks the
better.
Improper Initialization File Parameter Settings
For Oracle7, version 7.3, there are 154
initialization parameters, for Oracle8, version 8.0.5, there are
184. In Oracle8i there are 194. In Oracle9i version 9.0.1
there are 251 and in 9iR2, 257. In 10g the number of parameters
actually dropped to 254 but the number of undocumented parameters
increased. In 9iR2 the number of undocumented parameters was 583 up
to 918 in 10gR1.
Fortunately there are very few that you need to
adjust to tune Oracle. Table 2 lists the major tuning parameters,
but is not supposed to be a complete list by any means.
Parameter |
Definition |
create_bitmap_area_size |
This sets the memory
area for bitmap creation |
bitmap_merge_area_size |
This is the memory area used for bitmap merge |
create_stored_outlines |
This allows Oracle to
create stored outlines |
cursor_sharing |
This sets for automated literal replacement |
db_file_multiblock_read_count |
This sets the read size for full table and index scans |
filesystemio_options |
This is used to set direct or AIO options for filesystem
reads |
optimizer_index_caching |
Used to tune index
access |
optimizer_index_cost_adj |
Used to tune index access |
query_rewrite_enabled |
Sets for queries to
be rewritten to use materialized views or FBIs |
query_rewrite_integrity |
Sets the criteria for when MVs are used. |
session_cached_cursors |
Sets the number of cached cursors at the session level
|
sga_max_size |
Sets the maximum SGA
memory size |
sga_target |
Sets the baseline SGA memory size |
star_transformation_enabled |
Allows Oracle to use star transformation |
transactions_per_rollback_segment |
Sets the number of transactions that will use a single
rollback (undo) segment |
pga_aggregate_target |
Sets the total PGA memory usage limit |
workarea_size_policy |
Determines how workareas (sort and hash) are determined |
buffer_pool_keep |
Sets the size of the
keep buffer pool for tables and indexes |
buffer_pool_recycle |
Sets the size of the recycle buffer pool for tables and
indexes |
cursor_space_for_time |
Sacrifices memory for
cursor storage space |
db_16k_cache_size |
Sets the size of the
16K cache size |
db_2k_cache_size
|
Sets the size of the
2K cache size |
db_32k_cache_size
|
Sets the size of the
32K cache size |
db_4k_cache_size
|
Sets the size of the
4K cache size |
db_8k_cache_size
|
Sets the size of the
8K cache size |
db_block_size |
Sets the default
block size for the database |
db_cache_size |
Sets the default cache size |
Table 2: Tuning Parameters
How to determine proper setpoints for all of
these is beyond the scope of this paper. However the Oracle tuning
guides provide many good tips as does the Burleson Consulting web
site:
www.remote-dba.net.
Improper PGA setup
I don't believe there is anyone out there that
believes disk based sorts and hashes are good things. A disk based
operation will take anywhere from 17 to hundreds of times as long as
a memory based operation depending on buffering, IO bandwidth,
memory and disk speeds.
Oracle provides AWRRPT or statspack reports to
track and show the number of sorts. Unfortunately hashes are not so
easily tracked. Oracle tracks disk and memory sorts, number of sort
rows and other sort related statistics. Hashes on the other hand
only can be tracked usually by the execution plans for cumulative
values, and by various views for live values.
In versions prior to 9i the individual areas
were set using the sort_area_size and hash_area_size parameters,
after 9i the parameter PGA_AGGREGATE_TARGET was provided to allow
automated setting of the sort and hash areas. For currently active
sorts or hashes the script in Figure 19 can be used to watch the
growth of temporary areas.
For complete scripts, see my complete Oracle
script collection at
www.dba-oracle.com/oracle_scripts.htm.
column now format a14
column operation format a15
column dt new_value td noprint
set feedback off
select to_char(sysdate,'ddmonyyyyhh24miss') dt from dual;
set lines 132 pages 55
@title132 'Sorts and Hashes'
spool rep_out\&&db\sorts_hashes&&td
select
sid,work_area_size,expected_size,actual_mem_used,max_mem_used,tempseg_size,
to_char(sysdate,'ddmonyyyyhh24miss') now, operation_type
operation
from v$sql_workarea_active
/
spool off
clear columns
set lines 80 feedback on
ttitle off
Figure 19: Live Sorts and Hashes Report
Figure 20 shows an example output from this report.
Date: 01/04/06
Page: 1
Time: 01:27 PM
Sorts and Hashes
SYS
whoville database
Work Area Expected Actual Mem Max Mem Tempseg
SID Size Size
Used Used Size Now
Operation
---- --------- -------- ---------- ------- -------
--------------- ---------------
1176 6402048 6862848
0 0
04jan2006132711 GROUP BY (HASH)
582 114688 114688
114688 114688
04jan2006132711 GROUP BY (SORT)
568 5484544 5909504
333824 333824
04jan2006132711 GROUP BY (HASH)
1306 3469312 3581952 1223680
1223680
04jan2006132711 GROUP BY (HASH)
Figure 20: Example Sorts and hashes Report
As you can see the whoville database had no
hashes, at the time the report was run, going to disk. We can also
look at the cumulative statistics in the v$sysstat view for
cumulative sort data.
Date: 12/09/05
Page: 1
Time: 03:36 PM
Sorts Report
PERFSTAT
sd3p database
Type Sort
Number Sorts
-------------------- --------------
sorts (memory)
17,213,802
sorts (disk)
230
sorts (rows)
3,268,041,228
Figure 21: Cumulative Sorts
Another key indicator that hashes are occurring
are if there is excessive IO to the temporary tablespace yet there
are few or no disk sorts.
The PGA_AGGREGATE_TARGET is the target
total amount of space for all PGA memory areas. However, only 5% or
a maximum of 200 megabytes can be assigned to any single process.
The limit for PGA_AGGREGATE_TARGET is 4 gigabytes
(supposedly) however you can increase the setting above this point.
The 200 megabyte limit is set by the _pga_max_size
undocumented parameter, this parameter can be reset but only under
the guidance of Oracle support. But what size should
PGA_AGGREGATE_TARGET be set? The AWRRPT report in 10g provides a
sort histogram which can help in this decision. Figure 22 shows an
example of this histogram.
PGA Aggr Target Histogram
DB/Inst: OLS/ols Snaps: 73-74
-> Optimal Executions are purely in-memory operations
Low High
Optimal Optimal Total Execs Optimal
Execs 1-Pass Execs M-Pass Execs
------- -------
-------------- -------------- ------------ ------------
2K 4K
1,283,085 1,283,085
0
0
64K
128K 2,847
2,847
0
0
128K
256K 1,611
1,611
0
0
256K
512K 1,668
1,668
0
0
512K
1024K 91,166
91,166
0
0
1M 2M
690
690
0
0
2M 4M
174
164
10
0
4M 8M
18
12
6 0
-------------------------------------------------------------
Figure 22: Sort Histogram
In this case we are seeing 1-pass executions
indicating disk sorts are occurring with the maximum size being in
the 4m to 8m range. For an 8m sort area the PGA_AGGREGATE_TARGET
should be set at 320 megabytes (sorts get
0.5*(.05*PGA_AGGREGATE_TARGET)). For this system the setting was
at 160 so 4 megabytes was the maximum sort size, as you can see we
were seeing 1-pass sorts in the 2-4m range as well even at 160m.
By monitoring the realtime or live hashes and
sorts and looking at the sort histograms from the AWRRPT reports you
can get a very good idea of the needed PGA_AGGREGATE_TARGET
setting. If you need larger than 200 megabyte sort areas you may
need to get approval from Oracle support through the i-tar process
to set the _pga_max_size parameter to greater than 200
megabytes.
Summary
This paper has presented the major tuning issues
I have seen at many sites during tuning engagement. I have presented
ways of determining if the issues exist and how to determine
settings to help mitigate the issues in an active database.
|