 |
|
Oracle disk access speeds for full scan operations
Oracle Tips by Burleson Consulting
October 28, 2007
|
Prerequisite background reading:
A quick history of Oracle storage
Today we see a radical disconnect
between the traditional "transparent disk access" where you
could measure the "real" physical access speed and today's
storage arrays.
The most popular disk of the 1980's was the
refrigerator-sized 3380 disks, which contained only 1.2 gig of storage at the
astronomical cost of over $200,000. In today's 2007 dollars, disk in the
1980's costs more than $4,000.00 per megabyte.
Today, you can buy 100
GB PC disks for $200, and 100 GB of RAM Disk (solid-state disk) for
$100,000.

In todays age of SAN, NAS and mass
storage devices, the "real" latency of a disk read/write is
hidden. For example, during a data block write, many disk
storage devices return a "false ack" (acknowledgement message)
to Oracle, when in reality, the data remains within an onboard
RAM cache and has not truly been written to the platters.
Introduction to Oracle full scan
I/O
To properly tune the Oracle I/O
subsystem we must carefully distinguish between Oracle's
perception of disk latency and the reality. The back-end
disk devices are "black boxes" and the only "real" information
we have about disk latency are the salient metrics collected by
Oracle. The Oracle metrics can easily be skewed by the
back-end devices but Oracle only knows the end-to-end latencies:
-
physical write waits, physical read waits
(physical reads/sec, physical reads direct, &c)
-
"db file scattered reads" waits
-
"db file sequential reads" waits
Traditionally, multiblock disk I/O
operations (e.g. table access full, index fast full scan, index
range scans) were considered faster because the access to
physically contiguous blocks required only a single seek delay,
the largest component of platter latency.
Once aligned under the proper cylinder, the read-write head
sits idle, sucking in many data blocks as the platter spins
beneath the read-write head. See my
notes on
Oracle db file sequential scattered read disk I/O speed tuning
for more details.
The Oracle optimizer is very
important because it is the component that decides between index
access vs. full-scan access, and there are several important
optimizer settings that influence this behavior.
It's all about the relative cost
of sequential vs. scattered read waits, and
analyzing system statistics provides the optimizer with
real-world empirical timings to help the CBO make better
decisions about the "best" access method. We also see
other important parameters such as the
db_file_multiblock_read_count and the
_optimizer_cost_model parms. To learn more, see my
notes on
global SQL optimization with parameters.
Read-ahead caching
The read ahead caching has many names on different operating
systems, and it started on IBM mainframes as "sequential
prefetch".
The concept of read-ahead caching is simple. Over 90% of I/O
latency is consumed in the read-write head movement, as the
heads are placed under the target cylinder. Once in-place, the
disk continues to spin and the read-write head can
simultaneously transmit back the original block request at the
same time as the next sequential block passes below the
read-write heads.
For scan operations (index range scans, index fast full scans,
and full-table scans), a read-ahead cache can be very useful for
speeding up these "scattered read" operations.
In sum, if your database is requesting adjacent data blocks, the
read-ahead cache may improve I/O throughput. However, using
RAID10 (SAME, stripe and mirror everywhere), like with ASM, the
blocks are not adjacent, and a read-ahead cache may not be as
useful as a database where the data blocks are laid-out
sequentially. Of course, the stripe width influences this
decision.
But not all disk configurations
are created equal. There are some Oracle
conditions that may adverse effect disk access speed for
full-scan operations (e.g. "db file scatter reads").
What factors
effect multi-block disk access speeds?
It generally accepted that
scattered read waits (multiblock reads from full-scans) will be
reported as higher latency than sequential read waits (single
block requests) because the multi-block I/O pulls-in many data
blocks as opposed to a single block. However, we must
remember that the first I/O that locates the read-write heads
under the proper cylinder can be 70%-80% of total access time,
and many databases can do multiblock I/O faster than
sequential I/O (See
Oracle
STATSPACK & AWR reports showing scattered read timings).
Why the big difference in DB file
wait times? Could this be due to internal
Oracle configurations (e.g. Automatic Storage Management (ASM)
of bitmap freelists (ASSM)? Oracle guru David Aldridge has a
great
write-up on Linux kernel I/O for large Oracle systems in
Linux.
David suggests that full-scan
access speed is aggravated by Oracle willy-nilly block placement
in Automated Storage Management (ASM) and using bitmap freelists
(Automated Segment Storage Management).
By doing a workload analysis of an Oracle database, we should
expect to see that any of these external and internal factors
will influence the speed of full-scan and range scan access:
- An unreasonable workload - As system
workload increases, it is possible to detect disk enqueues, especially where
the data blocks are not "randomized".
- Automatic Storage Management - ASM place data blocks in a
non-sequential fashion within the logical tablespace.
- Automatic segment Space Management - ASSM (bitmap freelists) have
been associated with slower multi0block access in large data warehouses (See
Aldridge link).
- The db_file_multiblock_read_count (MBRC) effects the number of
blocks accessed within a "db file scattered read" wait event.
- Blocksize - The OS block size and Oracle blocksize effect the bit
stream packet sizes and frequency during an I/O call.
- Platter disk vs. RAM disk - For example, solid-state disk has no spinning
platters and research shows no difference between single block and
multi-block access speeds (which can be up to 300x faster than antiquated
platter disk)
- RAID level - Oracle standard "Stripe and Mirror Everywhere" SAME
(a.k.a. RAID 10) allows you to define a stripe size, the number of physical
data blocks that are placed contiguously on disk. Other RAID levels
(RAID 5), may scramble the data blocks, negating the benefits of multi-block
reads, since logically contiguous bocks are not co-located on disk.
So, what can we assume? It would be interesting to review
some real-world databases and see if there are any correlations
between internal setup (ASM, ASSM), physical disk architecture,
and multi-block read performance.
Expert Observations on scattered
read I/O speed:
Steve Karam (OCP, OCM, Oracle ACE)
notes that the disk latency is primarily governed externally to
Oracle:
"In my experience, an I/O is an I/O…whether it’s a single
block read or a multiblock read, it is one I/O and will take the same amount
of time.
I suppose there may be some credible cases where multiblock reads were
slower…perhaps having too small a segment size in the RAID volume or
something like that.
The problem is that if databases do show slower full-scan I/O latency, the
problem could be at so many levels…it could be the disk array, the RAID
type, the amount of disks in the volume, the HBAs, the multipathing
software…you just can’t generalize something like that.
John Garmany, a West Point graduate in Electrical Engineering, notes the the
total number of blocks requested in a multi-block read will effect the overall
average disk latency:
"Another problem is that I can’t really tell
the level at which you can state that multiblock access are slower.
For instance, if the choice is 50 scattered reads of a table
(a full-table scan) or 3 sequential reads of an index/table (index range
scan or index unique scan)
then of course the sequential I/O is going to be faster than the scattered."
A closer look at Oracle scattered
read waits
10gr2 Note: Starting in
Oracle 10g release2, Oracle recommends not setting the
db_file_multiblock_read_count parameter, allowing Oracle to
empirically determine the optimal setting. For more details,
see my notes on
10gR2 automatically tuned multi-block reads.
Lets examine the effect of
the "db_file_multiblock_read_count"
(MBRC) of the reported speeds of "db file scattered read".
Jonathan Lewis also conducted an experiment on a PC to
demonstrate the effect of seek delay on the first of a series of
db file scattered reads. In this revealing test, Lewis
notes that with an MBRC of 32, 1,024 blocks can be read into the
SGA with only 32 waits for "db file scattered read" waits:
You didn't specify a db_file_multiblock_read value - let's
assume it's 32, and can we call it 1,024 blocks to make the arithmetic easy,
and forget about the segment header block.
If you could actually set up this experiment, I would expect to see Oracle
report 32 waits for "db file scattered read" (32 x 32 = 1024). I would also
expect the first read to be slower than the other 31.
Here's the content of a trace file that I've just generated from a test that
gets as close to what you've described as I can make it.PARSE #11:c=31250,e=33825,p=0,cr=18,cu=0,mis=1,r=0,dep=0,og=1,tim=1330376261792
EXEC #11:c=0,e=79,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=1330376262031
WAIT #11: nam='SQL*Net message to client' ela= 7 driver id=1111838976 #bytes=1 p3=0 obj#=-1 tim=1330376262092
WAIT #11: nam='db file scattered read' ela= 25727 file#=6 block#=7818 blocks=32 obj#=89657 tim=1330376288058
WAIT #11: nam='db file scattered read' ela= 3913 file#=6 block#=7850 blocks=32 obj#=89657 tim=1330376292455
WAIT #11: nam='db file scattered read' ela= 5070 file#=6 block#=7882 blocks=32 obj#=89657 tim=1330376297995
WAIT #11: nam='db file scattered read' ela= 3613 file#=6 block#=7914 blocks=31 obj#=89657 tim=1330376302233
WAIT #11: nam='db file scattered read' ela= 5006 file#=6 block#=7945 blocks=32 obj#=89657 tim=1330376307744
WAIT #11: nam='db file scattered read' ela= 3810 file#=6 block#=7977 blocks=32 obj#=89657 tim=1330376312053
. .
Note that the first read is obviously considerably slower
than the rest. There's also an interesting pattern in the rest of the reads,
which I can't explain, but might relate to small amounts of time due to
single track switches.
At the DMCL level, many systems
implement a "sequential prefetch" read-ahead mechanism to
suck-in data blocks fast. The Oracle Documentation notes
this on db file scattered reads:
"The db file scattered Oracle metric event
signifies that the user process is reading buffers into the SGA buffer cache
and is waiting for a physical I/O call to return.
A db file scattered read issues a scatter-read to read the data into
multiple discontinuous memory locations. A scattered read is usually a
multiblock read. It can occur for a fast full scan (of an index) in addition
to a full table scan.
The db file scattered read wait event identifies that a full table scan is
occurring. When performing a full table scan into the buffer cache, the
blocks read are read into memory locations that are not physically adjacent
to each other.
Such reads are called scattered read calls, because the blocks are scattered
throughout memory.
This is why the corresponding wait event is called 'db file scattered read'.
Multiblock (up to DB_FILE_MULTIBLOCK_READ_COUNT blocks) reads due to full
table scans into the buffer cache show up as waits for 'db file scattered
read'."
Jonathan Lewis notes that SAN's may distort the
Oracle disk I/O metrics:
"At the opposite extreme, many SANs have a
read-ahead mechanism that kicks in when you do large reads - with the effect
that follow-on reads in (say) a large tablescans have been pre-fetched by
the SAN, giving Oracle's run-time engine the impression
that multiblock reads (after the first one of a series) are inherently fast
- faster, even, than single block reads.
That's fine, of course, if all your work is
supposed to be multiblock reads and not too many of them; but under the
covers the discs are being hit very hard, and it doesn't really take much of
an overload to make your response times start to very wildly. That's a case
where it makes sense to create some system statistics that tell the
optimizer about the discs - and try to "hide" the effect of the SAN
read-ahead.
(SANs are like swans - on the surface they can seem serene and graceful
while under the surface they're working like mad) "
My References:
Also see my related notes on Oracle scattered read
access speeds: