We must remember that Oracle does not
run in a vacuum, and that there are important "external" events that take place
outside the scope of the Oracle instance. We see this frequently with the
Oracle disk I/O metrics.
I/O remains a critical part of Oracle
performance tuning. While everything else in Oracle runs in microseconds
and RAM moves in nanoseconds, the mechanical disk devices on some Oracle servers
become a major bottleneck.
When tuning Oracle, we often forget to look
outside the instance, examining external influences on disk I/O speed.
These may include:
-
Stripe size - The stripe size for
raw disk partitions will influence I/O behavior, especially for multi-block
read operations (full scans).
-
TCP/IP - TCP IP settings and Oracle
TNS settings (e.g. tcp.nodelay) effect I/O timings.
-
RAID level - The RAID used on the
disk array can have a profound impact on end-to-end I/O timings. For
example, up until 2015,
RAID 5 with
Oracle was sometimes inappropriate for high-update databases.
-
Disk Controllers - A bottleneck in
the disk farm such as a lack of controller resources can precipitate high
I/O latency.
-
Storage Array Internals - Many of
today's storage arrays have specialized optimization software and on-board
RAM caching to improve throughput.
-
SAN/NAS - Dynamically attached
storage has special issues.
-
Server caching - Servers
have both an internal file
cache vs. JFS cache
When Oracle makes a physical I/O request, it's
handed-off to the operating system as a native I/O operation. At this
layer, the device-media interface takes over, as Oracle patiently waits for the
block to be returned to the calling routine. During this time, many
external events can influence the observed I/O timings.
Disks reads and duplicate RAM layers
As hardware evolved though the 1990's,
independent components of database systems started to employ their own RAM
caching tools as shown below:
Multiple RAM caches in an
Oracle enterprise
In this figure, the Oracle database is not the
only component to utilize RAM caching. The disk array employs a RAM cache, the
servers have a Journal File System (JFS) RAM cache, and the front-end web server
also serves to cache Oracle data.
This concept is important because many enterprises may inadvertently double
cache Oracle data. Even more problematic are the fake statistics reported by
Oracle when multiple level caches are employed:
-
Fake Physical I/O times: If a disk array
with a built-in RAM cache is in use, the disk I/O subsystem may acknowledge
(?ack?) a physical write to Oracle, when in reality the data has not yet
been written to the physical disk spindle. This ?false ack? can skew timing
of disk read/write speeds.
-
Wasted Oracle Data Buffer RAM: In system
that employs web servers, the Apache front end may cache frequently used
data. Thus, significant Oracle resources may be wasted by caching data
blocks that are already cached on the web server tier.
Disk writes and false "acks"
Many of today's disk arrays employ a delayed
write mechanism, improving performance by caching the data block in RAM, and
writing it asynchronously. At write time, these devices lie to Oracle (by
"acknowledging" the disk write with an "ack"), when in-reality the data block is
still in RAM on the storage array. This will skew I/O timing, reporting
them as lower values than a completed disk write.
Let's take a closer look and see why external
influences can bias performance tuning studies.
Oracle metrics and disk performance
As we have noted, the reported times that
Oracle gets back from the OS can be exaggerated (the false "ack"), and the root
cause of slow I/O is frequently hidden from Oracle by the machinations of the
operating system.
This issue is also aggravated by RAM caches (Disks cache, JFS cache) which hide the "real I/O timings. See
my notes on
multiple RAM levels for complete details.
Anytime an
Oracle data block is accessed from disk, we commonly see three sources of delay.
The first and most important source of delay is the read-write head movement
time. This is the time required for the read-write head to position itself under
the appropriate cylinder.
We also see rotational delay as the read-write head
waits for the desired block the past beneath it, and the third source of delay
is the data transmission time from the disk back to the Oracle SGA.
Large disks
can do read-write head thrashing
Seek time (read-write head
movement remains the largest component of Linux I/O latency. The Oracle
professional can work-around this issue by intelligently placing high
I/O data files in the ?middle? absolute track number to minimize
read-write head movement, allocating "hot" data files near the middle
absolute track of the disk spindle:
If we accept the premise
that 99 percent of the latency is incurred prior to actually accessing the
desired data block, then it makes sense that the marginal cost for reading a 32K
block is not significantly greater than the cost of reading a 2K block. In other
words, the amount of disk delay is approximately the same regardless of the size
of the block. Therefore it should follow that the larger the block that you can
read in on a single I/O, the less overall I/O will be performed on the Oracle
database.
The principle behind
caching is not unique to Oracle databases. Access for RAM is measured in
nanoseconds, while access from disk is generally measured in milliseconds. This
amounts to a to an order of magnitude improvement in performance if we can get
the Oracle data block into a RAM buffer.
As Oracle grows more sophisticated and RAM becomes cheaper, we tend
to see Oracle databases with system global areas (SGA) that commonly exceed 10
GB. This has important ramifications for the performance of the Oracle database
because once read, the Oracle data blocks reside in RAM where they can be
accessed tens of thousands of times faster than having to go to disk in order to
retrieve the data block. For details, see my notes on
disk I/O concepts.
Perception vs. reality in disk I/O
Consider this compelling argument that
different block sizes should have no influence on total response time.
They correctly note that, in theory, four 8k block reads should take the same
amount of time time as a single 32k read, and ergo, that block size should not
matter.
However, this does not explain
credible reports of different response times using different block sizes.
Could these "external" influences be to blame?