Troubleshooting an acute Oracle performance problems is not
always straightforward. Yesterday I got a call from a client that was
having a serious problem on their production server, an emergency. I
started by checking the server and noted that their CPU runqueue was through the
roof, far above the number of processors.
The end-users first called for emergency support when they
noticed that their SQL inserts were "hanging" making them wait for a long time.
Slow response time is one of the most common initial complaints when Oracle has
encountered a bottleneck. This was on a medium-sized Sun server, 16 CPU
cores with 32GB of fast RAM memory, running Oracle release 10.2.0.4 for 64-bit,
using the Solaris 10 operating system.
Identifying the root cause of the hanging
In order to locate what the problem was, the first step was to
do a little fact finding:
- What was happening inside Oracle when it started
going south?
- Was there any other unusual activity on the server?
- What kind of things happened in the past 24 hours that I might have
predictive value in the future?
After inspecting the environment I discovered that the night
before, a huge batch job update was run inside a loop by mistake. This problem
was discovered by the computer room operators and they quickly killed the job,
but not until lots of updates had happened.
The following morning things seemed to run fine at first, but
massive Oracle waits started piling up when inserting/deleting from a large
audit table. Audit tables are used for tracking purposes, and the can
become a bottleneck since all tasks will write to them.
I used the following query to find out the bottlenecks that
were currently contributing to the high wait issue:
select event, count(*) from
v$session_wait group by event;
The results of this query showed a great deal of waits,
largely library
cache pins, and
cache
buffers chains latches. The
library cache pin takes place if the process wants to pin an object in memory in
the library cache for examination, ensuring no other processes can update the
object at the same time and the cache buffers chains are used to protect a
buffer list in the buffer cache. The cache buffer chains are used when managing
an in-RAM data block inside the buffer cache. Contention on cache buffer chains
can indicate a "hot" block.
Tip: Using the Oracle wait views
- This simple query is extremely powerful. For anyone who has
not used the v$session_wait table, let me give you an introduction.
This table contains each SID (session logged into the instance), a wait
event, and supporting information in the p1, p2, and p3 columns (you can
look up the meaning of these columns in v$event_name on 10g).
Also, the ASH tables contain this information, and I recommend the
free
Oracle poster to see more details on the wait event tables and how they
fit together.
Because of extensive hands-on experience as the primary DBA on
this system I knew that these waits were "symptoms" of the root cause, and not
the true precipitator of the hanging. Suspecting a hidden root cause, I
searched for any waits that were not as common and found these two
suspicious events:
- enq: US - contention, wait for stopper event to be
increased
- Wait for a undo record.
Could these less common waits be a clue into the root cause?
Putting together the clues
The biggest clue was knowing that there was a runaway update
last night, and I knew that Oracle had to rollback all of the updates, a
resource intensive task that could take hours.
I also saw the end-user complain of high DML waits on the
audit table, and this was enough to put together the assumption that undo was
the root cause of these critical performance issues.
A quick check of the stop server resource consumers (using the
UNIX top
utility) showed that the SMON process was consuming a fair amount of CPU on
the system along with some parallel query (PX) processes, symptoms that can
indicate that Oracle is doing a fast start parallel recovery. I did a quick
check on MOSC and found Note 464246.1 - DATABASE HANG DUE TO PARALLEL
TRANSACTION RECOVERY. My hunch was now confirmed and I was confident that I had
identified why the database was hanging. My next step was to correct the
problem, and quickly, since this was a mission critical database.
Inside Parallel Transaction Recovery
When you perform DML in Oracle your changes will be logged in
an UNDO segment within the UNDO tablespace (among other places).
Conceptually, you can think of UNDO records as the "before" images of all
updates, and these before images are used to rollback any job that fails.
Please note that on RAC, each node has its own active UNDO
tablespace while on single instance systems, only one UNDO tablespace may be
active at any time (unless you're switching to a new UNDO with a log switch
event). But what does UNDO actually accomplish? It's far more than just
maintaining database integrity:
- Transaction rollback - UNDO is the mechanism by
which you can undo your changes before a commit with the ROLLBACK command.
Also, if a process is abnormally terminated UNDO will be used to clean up
any uncommitted transactions.
- Flashback - Oracle 10g is sometimes called the
Flashback Database. Flashback Query and Flashback Table (but not dropped
table) utilize UNDO and large shops can create a larger UNDO tablespace and
be able to see the exact contents of any table for many days into the past.
- Read consistency - Read consistency ensures that
you always get time consistent results, even for long queries that run while
the tables are being updated. For example, if you start a query at 9AM and
it takes 3 hours to finish, are the results you see from 9AM or noon? The
answer is 9AM, the moment you pressed the enter key. Oracle accomplishes
this through read consistency, which pulls the data you require from current
data blocks or consistent read blocks, which is UNDO. If you do not have
enough UNDO to satisfy your long running queries, you may receive an ORA-01555:
snapshot too old, rollback segment too small error.
- Instance Recovery - If your instance is abnormally
terminated (kill -9 pmon, shutdown abort, janitor trips over cable), SMON
will handle recovery on instance startup. This involves four phases: mount
the database, roll forward all redo since the last checkpoint to the
datafiles, open database, and rollback all non-committed transactions.
This last bullet point, instance recovery, is the reason
things went bad for my client. When Oracle rolls back all uncommitted
transactions during an instance startup, SMON will spawn parallel processes to
do the work. In addition, the server-side processes for each user connection
will also try to rollback uncommitted transactions if they stumble upon any
fatal problem while doing their own work.
During this time, if any processes are using excessive amounts
of UNDO, all updates (DML) will perform worse and worse. For instance, inserts
to a table by many connected users while jobs try to keep the data clean with
deletes on the backend could cause a lot of UNDO contention.
Now let's take a closer look at the Oracle UNDO recovery mechanism.
Inside Oracle Undo Recovery
Oracle instance recovery is controlled with several important
initialization parameters, most notably the FAST_START_PARALLEL_ROLLBACK
parameter. The FAST_START_PARALLEL_ROLLBACK parameter controls how much
juice SMON will take. The three possible parameters are: FALSE, LOW, and HIGH.
- False - A setting of FALSE will tell SMON to do
the recovery on its own without parallel assistance. This could work to
allow your app to remain online while SMON does its job, but could still
cause contention issues. In the end, it is better to let SMON finish the
work while other processes try to lay off UNDO as much as possible.
- Low - A setting of LOW will tell SMON to use 2 *
CPU_COUNT parallel processes to do the work, and a setting of high will use
4 * CPU_COUNT.
- High - A setting of HIGH will give the SMON processes priority,
allowing SMON to complete his work quickly.
In order to allow SMON to finish its work as quickly as
possible, we turned off their app and set the FAST_START_PARALLEL_ROLLBACK
parameter to HIGH, allowing SMON all the resources it needed to finish. To
determine how long it would take to finish, we consulted the
V$FAST_START_TRANSACTIONS view (hey, a V$ view thats pluralized!).
select state,
undoblocksdone,
undoblockstotal,
undoblocksdone / undoblockstotal * 100
from v$fast_start_transactions;
STATE UNDOBLOCKSDONE UNDOBLOCKSTOTAL UNDOBLOCKSDONE/UNDOBLOCKSTOTAL
---------------- -------------- --------------- ------------------------------
RECOVERING 94160 95432 98.6671138
The result you see there was near the end of the recovery process, with only
1,272 blocks to go to until completion. Once the work completed, we were able to
bring the application back up and make it available to the end-user community.
Undone by Undo
But after application startup, confusion struck! The vmstat utility
showed that the CPU runqueue started creeping up again, and all the while, the
question was asked, Why didn't this problem affect immediately? Why
did we not see it until the afternoon? Soon thereafter, contention was back
up on the system with the enq: US - contention wait in the forefront.
Even though UNDO fast start parallel rollback was finished, contention was still
high.
A small amount of further investigation provided the answer: background jobs
deleting from high concurrency had overlapped. This meant that multiple jobs
were attempting to do the same deletes over and over again while the application
was still trying to insert into those tables, causing locks and UNDO contention!
That also provided the answer to the question of why the
performance problem was not detected in the morning. The system was able
to provide reasonable response time because of the the smaller amount of UNDO
that was being generated during the less busy morning hours. However, once
things ramped up in the afternoon, (and with fast start parallel rollback taking
so much horsepower), the structure of their system started breaking down at all
levels. Prior to calling me, the client also did a shutdown abort and a
Dataguard failover, which unfortunately made the problem worse because
there was more UNDO to recover. The result was a catastrophic UNDO pile up.
The Solution
In the end though, we cant blame UNDO for the performance problem. It was only
doing its job, trying to preserve the integrity of the system. However, we can
take away two things from this problem:
- Remember the fast_start_parallel_rollback parm -
In the case of excessive CPU usage due to UNDO following an instance
failure, SMON can be throttled with the FAST_START_PARALLEL_ROLLBACK
parameter. You can monitor the progress using the V$FAST_START_TRANSACTIONS
view.
- Put control mechanisms in your application jobs -
Any scripts that perform DML should include a portion at the top that checks
to see if it is already running (ps -ef | grep [p]urge.sh for
instance). This will take care of overlapping process issues that are bound
to cause contention. In addition, avoid unnecessary DML. Do you really need
to insert/update/delete as much as you are? Is there another way that
perhaps uses memory, or even flat files to avoid a constant barrage on the
database layer?
In the end, the big conclusion is this: anything in excess can
be problematic, especially when there are waits involved.