|
 |
|
Oracle 911
Emergency Oracle Support
Oracle Tips by Burleson |
I work in a worldwide Oracle emergency support center (www.remote-dba.net)
where I receive calls from panicked clients all over the
world, each experiencing a serious loss of performance on
their mission-critical Oracle databases. I have worked on
systems that shut down entire factories due to poor
performance, leaving thousands of workers sitting on their
hands waiting for me to do something magical to restart
their assembly line. I have worked with hospital image
delivery systems when patients were literally in surgery,
with their doctors waiting for the information they needed
to save lives. I have worked on financial systems for which
downtime is measured in tens of thousands of dollars per
minute.
Working late nights as an emergency support DBA comes with
huge stress levels. Most of these databases are brand new to
me, and I have only a few minutes to assess the situation
and devise a plan to quickly relieve their bottleneck. I
work with emergency Oracle performance problems every week,
and I only get called when the easy remedies have already
been tried and discarded. When I ask why their DBA is not
helping, I generally hear one of the following:
-
We outsourced our DBA staff to India. They can?t figure it
out and suggested that we call you.
-
Our DBA is on a three-week hiking vacation in the Rocky
Mountains.
-
Our DBA quit six months ago and we haven?t had time to
find a replacement.
-
We
hired a guy and got him OCP certified, but we don?t think he
knows what he is doing.
-
We
called MOSC Support and we don?t understand their
instructions. They said that we could call you.
When a production database is in crisis, costs are never an
issue, but minimizing downtime is of the essence. The clients
demand a fast fix, and this often requires unconventional
methods.
Unconventional Methods of Emergency Support
Unlike academic and scientific Oracle DBAs who demand proofs
and reproducible results before making an Oracle change, the
emergency Oracle DBA has no such luxury. The emergency DBA
must use every weapon at their disposal to get the client
running as quickly as possible. These unconventional methods
are almost always driven by the client who does not appreciate
the benefits of an elegant, long-term fix for the root cause
of the problem.
Clients are impatient, and they often insist on
symptom-treating, stop-gap remedies that are neither elegant
nor comprehensive. In many cases, the client does not want to
hear the time-consuming tasks that are required to address the
root cause of the problem.
Contrary to the pontifications of theoreticians and
ivory-tower academics, there are many silver bullets for
Oracle performance tuning. By silver bullet, I mean a small
set of commands that quickly relieves an acute performance
bottleneck. Some of these techniques of just-in-time tuning
have been codified in Oracle10g with the Automatic Memory
Management (AMM) facility, in which the SGA regions are
changed dynamically to meet changing demands in processing.
Let?s take a close look at the real-world silver bullets that
I have encountered over the past 12 years of emergency Oracle
support. The following stories are true, verifiable accounts
(I have witnesses!) of Oracle databases in which a fast fix
was used to relieve an acute performance problem:
To
those scientist DBAs who demand ?proofs? that these methods
worked, you are not going to find them in this article. Most
of the production systems that I tune have hundreds of
segments and thousands of concurrent users. Trying to test a
hypothesis on a large running database is like trying to tune
your car while it?s flying down the freeway at 75 miles per
hour. It is impossible to reproduce the conditions of a
complex performance breakdown, and the emergency support DBA
is forced to rely on experience and anecdotal evidence to
guide their actions. (If you want proof that these techniques
work, just send me an email. I have lots of witnesses!)
Anyway, let?s examine each of these silver bullets and see how
a well-placed silver-bullet can save the day.
Fix Missing CBO Statistics
The call came in from a client in Florida who had just moved
their system into production and was experiencing a serious
performance problem. Upon inspection, we found
optimizer_mode=choose and only one table with statistics.
The DBA told me that she was running cost-based and she seemed
totally unaware of the requirement to analyze the schema for
CBO statistics. Here is how I recall the conversation:
DB: "How are you collecting
statistics?"
DBA: "We have BMC Patrol"
DB: "?No, No, how are you
getting SQL optimizer statistics?"
DBA: "The Oracle sales rep said that the CBO was
intelligent, so I assumed it was getting its own
statistics."
You know, in a way, she was right. The problem started when
she wanted to know the average row length for a table. She did
a Google search and discovered that it was in the
dba_tables.avg_row_len column. When she found it null, she
went to MOSC and learned that an analyze table
command would fill-in the avg_row_len column.
As
we know, when using optimizer_mode=choose with only one
table analyzed, any SQL that touches the table will be
optimized as a cost-based query, and the CBO will dynamically
estimate statistics for all tables with missing statistics. In
this case, a multi-step silver bullet did the trick:
alter table customer delete statistics;
The system immediately returned to an acceptable performance
level, and the DBA learned about the importance of providing
complete and timely statistics for the CBO using the
dbms_stats utility.
Repair Obsolete CBO Statistics Gathering
This shop called from Australia complaining about a serious
degradation in SQL performance after implementing partitioned
tablespaces in a 16-CPU Solaris 64-bit Oracle 9.0.4 system.
They said that they thoroughly tested the change in their
development and QA instances, and they could not understand
why they system was grinding to a halt.
Upon inspection, it turned out that they were using analyze
table and analyze index commands to gather their
CBO statistics. As we may know, the dbms_stats utility
gathers partition-wise statistics. There was not time to pull
a deep-sample collection, so a dbms_stats was issued
with a 10 percent sample size. Note that I parallelized it
with 15 parallel processes to speed-up the statistics
collection:
exec dbms_stats.gather_schema_stats( - ownname => 'SAPR4', - options => 'GATHER AUTO', - estimate_percent => 10, - method_opt => 'for all columns size repeat', - degree => 15 -)
This took less than 30 minutes and the improved CBO statistics
tripled the performance of the entire database.
Initialize Missing Oracle Instance Parameters
I
got a call from a client in California who said that their
performance was getting progressively worse as more customers
accessed the Oracle database. Upon inspection I discovered
that their db_cache_size parameter was not present in
the init.ora file. A quick instance bounce (to re-set
sga_max_size and db_cache_size) resulted in a
400 percent performance improvement.
In
another memorable case, I received a call from a data
warehouse client in California who complained that their
performance degraded as the database grew. A quick look
revealed that the sort_area_size parameter was missing
and defaulting to a tiny value. Again, a change of
sort_area_size=1048575, a quick bounce of the instance,
and overall database performance improved by more than 50
percent.
Adding Missing Indexes
An
Oracle Financial application shop in New York called and said
that their performance degraded as more data was entered into
the tables. A quick check of v$sql_plan using my
plan9i.sql script looked like this:
Full table scans and counts OWNER NAME NUM_ROWS C K BLOCKS NBR_FTS ---------- ------------------------ -------- - - -------- -------APPLSYS FND_CONC_RELEASE_DISJS 14,293 N 4,293 498,864 APPLSYS FND_CONC_RELEASE_PERIODS 384,173 N 67,915 134,864 DONALD PERSON_LOGON_ID 18,263,390 N 634,272 96,212DONALD SITE_AMDMNT 2,371,232 N 51,020 50,719DONALD CLIN_PTCL_VIS_MAP 23,123,384 N 986,395 11,273
Here we see a huge number of large-table, full-table scans. A
quick look into v$sql revealed that the rows returned
by each query was small, and a common WHERE clause for many
queries looked like this:
WHERE customer_status = ?:v1? and customer_age > :v2;
A
quick creation of a concatenated index on customer_status
and customer_age resulted in a 50x performance
improvement and reduced disk I/O by over 600 percent.
In
another memorable case on an 8.1.6 database, my access.sql
script revealed suspect large-table, full-table scans:
Full table scans and counts OWNER NAME NUM_ROWS C K BLOCKS NBR_FTS---------- -------------------- ------------ - - -------- --------APPLSYS FND_CONC_RELEASE_DISJS 1,293,292 N K 65,282 498,864APPLSYS FND_CONC_RELEASE_PERIODS 4,373,362 N K 62,282 122,764APPLSYS FND_CONC_RELEASE_STATES 974.193 N K 9,204 98,122APPLSYS FND_CONC_PP_ACTIONS 715,021 N 6,309 52,036APPLSYS FND_CONC_REL_CONJ_MEMBER 95,292 N K 4,409 23,122
The DBA had created an index on the order_date column
and was surprised that their order_date index was not
being used, primarily because their boss was too cheap to pay
for him to attend an Oracle8 new features class. Creating the
function-based index on to_char(order_date,?MON-DD?)
resulted in an immediate 5x performance improvement.
Changing CBO Optimizer Parameters
Another emergency situation involved an Oracle 9.0.2 client
from Phoenix who called complaining about steadily degrading
performance. A quick look into v$sql_plan view using my
plan9i.sql script revealed loads of suspected
unnecessary large-table, full-table scans.
Important Note: Prior to Oracle 10g, adjusting
these optimizer parameters was the only way to
compensate for sample size issues with dbms_stats.
As
of 10g, the use of dbms_stats.gather_system_stats
and improved sampling within dbms_stats had made
adjustments to these parameters far less important.
Ceteris Parabus, always adjust CBO statistics before
adjusting optimizer parms. For more details on
optimizer parameters, see my latest book "Oracle
Tuning: The Definitive Reference".
In
this case, the top SQL was extracted from v$sql and
timed as-is and with an index hint. The query with the index
hint ran almost 20x faster, but it was unclear why the CBO was
not choosing the index. This was a production emergency, and I
did not have the luxury of investigating the root cause of the
CBO issue. I had to act fast, so I ran a script against
v$bh and user_indexes and discovered that
approximately 65 percent of the indexes were currently inside
the data buffer cache. Based on similar systems, I decided to
lower optimizer_index_cost_adj to a value of 20,
hopefully forcing the CBO to lower the relative costs of index
access.
optimizer_index_cost_adj=20
optimizer_index_caching=65
If you
are Oracle9i and beyond, you can dynamically alter some
parameters:
alter system set optimizer_index_cost_adj=20 scope = pfile;
This quick fix changed the execution plans for over 350 SQL
statements and cut by half overall system response time. The
client was elated, and I was then able to take my time and
investigate the root cause of the problem.
Implement cursor_sharing=force
I
worked on a 9.0.2.4 database in Toronto, Canada, about which
the end users had complaints of poor performance right after a
new manufacturing plant was added to the existing database. A
quick look at the STATSPACK top five timed events looked like
this:
Top 5 Wait Events~~~~~~~~~~~~~~~~~ Wait % TotalEvent Waits Time (cs) Wt Time-------------------------------------------- ------------ ------------ -------enqueue 25,901 479,654 46.71db file scattered read 10,579,442 197,205 29.20db file sequential read 724,325 196,583 9.14latch free 1,150,979 51,084 4.97log file parallel write 148,932 39,822 3.88
My
first look was into the SQL section of the STATSPACK report,
where I noted that almost all of the SQL used ?literals? in
the WHERE clause of all queries.
WHERE
customer_state = "Alabama" and customer_type = "REDNECK";
This was a vendor package with dynamically generated SQL, so
cursor_sharing was the only fast solution. Setting
cursor_sharing=force greatly reduced the contention on the
library cache and reduced CPU consumption. The end users
reported a 75 percent improvement in overall performance.
Implement the KEEP Pool for Small-table, Full-scanned
Tables
I
worked on a database just last month in New Zealand (running
9.2.0.4) that had a 16 CPU Solaris server with 8GB of RAM. The
complaint was that performance had been degrading since the
last production change. A STATSPACK top five timed events
report showed that over 80 percent of system waits related to
?db file scattered reads.? A quick review of v$sql_plan
using plan9i.sql showed lots of small-table, full-table
scans, with many of the table not assigned to the KEEP pool
(as denoted by the ?K? column in the listing below):
Full table scans and counts OWNER NAME NUM_ROWS C K BLOCKS NBR_FTS---------- -------------------- ------------ - - -------- --------APPLSYS FND_CONC_RELEASE_DISJS 39 N 44 98,864APPLSYS FND_CONC_RELEASE_PERIODS 39 N K 21 78,232APPLSYS FND_CONC_RELEASE_STATES 1 N K 2 66,864APPLSYS FND_CONC_PP_ACTIONS 7,021 N 1,262 52,036APPLSYS FND_CONC_REL_CONJ_MEMBER 0 N K 322 50,174APPLSYS FND_FILE_TEMP 0 N 544 48,611APPLSYS FND_RUN_REQUESTS 99 N 98 48,606INV MTL_PARAMETERS 6 N K 16 21,478APPLSYS FND_PRODUCT_GROUPS 1 N 23 12,555APPLSYS FND_CONCURRENT_QUEUES_TL 13 N K 10 12,257AP AP_SYSTEM_PARAMETERS_ALL 1 N K 6 4,521
As
you may know, rows fetched into the db_cache_size from
full-table scans are not pinged to the Most-Recently-Used
(MRU) end of the data buffer. Running my buf_blocks.sql
script confirmed that the FTS blocks were falling off the
least-recently-used end of the buffer, and had to be
frequently reloaded into the buffer.
Contents of Data Buffers
Number of Percentage Blocks in of object Object Object Buffer Buffer Buffer BlockOwner Name Type Cache Blocks Pool Size------------ -------------------------- ----------- ---------- ------- -------DW01 WORKORDER TAB PART 94,856 6 DEFAULT 8,192DW01 HOUSE TAB PART 50,674 7 DEFAULT 16,384ODSA WORKORDER TABLE 28,481 2 DEFAULT 16,384DW01 SUBSCRIBER TAB PART 23,237 3 DEFAULT 4,096ODS WORKORDER TABLE 19,926 1 DEFAULT 8,192DW01 WRKR_ACCT_IDX INDEX 8,525 5 DEFAULT 16,384DW01 SUSC_SVCC_IDX INDEX 8,453 38 KEEP 32,768
In
this case, I ran my buf_keep_pool.sql script to
reassign all tables that experienced small-table, full-table
scans into the KEEP pool. The output looks like this, and can
be fed directly into SQL*Plus:
alter TABLE
BOM.BOM_OPERATIONAL_ROUTINGS storage (buffer_pool keep);
alter INDEX BOM.CST_ITEM_COSTS_U1 storage (buffer_pool keep);
alter TABLE INV.MTL_ITEM_CATEGORIES storage (buffer_pool keep);
alter TABLE INV.MTL_ONHAND_QUANTITIES storage (buffer_pool
keep);
alter TABLE INV.MTL_SUPPLY_DEMAND_TEMP storage (buffer_pool
keep);
alter TABLE PO.PO_REQUISITION_LINES_ALL storage (buffer_pool
keep);
alter TABLE AR.RA_CUSTOMER_TRX_ALL storage (buffer_pool keep);
alter TABLE AR.RA_CUSTOMER_TRX_LINES_ALL storage (buffer_pool
keep);
alter INDEX WIP.WIP_REQUIREMENT_OPERATIONS_N3 storage (buffer_pool
keep);
With more efficient buffer caching, I fixed the problem in less
than one hour and overall database performance more than doubled.
Add Additional SGA RAM
One
of the most common silver bullets are databases that have a
?working set? of frequently referenced data that cannot fit into
the data buffer cache. This used to be a huge problem for the
32-bit Oracle server in which the total SGA size was difficult to
grow beyond 1.7 gig without special tricks like AWE and NUMA.
However, I still routinely see databases on dedicated servers with
8GB RAM with an SGA size less than 500MB. A quick increase in
db_block_buffers or db_cache_size, and performance
improves dramatically.
Employ Materialized Views
Once there was a call from a point-of-sale data warehouse in
Germany. The IT manager spoke very little English and most of the
conversation was done using
Babelfish. The system was largely read-only with a short batch
window for nightly updates. Once I connected, I immediately
noticed that virtually every query in the system was performing a
sum() or avg() function against several key tables.
The v$sql_plan view (via plan9i.sql) showed loads of
very-large-table, full-table scans, and the system was crippled
with "db file scattered read" waits.
I
was easily able to help by creating three materialized views and
employing query rewrite to reduce physical disk I/O by over 2,000
percent; this improved performance by more than 30x, a real
silver bullet!
Implement Bitmap Indexes
I
was called upon to troubleshoot and fix a State Police query
system that was experiencing slow query performance. The system
was read-only except for a 30-minute window at night for data
loading. Upon inspection of the SQL, I noted complex combinational
WHERE clauses:
WHERE
color="BLU" and make="CHEVY" and year=2015 and doors=2;
The distinct values for each of these columns were less than
200, and concatenated indexes were employed. Replacing the
b-tree indexes with bitmap indexes resulted in a stunning
performance improvement for the entire system, taking queries
from 3 seconds down to under one-tenth of a second.
Add Freelists
A
client called from Michigan once with a complaint that the
company order processing center was unable keep up with adding
new orders into Oracle. The client had just expanded its
telephone order processing department and had doubled the
order processing staff to meet a surge in market interest. The
VP was frantic, saying that 400 order-entry clerks were
getting 30-second response time and they were forced to
manually write-down order information.
I
checked v$session and found 450 connected users, and a
quick review of v$sql revealed that at virtually all
the DML were inserts into a customer_order table. The
top timed event was buffer busy wait and it was clear
that there were enqueues on the segment header blocks for the
table and its indexes.
The ?proper? fix for this issue is to create a new tablespace
for the table and index using Automatic Segment Space
Management (ASSM), also known as bitmap freelists. I could
then reorganize the table online with the dbms_redefinition
utility and alter index cust_pk rebuild the index into
the new tablespace. However, it would take me several hours to
build and execute the jobs and the VP said that he was losing
over $500 per minute.
The system was on release 9.2.0.4, so I was able to
immediately relieve the segment header contention with these
commands:
alter table
customer_order freelists 5;
alter index cust_pk freelists 5;
(Note: I did not know the length of the enqueues on the segment
header, so I added the additional freelists, one at a time,
until the buffer busy waits disappeared).
He
additional freelists did the trick and the segment header
contention disappeared. However, I knew that this was only a
stop-gap fix and as soon as they ran their weekly purge (a
single process) that only one of the five freelists would get
the released blocks, causing the table to extend
unnecessarily.
Windows Oracle Issues
Windows Oracle databases are always the most fun to tune because
they are often implemented by someone with a very limited
knowledge of Oracle. The following are my favorite Windows
Oracle Silver Bullets.
-
Norton Anti-Virus ? I got a call from a new client in
England who said that their database had slowed to the point
where sub-second queries were taking 15 minutes. A review of a
STATSPACK report shows giant waits of up to 10 seconds on read
I/O. A review of the external Windows environment revealed
that a well-intentioned support person was told to install
Norton Antivirus on all Windows servers. Upon every block
read, Norton was conducting a virus check!
-
A really cool screen saver ? Another case involved a
Windows Oracle system that was experiencing sporadic CPU
shortages, periods when over half of the PU was being consumed
by some external process. Because I was dialed in, I could not
see the obvious. An on-site person informed me that the screen
saver was called 3D Flowerbox. This was a hugely CPU
intensive screen saver that performed thousands of
calculations per second to generate the cool display.
Conclusion
For adrenaline junkies, being an emergency Oracle support DBA is
great fun. As I have illustrated, this kind of support often
requires a unique set of techniques:
-
Fix the symptom first: You can come back and address
the root cause later.
-
Time is of the essence: When you must provide a quick
fix, instance-wide adjustments are often your best hope.
-
Be creative: Traditional (i.e., time consuming) tuning
methods don't apply in an emergency.
|
|
Get the Complete
Oracle SQL Tuning Information
The landmark book
"Advanced Oracle
SQL Tuning The Definitive Reference" is
filled with valuable information on Oracle SQL Tuning.
This book includes scripts and tools to hypercharge Oracle 11g
performance and you can
buy it
for 30% off directly from the publisher.
|

|
|