 |
|
Automating histogram sampling with dbms_stats
Oracle Tips by Burleson Consulting
May 14, 2003
|
In Oracle we now see
11g extended optimizer statistics,
an alternative to dynamic_sampling for estimating result
set sizes.
One exciting
feature of dbms_stats is the ability to automatically look for
columns that should have histograms, and create the histograms.
Multi-bucket histograms add a huge parsing overhead to SQL statements,
and histograms should ONLY be used when the SQL will choose a different
execution plan based upon the column value.
To aid in
intelligent histogram generation, Oracle uses the method_opt
parameter of dbms_stats. There are also important new options
within the method_opt clause, namely skewonly, repeat
and auto.
method_opt=>'for all columns size skewonly'
method_opt=>'for all columns size repeat'
method_opt=>'for all columns size auto'
In practice, there is a specific order to use the
different options of dbms_stats.
See
this article for details. Let’s take a close look at each method option.
The method_opt=’SKEWONLY’ dbms_stats Option
The first is the “skewonly” option which
very time-intensive because it examines the distribution of values for
every column within every index. If dbms_stats discovers an
index whose columns are unevenly distributed, it will create histograms
for that index to aid the cost-based SQL optimizer in making a decision
about index vs. full-table scan access. For example, if an index has
one column that is in 50% of the rows, a full-table scan is faster than
and index scan to retrieve these rows.
Histograms are also used with SQL that has
bind variables and SQL with cursor_sharing enabled. In these
cases, the CBO determines if the column value could affect the execution
plan, and if so, replaced the bind variable with a literal and performs
a hard parse.
--**********************************************************
-- SKEWONLY option – Detailed analysis
--
-- Use this method for a first-time analysis for skewed
indexes
-- This runs a long time because all indexes are
examined
--**********************************************************
begin
dbms_stats.gather_schema_stats(
ownname => 'SCOTT',
estimate_percent => dbms_stats.auto_sample_size,
method_opt => 'for all columns size
skewonly',
degree => 7
);
end;
/
The
method_opt=’REPEAT’
dbms_stats Option
Following the one-time detailed analysis,
the re-analyze task will be less resource intensive with the REPEAT
option. Using the repeat option will only re-analyze indexes with
existing histograms, and will not search for other histograms
opportunities. This is the way that you will re-analyze you statistics
on a regular basis.
--************************************************************
-- REPEAT OPTION - Only re-analyze histograms for
indexes
--
that have histograms
--
-- Following the initial analysis, the weekly analysis
-- job will use the “repeat” option. The repeat option
-- tells dbms_stats that no indexes have changed, and
-- it will only re-analyze histograms for
--
indexes that have histograms.
--***********************************************************
begin
dbms_stats.gather_schema_stats(
ownname => 'SCOTT',
estimate_percent => dbms_stats.auto_sample_size,
method_opt => 'for all columns size
repeat',
degree => 7
);
end;
/
The method_opt=’AUTO’ dbms_stats Option
The auto option is used when
monitoring is implemented and
creates histograms based upon data distribution and the manner in which
the column is accessed by the application (e.g. the workload on the
column as determined by monitoring, especially foreign keys to determine
the cardinality of table join result sets). Using method_opt=>’auto’ is
similar to using the gather auto in the option parameter
of dbms_stats.
begin
dbms_stats.gather_schema_stats(
ownname => 'SCOTT',
estimate_percent => dbms_stats.auto_sample_size,
method_opt => 'for all columns size auto',
degree => 7
);
end;
/
Remember, analyzing for
histograms is time-consuming, and histograms are used under two
conditions:
-
Table join order –
The CBO must know the size of the intermediate result sets
(cardinality) to properly determine the correct join order the
multi-table joins.
-
Table access method
– The CBO needs to know about columns in SQL where clauses, where
the column value is skewed such that a full-table scan might be faster
than an index range scan. Oracle uses this skew information in
conjunction with the clustering_factor
columns of the dba_indexes
view.
Hence, this is the proper
order for using the dbms_stats package to locate proper columns for
histograms:
1. Skewonly option
- You want to use skewonly to do histograms for skewed columns, for
cases where the value will make a difference between a full-table scan
and an index scan.
2. Monitor - Next, turn-on monitoring. Issue an “alter table xx
monitoring” and “alter index yyy monitoring” command for all segments
in your schema. This will monitor workload against
3. Auto option - Once monitoring is in-place, you need to
re-analyze with the "auto" option to create histograms for join
columns within tables. This is critical for the CBO to determine the
proper join order for finding the driving table in multi-table joins.
4. Repeat option - Finally, use the "repeat" option to
re-analyze only the existing histograms.
Periodically you will
want to re-run the skewonly and auto option to identify any new
columns that require histograms. Once located, the repeat option will
ensure that they are refreshed with current values.
In Oracle we now see
11g extended
optimizer statistics, an alternative to
dynamic_sampling for estimating result set sizes.
Arup Nanda has a great article on extended statistics with dbms_stats,
specialty histogram analysis using function-based columnar data:
Next, re-gather statistics on the table and collect the
extended statistics on the expression upper(cust_name).
begin
dbms_stats.gather_table_stats (
ownname => 'ARUP',
tabname => 'CUSTOMERS',
method_opt => 'for all columns size skewonly for columns (upper(cust_name))'
);
end;
Alternatively you can define
the column group as part of the gather statistics command.
You do that by placing
these columns in the method_opt parameter of the gather_table_stats procedure in
dbms_stats as shown below:
begin
dbms_stats.gather_table_stats (
ownname => 'ARUP',
tabname => 'BOOKINGS',
estimate_percent=> 100,
method_opt => 'FOR ALL COLUMNS SIZE SKEWONLY FOR COLUMNS(HOTEL_ID,RATE_CATEGORY)',
cascade => true
Usage notes:
This advice from Greg Rahn, of the Oracle Real-world tuning
group:
Question: Will dbms_stats someday detect sub-optimal table join orders from a workload,
and create appropriate histograms?
If histograms exist, then they were either automatically created because the
columns met the criteria defined on page 10-11 of the document, or manually
created. If they were created automatically, then is probable they will
influence the plan for the better.
Sub-optimal join orders are generally the result of poor cardinality estimates.
Histograms are designed to help with cardinality estimates where data skew
exists.
Question: Keeping statistics: What is the current "official" policy regarding statistics retention?
The old CW was that the DBA should collect a deep, representative sample, and
keep it, only re-analyzing when it's "a difference that makes a difference"?
I don't know if there is an "official" policy per se, but I will offer my
professional opinion based on experience. Start with the dbms_stats defaults.
Modify as necessary based on plan performance. Use dynamic sampling and/or
dbms_sqltune or hints/outlines where appropriate (probably in that order). Understand the problem before attempting solutions.
There are a couple of cases that I would be mindful of:
1) Experience has shown that poor plans can be a result of under estimated NDV
with skewed data and DBMS_STATS.AUTO_SAMPLE_SIZE
(or too small of a sample). This has been addressed/enhanced in 11g. In 10g it
requires choosing a fixed sample size that yields an accurate enough NDV to get
the optimal plan(s). The sample size will vary case by case as it is data
dependent.
2) Low/High value issues on recently populated data, specifically with
partitioned tables. If the partition granule size is small (say daily or
smaller) the default 10% stale might be too little. It may be best to gather
partition stats immediately after loading, or set them manually. It's better to
have stats that are an over estimate on the number of rows/values than an under
estimate. For example, its better to have a hash join on a small set of data
than a nested loops on a large set.