 |
|
Oracle Regular expressions indexes as a tuning tool
Oracle Tips by Burleson Consulting |
Oracle Regular
Expression syntax has profound implications for Oracle tuning,
especially in the area of indexing where indexes can be created
on regular expressions, eliminating expensive full-table scans
in-favor of fast index access. Regular expressions are
extremely powerful for extracting facts from large text columns,
especially the
regexp_like syntax.
Oracle expert Jonathan Gennick
notes that
regular expression can be used in an Oracle index, a
powerful tool for improving the speed of complex SQL queries,
and notes that regular expression indexes (a type of
function-based index) can dramatically reduce database overhead
for pattern-matching queries. I've noted in my book
"Oracle
Tuning: The Definitive Reference", that function-based
indexes are one of the most powerful and underutilized tools for
Oracle professionals.
Here are some little know facts
about regular expressions:
Indexing on regular
expressions
In
Parsing with regular
expressions regexp_like Jonathan Gennick shows a great
example where we use Oracle regular expressions to extract
“acreage” references from inside a text string, ignoring
important factors such as case sensitivity and words stems
(acre, acres, acreage):
COLUMN park_name format a30
COLUMN acres format a13
SELECT
park_name,
REGEXP_SUBSTR(description,'[^ ]+[-
]acres?',1,1,'i') acres
FROM michigan_park
WHERE REGEXP_LIKE(description, '[^
]+[- ]acres?','i');
Here is the
output, where we see that the regular expression has parsed-out
the acreage figures, just as-if they were a discrete data column
with the table:
PARK_NAME ACRES
____________________________
___________
Mackinac Island State Park
1800 acres
Muskallonge Lake State Park 217-acre
Porcupine Mountains State Park 60,000 acres
Tahquamenon Falls State Park 40,000+ acres
The only problem with
this query is that it will always perform a large-table
full-table scan on the michigan_park table, causing unnecessary
overhead for Oracle.
However, using the powerful
function-based indexes we could eliminate the unnecessary
overhead by using the regular expression directly in the index.
CREATE INDEX
parks_acreage
ON
michigan_parks
(REGEXP_LIKE(description, '[^ ]+[- ]acres?','i'));
However,
Laurent Schneider notes that it is illegal to have an index on a
Boolean function, but you could have an index on a case
expression returning 1.
create index
i
on
michigan_parks
(case when
description like '_% acre%'
or
description like '_%-acre%'
then 1 end);
select
*
from
michigan_parks
where
(case when
description like '_% acre%'
or
description like '_%-acre%' then 1 end)
is not null;
This simple index definition would
create a yes/no index on all park records that contain a
reference to "acre", "acres", "acreage". The database
overhead would be once, when each rows is added to the table,
and not over-and-over again when queries are executed.
The rules for choosing a
function-based index on a complex expression (regular
expression, decode) is a trade-off between several factors:
-
The number of
blocks in the table - A full-table scan of a super-large
table can cause I/O contention.
-
The percentage
of rows returned - If the regular expression returns
only a small percentage of the total table rows, a regular
expression index will greatly reduce I/O.
-
The frequency
of the query - If the query is executed frequently,
Oracle may do millions of unnecessary full-table scans.
-
The tolerance
for slower row inserts - Parsing the text column at
insert time (to add the row to the regular expression index)
will slow-down inserts.
It's the age-old
quandary. If we build the regular expression once (at insert
time) it can be used over-and-over again with little overhead.
Conversely, using regular expressions in SQL without a
supporting index will cause repeated full-table scans.
For detailed
information on learning regular expression and using them in the
database see Jonathan Gennick and Peter Linsley's book
Oracle Regular Expressions Pocket Reference.