|
 |
|
Parsing with regular expressions regexp_like
Oracle Database Tips by Donald Burleson |
Also see:
Oracle Regular
expressions indexes as a tuning tool
The regexp_instr function is a very
powerful to extract patterns from inside a string. Let's take a
closer look at the prototype for the regexp_instr function:
regexp_instr (
string,
pattern,
position,
occurrence,
return-option,
parameters)
The last argument is the most important. The
"parameters" argument can be a combination of these formats:
i - Used to
introduce case-insensitively
c - Keeps case
(default value)
n - to make
the dot (.) match new lines as well
m - to make ^
and $ match beginning and end of a line in a multi-line string
The "i" parameter is used to force
case-insensitive parsing. Just like the common UNIX "grep –i"
command, we can use the lowercase "i" to extract both "Acre"
and "acres". The lowercase 'i" is also the last argument to
the regexp_like function.
Borrowing an example from Jonathan Gennick's
superb article "First Expressions", we see a parsing routine to extract "acreage"
references from inside a test string, ignoring important factors
such as case sensitivity and words stems (acre, acres, acreage):
COLUMN park_name format a30
COLUMN acres format a13
SELECT
park_name,
REGEXP_SUBSTR(description,'[^ ]+[- ]acres?',1,1,'i') acres
FROM michigan_park
WHERE REGEXP_LIKE(description, '[^ ]+[- ]acres?','i');
Here is the output, where we see that the
regular expression has parsed-out the acreage figures, just as-if
they were a discrete data column with the table:
PARK_NAME ACRES
____________________________ ___________
Mackinac Island State Park 1800 acres
Muskallonge Lake State Park 217-acre
Porcupine Mountains State Park 60,000 acres
Tahquamenon Falls State Park 40,000+ acres
As we see, the regexp_like function is
extremely powerful for extracting known pattenrs from inside larger
strings. Also, you can nest the regexp_instr function inside
a substr function, using the regexp_instr fifth
argument (the return-option) to compute the length of the
result column "acres":
COLUMN park_name format a30
COLUMN acres format a13
SELECT
park_name,
SUBSTR(
description,
REGEXP_INSTR(description, '[^ ]+ acres|[^ ]+-acre',1,1,0,'i'),
-- start
REGEXP_INSTR(description, '[^ ]+ acres|[^ ]+-acre',1,1,1,'i')
-
--
minus
REGEXP_INSTR(description, '[^ ]+ acres|[^ ]+-acre',1,1,0,'i')
-- nbr
) acres
FROM michigan_park
WHERE
REGEXP_LIKE(description, '[^ ]+ acres|[^ ]+-acre','i');
Here we see identical output, but we gain
insight into how the regexp_instr function is used to get the
start-position and string-length numbers for the regexp_substr
function:
PARK_NAME ACRES
____________________________ ___________
Mackinac Island State Park 1800 acres
Muskallonge Lake State Park 217-acre
Porcupine Mountains State Park 60,000 acres
Tahquamenon Falls State Park 40,000+ acres
Jonathan Gennick notes:
The arguments to SUBSTR are string, start
position, and length. The first call to REGEXP_INSTR searches for a
substring matching the pattern 'xxx acres'. I did that by searching
for any number of non-space characters followed by the word "acres",
as in:
[^ ]+ acres
The "[^ ]+" part of the pattern allows for the non-space characters.
I used alternation to also include the case where a hyphen appeared
between the number and the acreage:
[^ ]+-acre
To compute the number of characters to take using SUBSTR, I
subtracted the start position of the acreage string from the ending.
To get the ending position, I flipped the third argument to
REGEXP_INSTR from a 0 to a 1. (that may actually give one past the
end. look it up to be sure).
I see now that I did not anticipate the possibility of a "one acre"
("acre" as singular) national park. Ah, well. It'd be easy enough to
accommodate that possibility.
Now that we understand the basic concept of
data parsing, let's examine how we might use regular expressions to
find transposition errors within an Oracle database.
See also:
Oracle Regular
expressions indexes as a tuning tool
For expert Oracle data cleansing support and
data scrubbing consulting, use an expert from BC. We
understand the powerful Oracle data unification tools, and
we can aid in improving the data quality of any Oracle
database, large or small. |
 |
|