 |
|
Advanced Oracle SQL: Regular Expression Functions
Oracle Tips by Laurent Schneider
|
Laurent
Schneider is considered one of the top Oracle SQL experts, and
he is the author of the book "Advanced SQL Programming" by Rampant TechPress. The following
is an excerpt from the book.
Oracle 10g introduced regular expression functions in SQL
with the functions REGEXP_SUBSTR, REGEXP_REPLACE, REGEXP_INSTR and REGEXP_LIKE.
Oracle 11g extends the set of available expressions with REGEXP_COUNT.
SELECT
ENAME,
REGEXP_SUBSTR(ENAME,'DAM') SUBSTR,
REGEXP_INSTR(ENAME, 'T') INSTR,
REGEXP_REPLACE(ENAME,'AM','@') REPLACE,
REGEXP_COUNT(ENAME, 'A') COUNT
FROM
EMP
WHERE
REGEXP_LIKE(ENAME,'S');
ENAME SUBSTR INSTR
REPLACE COUNT
---------- ---------- ---------- ---------- ----------
SMITH 4 SMITH 0
JONES 0 JONES 0
SCOTT 4 SCOTT 0
ADAMS DAM 0 AD@S 2
JAMES 0 J@ES 1
REGEXP_SUBSTR returns the substring DAM if found,
REGEXP_INSTR returns the position of the first 'T', REGEXP_REPLACE replaces the
strings 'AM' with '@' and REGEXP_COUNT counts the occurrences of 'A'.
REGEXP_LIKE returns the strings that contain the pattern 'S'.
SELECT
REGEXP_SUBSTR('Programming','[[:alpha:]]+',1,2)
FROM
DUAL;
REGEXP
------
Oracle
'[[:alpha:]]' is a POSIX regular expression that matches any
letter. The second set of consecutive word characters is returned. The '+'
specifies that the number of characters to be matched is one or more. '.'
matches exactly one character; '.?' matches zero or one character; '.*' match
zero, one or more character; '.+' matches one or more character; '.{3}' matches
exactly three characters; '.{4,6}' matches 4, 5 or 6 characters; '.{7,}' matches
7 or more characters. The third argument is the starting position. The default 1
means the pattern will be searched from the beginning of the substring. The
fourth argument in 11g represents the occurrence of the substring.
SELECT
REGEXP_SUBSTR('Programming','\w+',1,2)
FROM
DUAL;
REGEXP
------
Oracle
Oracle 10gR2 introduced Perl-influenced regular expressions.
'\w' represents any letter, number and the underscore. Unfortunately, in
comparison to the old-style approach with INSTR and SUBSTR, the 10g regular
expressions perform poorly.
SET TIMING ON
DECLARE
X VARCHAR2(40);
BEGIN
FOR I IN 1..10000000 LOOP
X := 'Programming';
X := SUBSTR(X,
INSTR(X, ' ')+1,
INSTR(X, ' ', 1,2)-INSTR(X, ' ')-1);
END LOOP;
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:00:20.40
SET TIMING ON
DECLARE
X VARCHAR2(40);
BEGIN
FOR I IN 1..10000000 LOOP
X := 'Programming';
X := REGEXP_SUBSTR(X,'\w+',1,2);
END LOOP;
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:02:10.82
REPLACE replaces all occurrence of a string. REGEXP_REPLACE
has the same behavior by default, but when the fifth parameter, OCCURRENCE, is
set to a value greater than zero, the substitution is not global.
SELECT
REGEXP_REPLACE
(
'Programming',
'([[:alpha:]]+)[[:space:]]([[:alpha:]]+)',
'\2: \1',
1,
1
)
FROM
DUAL;
REGEXP_REPLACE('ADVANCEDORACLESQ
--------------------------------
Oracle: Advanced SQL Programming
The search pattern contains a group of one or more
alphabetic characters, followed by a space, then followed by a group of one or
more alphabetic characters. This pattern is present more than once in the
string, but only the first occurrence is affected. The replace pattern contains
a reference to the second word, followed by a column and a space, followed by
the first string.
SELECT
REGEXP_SUBSTR
(
'Programming',
'(\w).*?\1',
1,
1,
'i'
)
FROM
DUAL;
REGE
----
Adva
The search pattern contains any alphabetic character
followed by a non-greedy number of characters followed by the same character as
in the group. The search starts at the character one and looks for the first
match of the pattern. The modifier 'i' indicates a case insensitive search.
Non-greedy expressions appeared in 10gR2. The difference between a non-greedy
expression like '.*?', '.+?', '.??', '.{2}?', '.{3,5}?' or '.{6,}?' and a greedy
expression like '.*', '.+', '.?', '.{2}', '.{3,5}' or '.{6,}' is that the
non-greedy searches for the smallest possible string and the greedy for the
largest possible string.
SELECT
REGEXP_SUBSTR
(
'Oracle',
'.{2,4}?'
) NON_GREEDY,
REGEXP_SUBSTR
(
'Oracle',
'.{2,4}'
) GREEDY
FROM
DUAL;
NON_GREEDY GREEDY
---------- ------
Or Orac
Both patterns select from two to four characters. In this
case, it could be 'Or', 'Ora' or 'Orac'. The non-greedy pattern returns two and
the greedy four:
SELECT
ENAME,
REGEXP_SUBSTR(ENAME,'^K') "^K",
REGEXP_SUBSTR(ENAME,'T$') "T$",
REGEXP_SUBSTR(ENAME,'^[ABC]') "^[ABC]",
REGEXP_SUBSTR(ENAME,'^.?M') "^.?M",
REGEXP_SUBSTR(ENAME,'(RD|ES)$') "(RD|ES)$",
REGEXP_SUBSTR(ENAME,'(..R){2}') "(..R){2}",
REGEXP_SUBSTR(ENAME,'^.{4}[^A-E]')
"^.{4}[^A-E]"
FROM
EMP;
ENAME ^K T$ ^[ABC]
^.?M (RD|ES)$ (..R){2} ^.{4}[^A-E
---------- -- -- ------ ---- -------- -------- ----------
SMITH SM SMITH
ALLEN A ALLEN
WARD RD
JONES ES JONES
MARTIN M MARTI
BLAKE B
CLARK C CLARK
SCOTT T SCOTT
KING K
TURNER
TURNER
ADAMS A ADAMS
JAMES
ES JAMES
FORD RD
MILLER M
The function REGEXP_SUBSTR matches ENAME to a pattern and
returns the matched string. The first pattern checks if the name starts with
'K', the second checks if it ends with 'T', the third checks if it starts with
A, B or C, the fourth checks if the string start with one or zero characters
followed by M (which means the first or second character is a 'M'), the fifth
checks if it ends with either ES or RD, the sixth checks if the pattern ?one
character + one character + the letter R? is found twice consecutively
and the last pattern checks if the fifth character (the character following 4
characters at the beginning of the string) is not in the range A-E. Note that
KING is not matched because the fifth character is not a character different
from A-E. To test a string less than five characters, the pattern ^.{1,4}$ could
be used.