|
|
|
Using utl_match to compare string similarity
Oracle Database Tips by Donald BurlesonMarch 30, 2015
|
Hugo Cardoso asks: Given a column name
(word or small text) I want to choose from a set of column names the most seemed
(if it is not equal). I'm thinking to use 'soundex' function, but I do not know
if I can use it (and how use it) as a measured of proximity (choose the nearest)
in the case of the function return it is not exactly the same. I made some
tests and it seems to me that the 'soundex' function it is not good for me (for
example it is insensitive to the numbers). I think that the solution for
my problem is in utl_match built-in package. Is there any examples of using
utl_match to compare strings?
Answer: Soundex is commonly used in
genealogy research where you must locate 'similar sounding? names in databases,
where the original data was entered hundreds of years ago, often by someone who
used phonetics to 'sound-out? the way that a spoken name was heard.
The Oracle soundex function accepts a string and returns a number corresponding
to the 'sound? of the spoken string, an invaluable tool for locating
typographical errors in a name.
Oracle Corporation borrowed the soundex algorithm from the book The Art of
Computer Programming, Volume 3, by Donald Knuth, to generate the phonetic
representation of character strings.
Here are some of the values SOUNDEX generated and how they vary according to the
input string:
SOUNDEX ('feuerstein') ==> 'F623'
SOUNDEX ('feuerst') ==> 'F623'
From the Oracle documentation, we see these
limits of the soundex function.
The soundex value always begins with the
first letter in the input string. soundex only uses the first five
consonants in the string to generate the return value. Only consonants are
used to compute the numeric portion of the soundex value. Except for a
possible leading vowel, all vowels are ignored. soundex is not
case-sensitive. Upper- and lowercase letters return the same soundex value.
Using utl_match to compare strings
According to Oracle the
utl_match utility is new in Oracle 10g release 2, and there is no
documentation. The undocumented utl_match package is used to compare the
similarity of a target string and a examined string, embedded inside SQL. The
utl_match procedure has four string compare functions:
edit_distance - This utl_edit
function counts the number of character changes that are required to morph
the target string into the examined string.
edit_distance_similarity - This utl_edit function returns the
percentage of similarity between the target string and the examined string.
jaro_winkler - This utl_edit function is used to detect data entry
errors by measuring the degree that the strings match.
jaro_winkler_similarity - This utl_edit function is similar to
edit_distance_similiarity, computing the percentage of a string match.
This
OTN thread shows some great examples of utl_match:
SCOTT@orcl_11g> -- ordering by
utl_match.edit_distance:
SCOTT@orcl_11g> select z.*, UTL_MATCH.EDIT_DISTANCE ('hilton hotel', LOWER (operator_name))
edit_dist
2 from operator z
3 where contains (concat,'hilton hotel within {the_operator_name}',1 ) > 0
4 ORDER BY edit_dist
5 /
OPERATOR_ID OPERATOR_NAME EDIT_DIST
----------- ------------------------------ ----------
1000084 Hilton Hotel 0
1005660 Hilton Hotel 0
1001397 HILTON HOTEL AIRPORT 8
1004356 Hilton Hotel/Avendra Eugene 15
1001472 Hilton Hotel & Conference Ctr. 18
SCOTT@orcl_11g> -- ordering by utl_match.edit_distance_similarity:
SCOTT@orcl_11g> select z.*, UTL_MATCH.EDIT_DISTANCE_SIMILARITY ('hilton hotel',
LOWER (operator_name)) edit_dist_sim
2 from operator z
3 where contains (concat,'hilton hotel within {the_operator_name}',1 ) > 0
4 ORDER BY edit_dist_sim DESC
5 /
OPERATOR_ID OPERATOR_NAME EDIT_DIST_SIM
----------- ------------------------------ -------------
1000084 Hilton Hotel 100
1005660 Hilton Hotel 100
1001397 HILTON HOTEL AIRPORT 60
1004356 Hilton Hotel/Avendra Eugene 45
1001472 Hilton Hotel & Conference Ctr. 40
SCOTT@orcl_11g> -- ordering without selecting the column:
SCOTT@orcl_11g> select * from operator z
2 where contains (concat,'hilton hotel within {the_operator_name}',1 ) > 0
3 ORDER BY UTL_MATCH.EDIT_DISTANCE ('hilton hotel', LOWER (operator_name))
4 /
OPERATOR_ID OPERATOR_NAME
----------- ------------------------------
1000084 Hilton Hotel
1005660 Hilton Hotel
1001397 HILTON HOTEL AIRPORT
1004356 Hilton Hotel/Avendra Eugene
1001472 Hilton Hotel & Conference Ctr.
SCOTT@orcl_11g> select * from operator z
2 where contains (concat,'hilton hotel within {the_operator_name}',1 ) > 0
3 ORDER BY UTL_MATCH.EDIT_DISTANCE_SIMILARITY ('hilton hotel', LOWER (operator_name))
DESC
4 /
OPERATOR_ID OPERATOR_NAME
----------- ------------------------------
1000084 Hilton Hotel
1005660 Hilton Hotel
1001397 HILTON HOTEL AIRPORT
1004356 Hilton Hotel/Avendra Eugene
1001472 Hilton Hotel & Conference Ctr.
|
If you like Oracle tuning, you
might enjoy my book "Oracle
Tuning: The Definitive Reference", with 950 pages of tuning tips and
scripts.
You can buy it direct from the publisher for 30%-off and get instant
access to the code depot of Oracle tuning scripts. |
|