 |
|
Search Engine word stemming and synonym expansion
Oracle Database Tips by Donald Burleson |
Today we assume that the search engine can locate
all of the matching pages (precision) and return relevant results.
Google, considered the de-facto engine by many technies has rocketed
Google into a billion dollar industry, with over a dozens contenders
(Magellan, Alta Vista, MSN, Mama, Etc) ripping at Google's heels.
From an academic perspective, the "best" search
engine is the one that returns the "right" answer, the one that
derived the "meaning" of the query and returned on-point results.
Old research attempted to qualify the quality of search engines
using difficult metrics such as
Search Engine
Precision and Recall.
A central part of quantifying the "relevance"
of any query is to "expand" the query into a more complex query.
For example, consider the query:
cheap condo Los Angeles no credit check
Word Stemming
"Word stemming" is defined as the
ability to include word variations. For example any noun-word would
include variations (whose importance is directly proportional to the
degree of variation) With word stemming, we use quantified methods
for the rules of grammar to add word stems and rank them according
to their degree of separation from the root word. For example,
we might see stems identified for "cheap", "condo" and "check":
(cheap
or cheaper)
AND
(condo and condos)
AND
(check and checked and checking)
Synonym Expansion
Synonym Expansion is where we take variants of
the word and assign them to the search engine query. Retuning
to our example, the term "cheap" might indicate that the searcher is
also interested in similar terms for a low cost:
cheaper
or
inexpensive
or
"low cost"
or
bargain
Similarly, the term "condo" might indicate that
the searcher is also interested in similar types on housing"
condo
or
apartment
or
flat
or
"rental property"
When we expand a query we develop a complex
word search expression for the base engine. In our case the
simple "cheap condo Los Angeles no credit check" is transformed into
a far more complex Boolean form:
(cheap or
cheaper)
AND
(condo and condos)
AND
(check and checked and checking)
AND
(cheaper or inexpensive or "low cost" or bargain)
AND
(condo or apartment or flat or "rental property")
Oh, but what about adding stems of the
synonyms:
AND
(apartment or apartments)
AND
(bargains or bargain or bargaining)
Of course, we have not yet assigned weights to
the synonyms in the query. For example the word "flat" is an
obscure term for housing and it would have far less weight than the
original "condo".
 |
If you like Oracle tuning, you may enjoy my new book "Oracle
Tuning: The Definitive Reference", over 900 pages
of my favorite tuning tips & scripts.
You can buy it direct from the publisher for 30%-off and get
instant access to the code depot of Oracle tuning scripts. |
|