|
 |
|
Assigning weights for a multi-column Oracle text index
Oracle Database Tips by Donald BurlesonJanuary 18, 2015
|
By
Travis Hein
There is a document table in our schema that is populated
when documents (text files, word documents, pdf files, etc) are uploaded into
the system. There is a conversion tool that extracts raw text for us, so
capturing the contents for indexing is working, and we are able to search on
this table.
It is desired that we allow the user is able to perform a full text search on
the document title AND the document content.
And it is desired that the 'weight' or influence on the score for keywords found
in the document title have more of an influence on the relevance ranking (Score)
than the keywords found in the document content.
We initially implemented the search for title and document
content by creating two indicies, one on the document content and a second one
on the document title column in our document table:
CODECREATE INDEX ft_documentContent ON document (documentContent) INDEXTYPE IS
CTXSYS.CONTEXT;
CREATE INDEX ft_documentTitle ON document(documentTitle) INDEXTYPE IS
CTXSYS.CONTEXT;
Then our search query would be
CODESELECT document.*, Score(1) + 2*Score(2) as Score
FROM document
WHERE (CONTAINS(documentContent, 'the_keyword', 1) > 0
OR CONTAINS(documentTitle, 'the_keyword', 2) > 0 )
ORDER BY Score DESC;
This would work to combine the score ranking from the CONTAINS on our indices to
promote the ones found in the documentTitle a little higher than the content
alone for most cases, except when the document contained a ton of the keyword
matches.
The problem with this above approach was that we have found the performance to
be unexpectedly slow when used with even 10,000 documents. In our environment a
typical search with the above query with CONTAINS(1) OR CONTAINS(2) would take
around 2 - 4 seconds to complete.
While simplifying the query to use only a single CONTAINS, as below, only takes
about 0.5 seconds.
CODESELECT document.*, Score(1) as Score
FROM document
WHERE (CONTAINS(documentContent, 'the_keyword', 1) > 0 )
ORDER BY Score DESC;
So something about doing CONTAINS(1) OR CONTAINS(2) is bad for performance.
What we then looked into doing (where we are now) is to create a full text index
by specifying the MULTI_COLUMN_DATASTORE, which allows us to create a single
full text index that will internally consider more than one column in the
document table at once.
We got this working by setting up the preference in CTXSYS schema, connecting to
the CTXSYS user in sqlplus and running
CODEbegin
Ctx_Ddl.Create_Preference (
preference_name => 'my_datastore',
object_name => 'MULTI_COLUMN_DATASTORE' );
Ctx_Ddl.Set_Attribute (
preference_name => 'my_datastore',
attribute_name => 'COLUMNS',
attribute_value => 'documentTitle,documentContent' );
end;
/
And according to the documentation, creating a dummy column that is not being
indexd, and is not in the multi column specification:
CODEALTER TABLE DOCUMENT ADD dummy VARCHAR(1);
And then adjusting our Oracle text indexes to use the new multi column datastore
preference we set up:
CODEDROP INDEX ft_documentContent;
DROP INDEX ft_documentTitle;
CREATE INDEX ft_document ON document(dummy) INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('DATASTORE CTXSYS.my_datastore');
And then adjusting our query string to search on this new dummy column:
CODESELECT document.*, Score(1) as Score
FROM document
WHERE (CONTAINS(dummy, 'the_keyword', 1) > 0 )
ORDER BY Score DESC;
And this seems to work for us, the keyword being found in the document title
and/or the content.
The issue we are now having is the ranking for the occurrence of a keyword in
the title is now the same weight as the occurrence of a keyword in the document
contents.
What we would like to be able to do is tell the MULTI_COLUMN_DATASTORE settings,
so that when it is internally looking at both of the columns to consider its
output for CONTAINS() and the Score(1), that it would consider the occurrence of
a keyword match in the documentTitle column to have twice as much relevance as
the occurrence of a keyword in the documentContent column.
Solution to multi-column Oracle*text search weighting
I just needed to set up 'sections' within our user schema
before we created our index.
begin
ctx_ddl.create_section_group ( group_name
=> 'my_section_group' ,
group_type => 'basic_section_group' );
ctx_ddl.add_field_section ( group_name => 'my_section_group'
,
section_name => 'documentTitle', tag =>
'documentTitle', visible => true );
ctx_ddl.add_field_section ( group_name => 'my_section_group'
,
section_name => 'documentText', tag => 'documentText',
visible => true );
end;/
And then change the contains in the where clause:
CODE
SELECT ....
FROM ....
WHERE CONTAINS(dummy, '(((the_keyword)
within documentTitle))*2 OR ((the_keyword) within documentText)',1) > 0)
ORDER BY Score Desc
So here it is applying the weight of the documentTitle
section (perhaps i should not have created the section name with the same name
as the column in my table) separately from the documentText column.
 |
If you like Oracle tuning, you
might enjoy my book "Oracle
Tuning: The Definitive Reference", with 950 pages of tuning tips and
scripts.
You can buy it direct from the publisher for 30%-off and get instant
access to the code depot of Oracle tuning scripts. |

|
|