On my current assignment I am dealing with slowly changing
dimensions, I found an interesting technique I would like to
share.
In data warehouses one item that can be difficult to deal with
are slowly changing dimensions. Slowly changing dimensions are
dimensions whose values may alter over time such as addresses,
deployments, etc. Often it is a desire of the data warehouse
manager to maintain a running history of these changes over
time, this makes the dimension a type 2 dimension.
The reason
these are difficult to deal with is because determining what has
changed becomes difficult if there are more than a few fields in
the dimension. The data warehouse manager has the choice of a
difficult, field by field verification of each field in the
record or using some other form of field verification to
determine if a record has changed in a dimension thus requiring
a new record insertion. When the new record is inserted into a
slowly changing dimension the old record is marked as invalid
either by a flag or via a date field or set of date fields
allowing for a bi-temporal dimension.
One method for finding out if a record has a changed field is to
generate a CRC or a hash for the entire pertinent sections of
the record, store this signature with the record and then on
each record that is to be inserted, a signature is calculated
for the important sections of the record. This new signature is
compared to the stored signature and if the signatures match,
the record is tossed, if the signature is changed, the new
record is inserted, the old is marked as invalid and the new
marked as the current record, if the new record matches none of
the existing signatures, the record is inserted but no existing
records are altered.
However, calculating a CRC or HASH can be difficult, unless you
know something about what Oracle provides in the various
packages contained in Oracle. For example the OUTLN_EDIT_PKG
contains the GENERATE_SIGNATURE package. The GENERATE_SIGNATURE
package creates a 32 byte hash signature for up to a 32k-1 sized
string value passed to it either internally or via a procedure
call. If the pass-in is from a procedure call from SQL then the
length would be limited to 4000 bytes. Providing your critical
columns are less than 4000 bytes concatenated length you can
generate a signature that can be used to compare records, in
this scenario, each date or number field should be converted
using the appropriate function to a character equivalent.
To use the OUTLN_EDIT_PKG.GENERATE_SIGNATURE package the first
thing to do is make sure your user that owns the schema for the
data warehouse (DWH) has EXECUTE privilege on the OUTLN_EDIT_PKG:
From the owner of the OUTLN_EDIT_PKG, the user SYS, grant the
proper privilege:
GRANT
EXECUTE ON outln_edit_pkg TO ;
In order to use the GENERATE_SIGNATURE package most effectively,
I suggest wrapping it into a deterministic function, for
example:
From the DWH user you could create a function similar to:
CREATE OR
REPLACE FUNCTION get_sign (text VARCHAR2)
RETURN RAW
DETERMINISTIC
AS
signature RAW(32);
text2 VARCHAR2(32767);
BEGIN
text2:=text;
IF text2 IS NULL THEN
text2:='1';
END IF;
sys.outln_edit_pkg.generate_signature(text2,signature);
RETURN signature;
END;
/
Note that this is just a bare-bones function, exception checking
should be added.
Notice I use a default value of '1' if the input value is null,
because the function generates a hash signature you cannot pass
it a null value, so in the case of a null value we change it to
a 1, you of course could use any string value you desire, just
be sure to document it!
Once you have a deterministic function you can either use it as
an INSERT into a table column, or, use it to generate a function
based index. By using a function based index and query re-write
you can eliminate the need to have a column in the table. This
allows you to regenerate the matching columns without having to
update the complete dimension each time there may be columns
added to the dimension. Also, there are performance benefits to
using a function based index or an internal column over doing
either a column by column match. Let's look at some examples:
First, we have our test table TEST, in this case I created it
from the DBA_OBJECTS view using a simple CTAS.
SQL> DESC
test
Name Null? Type
----------------------------------------- --------
-------------------
OWNER VARCHAR2(30)
OBJECT_NAME VARCHAR2(128)
SUBOBJECT_NAME VARCHAR2(30)
OBJECT_ID NUMBER
DATA_OBJECT_ID NUMBER
OBJECT_TYPE VARCHAR2(18)
CREATED DATE
LAST_DDL_TIME DATE
TIMESTAMP VARCHAR2(19)
STATUS VARCHAR2(7)
TEMPORARY VARCHAR2(1)
GENERATED VARCHAR2(1)
SECONDARY VARCHAR2(1)
SQL> SELECT COUNT(*) FROM test;
COUNT(*)
----------
32359
Once we have our test table, we create our function to call the
GENERATE_SIGNATURE and then use it to create a function based
index:
SQL>
CREATE INDEX test_fbi ON
test(get_sign(ownerobject_namesubobject_name));
Index created.
Before we can use query rewrite we must analyze the table and
index as well as set the proper initialization parameters, (query_rewrite_enabled
= true and query_rewrite_integrity = trusted).
SQL>
ANALYZE TABLE test COMPUTE STATISTICS;
Table analyzed.
Let's look at a query that will use this index and see what the
performance is like.
SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') =
get_sign(a.ownera.object_namea.subobject_name);
OBJECT_ID
----------
35567
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=324
Bytes=11016)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=324
Bytes=11016)
2 1 INDEX (RANGE SCAN) OF 'TEST_FBI' (NON-UNIQUE) (Cost=2
Card=129)
Statistics
----------------------------------------------------------
74 recursive calls
0 db block gets
14 consistent gets
1 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> /
OBJECT_ID
----------
35567
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=34)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=34)
2 1 INDEX (RANGE SCAN) OF 'TEST_FBI' (NON-UNIQUE) (Cost=2
Card=1)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> /
OBJECT_ID
----------
35567
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=34)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=34)
2 1 INDEX (RANGE SCAN) OF 'TEST_FBI' (NON-UNIQUE) (Cost=2
Card=1)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Now let?s drop the index and re-run our query to check
performance without it:
SQL> DROP
INDEX test_fbi;
Index dropped.
SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') =
get_sign(a.ownera.object_namea.subobject_name);
OBJECT_ID
----------
35567
Elapsed: 00:00:01.04
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=138
Card=324
Bytes=11016)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=138 Card=324
Bytes=11016)
Statistics
----------------------------------------------------------
243 recursive calls
0 db block gets
526 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
6 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> /
OBJECT_ID
----------
35567
Elapsed: 00:00:01.03
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=138 Card=324
Bytes=11016)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=138 Card=324
Bytes=11016)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') =
get_sign(a.ownera.object_namea.subobject_name);
OBJECT_ID
----------
35567
Elapsed: 00:00:01.03
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=132 Card=324
Bytes=11016)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=132 Card=324
Bytes=11016)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
446 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
As you can see the simple SELECT statement using the function
based index performs in too small a time interval for Oracle's
timing command to measure (the test was run several times, with
identical results), while the SELECT not able to use the index
takes a consistent time just over 1 second (the tests were run
several times, and the times averaged).
As you can see even when the table is read from memory (physical
reads = 0) the consistent gets (446) are much higher than those
from the index based selects ( consistent gets 4) and the index
based query performs better even though it has 74 recursive
calls. The recursive calls were only present on the first run,
after that the recursive calls were 0.
In situations where the call is repeated many times, such as for
loading dimension tables, then the performance improvement
should be dramatic.
In other tests, using an added column SIGNATURE, a RAW(32), and
an index on the column we can get identical performance to a
function based index and table combination:
First, using a modified SELECT to go against the SIGNATURE
column and no index:
SQL> alter
table test add signature raw(32);
Table altered.
SQL> update test a set a.signature=get_sign(a.ownera.object_namea.subobject_name);
32359 rows updated.
SQL> commit;
Commit complete.
SQL> set autotrace on
SQL> set timing on
SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') = a.signature;
OBJECT_ID
----------
35567
Elapsed: 00:00:00.08
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=116 Card=324
Bytes=7128)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=116 Card=324
Bytes=7128)
Statistics
----------------------------------------------------------
277 recursive calls
0 db block gets
534 consistent gets
3 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
4 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> /
OBJECT_ID
----------
35567
Elapsed: 00:00:00.07
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=116 Card=324
Bytes=7128)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=116 Card=324
Bytes=7128)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
So without an index the use of a special SIGNATURE column
performs worse than the table and function based index
combination. However, when we add an index, the performance is
the same from the timing and the statistics.
SQL>
create index test_ui on test(signature);
Index created.
Elapsed: 00:00:00.07
SQL> analyze table test compute statistics;
Table analyzed.
Elapsed: 00:00:02.05
SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') = a.signature;
OBJECT_ID
----------
35567
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=20)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=20)
2 1 INDEX (RANGE SCAN) OF 'TEST_UI' (NON-UNIQUE) (Cost=2
Card=1)
Statistics
----------------------------------------------------------
26 recursive calls
0 db block gets
8 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> /
OBJECT_ID
----------
35567
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=20)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=20)
2 1 INDEX (RANGE SCAN) OF 'TEST_UI' (NON-UNIQUE) (Cost=2
Card=1)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> /
OBJECT_ID
----------
35567
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=20)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=20)
2 1 INDEX (RANGE SCAN) OF 'TEST_UI' (NON-UNIQUE) (Cost=2
Card=1)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
However, if there is a chance that columns will be added
or removed using a column inside the dimension will require the
entire dimension to be updated a row at a time. In low row-count
dimensions this may be acceptable, in large dimensions with
thousands of rows it may be better to use the table and function
based index solution.
|
|
Get the Complete
Oracle SQL Tuning Information
The landmark book
"Advanced Oracle
SQL Tuning The Definitive Reference" is
filled with valuable information on Oracle SQL Tuning.
This book includes scripts and tools to hypercharge Oracle 11g
performance and you can
buy it
for 30% off directly from the publisher.
|