One of the questions that often
comes up when I'm helping people with Oracle OLAP, is what is
the most efficient way to load incremental data into a cube?
Take for example a situation where you have an initial load of
8000 rows into a cube that has five dimensions, and subsequently
you need to load another 2000 odd rows; do you rebuild the cube
from scratch when you need to add the further 2000 rows, or do
you leave the cube as it is and add the 2000 rows on top of
what's already there, either as a separate load on their own or
mixed in with the data that was already loaded? If you're using
AWM10g, do you select the "Aggregate the full cube" option, or
the "Aggregate just for incoming values" option, and if you
choose the latter what benefits does it bring?

This is actually one of the
questions I've addressed in the forthcoming article I've written
for DBAZine on Oracle Database 10g OLAP performance tuning, but
I don't think I'm giving too much away when I tell you that,
with 9iR2 and 10gR1, it actually doesn't make any difference at
all whether you choose to aggregate just for new values or for
all values in the cube, or whether you separate out the new data
and load it in on it's own, or you just reload the entire
dataset, old data and new - the aggregation process takes the
same time whatever. However, one of the new things promised with
10gR2 is that the loading of incremental data is much faster
than with 9iR2 or 10gR1, as this excerpt from the
Oracle OLAP 10gR2 PDF shows:

Quite an
impressive improvement. With this in mind, I thought I'd load up
the dataset that I used for the article and see what sort of
improvements I could find.
The cube consisted
of two measures (order quantity, ship quantity) and five
dimensions (product, customer, channel, time and promotion). The
source table for the measures contained 11429 rows, and I split
this into two load batches, one of 8884 rows, and 2545 in the
second. I then put together four scenarios:
-
Load the first 8884 rows into the cube, select
"Aggregate the full cube", then load these 8884 rows plus 2545
additional rows, again selecting "aggregate the full cube".
-
Load the first 8884 rows as before, select
"Aggregate the full cube", then load these 8884 rows plus 2545
additional rows, but this time select "aggregate for incoming
values only"
-
Load the first 8884 rows as before, then
additionally load just the 2545 new rows and select "aggregate
the full cube"
-
Load the first 8884 rows as before, then
additionally load just the 2545 new rows, but this time select
"aggregate for incoming values only"
These scenarios
were tested on 10gR1 and 10gR2 on Windows XP SP2, with 10gR2
being the developer preview that's available on OTN. 10gR2 was
running in a VMWare virtual machine with 1GB of RAM allocated,
whilst 10gR1 was running on the OS proper. All settings were
left as the default, with no changes to SGA_TARGET,
PGA_AGGREGATE_TARGET and so on.
With 10g Release
1, the results looked like this:

As I said earlier,
there's no real difference between any of the load scenarios;
each of them took around the same time to process the additional
data. Also, the rate at which incremental data is processed is
about one-fifth of the rate of the original data load - my take
on this is that the load routine within Oracle OLAP has to
"merge" the incoming data in with existing cells in the
measures, which is a lot slower than straight "inserts".
Nonetheless, there's no performance boost to be had by either
selecting "aggregate incoming values only", or by separating out
new data from old. Infact, if you look at the last scenario,
doing a full load of all the 11k rows, you'll see that it's
actually quicker to rebuild the cube from scratch than to try
and add in an incremental load.
With 10g Release 2
(remember, it's running in VMWare virtual machine, so we're not
strictly comparing apples and apples, and it's beta code) the
results look like this:

A few points to
note here:
-
The rate at which initial data is loaded is
about 50% faster than 10gR1 - not bad.
-
When the new data is mixed in with the old and
reloaded into the cube, 10gR2 is still faster than 10gR1, but
the benefit is less than with our initial data load
-
If the new data is split out from the old
before it's loaded, 10gR2 is up to twice as fast as 10gR1 for
loading incremental data. Strangely, it didn't seem to matter
whether I selected the "aggregate just incoming values" option
or not, in fact in the tests I carried out the load was
actually faster if I just selected "aggregate all values".
-
With these performance improvements, it's now
worthwhile doing an incremental load, rather than rebuilding
your cube from scratch each time.
I ran the tests a
few times to try and get some stable / averaged out figures, but
you should bear in mind that the dataset is comparatively small
and the incremental load "boosting" algorithm may provide more
benefits the larger the size of your cube.
So, what this says
then is that there are performance improvements in 10gR2 for
loading incremental data, but (from my tests, anyway) they are
found when you just load just the new data into a cube, rather
than reloading the entire dataset including your new data. I
wasn't however able to see any performance benefit from choosing
the "aggregating incoming values only" option - I seemed to get
this benefit whether the option was selected or not. The
performance benefit isn't huge - certainly not the 18-fold
improvement that the Oracle slide suggests - but it is there,
and it made the difference for me between reloading the cube
from scratch each time, to putting in place a system where we
perform an incremental load of just the new rows.
Anyone else had a
play around with this, perhaps got closer to the improvements in
the slide? Let me know if you do.