Oracle Training Oracle Support Development Oracle Apps

 
 Home
 E-mail Us
 Oracle Articles
New Oracle Articles


 Oracle Training
 Oracle Tips

 Oracle Forum
 Class Catalog


 Remote DBA
 Oracle Tuning
 Emergency 911
 RAC Support
 Apps Support
 Analysis
 Design
 Implementation
 Oracle Support


 SQL Tuning
 Security

 Oracle UNIX
 Oracle Linux
 Monitoring
 Remote s
upport
 Remote plans
 Remote
services
 Application Server

 Applications
 Oracle Forms
 Oracle Portal
 App Upgrades
 SQL Server
 Oracle Concepts
 Software Support

 Remote S
upport  
 Development  

 Implementation


 Consulting Staff
 Consulting Prices
 Help Wanted!

 


 Oracle Posters
 Oracle Books

 Oracle Scripts
 Ion
 Excel-DB  

Don Burleson Blog 


 

 

 


 

 

 

 


Testing the Incremental Load Improvements in 10gR2 OLAP - Part II
December 27, 2005
Mark Rittman

UPDATE 14-Sep-05:

Scott Powell dropped me a line this evening with some comments on the above article. They're actually very significant as they explain why the incremental loads are taking almost as long as the full loads. Read on...

"I suspect this is what is also causing your performance problems / results with the incremental loads in the article you published on Sept 8th "Testing the Incremental Load Improvements in 10gR2 OLAP".

I've attached a quick script I wrote. This script creates 3 "tuple" tables, that hold all of the composite tuples from the source data.

create table tuples_all as
select distinct
	   channelid,
	   promoid,
	   prodid,
	   custid
from   gs_sales;

create table tuples_load1 as
select distinct
	   channelid,
	   promoid,
	   prodid,
	   custid
from   gs_sales_load_1;

create table tuples_load2 as
select distinct
	   channelid,
	   promoid,
	   prodid,
	   custid
from   gs_sales_load_2;

select count(*) from tuples_all;
select count(*) from tuples_load1;
select count(*) from tuples_load2;

TUPLES_ALL loads from gs_sales, TUPLES_LOAD1 loads from gs_sales_load_1, and TUPLES_LOAD2 loads from gs_sales_load_2. Here are the results:

tuples_all  	1,688 tuples 
tuples_load1  	1,634 tuples 
tuples_load2  	1,135 tuples 

Basically, with time as the dense dimension, I've removed it as a consideration from the tuples (because any reference to any time period will have to load the entire page). As you can see, your "load1" and "load2" data is touching almost all of the tuples. I believe this is why your performance is almost always taking almost the same amount of time as a full load - because essentially you are updating almost every page both with the initial and the incremental loads.

I suspect (but have not yet tested) that your performance results would be markedly different if you did the following:

a) redesigned the variables to be dimensioned like this:  <channel composite<time customer product promotion>>
b) split out the gs_sales_load_1 to include all data for time periods < 01-apr-2005
c) split out the gs_sales_load_2 to include all data for time periods >= 01-apr-2005

This gives a better representation of what an incremental load would look like - both because the variable now has TIME as a sparse dimension (meaning when the engine tries to aggregate it, it doesn't have to touch every page), and because loading incremental loads based on time periods is pretty common from a business perspective.

I will do my best to clean up this documentation, but hopefully you'll find this useful. I'll try to check the performance of the incremental load in the next couple of days - if you beat me to it, let me know how it goes."

Thank Scott! I'll rerun the tests as soon as possible and report back on the results. Until then, treat the results in my test as unreliable until verified otherwise.

Update 15-Sep-05

: Scott send a follow-up email:

"Mark, just FYI, I had the chance to redefine the cube and do some further testing. See the results immediately below - saw huge performance gains by choosing channel as the dense dim instead of time:

Your initial cube <time <customer product promotion channel>>
Load data time 65 seconds
Aggregate data time almost 1.5 hours
Size of variables approx 315 Mb each

New cube <channel <time customer product promo>>
Load data time 1 second
Aggregate data time 33 seconds
Size of variables approx 10 Mb each

Redefining sparsity obviously made a huge difference.

However, I then tried to do an incremental load, and did NOT see the results I expected. Basically, I broke up the gs_sales table into gs_sales_load_1 containing everything with a date < 01-apr-2005 (10,675 records), and gs_sales_load_2 with a date >= 01-apr-2005 (754 records). I expected to see a huge difference between the initial load and the incremental update, but didn't:

Step 1:     load data from gs_sales_load_1 and do a full cube aggregation:
                Load time: 1 sec
                Full aggregation time: 35 sec (note - a bit longer than the 33
                seconds before, but not significant???)

Step 2:     incremental load from gs_sales_load_2 and do an incremental cube aggregation:
                Load time: 3 sec
                Incremental agg time: 21 secs

Obviously the 21 secs is better than the 35 seconds, but not by the orders of magnitude I expected to see (I thought it would take just 3 or 4
seconds to do the whole thing)

I'm not sure exactly what to make of this. With the solve times and disk size so small, its hard to judge if these results are a true representation. I'm considering loading up my big play cube to try a more robust test.

What do you think?

Note also that I haven't tested anything yet with compressed composites turned on.

Thx,
Scott"

Update 16-Sep-05

:I had a word with Paul Cannon, our resident OLAP/Express expert, who came back with the following observations

"Mark,

This depends a bit on what Scott means by incremental load. Presumably he means loading data for one time period at a time. If this is the case then having time inside the composite is forcing the creating of a new set of tuples for each load.

For example, if the combination of customer, product and promotion results in, say, 1000 tuples, then you will have 1000 'sets' of these tuples for each time period, but of course these tuples won't exist until data for each time period has been loaded. Therefore when loading a new set of data for a new time period as well as loading the data, the AW has the internal overhead of creating the extra tuples. If Time is outside the composite then the same single set of tuples will get used over again for each time period. (this is a very simplistic view - the actually number of tuples will usually vary for each load, but we'll ignore that for this example!)

To put this another way, with time outside the composite, loading a new set of data will result in variable size growing by one dimension value. With time inside the composite, the variable is growing by 1000 dimension values.

Whilst putting time inside the composite may result in gains in one area (aggregations etc...) you have to bear in mind the negative effect this extra internal maintenance is having. Hence Scott won't get quite the performance improvement he's expecting.

Unless the time dimension is genuinely sparse (which seems unlikely as data is being loading on a per time period basis and presumably aggregated up a time hierarchy), perhaps Scott should try leaving both time and channel outside of the composite.

Paul."

Update 16-Sep-05

: Scott has posted a question on this subject to the OTN Forum. You can check it out here.

Update 16-Sep-05

: Scott subsequently commented on Paul's observations:

"I think Paul is referring to a typical OFA type setup, where almost every composite value (usually account & org) exists for just about every value of time (which is usually at the month level), making time a very dense dimension. However, on your GSW sample data, this is showing customer purchases at the day level spread out over 8 years. Unless the same customer purchases the same product through the same channel with the same promotion on just about every day over 8 years, time is going to be sparse.

If you look at the original .xls worksheet (copied here), the sparsity rating of 433 for TIME on the base level data means that the same customer / product / channel / promo tuple only shows up once for every 433 time periods - meaning that if time is dense the other 432 cells on a page would be empty. Note also that Paul is completely right that the data gets more dense after aggregation - in this case, the time sparsity rating goes from 433 down to 41, but that still means that, on average, the same customer / product / channel / promo tuple shows up only once every 41 time periods... This is why making time sparse instead of dense resulting in the variable sizes shrinking from 600+ Mb down to about 20 Mb, and reduced the solve times from 1.5+ hours to only 33 seconds.

One final thing to note is that I did re-split the data for the "initial" vs. "incremental" load to better highlight the incremental performance problem for Oracle. I split your GS_SALES table down to have everything prior to 6/28/2005 being in gs_sales_load_1 (11,411 records), and only 6/28/2005 - 6/29/2005 being in gs_sales_load_2 (18 records). This means that the incremental load should only be creating 18 new tuples plus the associate parent tuples (comes out to being exactly 1,466 total tuples). Given that the entire solved cube contains 435,274 total tuples, I would expect the time required to aggregate should be (1,446 / 435,274) * the 33 seconds the full aggregation takes - or about .11 seconds. Obviously there may be some overhead on setting up the incremental loads (perhaps that is what is distorting the numbers on this small sample?)

So long and short - Paul's comments make sense in more of an "OFA" type setup, but not for a cube that focuses on customer purchases, etc. Also note that this is one of the big reasons I hate to see recommendations by Oracle that time always be made a dense dimension...because I have found, for MANY applications, that time is often one of the LEAST dense dimensions.

Let me know what you and Paul think! I'll let you know what I hear from the Oracle side (the thread is on OTN also)."

There's obviously a lot more to dealing with scarsity than saying that "time is always the dense dimension" and Scott in particular has sent me some useful calculations and examples for working out sparsity both prior to, and after, a cube load. I'll post an article when I get back from Open World on what my take on all this is.


 

 
��  
 
 
Oracle Training at Sea
 
 
 
 
oracle dba poster
 

 
Follow us on Twitter 
 
Oracle performance tuning software 
 
Oracle Linux poster
 
 
 

 

Burleson is the American Team

Note: This Oracle documentation was created as a support and Oracle training reference for use by our DBA performance tuning consulting professionals.  Feel free to ask questions on our Oracle forum.

Verify experience! Anyone considering using the services of an Oracle support expert should independently investigate their credentials and experience, and not rely on advertisements and self-proclaimed expertise. All legitimate Oracle experts publish their Oracle qualifications.

Errata?  Oracle technology is changing and we strive to update our BC Oracle support information.  If you find an error or have a suggestion for improving our content, we would appreciate your feedback.  Just  e-mail:  

and include the URL for the page.


                    









Burleson Consulting

The Oracle of Database Support

Oracle Performance Tuning

Remote DBA Services


 

Copyright © 1996 -  2017

All rights reserved by Burleson

Oracle ® is the registered trademark of Oracle Corporation.

Remote Emergency Support provided by Conversational