"Are there any sensible rules we should apply when
choosing aggregation levels. e.g. should you always aggregate
at the top and bottom levels? Should you only aggregate at the
levels that give you the most savings (e.g. aggregating from
period to year should in theory reduce by a factor of 12)?
Should we only consider aggregation at levels that support the
current reporting requirements. I guess we don't go crazy and
aggregate at all possible levels?"
When you load base-level data into an analytic workspace, if
you want data to appear at higher levels of aggregation you have
to have put an aggregation plan in place. In this respect
analytic workspaces are a bit different to relational tables -
if you want to sum up your sales figures to the regional and
product category levels, you just SUM the measures in the fact
table and do a GROUP BY to aggregate your data on the fly. With
analytic workspaces, data is only normally only loaded in at the
bottom level of a dimension hierarchy, and all the cells in the
measure that hold higher level data are empty until you specify
and execute an aggregation plan.
According to the Oracle By Example tutorial that comes with
Global Sample Schema,
"In analytic workspaces, aggregate data can be generated
at two distinct times:
- On the fly in response to a query. Calculated values
may be cached for use throughout the session, but they are
not shared among sessions.
- As part of the build procedure. Calculated values are
stored as part of the analytic workspace and shared by all
If your dimensions have multiple hierarchies or if the
hierarchies have many levels, then fully aggregating the
measures can increase the size of your analytic workspace (and
thus your database) geometrically. At the same time, much of
the intermediate level data may be accessed infrequently or
not at all.
The most effective method of summarizing data in an analytic
workspace is by storing some aggregates and calculating others
on the fly. A typical strategy for doing this is called
skip-level aggregation, because some levels are stored and
others are skipped until runtime."
The question of course is which levels to aggregate, and
which ones to skip. According to the tutorial:
"A good strategy for identifying levels for
pre-aggregation is to determine the ratio of dimension members
at each level, and to keep the ratio of members to be rolled
up on the fly at approximately 10:1. This ratio assures that
all answer sets can be returned quickly. Either the data is
stored in the analytic workspace, or it can be calculated by
rolling up 10 values at a time into each aggregate value.
This 10:1 rule is best applied with some judgment. You might
want to permit a higher ratio for levels that you know are
seldom accessed. Or you might want to pre-calculate levels at
a lower ratio if you know they have heavy use.
Slower varying dimensions take longer to aggregate because the
data is scattered throughout its storage space. If you are
optimizing for data maintenance, then fully aggregate the
faster varying dimensions and use skip-level aggregation on
the slower varying dimensions."
Which leaves the question as to whether you should always
aggregate the top and bottom levels. In terms of the top level,
in one respect it depends on whether it meets the criteria that
would cause any other level to be pre-aggregated: is it
frequently accessed, and does it have sufficient children to
warrant performing the aggregation in advance. One thing to bear
in mind here though is whether your reports all default to the
top level of the hierarchy when they are first loaded - if they
do, then it might be worth pre-aggregating at the top level
anyway, just to ensure that reports all come up straight away
and there is no perceived delay in bringing up your reporting
With regard to the bottom level, the reason you might (or
might not) specify this level of aggregation is when you are
considering aggregation across all the levels in a dimension.
When you think of it this way - what combinations of dimension
levels do you want to pre-summarize at - then you will need to
consider the base levels in any aggregation plan, as you might
want to presummarise your data at the product group, customer
region and month level combination if this would speed up a
common query. The important thing to realize here is that we're
not talking about summarizing your sales measure by time, or
products, or customers, in isolation - if we were,
pre-summarizing at the base level would be pointless, as there'd
be nothing to pre-summarize - we're talking about the "matrix"
of all possible dimension level combinations, and clearly in
this case we might want to pre-summarize sales by product
category, country (both of which are higher levels in their
respective dimensions) and month (which in our case is the
lowest level in the time dimension).