"In the context of an analytic workspace, what is
sparsity? what tests can we use to determine whether a
dimension is sparse? What impact does it have on the cube
generation if something is nominated as sparse/non-sparse?"
Sparsity refers to the extent to which a measure contains
null values, or "NA" values as they are referred to within
Express and analytic workspaces. An NA value can occur when
there is no measure value for a particular combination of the
measure's dimension, and just as with nulls in regular
relational Oracle, they do take up a bit of storage space and in
addition add to the time required to perform an aggregation.
When we talk about sparsity, what we're really talking about
is the sparsity of a measure, not individual measures, which is
where I think the confusion comes in sometimes. To take an
example, we might be building a cube for a supermarket, where
one dimension is geography (individual stores), another is time
(lowest level being a week), another is customers and the last
one is products. If we have 1000 stores, 52 weeks in a year,
500,000 customers and 10000 products, our measure has a
potential 1000x52x10000x500,000 = 260,000,000,000,000 cells, but
we might only have 1872000000 populated cells in our measure
(450,000 customers shopping on average 26 times a year, buying
40 products at just 2 stores) making our measure (936,000,000 /
260,000,000,000,000) x 100 = 0.00036% sparse.
If we create this measure and don't do anything to deal with
sparsity, and assuming each cell takes up 8 bytes, our measure
is going to be ( 260,000,000,000,000 x 8 ) / 1,024,000 =
2,031,250,000Gb in size, which is clearly impractical. To deal
with this issue of
database explosion there are a number of methods within
Oracle OLAP to deal with sparse and very sparse cubes, the main
one of which is the replacement of "sparse" dimensions with
something called a
composite.
This is where the concept of dimensions being "sparse" comes
in. In our example, it's the measure itself that's sparse (only
0.00036% of cells have a value), and looking at the factors that
make this up, we can observe that most customers shop on average
every two weeks, but only visit a small number of stores, and
only buy a small fraction of the total products on sale.
Therefore, we can say that time is dense (people shop once out
of every two weeks) and customers is dense (almost all customers
make a purchase) but products and geography are sparse (they
only buy a small selection of products from a small set of
stores). In this example then, we'd probably set time as dense,
products and geography as sparse, and probably set customers as
sparse as well (it's the interaction of products, customers and
geography that makes the cube sparse).
If we do this, and AWM creates a composite to replace the
sparse customer, product and geography dimensions, our composite
dimension will have 450,000 x 40 x 2 = 36, 000,000 "tuples"
(combinations of dimension members), which when we then multiply
it by 52 weeks gives us a measure with 1,872,000,000 cells,
936,000,000 of which will be populated, giving us a 50% measure
sparsity and taking up 14.625Gb - still big, but nowhere near as
big as our original measure. Having a smaller measure through
using a composite reduces the amount of time required to
aggregate a cube (you probably would never finish rolling up the
original one) and of course the disk space required to store it.
Composites do of course have drawbacks; although they are
populated automatically when initially defined they take time to
create, which is why in some cases conjoints, or more recently
with Oracle 9i and 10g concat dimensions, are sometimes used
instead, as with these you populate the object manually which is
useful if you have special knowledge over which tuples are
required to be created. However for most applications they are a
"no-brainer" and should be used whenever the cube is sparse,
which most of them are. One other point to note is that, when
using a composite, you should always leave at least one
dimension as dense (usually "time") as composites with
all-sparse dimensions seem to have performance problems
(experience from other colleagues in the field, don't know the
reason why but seems to be generally accepted).
One other point to note is with Oracle OLAP 10g, where you
get a new feature called
compressed
composites. These take the composite idea further and for
particularly sparse cubes, additionally remove redundant data
from the measure creating the smallest possible measure,
composite and composite index (the bit that links the composite
to the measure). Compressed composites currently have few
restrictions (no partial aggregation, SUM aggregation only)
which are likely to be lifted when 10g Release 2 comes out. If I
was building the cube described in the example above, I'd
definitely try a compressed composite as with that level of
sparsity, I'd expect a much smaller measure again.