"In the context of an analytic workspace, what is sparsity? what tests can we use to determine whether a dimension is sparse? What impact does it have on the cube generation if something is nominated as sparse/non-sparse?"

Sparsity refers to the extent to which a measure contains null values, or "NA" values as they are referred to within Express and analytic workspaces. An NA value can occur when there is no measure value for a particular combination of the measure's dimension, and just as with nulls in regular relational Oracle, they do take up a bit of storage space and in addition add to the time required to perform an aggregation.

When we talk about sparsity, what we're really talking about is the sparsity of a measure, not individual measures, which is where I think the confusion comes in sometimes. To take an example, we might be building a cube for a supermarket, where one dimension is geography (individual stores), another is time (lowest level being a week), another is customers and the last one is products. If we have 1000 stores, 52 weeks in a year, 500,000 customers and 10000 products, our measure has a potential 1000x52x10000x500,000 = 260,000,000,000,000 cells, but we might only have 1872000000 populated cells in our measure (450,000 customers shopping on average 26 times a year, buying 40 products at just 2 stores) making our measure (936,000,000 / 260,000,000,000,000) x 100 = 0.00036% sparse.

If we create this measure and don't do anything to deal with sparsity, and assuming each cell takes up 8 bytes, our measure is going to be ( 260,000,000,000,000 x 8 ) / 1,024,000 = 2,031,250,000Gb in size, which is clearly impractical. To deal with this issue of database explosion there are a number of methods within Oracle OLAP to deal with sparse and very sparse cubes, the main one of which is the replacement of "sparse" dimensions with something called a composite.

This is where the concept of dimensions being "sparse" comes in. In our example, it's the measure itself that's sparse (only 0.00036% of cells have a value), and looking at the factors that make this up, we can observe that most customers shop on average every two weeks, but only visit a small number of stores, and only buy a small fraction of the total products on sale. Therefore, we can say that time is dense (people shop once out of every two weeks) and customers is dense (almost all customers make a purchase) but products and geography are sparse (they only buy a small selection of products from a small set of stores). In this example then, we'd probably set time as dense, products and geography as sparse, and probably set customers as sparse as well (it's the interaction of products, customers and geography that makes the cube sparse).

If we do this, and AWM creates a composite to replace the sparse customer, product and geography dimensions, our composite dimension will have 450,000 x 40 x 2 = 36, 000,000 "tuples" (combinations of dimension members), which when we then multiply it by 52 weeks gives us a measure with 1,872,000,000 cells, 936,000,000 of which will be populated, giving us a 50% measure sparsity and taking up 14.625Gb - still big, but nowhere near as big as our original measure. Having a smaller measure through using a composite reduces the amount of time required to aggregate a cube (you probably would never finish rolling up the original one) and of course the disk space required to store it.

Composites do of course have drawbacks; although they are populated automatically when initially defined they take time to create, which is why in some cases conjoints, or more recently with Oracle 9i and 10g concat dimensions, are sometimes used instead, as with these you populate the object manually which is useful if you have special knowledge over which tuples are required to be created. However for most applications they are a "no-brainer" and should be used whenever the cube is sparse, which most of them are. One other point to note is that, when using a composite, you should always leave at least one dimension as dense (usually "time") as composites with all-sparse dimensions seem to have performance problems (experience from other colleagues in the field, don't know the reason why but seems to be generally accepted).

One other point to note is with Oracle OLAP 10g, where you get a new feature called compressed composites. These take the composite idea further and for particularly sparse cubes, additionally remove redundant data from the measure creating the smallest possible measure, composite and composite index (the bit that links the composite to the measure). Compressed composites currently have few restrictions (no partial aggregation, SUM aggregation only) which are likely to be lifted when 10g Release 2 comes out. If I was building the cube described in the example above, I'd definitely try a compressed composite as with that level of sparsity, I'd expect a much smaller measure again.