One important feature of Oracle data warehousing is the creation and
maintenance of complex classification hierarchies. Clients need the
ability to create custom-defined populations and compare these
populations to other arbitrary data groupings. In data warehouse
vernacular, these custom classifications are known as “cohorts” or
“clusters.
This ability to custom-define populations is the foundation of
advanced analytics, traditionally within the domain of data
warehousing, where correlations between populations are discovered
using multivariate statistics.
However, there is a push towards allowing the definition of cohorts
online. Sophisticated data warehouse users are requesting the
ability to define and compare “user defined populations” in
real-time using Oracle 10g Discoverer.
For a simple example, consider a point of sale data warehouse for a
grocery store chain. When customers use their “shoppers club” card
to receive discounts, the POS warehouse tracks their purchases. An
executive might want to create an abstract data type called “person
type” and then
instantiate user-defined diseases based on complex
Boolean rules. These classification groups are often given
arbitrary names:
-
“A YUPPIE is someone over the age of 18 with a
college degree who has a home valued at more than $300,000, a
gross income of at least $100,000 per year”.
-
“A DINK (double-income, no kids) is a married
couple with no children whose combined income is at least
150,000 per year, a pristine credit rating and net assets of
more than $200,000”.
Once defined, these arbitrary categorizations must be able to
take-on full status as a warehouse dimension such that they can be
used in OLAP queries:
-
Display the average monthly coffee expenses for
DINKS, grouped by MONTH
-
Display the average grocery expenses for
YUPPIES, grouped by NET INCOME and OCCUPATION
This ability to define and compare clusters is a important feature
of any data warehouse. In an article section titled “the
analytics of behavior”, Dr. Ralph Kimball notes that clustering
is critical for:
Classification is possible once you have clusters. If a new
customer prospect can be associated with one of your existing
clusters, you can reasonably infer that this customer will behave
like the other members of that cohort group. Note the word behave.
You have classified the customers by their behavior. We need some
more advanced analytics here in order to understand how close the
prospect is to the centroid of the existing cohort group, another
step up the analytic ladder.
Prediction is the highest art form. You can associate a
numeric metric with each known member of a cohort group and then use
that metric together with the "distance" to the new prospect to
derive a numeric prediction of lifetime value, or likelihood to
default.
From an operational perspective we must have the ability to define
these clusters and compare their behavior to other clusters.
The concept of “behavior tags”
Dr. Kimball coined the term “behavior
tags” to describe the process of identifying participation in
clusters. Kimball notes that participation on a cluster is
time-based, and someone can be a YUPPIE one year and then change
groups as their income or occupation changes. The trick is to be
able to define the cluster and then pre-define those members of the
cluster over time.
For our purposes we need to use Oracle Discoverer to allow our
sophisticated users to create user-defined populations and use these
arbitrary groupings within all areas of Discoverer, including OLAP
and decision support.
Modeling Clusters with Oracle Discoverer
These are suggestions from Mark Rittman. Take your two
classification groups:
-
“A YUPPIE is someone over the age of 18 with a
college degree who has a home valued at more than $300,000, a
gross income of at least $100,000 per year”.
-
“A DINK (double-income, no kids) is a married
couple with no children whose combined income is at least
150,000 per year, a pristine credit rating and net assets of
more than $200,000”.
Looking at Yuppies, the distinguishing characteristics are age,
education, home value and gross income. All of these distinguishing
factors could be considered attributes of a customer dimension. The
same goes for DINKS, with the attributes being marital status,
number of children, combined income, credit rating and assets.
Your mail says that these groupings would become full dimensions,
but in Discoverer and OLAP terms you could consider these groupings
as a "saved selection" of a number of dimension members. One saved
selection would be customers who met the Yuppie criteria, another
would be customers who met the DINK criteria. I don't think these
groupings of customers would be dimensions in themselves, what they
are is a selection of members of the customer dimension.
Where Discoverer 10g comes into this is that, when you use
Discoverer for OLAP, and the OLAP Option, you can use the Query
Wizard to create saved selections like these using an OLAP-aware GUI
tool. The user can come up with a subset of (for example) customers,
who have a shoe size of 5, who's income is above average, who have
spent more than the average on electric items, and who have not been
mailed before, and make a saved selection (i.e, a "cluster") of
these for use later. These saved selections can then be recalled to
compare one cluster with another.
In terms of Discoverer 10g, I think this is what you are looking at.
In terms of implementation, you would use Discoverer for OLAP
against either a relational OLAP schema (regular star schema with
additional metadata) or you'd move the data into an analytic
workspace (better query performance).
Take a look at this viewlet:
http://www.oracle.com/technology/products/bi/files/viewlets/Chapter4_viewlet_swf.html
Of course the real question is what attributes should I pick to
create my clusters - someone would have come up with the "Yuppie"
cluster as one that could be profitable, and the real skill is
putting these clusters together in the first place, which is of
course a task usually done using data mining software. All you're
doing with Discoverer is taking the attributes that describe the
cluster and selecting them using the tool.
Other than that, if you're after a more custom solution - perhaps
one that combines data mining and selection building, perhaps one
that more closely matches the customers requirements, then you're
looking at a custom BI application using Oracle BI Beans.
One thing to bear in mind with the Discoverer for OLAP solution is
that there's no EUL - you use the Discoverer Catalog instead, and
create the dimensions, measures, permissions and so on using
Analytic Workspace Manager.