When building an Oracle
data warehouse, it is important to properly organize all of the
classification information about the facts in the warehouse. This
article will examine the basic techniques for identifying and
organizing classification attributes for any Oracle warehouse.
Basic data warehouse table for Oracle8i
In a basic data warehouse, the goal is to capture information about
some relevant fact. These facts are commonly kept according to a
time dimension.
Examples of commonly used facts for a warehouse include:
- The number of units sold each day.
- The average sale price per month.
- The number of widgets manufactured per day.
In addition to the basic facts and dates, the data warehouse will
also collect information about attributes of the facts. It is these
attributes that make the data warehouse function to provide end
users with insights about patterns in the warehouse.
Figure A shows a sample fact table.
Figure A |
 |
A fact table is created with the basic fact and all keys relating to
the classification of the fact. From this fact table, we see the
basic fact is number_of_units_sold. In addition, we see keys to
other information about the sale, namely the date, the key to the
store where the sales were made, and the SKU (stock keeping unit)
number of the items. By itself, the fact table does not provide much
useful information about the sale. However, when combined with
additional attributes, we can see a great deal of information about
a sale.
Let's begin this discussion with a basic review of attribute
classification techniques.
Handling attribute classification
One of the foremost problems in a data warehouse is the creation and
maintenance of the nonfact data attributes. A nonfact attribute is a
table column that serves to describe the fact. In all cases,
attributes will either be choose-one, finite value, or hierarchical.
Let's discuss each of these data attributes and the design
techniques you can use to implement them within a data warehouse.
Choose-one attribute
You implement a choose-one attribute in the case of mutually
exclusive attributes, such as gender, level of education, category
of income, and so on. In most cases, a flag column is used to
represent these data attributes, and Oracle check constraints are
used to enforce the valid values for the flag. The check constraint
validates incoming columns at row insert time. For example, rather
than having an application verify that all occurrences of region
are North, South, East, or West, you can add a check constraint to
the table definition to ensure the validity of the region
column.
Finite value attributes
Finite value attributes contain specific values that can't be
categorized by range values. Examples include last_name,
street_address, and phone_number.
Hierarchical attributes
Hierarchical attributes are the most challenging to represent in a
data warehouse. Within a classification tree, each branch may have
many levels, and once an entity has been classified, it should
automatically participate in all other classes it participates in.
A practical example of attribute
classification
To illustrate, let's take an example of a data warehouse for a
supermarket.
In a supermarket warehouse, buyers are encouraged to get "VIP cards"
in order to get discounts on products. When they apply for the VIP
card, the supermarket collects their demographic information,
including their age, income, education, and housing data.
At sale time, the VIP card cross-references the purchases with the
demographics of the person who made the purchase. After recording
the sales of the items and the person who purchased the items, we're
ready to store and analyze the data in our warehouse.
The main purpose of supermarket data warehouses is to target
promotions and mail advertising. By knowing the propensity of a
specific class of buyers to purchase a specific class of product,
the supermarket can target potential customers with great accuracy.
But how do we get started?
Let's assume that I go to the supermarket and buy a jar of prune
juice. By itself, this sale is not terribly interesting, but when we
view the sale in the context of the attribute data, we see that
there are many opportunities for summarizing the data. In addition
to the single attribute items, such as age and gender, we also see
data warehouse attributes that can be used to create a hierarchical
structure:
- Person attribute hierarchy:
- Education:
College education
Bachelor's degree
Master's degree
- Job:
Professional
Computer Science
- Item hierarchy:
- Edible items
Fruits
Juices
Bottled juices
Classification hierarchies are very important for the Oracle data
warehouse because they allow the correlation of distinct classes of
data. It is quite easy to tabulate the prune juice buying habits for
people with master's degrees, but what if we want to make global
correlations on buying behaviors for entire classes of items?
For example, with this hierarchy, we could perform the following
analysis:
- Cross-reference buying preferences for fruit products by the
job description of the buyer
- Cross-reference the buying preferences for juices by the
education of the buyer
From these examples, it should be clear that just knowing the basic
attributes for a fact is not enoughwe must be able to classify the
facts into a hierarchy.
Returning to our example from the supermarket, we remember that
while the customer entity has many attributes, each attribute is
distinct and can be represented independently of other attributes.
The Oracle table definition for the fact table might include the
following customer attributes:
CREATE TABLE FACT (
. . .
highest_education number(2)
CONSTRAINT highest_education_check
CHECK (highest_education between 0 and 20);
size_of_immediate_family number(2)
CONSTRAINT family_check
CHECK (size_of_immediate_family between 0 and
20);
number_of_children number(2)
CONSTRAINT children_check
CHECK (number_of_children between 0 and 20);
own_or_rent_home char(1)
CONSTRAINT own_check
CHECK (own_or_rent in 'U','O','R');
yearly_income_class number(1)
CONSTRAINT income_check
CHECK (yearly_income_class between 1 and 5);
Here, we've used numeric abbreviations to reduce space in the fact
table. In this case, a total of 8 bytes would be added to the fact
table to provide fast reference to information about the type of
customer who made the transaction. Also note that there are
provisions for unknown values for the customer, as would be the case
when a customer did not use a VIP carda "U" for unknown character
values and a zero for unknown numeric values.
Now, queries from end users desiring transaction information
according to the type of customer are easy to respond to because the
fact table does not need to be joined with the customer table. But
how can we identify the product classifications? Because the product
classifications are hierarchical in nature, we need to develop
another mechanism for representing them in our fact table.