A look at decision-support systems, star schemas, data mining, relational
OLAP, and how to use Oracle 7.3 features with data warehouses.
As more companies begin to embrace the concept of a historical data
repository for online analytical processing (OLAP) and decision-support systems
(DSSs), new client/server issues are emerging as developers struggle to create
Oracle client/server applications that will perform acceptably. There's been a
great deal of interest in the application of data warehousing and
multidimensional databases (MDDBs) to advanced systems such as expert systems
and DSSs. These systems have been used to solve semi-structured and even
unstructured problems. Traditionally, these types of systems combined "inference
engines" and relational databases to store the knowledge processing components.
Unfortunately, little work has been done with the application of warehouse
databases for DSSs and expert systems.
Decision-Support Systems and Data Warehouses
DSSs are generally defined as systems that deal with semi-structured problems
- that is, problems with a structured component as well as a human intuition
component. The well-structured components are the decision rules, which are
stored as the problem-processing system, while the intuitive or creative
component is left to the user.
DSSs also let users create "what if" scenarios. Although Oracle claims that
DSSs do not allow the creation of what-if scenarios, the traditional definition
of decision support allows users to "model" a scenario and then vary the inputs
on that scenario to test their hypothesis. (See Foundations of Decision
Support Systems by Holsapple and Whinston [Academic Press of New York,
1981].) These are essentially modeling tools that let users define an
environment and then simulate the behavior of that environment under changing
conditions. For example, users of a DSS for finance could create a hypothetical
stock portfolio and then direct the DSS to model the behavior of that stock
portfolio under different market conditions. Once these behaviors are specified,
users may vary the contents of the portfolio and view the results. There has
been much debate about the role of intuition in this type of problem solving.
DSSs let users control the decision-making process by applying their own
decision-making rules and intuition. However, there is also much debate as to
whether artificial intelligence can be used to manage the intuitive component of
these systems.
There is another approach to the simulation of multiple dimensions: design
techniques used with Oracle 7.3. This push toward star schema design has been
somewhat successful, especially because the designers do not have to buy an MDDB
or invest in an expensive front-end tool. (See Figure 1.) Many vendors now have
tools to make a relational database appear to be a multidimensional
architecture.
Figure 1
In general, using a relational database for OLAP is achieved by any
combination of the following techniques:
1. Pre-joining tables together. This is an obtuse way of saying that a
denormalized table is created from the normalized online database. A large
pre-join of several tables may be called a fact table in a star schema. 2.
Pre-summarization. This is performed to prepare the data for any
drill-down requests that may come from end users. Essentially, the different
levels of aggregation are identified and the aggregate tables are computed and
populated when the data is loaded. 3. Massive denormalization. The
whole side effect of very inexpensive disk storage has been the rethinking of
the merits of third normal form (3NF). Today, redundancy is widely accepted, as
seen in the popularity of replication tools, snapshot utilities, and non-first
normal form databases. If you can pre-create every possible result table at load
time, then your end users will get excellent response times when performing
queries. The star schema is an example of massive denormalization. 4.
Controlled periodic batch updating. New detail data is rolled into the
aggregate table on a periodic basis, with all summarization recalculated as the
new data is introduced into the database.
Data Aggregation and Drill-Down
One of the most fundamental principles of an MDDB is the idea of aggregation.
As you know, managers at different levels require different levels of
summarization to make intelligent decisions. To allow each manager to choose the
level of aggregation, most warehouse offerings have a "drill-down" feature that
allows the user to adjust the level of detail, eventually reaching the original
transaction data. For obvious performance reasons, the aggregations are
precalculated and loaded into the warehouse during off-hours. (See Figure
2.)
Figure 2
There are several types of aggregation. The most common type is called a
roll-up aggregation. An example of this type of aggregation would be taking
daily sales totals and rolling them up into a monthly sales table. This type of
aggregation is relatively easy to compute and run. The more difficult type of
aggregation is the aggregation of Boolean and comparative operators. For
example, assume that a salesperson table contains a Boolean column called
"turkey." A salesperson is a turkey if his or her individual sales are below the
group average for the day. A salesperson may be a turkey on only 15 percent of
the individual days, but when the data is rolled up into a monthly summary, his
or her average may drop below the group average, making the salesperson a turkey
even though he or she had only a few, alBCt very bad, sales days.
The base rule is simple; if the data would fit well into a spreadsheet, then
it is probably well suited to an MDDB, or at least an MDDB representation.
Oracle's Relational Answer to MDDB
Dr. Ralph Kimball, founder of Red Brick Systems, popularized the term "star"
schema to describe a denormalization process that simulates the structure of an
MDDB. With a star schema, designers can simulate the functions of an MDDB
without purchasing expensive third-party software. Kimball describes
denormalization as the pre-joining of tables, such that the runtime application
does not have to join tables. At the heart of the star schema is a "fact" table.
The fact table is usually composed entirely of key values and raw data. A fact
table is generally very long and may have millions of rows. Surrounding the fact
table is a series of "dimension" tables that add value to the base information
in the fact table.
Figure 3
For example, consider the ER model for a sales database shown in Figure 3,
which uses a standard 3NF database to represent the sales of items. Because
there is no redundant information, you must compute salient data such as the
total for an order from the atomic items that make up the order. In this 3NF
database, you must create a list of line items, multiplying the quantity ordered
by the price, for all items that belong in order 123. In the following example,
an intermediate table called temp is created to hold the result list: create table temp as
select (quantity.quantity_sold *
item.list_price) line_total
from quantity, item
where
quantity.order_nbr = 123
and
quantity.item_nbr = item.item_nbr;
select SUM(line_total) from temp;
Also note that the region-state-city table hierarchy in this example is very
deliberate. In order to be truly in 3NF, you do not allow redundant information
(except, of course, foreign keys). Because this example has been fully
normalized into five tables, a query that would appear very simple to the end
user would have relatively complex SQL. For example, the SQL to calculate the
sum of all orders in the western region might look very complex and involve a
five-way table join. (See Listing 1.) Listing 1 - Burleson
CREATE TABLE temp AS
SELECT (quantity.quantity_sold * item.list_price) line_total
FROM quantity, item, customer, city, state
WHERE
quantity.item_nbr = item.item_nbr /* join ITEM and QUANTITY */
AND
item.cust_nbr = customer.cust_nbr /* join ITEM and CUSTOMER */
AND
customer.city_name = city.city_name /* join CUSTOMER and CITY */
AND
city.state_name = state.state_name /* join CITY and STATE */
AND
state.region_name = 'WEST';
In the real world, of course, you would introduce enough redundancy to
eliminate the city, state, and region tables. Regardless, the point is clear: A
manager who wants to analyze a series of complete order totals would require a
huge amount of runtime computation. Here you arrive at the basic tradeoff: If
you want true freedom from redundant data, you must pay the price at query
time.
Remember, the rules of database design have changed. Ten years ago,
normalization theory emphasized the need to control redundancy and touted the
benefits of a structure that was free of redundant data. Today, with disk prices
at an all-time low, the attitude toward redundancy has changed radically. The
relational vendors are offering a plethora of tools to allow snapshots and other
methods for replicating data, and other vendors such as UniSQL are offering
database products that allow for non-first normal form implementations. Today,
it is perfectly acceptable to create first normal form implementations of
normalized databases, thereby pre-joining the tables and avoiding the
high-performance costs of runtime SQL joins. Traditionally, databases were
normalized to reduce redundant information. In a 3NF representation, information
only exists once, making it very simple to locate and change one piece of data.
Although denormalization will simplify queries, the introduction of redundancy
can make certain update operations cumbersome.
The basic principle behind the star query schema is the introduction of
highly redundant data for performance reasons. Let's evolve the 3NF database
into a star schema by creating a fact table to hold the quantity for each item
sold. Essentially, a fact table is a first normal form representation of the
database, with a very high degree of redundancy added into the tables. (See
Figure 4.)
Figure 4
It is hard to believe that Figure 4 contains the same data as the fully
normalized database. The new fact table will contain one row for each item on
each order, so there is a tremendous amount of redundant key information in the
fact table. Of course, the star query schema will require far more disk space
than the 3NF database from which it was created, and the star schema will most
likely be a read-only database because of the widespread redundancy that has
been introduced into the model. Also, the widespread redundancy will make
updating difficult, if not impossible. Also note the "dimension" tables
surrounding the fact table. Some of the dimension tables contain data that can
be added to queries with joins, while other dimensions such as region do not
contain any data.
So, with the huge disk-space consumption and read-only restriction, what does
this star schema buy for you? Now that you have a star schema, you can formulate
SQL queries to quickly get the information that you desire. For example, by
performing some of the work up front, getting the total cost for an order is now
simple: select sum(total_cost) order_total
from fact
where
fact.order_nbr = 123;
Consider what would happen if the user of this schema wanted to analyze
information by aggregate values. Assume your manager wants to know the breakdown
of sales by region. The data is not organized by region, but the fact table can
be queried easily to get the answer. Now getting the sum of all orders for the
western region becomes trivial: select sum(total_cost)
from fact
where
region = 'west'
In addition to simplifying the query structure, all of the table joining has
been eliminated, and you can easily get the extracted information from your star
schema. A value such as region would be an ideal candidate for the use of Oracle
7.3 bitmapped indexes. Values that have fewer than three distinct values can see
dramatic performance improvements by utilizing the bitmapped index
technique.
The natural consequence of this approach is that many IS shops will keep two
copies of their production databases: one in 3NF for online transaction
processing and another denormalized version of the database for decision-support
and data warehouse applications.
Using Oracle's Distributed SQL to PopulatE Star Schemas
It should be self-evident at this point that it is desirable in some cases to
have several copies of the same database: a 3NF database for online transaction
processing and a denormalized star schema for use by data warehouse and OLAP
applications. The problem with this dual approach arises when attempting to keep
the star schema in sync with the operational database. Fortunately, Oracle
provides several mechanisms to assist in this synchronization. Because it is
safe to assume that the star schema will be used by executives for long-range
trend analysis, it is probably not imperative that the star schema be completely
up-to-date with the operational database, and you can, therefore, develop an
asynchronous method for updating the star schema.
If you make this assumption, then you can use a single SQL statement to
extract the data from the operational database and populate the new rows into
the star schema. (See Listing 2.) In Listing 2, assume that the star schema
resides at our corporate headquarters in London, and that the table is called
the fact_table. (Snapshots can also be used to get the same results.) Listing 2
INSERT INTO fact_table@london
SELECT
order_year,
order_quarter,
order_month,
order_nbr,
salesperson_name,
customer_name,
customer_city,
customer_state,
customer_region,
item_nbr,
quantity_sold,
price*quantity_sold
FROM quantity, item, customer, city, state
WHERE
quantity.item_nbr = item.item_nbr /* join ITEM and QUANTITY */
AND
item.cust_nbr = customer.cust_nbr /* join ITEM and CUSTOMER */
AND
customer.city_name = city.city_name /* join CUSTOMER and CITY */
AND
city.state_name = state.state_name /* join CITY and STATE */
AND
order_date = SYSDATE /* get only today's transactions */
;
Listing 2 shows a very simple method for achieving the extraction,
normalization, and insertion of the operational data into the star schema. Note
that you can even handle computed values such as total_cost, which is price
times quantity_sold. By specifying the sysdate in the where clause, you insure
that only the day's transactions are extracted and loaded into the star schema
fact_table. Of course, you are still undertaking a very large five-way table
join, but you would hope to run this extraction during off-hours when the
retrieval would not impact the production users.
But what about rows that were deleted? While uncommon, you still need to
account for the possibility that some orders may be canceled, and you need a
mechanism for updating the star schema to reflect such deletions. The most
obvious method for removing deleted orders from the star schema is to create a
delete trigger on the order table of the operational system. This delete trigger
will fire off a remote delete from the trigger to delete all invalid rows from
the star schema: CREATE TRIGGER delete_orders
AFTER DELETE on order
BEGIN
DELETE FROM fact_table@london
WHERE
order_nbr = :del_ord
END;
You now have a mechanism for keeping your data warehouse in relative
synchronization with the operational database. Note that in order for this to
work, the remote database must be available and the network must be up. In very
large distributed networks with hundreds of databases at remote locations, it is
not uncommon for one of the lines to experience sporadic failures. Many
sophisticated shops will test the return code from these operations and
automatically retry every five minutes until the connection is successful.
What are you going to do as the fact_table expands beyond its initial table
capacity? Assume that your organization processes 20,000 orders per day, leading
to 7.3 million rows per year. With Oracle indexes, a table this large can create
unique performance problems primarily because the index must spawn many levels
to index 7.3 million rows properly. Whereas a typical query might involve three
index reads, a query against a multimillion row table might involve five index
reads before the target row is fetched.
To alleviate this problem, many designers will partition the table into
smaller sub-tables, using the data as the distinguishing factor. As such, you
may have a table for each month, with names such as fact_table_1_96,
fact_table_2_96, and so on. Whenever you need to address multiple tables in a
single operation, you can use the SQL union all verb to merge the tables
together: SELECT * FROM fact_table_1_96
UNION ALL
SELECT * FROM fact_table_2_96
UNION ALL
SELECT * FROM fact_table_3_96
ORDER BY order_year, order_month;
In addition to having the benefit of smaller table indexes, this type of
table partitioning combined with the union all verb has the added benefit of
allowing Oracle's parallel query engine to perform full-table scans
simultaneously on each of the sub-tables. In this case, a separate process is
invoked to process each of the three table scans. An Oracle query coordinator
would then gather the result data and sort it according to the order by clause.
In the preceding example, you could expect a 50 percent performance improvement
over a query against a single fact_table.
Aggregation, Roll-Ups, and Star Schemas
You have now defined and populated a star schema that contains the total
sales for each order for each day. While it is now easy to see the total for
each order, rarely do the users of a DSS require this level of detail. Most
managers would be more interested in knowing the sum of sales or units sold,
aggregated by month, quarter, region, and so on. Even with a star schema, these
types of aggregations are difficult to compute at runtime while still providing
the end user with acceptable response time. Essentially, you can either
aggregate at runtime or pre-aggregate the data offline, making the totals
available without runtime computation.
One simple alternative to runtime aggregation is to write SQL to
pre-aggregate the data according to the dimensions that the end user may want to
see. In the example, assume that management would want to aggregate monthly
sales by region, state, item type, and salesperson. Because you have four
possible dimensions, you can generate a list of the following six aggregate
tables to pre-create (assume that all of the tables would have a month_year
field as their primary key):
- Region by State - this table would have region_name, state_name,
and monthly_sales as columns.
- Region by item_type - this table would have region_name, item_type,
and monthly_sales as columns.
- Region by Salesperson - this table would have region_name,
salesperson_name, and monthly_sales as columns.
- State by item_type - this table would have state_name, item_type,
and monthly_sales as columns.
- State by Salesperson - this table would have state_name,
salesperson_name, and monthly_sales as columns.
- item_type by Salesperson - this table would have item_type,
salesperson_name, and monthly_sales as columns.
The SQL to produce these tables can easily be run as a batch task at
end-of-month processing. For example, the SQL to create the region_by_state
table might look like: INSERT into region_item_type
SELECT "3", "1996", region_name,
item_type, total_cost
FROM FACT_TABLE_3_96
GROUP BY region_name, item_type
;
The sample region_item_type table after formatting might look like Table
1.
Table 1
DATE |
REGION |
TYPE |
MONTHLY_SALES |
3/96 |
WEST |
Clothes |
$113,999 |
3/96 |
WEST |
Hardware |
$56,335 |
3/96 |
WEST |
Food |
$23,574 |
3/96 |
EAST |
Clothes |
$45,234 |
3/96 |
EAST |
Hardware |
$66,182 |
3/96 |
EAST |
Food |
$835,342 |
3/96 |
SOUTH |
Clothes |
$1,223 |
3/96 |
SOUTH |
Hardware |
$56,392 |
3/96 |
SOUTH |
Food |
$9,281 |
3/96 |
NORTH |
Clothes |
$826,463 |
3/96 |
NORTH |
Hardware |
$77,261 |
3/96 |
NORTH |
Food |
$43,383 |
These aggregate tables can be built in the middle of the night, right after
the master fact tables have been populated with the day's sales. The next
morning, the prior day's sales will have been rolled up into these summaries,
and management has an accurate, fast, and easy-to-use tool for decision
support.
Of course, these tables are two-dimensional, but they can easily be massaged
by an application to provide a tabular representation of the variables.
Following is a tabular form of the data in Table 1: REGION vs TYPE
Clothes Food Hardware
WEST $113,999 $23,574 $56,335
EAST $45,234 $835,342 $66,182
NORTH $826,463 $43,383 $77,261
SOUTH $1,223 $9,281 $56,392
This technique replicates the functionality of an MDDB, whereby an end user
can specify the axis of interest and the MDDB will build a tabular
representation of the data.
But what if managers want to look at quarterly summaries instead of monthly
summaries? Or what about yearly summaries? Of course, this same technique can be
used to roll up the monthly summary tables into quarterly summaries, yearly
summaries, and so on, according to the demands of the end user.
The History of OLAP
Dr. E. F. Codd first used the term OLAP in a 1993 white paper sponsored by
Arbor Software. In this same paper, Codd also created 12 rules for OLAP. (See
the "OLAP Bibliography" for a summary of OLAP white papers and magazines
articles, including Dr. Codd's white paper.) Despite Codd's claims of new
technology, some offerings such as IRI Express, now called Oracle Express, date
to the early 1970s. (For more information on OLAP and Oracle Express Objects,
please see Dan Bulos' article, "OLAP Comes of Age: Oracle Express Objects.")
There is a popular forum on the Internet that discusses OLAP issues at
http://www.comp.database.olap.
Simulation of Cubic Databases (Dimensionality)
Assume that the denormalized sample customer table shown in Table 2 is
physically stored in data order. You can now imagine how this data might look as
a cubic table. (See Figure 5.) Of course, cubic representation requires that the
data be loaded into an MDDB or a spreadsheet that supports pivot tables. When
considering an MDDB, there are two arguments. Relational database vendors point
out that because MDDBs are proprietary and not open, the more open relational
databases should be used. MDDB vendors, meanwhile, point out some serious
inadequacies with SQL that make using a relational database very difficult.
Table 2
CUSTOMER-NAME |
# Sales |
YY-MM |
City |
State |
Smith & Jones |
300 |
91-01 |
NY |
NY |
Dewey Cheetham & Howe |
400 |
91-01 |
San Fran |
CA |
Smith & Jones |
120 |
91-02 |
NY |
NY |
Dewey Cheetham & Howe |
300 |
91-02 |
San Fran |
CA |
Smith & Jones |
145 |
91-03 |
NY |
NY |
Dewey Cheetham & Howe |
334 |
91-03 |
San Fran |
CA |
NY
NY
Figure 5
Also, dimensions may be hierarchical in nature, and this may add further
confusion. A time dimension, for example, may be represented as a hierarchy with
Year, Quarter, Month, and Day. Each of these "levels" in the dimension hierarchy
may have its own values. In other words, a cubic representation with Time as a
dimension may be viewed in two ways:
1. A series of cubes: one for Year, another for Quarter, and another with
full_date. 2. A five-dimension table.
MDDBs are most commonly used with data that is a natural fit for pivot
tables. It should come as no surprise that most MDDB sites are used with finance
and marketing applications. Unfortunately, most MDDBs do not scale up well for
warehouse applications. For example, the largest supported database for Arbor
Software's Essbase is about 20GB, whereas data warehouses so large that they are
measured in terabytes are not uncommon.
Note that defining the aggregation of an MDDB is no different from defining
aggregate tables to a relational database. At load time, the database must still
compute the aggregate values.
MDDBs also employ the concept of sparse data. Because data is aggregated and
presliced, there may be cells on a cube with no data. For example, consider a
cube that tracks sales of items across a large company. The cells representing
sales of thermal underwear are null for Hawaii, while the sales of surfboards in
Wyoming are also null. Almost all of the products offered maintain a mechanism
for compressing out these types of null values.
Oracle's Alternatives to Cubic Data Representation
You can use many traditional database designs to simulate a data cube. One
alternative to the cubic representation is to leave the table in linear form and
use SQL to join the table against itself to produce a result. (See Figure 6.)
Following is a query that might require a self-join of a table:
Figure 6
1. Show all customers in Hawaii who purchased our product more than 500
times. 2. Show all customers in Louisiana who purchase less than 10
times/month. 3. Show all large customers (buy more than 100 items per month)
in Alaska whose usage has dropped more than 10 percent in 1990. 4. Show all
customers in New York whose usage in March 1990 deviated more than 20 percent
from their usage in March 1991. 5. Show all customers for which the company
name contains "Widget" in California and whose usage has dropped more than 20
percent in 1991.
In the example in Listing 3, compare all user sites in which usage of your
product has experienced a negative usage variance of greater than five percent
between October 1990 and December 1990. A subset of this data can easily be
extracted such that only California sites with more than 100 usages per month
are displayed. For display, the user chooses percentage variance, number of
requests, site number, zip code, and city. Note the sort order of the report. It
is sorted first by zip, followed by city, and then by percentage variance within
city. Note that the variance analysis is performed directly in the SQL
statement. In the above case, California users whose usage has dropped by more
than five percent (comparing January 1991 to January 1992) are displayed. Listing 3
SELECT INTEGER (((e.number_of_sales s.number_of_sales) /
s.number_of_sales) * 100) ,
e.customer_name , e.city_name , e.zip , s.number_of_sales ,
e.number_of_sales
FROM detail s , detail e
WHERE
s.customer_name = e.customer_name
AND
e.state_abbr = 'ca'
AND
e.date_yymm = 9101
AND
s.date_yymm= 9201
AND
e.number_of_sales <
s.number_of_sales (.05 * s.number_of_sales)
ORDER BY e.zip asc , e.city_name asc , 1 ;
But what if the user wants to compare one full year with another year? The
table is structured for simple comparison of two specific month dates, but the
SQL query could be modified slightly to aggregate the data and compare two
ranges of dates.
The query shown in Listing 4 will aggregate all sales for an entire year and
compare 1991 with 1992. Here you answer the question, "Show me all customers in
California whose sales have dropped by more than five percent between 1991 and
1992." Listing 4
SELECT INTEGER (((e.number_of_sales s.number_of_sales) /
s.number_of_sales) * 100) ,
e.customer_name , e.city_name , e.zip , s.number_of_sales ,
e.number_of_sales
FROM detail s , detail e
WHERE
s.customer_name = e.customer_name
AND
e.state_abbr = 'ca'
AND
substr(e.date_yymm,1,2) = "91"
AND
substr(s.date_yymm,1,2) = "92"
AND
e.number_of_sales <
s.number_of_sales (.05 * s.number_of_sales)
ORDER BY e.zip asc , e.city_name asc , 1 ;
On the surface, it appears that SQL can be used against two-dimensional
tables to handle three-dimensional time-series problems. It also appears that
you can use SQL to roll up aggregations at runtime, alleviating the need to
perform a roll-up at load time, as with a traditional database. While this
implementation does not require any special MDDBs, there are two important
issues to resolve:
1. Joining a table against itself, especially when comparing ranges of dates,
may create several levels of nesting in the SQL optimization and poor response
time and performance. 2. End users would be incapable of formulating this
type of sophisticated SQL query.
If you strip away all of the marketing hype and industry jargon, you will see
that data warehouses and MDDBs can be easily simulated by pre-creating many
redundant tables, each with precalculated roll-up information. In fact, the base
issue is clear: Complex aggregation will need to be computed at runtime or when
the data is loaded.
Data Mining and Relational OLAP
The recent interest in data warehousing has created many new techniques and
tools for obtaining useful information from these behemoth databases. Data
mining is one area that holds a great deal of promise for users of data
warehouses.
In traditional DSSs, users formulated queries against the database and
deciphered any trends that were present in the data. Unfortunately, this
approach is only as good as the user of the system, and many statistically valid
associations between data items can be missed. This is especially true in data
warehouse systems in which unobtrusive trends are present. For example, at the
psychology department at the University of Minnesota, the developers of the
hugely popular Minnesota Multiphasic Personality Inventory (MMPI) have found
some startling patterns regarding how people with different psychological
diagnoses respond to seemingly ordinary questions. These patterns provide
unobtrusive measures of human personality. For example, the MMPI developers
found that people with low self-esteem tend to prefer baths to showers, and
while there is not a "reason" for this preference, there remains a statistically
valid correlation between self-concept and bathing preferences.
In the business world, there are similar types of unobtrusive trends. It is
the goal of data mining software to identify these trends for the users of the
warehouse. In addition to identifying trends, some data mining software analyzes
other data and determines the reasons for the identified trends. Although basic
statistical tools are adequate for performing correlations between a small
number of related variables, large databases with hundreds of data items are
quickly bogged down in a mire of multivariate "chi-square" techniques that are
hard to follow for even the most experienced statistician. As such, the new data
mining tools are meant to accept only general hints from the users, and then go
forth on their own into the data, probing for trends.
In other cases, data mining techniques are used to prove a hypothesis based
on existing data. For example, a marketing specialist may speculate that
customers with an income between $50,000 and $80,000 are likely to buy a certain
product. A quick verification of this hypothesis can be run, thereby either
confirming or disproving the hypothesis.
Yet another class of tools uses a relatively straightforward
exception-detection mechanism to cruise the database looking for "unexpected"
trends or unusual patterns. Many of the data mining tools use techniques
borrowed from artificial intelligence, including fuzzy logic, neural networks,
fractals, and various other statistical techniques.
Because many of these tools perform a huge amount of internal processing,
many of them read selected information from the relational database into a
proprietary, internal data representation for analysis. There are no widely used
data mining tools that run directly against the relational database. Although
there is still a great deal of interest in data mining applications, no single
vendor has stepped up to claim market leadership. It will probably be many years
before all owners of a data warehouse have tools that will be able to exploit
their data resources fully.
Using Oracle 7.3 Features with Data Warehouses
The new features of Oracle 7.2 and Oracle 7.3 will not be activated unless
the following server parameter is used: COMPATIBILITY=7.3.0.0.0
With Oracle 7.3, there are several new features that can dramatically improve
performance of Oracle data warehouse and DSSs.
Oracle Parallel Query for Data Warehouses. It is a common
misconception that it is necessary to have parallel processors (SMP or MPP) in
order to use and benefit from parallel processing. Even on the same processor,
multiple processes can be used to speed up queries. The most powerful approach
deals with the use of the SQL union verb in very large databases (VLDBs). In
most very large Oracle data warehouses, it is common to partition a single table
into many smaller tables in order to improve query throughput. For example, a
sales table that is ordered by date_of_sale may be partitioned into 1997_sales,
1998_sales, and 1999_sales tables. This approach is very common with data
warehouse applications in which a single logical table might have millions of
rows. Although this "splitting" of a table according to a key value violates
normalization, it can dramatically improve performance for individual queries.
For large queries that may span many logical tables, the isolated tables can be
easily reassembled using Oracle's parallel query facility: CREATE VIEW all_sales as
SELECT * FROM 1997_sales
UNION ALL
SELECT * FROM 1998_sales
UNION ALL
SELECT * FROM 1999_sales;
You can now query the all_sales view as if it were a single database table,
and Oracle's parallel query will automatically recognize the union all parameter
and fire off simultaneous queries against each of the three base tables. For
example, the following query will assemble the requested data from the three
tables in parallel, with each query being separately optimized. The result set
from each subquery is then merged by the query coordinator: SELECT customer_name
FROM all_sales
WHERE
sales_amount > 5000;
Star Query Hints and Star Joins with Oracle 7.3. As you have already
seen, the star schema design involves creating a main fact table that contains
all of the primary keys in the related tables. This massive denormalization of
the database structure means that just about any query against the star schema
will involve the joining of many large tables, including a large "fact" table
and many smaller reference tables. Oracle has provided a new feature with
release 7.3 that detects a star query join and invokes a special procedure to
improve performance of the query. Prior to release 7.3, this feature only worked
with up to five tables, but this restriction has been eliminated. Also, release
7.3 no longer requires the use of star query hints, but they are still allowed
in the SQL syntax and are generally a good idea for documentation purposes. The
star query requires that a single concatenated index resides for all keys in the
fact table.
Oracle follows a simple procedure for addressing these queries. Oracle will
first service the queries against the smaller reference tables and combine the
result set into a Cartesian product table in memory. Once the sum of the
reference tables has been addressed, Oracle will perform a nested-loop join of
the intermediate table against the fact table. This approach is much faster than
the traditional method of joining the smallest reference table against the fact
table and then joining each of the other reference tables against the
intermediate table.
Using Oracle's Bitmap Indexes. For data warehouse application that
must attain maximum performance, there are some special situations in which
Oracle bitmapped indexes may be useful. As you know, with very large tables, the
sheer volume of rows can make even a trivial query run for a long time. Oracle
has introduced bitmapped indexes with release 7.3 in an attempt to improve index
lookup performance for queries, especially decision-support queries that may
have many conditions in the where clause. Histograms can also improve the
performance of certain queries. (Please see Steve Bobrowski's article "Tuning
Application Logic" for a discussion of histograms.)
The bitmap approach to indexing is very different from the traditional B-tree
style of indexes. In a traditional index, index keys are sorted and carried in
several tree nodes. In a bitmapped index, an array is created. This array has
all possible index values as one axis, while the other axis contains all rows in
the base table. For example, consider a bitmapped index on the region field of
the sales table, in which the regions are North, South, East, and West. If the
sales table contains 1,000,000 rows, then the bitmapped index would create a
array of 4 ¥ 1,000,000,000 to store the possible key values. Within this array,
the index data is binary. If a value is true it is assigned a binary "1,"
whereas a false reading is set to binary "0." (See Figure 7.)
Figure 7
In Figure 7, you can see how this query runs faster than a traditional query.
The Oracle optimizer will notice that the items in the where clause have
bitmapped indexes and will simply scan for non-zero values in the proper array
column, quickly returning the row ID of the columns. A fast merge of the result
set will then quickly identify the rows that meet the query criteria. Although
it may appear that the index is very large, Oracle has developed a compression
method whereby all of the binary zeros are omitted from the bitmap, making it
very compact.
This may look reasonable at first glance. However, there are some drawbacks
to bitmapped indexing. The first and most obvious is that bitmapped indexes are
bases for columns that have only small numbers of possible values. For columns
that have many values such as state_name or city_name, the index overhead would
probably exceed any performance gains. However, for columns such as sex, color,
and size that have a small number of finite values, bitmapped indexes will
greatly improve query retrieval speeds. Bitmapped indexes are especially useful
for DSSs in which many conditions are combined to a single where clause.
Using Oracle 7.3 Hash Joins. Oracle 7.3 also provides another method
for speeding up decision-support and warehouse queries. This method is called
the hash join. A hash join is a technique by which Oracle bypasses the
traditional sort-merge join technique and replaces it with an algorithm that
performs a full table scan, placing the rows into memory partitions. Partition
pairs that do not fit into memory are placed in the temp tablespace. Oracle then
builds a hash table on the smallest partition and uses the larger partition to
probe the newly created hash table. This technique alleviates the need for
in-memory sorting and does not require that indexes exist on the target
tables.
The following four server parameters must be set to use hash joins:
- optimizer_mode parameter must be set to cost
- hash_join_enabled=true
- hash_multiblock_io_count=true
- hash_area_size=size
To execute a hash join, the hash join hint must be used. Following is an
example: SELECT /* USE_HASH/ *
FROM customer, order
WHERE
customer.cust_no = order.cust_no
AND
credit_rating = 'GOOD';
GOING WAREHOUSE
Oracle is now offering a considerable number of tricks that can help you use
Oracle's proven relational technology to implement fast and robust warehouse
systems. Prior to Oracle 7.3, a user who wanted to create a data warehouse was
forced to deal with a vendor that did not have the same market presence as
Oracle, and there were concerns as to whether these smaller data warehouse
companies would survive. Oracle developers have long awaited these new
performance features for their data warehouse operations. High-speed
decision-support applications using Oracle's reliable relational database are
now possible.
During the 1980s, Don Burleson established himself as a authority on database
system architectures while serving on the faculty of several major universities.
He is currently a senior DBA for a Fortune 50 company. He is also the author of
Managing Distributed Databases (Wiley & Sons, 1995) and the
forthcoming Oracle Performance and Tuning (Que, due in September 1996).
You can email Don at 76443.2751@compuserve.com.
|