Call now: 252-767-6166  
Oracle Training Oracle Support Development Oracle Apps

 
 Home
 E-mail Us
 Oracle Articles
New Oracle Articles


 Oracle Training
 Oracle Tips

 Oracle Forum
 Class Catalog


 Remote DBA
 Oracle Tuning
 Emergency 911
 RAC Support
 Apps Support
 Analysis
 Design
 Implementation
 Oracle Support


 SQL Tuning
 Security

 Oracle UNIX
 Oracle Linux
 Monitoring
 Remote s
upport
 Remote plans
 Remote
services
 Application Server

 Applications
 Oracle Forms
 Oracle Portal
 App Upgrades
 SQL Server
 Oracle Concepts
 Software Support

 Remote S
upport  
 Development  

 Implementation


 Consulting Staff
 Consulting Prices
 Help Wanted!

 


 Oracle Posters
 Oracle Books

 Oracle Scripts
 Ion
 Excel-DB  

Don Burleson Blog 


 

 

 


 

 

 

Developing Effective Oracle Data Warehouse and OLAP Applications
OTJ, Summer 1996
Donald Burleson

 

A look at decision-support systems, star schemas, data mining, relational OLAP, and how to use Oracle 7.3 features with data warehouses.


As more companies begin to embrace the concept of a historical data repository for online analytical processing (OLAP) and decision-support systems (DSSs), new client/server issues are emerging as developers struggle to create Oracle client/server applications that will perform acceptably. There's been a great deal of interest in the application of data warehousing and multidimensional databases (MDDBs) to advanced systems such as expert systems and DSSs. These systems have been used to solve semi-structured and even unstructured problems. Traditionally, these types of systems combined "inference engines" and relational databases to store the knowledge processing components. Unfortunately, little work has been done with the application of warehouse databases for DSSs and expert systems.

Decision-Support Systems and Data Warehouses

DSSs are generally defined as systems that deal with semi-structured problems - that is, problems with a structured component as well as a human intuition component. The well-structured components are the decision rules, which are stored as the problem-processing system, while the intuitive or creative component is left to the user.

DSSs also let users create "what if" scenarios. Although Oracle claims that DSSs do not allow the creation of what-if scenarios, the traditional definition of decision support allows users to "model" a scenario and then vary the inputs on that scenario to test their hypothesis. (See Foundations of Decision Support Systems by Holsapple and Whinston [Academic Press of New York, 1981].) These are essentially modeling tools that let users define an environment and then simulate the behavior of that environment under changing conditions. For example, users of a DSS for finance could create a hypothetical stock portfolio and then direct the DSS to model the behavior of that stock portfolio under different market conditions. Once these behaviors are specified, users may vary the contents of the portfolio and view the results. There has been much debate about the role of intuition in this type of problem solving. DSSs let users control the decision-making process by applying their own decision-making rules and intuition. However, there is also much debate as to whether artificial intelligence can be used to manage the intuitive component of these systems.

There is another approach to the simulation of multiple dimensions: design techniques used with Oracle 7.3. This push toward star schema design has been somewhat successful, especially because the designers do not have to buy an MDDB or invest in an expensive front-end tool. (See Figure 1.) Many vendors now have tools to make a relational database appear to be a multidimensional architecture.

Figure 1Burleson Figure 1

In general, using a relational database for OLAP is achieved by any combination of the following techniques:

1. Pre-joining tables together. This is an obtuse way of saying that a denormalized table is created from the normalized online database. A large pre-join of several tables may be called a fact table in a star schema.
2. Pre-summarization. This is performed to prepare the data for any drill-down requests that may come from end users. Essentially, the different levels of aggregation are identified and the aggregate tables are computed and populated when the data is loaded.
3. Massive denormalization. The whole side effect of very inexpensive disk storage has been the rethinking of the merits of third normal form (3NF). Today, redundancy is widely accepted, as seen in the popularity of replication tools, snapshot utilities, and non-first normal form databases. If you can pre-create every possible result table at load time, then your end users will get excellent response times when performing queries. The star schema is an example of massive denormalization.
4. Controlled periodic batch updating. New detail data is rolled into the aggregate table on a periodic basis, with all summarization recalculated as the new data is introduced into the database.

Data Aggregation and Drill-Down

One of the most fundamental principles of an MDDB is the idea of aggregation. As you know, managers at different levels require different levels of summarization to make intelligent decisions. To allow each manager to choose the level of aggregation, most warehouse offerings have a "drill-down" feature that allows the user to adjust the level of detail, eventually reaching the original transaction data. For obvious performance reasons, the aggregations are precalculated and loaded into the warehouse during off-hours. (See Figure 2.)

Figure 2Burleson Figure 2

There are several types of aggregation. The most common type is called a roll-up aggregation. An example of this type of aggregation would be taking daily sales totals and rolling them up into a monthly sales table. This type of aggregation is relatively easy to compute and run. The more difficult type of aggregation is the aggregation of Boolean and comparative operators. For example, assume that a salesperson table contains a Boolean column called "turkey." A salesperson is a turkey if his or her individual sales are below the group average for the day. A salesperson may be a turkey on only 15 percent of the individual days, but when the data is rolled up into a monthly summary, his or her average may drop below the group average, making the salesperson a turkey even though he or she had only a few, alBCt very bad, sales days.

The base rule is simple; if the data would fit well into a spreadsheet, then it is probably well suited to an MDDB, or at least an MDDB representation.

Oracle's Relational Answer to MDDB

Dr. Ralph Kimball, founder of Red Brick Systems, popularized the term "star" schema to describe a denormalization process that simulates the structure of an MDDB. With a star schema, designers can simulate the functions of an MDDB without purchasing expensive third-party software. Kimball describes denormalization as the pre-joining of tables, such that the runtime application does not have to join tables. At the heart of the star schema is a "fact" table. The fact table is usually composed entirely of key values and raw data. A fact table is generally very long and may have millions of rows. Surrounding the fact table is a series of "dimension" tables that add value to the base information in the fact table.

Figure 3

For example, consider the ER model for a sales database shown in Figure 3, which uses a standard 3NF database to represent the sales of items. Because there is no redundant information, you must compute salient data such as the total for an order from the atomic items that make up the order. In this 3NF database, you must create a list of line items, multiplying the quantity ordered by the price, for all items that belong in order 123. In the following example, an intermediate table called temp is created to hold the result list:

create table temp as
select (quantity.quantity_sold * 
 item.list_price) line_total
from quantity, item
where
quantity.order_nbr = 123
and
quantity.item_nbr = item.item_nbr;

select SUM(line_total) from temp;

Also note that the region-state-city table hierarchy in this example is very deliberate. In order to be truly in 3NF, you do not allow redundant information (except, of course, foreign keys). Because this example has been fully normalized into five tables, a query that would appear very simple to the end user would have relatively complex SQL. For example, the SQL to calculate the sum of all orders in the western region might look very complex and involve a five-way table join. (See Listing 1.)

Listing 1 - Burleson

CREATE TABLE temp AS
SELECT (quantity.quantity_sold * item.list_price) line_total
FROM quantity, item, customer, city, state
WHERE
quantity.item_nbr = item.item_nbr   /* join ITEM and QUANTITY */
AND
item.cust_nbr = customer.cust_nbr   /* join ITEM and CUSTOMER */
AND
customer.city_name = city.city_name /* join CUSTOMER and CITY */
AND
city.state_name = state.state_name  /* join CITY and STATE */
AND
state.region_name = 'WEST';

In the real world, of course, you would introduce enough redundancy to eliminate the city, state, and region tables. Regardless, the point is clear: A manager who wants to analyze a series of complete order totals would require a huge amount of runtime computation. Here you arrive at the basic tradeoff: If you want true freedom from redundant data, you must pay the price at query time.

Remember, the rules of database design have changed. Ten years ago, normalization theory emphasized the need to control redundancy and touted the benefits of a structure that was free of redundant data. Today, with disk prices at an all-time low, the attitude toward redundancy has changed radically. The relational vendors are offering a plethora of tools to allow snapshots and other methods for replicating data, and other vendors such as UniSQL are offering database products that allow for non-first normal form implementations. Today, it is perfectly acceptable to create first normal form implementations of normalized databases, thereby pre-joining the tables and avoiding the high-performance costs of runtime SQL joins. Traditionally, databases were normalized to reduce redundant information. In a 3NF representation, information only exists once, making it very simple to locate and change one piece of data. Although denormalization will simplify queries, the introduction of redundancy can make certain update operations cumbersome.

The basic principle behind the star query schema is the introduction of highly redundant data for performance reasons. Let's evolve the 3NF database into a star schema by creating a fact table to hold the quantity for each item sold. Essentially, a fact table is a first normal form representation of the database, with a very high degree of redundancy added into the tables. (See Figure 4.)

Figure 4Burleson Figure 4

It is hard to believe that Figure 4 contains the same data as the fully normalized database. The new fact table will contain one row for each item on each order, so there is a tremendous amount of redundant key information in the fact table. Of course, the star query schema will require far more disk space than the 3NF database from which it was created, and the star schema will most likely be a read-only database because of the widespread redundancy that has been introduced into the model. Also, the widespread redundancy will make updating difficult, if not impossible. Also note the "dimension" tables surrounding the fact table. Some of the dimension tables contain data that can be added to queries with joins, while other dimensions such as region do not contain any data.

So, with the huge disk-space consumption and read-only restriction, what does this star schema buy for you? Now that you have a star schema, you can formulate SQL queries to quickly get the information that you desire. For example, by performing some of the work up front, getting the total cost for an order is now simple:

select sum(total_cost) order_total
from fact
where
fact.order_nbr = 123;

Consider what would happen if the user of this schema wanted to analyze information by aggregate values. Assume your manager wants to know the breakdown of sales by region. The data is not organized by region, but the fact table can be queried easily to get the answer. Now getting the sum of all orders for the western region becomes trivial:

select sum(total_cost)
from fact
where
region = 'west'

In addition to simplifying the query structure, all of the table joining has been eliminated, and you can easily get the extracted information from your star schema. A value such as region would be an ideal candidate for the use of Oracle 7.3 bitmapped indexes. Values that have fewer than three distinct values can see dramatic performance improvements by utilizing the bitmapped index technique.

The natural consequence of this approach is that many IS shops will keep two copies of their production databases: one in 3NF for online transaction processing and another denormalized version of the database for decision-support and data warehouse applications.

Using Oracle's Distributed SQL to PopulatE Star Schemas

It should be self-evident at this point that it is desirable in some cases to have several copies of the same database: a 3NF database for online transaction processing and a denormalized star schema for use by data warehouse and OLAP applications. The problem with this dual approach arises when attempting to keep the star schema in sync with the operational database. Fortunately, Oracle provides several mechanisms to assist in this synchronization. Because it is safe to assume that the star schema will be used by executives for long-range trend analysis, it is probably not imperative that the star schema be completely up-to-date with the operational database, and you can, therefore, develop an asynchronous method for updating the star schema.

If you make this assumption, then you can use a single SQL statement to extract the data from the operational database and populate the new rows into the star schema. (See Listing 2.) In Listing 2, assume that the star schema resides at our corporate headquarters in London, and that the table is called the fact_table. (Snapshots can also be used to get the same results.)

Listing 2


INSERT INTO fact_table@london
SELECT
    order_year,
    order_quarter,	
    order_month,
    order_nbr,
    salesperson_name,
    customer_name,
    customer_city,
    customer_state,
    customer_region,
    item_nbr,
    quantity_sold,
    price*quantity_sold
FROM quantity, item, customer, city, state
WHERE
quantity.item_nbr = item.item_nbr   /* join ITEM and QUANTITY */
AND
item.cust_nbr = customer.cust_nbr   /* join ITEM and CUSTOMER */
AND
customer.city_name = city.city_name /* join CUSTOMER and CITY */
AND
city.state_name = state.state_name  /* join CITY and STATE */
AND
order_date = SYSDATE                /* get only today's transactions */
;

Listing 2 shows a very simple method for achieving the extraction, normalization, and insertion of the operational data into the star schema. Note that you can even handle computed values such as total_cost, which is price times quantity_sold. By specifying the sysdate in the where clause, you insure that only the day's transactions are extracted and loaded into the star schema fact_table. Of course, you are still undertaking a very large five-way table join, but you would hope to run this extraction during off-hours when the retrieval would not impact the production users.

But what about rows that were deleted? While uncommon, you still need to account for the possibility that some orders may be canceled, and you need a mechanism for updating the star schema to reflect such deletions. The most obvious method for removing deleted orders from the star schema is to create a delete trigger on the order table of the operational system. This delete trigger will fire off a remote delete from the trigger to delete all invalid rows from the star schema:

CREATE TRIGGER delete_orders
    AFTER DELETE on order
BEGIN
DELETE FROM fact_table@london
    WHERE
    order_nbr = :del_ord
END;

You now have a mechanism for keeping your data warehouse in relative synchronization with the operational database. Note that in order for this to work, the remote database must be available and the network must be up. In very large distributed networks with hundreds of databases at remote locations, it is not uncommon for one of the lines to experience sporadic failures. Many sophisticated shops will test the return code from these operations and automatically retry every five minutes until the connection is successful.

What are you going to do as the fact_table expands beyond its initial table capacity? Assume that your organization processes 20,000 orders per day, leading to 7.3 million rows per year. With Oracle indexes, a table this large can create unique performance problems primarily because the index must spawn many levels to index 7.3 million rows properly. Whereas a typical query might involve three index reads, a query against a multimillion row table might involve five index reads before the target row is fetched.

To alleviate this problem, many designers will partition the table into smaller sub-tables, using the data as the distinguishing factor. As such, you may have a table for each month, with names such as fact_table_1_96, fact_table_2_96, and so on. Whenever you need to address multiple tables in a single operation, you can use the SQL union all verb to merge the tables together:

SELECT * FROM fact_table_1_96
UNION ALL
SELECT * FROM fact_table_2_96
UNION ALL
SELECT * FROM fact_table_3_96
ORDER BY order_year, order_month;

In addition to having the benefit of smaller table indexes, this type of table partitioning combined with the union all verb has the added benefit of allowing Oracle's parallel query engine to perform full-table scans simultaneously on each of the sub-tables. In this case, a separate process is invoked to process each of the three table scans. An Oracle query coordinator would then gather the result data and sort it according to the order by clause. In the preceding example, you could expect a 50 percent performance improvement over a query against a single fact_table.

Aggregation, Roll-Ups, and Star Schemas

You have now defined and populated a star schema that contains the total sales for each order for each day. While it is now easy to see the total for each order, rarely do the users of a DSS require this level of detail. Most managers would be more interested in knowing the sum of sales or units sold, aggregated by month, quarter, region, and so on. Even with a star schema, these types of aggregations are difficult to compute at runtime while still providing the end user with acceptable response time. Essentially, you can either aggregate at runtime or pre-aggregate the data offline, making the totals available without runtime computation.

One simple alternative to runtime aggregation is to write SQL to pre-aggregate the data according to the dimensions that the end user may want to see. In the example, assume that management would want to aggregate monthly sales by region, state, item type, and salesperson. Because you have four possible dimensions, you can generate a list of the following six aggregate tables to pre-create (assume that all of the tables would have a month_year field as their primary key):

 

  • Region by State - this table would have region_name, state_name, and monthly_sales as columns.
  • Region by item_type - this table would have region_name, item_type, and monthly_sales as columns.
  • Region by Salesperson - this table would have region_name, salesperson_name, and monthly_sales as columns.
  • State by item_type - this table would have state_name, item_type, and monthly_sales as columns.
  • State by Salesperson - this table would have state_name, salesperson_name, and monthly_sales as columns.
  • item_type by Salesperson - this table would have item_type, salesperson_name, and monthly_sales as columns.

The SQL to produce these tables can easily be run as a batch task at end-of-month processing. For example, the SQL to create the region_by_state table might look like:

INSERT into region_item_type
SELECT "3", "1996", region_name, 
 item_type, total_cost
FROM FACT_TABLE_3_96
GROUP BY region_name, item_type
;

The sample region_item_type table after formatting might look like Table 1.

Table 1
DATE REGION TYPE MONTHLY_SALES
3/96 WEST Clothes $113,999
3/96 WEST Hardware $56,335
3/96 WEST Food $23,574
3/96 EAST Clothes $45,234
3/96 EAST Hardware $66,182
3/96 EAST Food $835,342
3/96 SOUTH Clothes $1,223
3/96 SOUTH Hardware $56,392
3/96 SOUTH Food $9,281
3/96 NORTH Clothes $826,463
3/96 NORTH Hardware $77,261
3/96 NORTH Food $43,383

These aggregate tables can be built in the middle of the night, right after the master fact tables have been populated with the day's sales. The next morning, the prior day's sales will have been rolled up into these summaries, and management has an accurate, fast, and easy-to-use tool for decision support.

Of course, these tables are two-dimensional, but they can easily be massaged by an application to provide a tabular representation of the variables. Following is a tabular form of the data in Table 1:

REGION vs TYPE
	Clothes	    Food	Hardware
WEST	$113,999    $23,574	$56,335

EAST	$45,234	    $835,342	$66,182

NORTH	$826,463    $43,383	$77,261

SOUTH	$1,223	    $9,281	$56,392

This technique replicates the functionality of an MDDB, whereby an end user can specify the axis of interest and the MDDB will build a tabular representation of the data.

But what if managers want to look at quarterly summaries instead of monthly summaries? Or what about yearly summaries? Of course, this same technique can be used to roll up the monthly summary tables into quarterly summaries, yearly summaries, and so on, according to the demands of the end user.

The History of OLAP

Dr. E. F. Codd first used the term OLAP in a 1993 white paper sponsored by Arbor Software. In this same paper, Codd also created 12 rules for OLAP. (See the "OLAP Bibliography" for a summary of OLAP white papers and magazines articles, including Dr. Codd's white paper.) Despite Codd's claims of new technology, some offerings such as IRI Express, now called Oracle Express, date to the early 1970s. (For more information on OLAP and Oracle Express Objects, please see Dan Bulos' article, "OLAP Comes of Age: Oracle Express Objects.") There is a popular forum on the Internet that discusses OLAP issues at http://www.comp.database.olap.

Simulation of Cubic Databases (Dimensionality)

Assume that the denormalized sample customer table shown in Table 2 is physically stored in data order. You can now imagine how this data might look as a cubic table. (See Figure 5.) Of course, cubic representation requires that the data be loaded into an MDDB or a spreadsheet that supports pivot tables. When considering an MDDB, there are two arguments. Relational database vendors point out that because MDDBs are proprietary and not open, the more open relational databases should be used. MDDB vendors, meanwhile, point out some serious inadequacies with SQL that make using a relational database very difficult.

Table 2
CUSTOMER-NAME # Sales YY-MM City State
Smith & Jones 300 91-01 NY NY
Dewey Cheetham & Howe 400 91-01 San Fran CA
Smith & Jones 120 91-02 NY NY
Dewey Cheetham & Howe 300 91-02 San Fran CA
Smith & Jones 145 91-03 NY NY
Dewey Cheetham & Howe 334 91-03 San Fran CA

 
			
				NY
				
				NY
				

Figure 5Burleson Figure 5

Also, dimensions may be hierarchical in nature, and this may add further confusion. A time dimension, for example, may be represented as a hierarchy with Year, Quarter, Month, and Day. Each of these "levels" in the dimension hierarchy may have its own values. In other words, a cubic representation with Time as a dimension may be viewed in two ways:

1. A series of cubes: one for Year, another for Quarter, and another with full_date.
2. A five-dimension table.

MDDBs are most commonly used with data that is a natural fit for pivot tables. It should come as no surprise that most MDDB sites are used with finance and marketing applications. Unfortunately, most MDDBs do not scale up well for warehouse applications. For example, the largest supported database for Arbor Software's Essbase is about 20GB, whereas data warehouses so large that they are measured in terabytes are not uncommon.

Note that defining the aggregation of an MDDB is no different from defining aggregate tables to a relational database. At load time, the database must still compute the aggregate values.

MDDBs also employ the concept of sparse data. Because data is aggregated and presliced, there may be cells on a cube with no data. For example, consider a cube that tracks sales of items across a large company. The cells representing sales of thermal underwear are null for Hawaii, while the sales of surfboards in Wyoming are also null. Almost all of the products offered maintain a mechanism for compressing out these types of null values.

Oracle's Alternatives to Cubic Data Representation

You can use many traditional database designs to simulate a data cube. One alternative to the cubic representation is to leave the table in linear form and use SQL to join the table against itself to produce a result. (See Figure 6.) Following is a query that might require a self-join of a table:

Figure 6Burleson Figure 6

1. Show all customers in Hawaii who purchased our product more than 500 times.
2. Show all customers in Louisiana who purchase less than 10 times/month.
3. Show all large customers (buy more than 100 items per month) in Alaska whose usage has dropped more than 10 percent in 1990.
4. Show all customers in New York whose usage in March 1990 deviated more than 20 percent from their usage in March 1991.
5. Show all customers for which the company name contains "Widget" in California and whose usage has dropped more than 20 percent in 1991.

In the example in Listing 3, compare all user sites in which usage of your product has experienced a negative usage variance of greater than five percent between October 1990 and December 1990. A subset of this data can easily be extracted such that only California sites with more than 100 usages per month are displayed. For display, the user chooses percentage variance, number of requests, site number, zip code, and city. Note the sort order of the report. It is sorted first by zip, followed by city, and then by percentage variance within city. Note that the variance analysis is performed directly in the SQL statement. In the above case, California users whose usage has dropped by more than five percent (comparing January 1991 to January 1992) are displayed.

Listing 3


SELECT INTEGER (((e.number_of_sales ­ s.number_of_sales) /
 s.number_of_sales) * 100) ,
e.customer_name , e.city_name , e.zip , s.number_of_sales , 
 e.number_of_sales

FROM detail s , detail e

WHERE
s.customer_name = e.customer_name
 AND
e.state_abbr = 'ca'
 AND
e.date_yymm = 9101
 AND
s.date_yymm= 9201
 AND
e.number_of_sales < 
 s.number_of_sales ­ (.05 * s.number_of_sales)

ORDER BY e.zip asc , e.city_name asc , 1 ;

But what if the user wants to compare one full year with another year? The table is structured for simple comparison of two specific month dates, but the SQL query could be modified slightly to aggregate the data and compare two ranges of dates.

The query shown in Listing 4 will aggregate all sales for an entire year and compare 1991 with 1992. Here you answer the question, "Show me all customers in California whose sales have dropped by more than five percent between 1991 and 1992."

Listing 4 


SELECT INTEGER (((e.number_of_sales ­ s.number_of_sales) /
 s.number_of_sales) * 100) ,
e.customer_name , e.city_name , e.zip , s.number_of_sales , 
 e.number_of_sales

FROM detail s , detail e

WHERE
s.customer_name = e.customer_name
 AND
e.state_abbr = 'ca'
 AND
substr(e.date_yymm,1,2) = "91"
 AND
substr(s.date_yymm,1,2) = "92"
 AND
e.number_of_sales < 
 s.number_of_sales ­ (.05 * s.number_of_sales)

ORDER BY e.zip asc , e.city_name asc , 1 ;

On the surface, it appears that SQL can be used against two-dimensional tables to handle three-dimensional time-series problems. It also appears that you can use SQL to roll up aggregations at runtime, alleviating the need to perform a roll-up at load time, as with a traditional database. While this implementation does not require any special MDDBs, there are two important issues to resolve:

1. Joining a table against itself, especially when comparing ranges of dates, may create several levels of nesting in the SQL optimization and poor response time and performance.
2. End users would be incapable of formulating this type of sophisticated SQL query.

If you strip away all of the marketing hype and industry jargon, you will see that data warehouses and MDDBs can be easily simulated by pre-creating many redundant tables, each with precalculated roll-up information. In fact, the base issue is clear: Complex aggregation will need to be computed at runtime or when the data is loaded.

Data Mining and Relational OLAP

The recent interest in data warehousing has created many new techniques and tools for obtaining useful information from these behemoth databases. Data mining is one area that holds a great deal of promise for users of data warehouses.

In traditional DSSs, users formulated queries against the database and deciphered any trends that were present in the data. Unfortunately, this approach is only as good as the user of the system, and many statistically valid associations between data items can be missed. This is especially true in data warehouse systems in which unobtrusive trends are present. For example, at the psychology department at the University of Minnesota, the developers of the hugely popular Minnesota Multiphasic Personality Inventory (MMPI) have found some startling patterns regarding how people with different psychological diagnoses respond to seemingly ordinary questions. These patterns provide unobtrusive measures of human personality. For example, the MMPI developers found that people with low self-esteem tend to prefer baths to showers, and while there is not a "reason" for this preference, there remains a statistically valid correlation between self-concept and bathing preferences.

In the business world, there are similar types of unobtrusive trends. It is the goal of data mining software to identify these trends for the users of the warehouse. In addition to identifying trends, some data mining software analyzes other data and determines the reasons for the identified trends. Although basic statistical tools are adequate for performing correlations between a small number of related variables, large databases with hundreds of data items are quickly bogged down in a mire of multivariate "chi-square" techniques that are hard to follow for even the most experienced statistician. As such, the new data mining tools are meant to accept only general hints from the users, and then go forth on their own into the data, probing for trends.

In other cases, data mining techniques are used to prove a hypothesis based on existing data. For example, a marketing specialist may speculate that customers with an income between $50,000 and $80,000 are likely to buy a certain product. A quick verification of this hypothesis can be run, thereby either confirming or disproving the hypothesis.

Yet another class of tools uses a relatively straightforward exception-detection mechanism to cruise the database looking for "unexpected" trends or unusual patterns. Many of the data mining tools use techniques borrowed from artificial intelligence, including fuzzy logic, neural networks, fractals, and various other statistical techniques.

Because many of these tools perform a huge amount of internal processing, many of them read selected information from the relational database into a proprietary, internal data representation for analysis. There are no widely used data mining tools that run directly against the relational database. Although there is still a great deal of interest in data mining applications, no single vendor has stepped up to claim market leadership. It will probably be many years before all owners of a data warehouse have tools that will be able to exploit their data resources fully.

Using Oracle 7.3 Features with Data Warehouses

The new features of Oracle 7.2 and Oracle 7.3 will not be activated unless the following server parameter is used:

COMPATIBILITY=7.3.0.0.0

With Oracle 7.3, there are several new features that can dramatically improve performance of Oracle data warehouse and DSSs.

Oracle Parallel Query for Data Warehouses. It is a common misconception that it is necessary to have parallel processors (SMP or MPP) in order to use and benefit from parallel processing. Even on the same processor, multiple processes can be used to speed up queries. The most powerful approach deals with the use of the SQL union verb in very large databases (VLDBs). In most very large Oracle data warehouses, it is common to partition a single table into many smaller tables in order to improve query throughput. For example, a sales table that is ordered by date_of_sale may be partitioned into 1997_sales, 1998_sales, and 1999_sales tables. This approach is very common with data warehouse applications in which a single logical table might have millions of rows. Although this "splitting" of a table according to a key value violates normalization, it can dramatically improve performance for individual queries. For large queries that may span many logical tables, the isolated tables can be easily reassembled using Oracle's parallel query facility:

CREATE VIEW all_sales as
  SELECT * FROM 1997_sales
  UNION ALL
  SELECT * FROM 1998_sales
  UNION ALL
  SELECT * FROM 1999_sales;

You can now query the all_sales view as if it were a single database table, and Oracle's parallel query will automatically recognize the union all parameter and fire off simultaneous queries against each of the three base tables. For example, the following query will assemble the requested data from the three tables in parallel, with each query being separately optimized. The result set from each subquery is then merged by the query coordinator:

SELECT customer_name
FROM all_sales
WHERE
sales_amount > 5000;

Star Query Hints and Star Joins with Oracle 7.3. As you have already seen, the star schema design involves creating a main fact table that contains all of the primary keys in the related tables. This massive denormalization of the database structure means that just about any query against the star schema will involve the joining of many large tables, including a large "fact" table and many smaller reference tables. Oracle has provided a new feature with release 7.3 that detects a star query join and invokes a special procedure to improve performance of the query. Prior to release 7.3, this feature only worked with up to five tables, but this restriction has been eliminated. Also, release 7.3 no longer requires the use of star query hints, but they are still allowed in the SQL syntax and are generally a good idea for documentation purposes. The star query requires that a single concatenated index resides for all keys in the fact table.

Oracle follows a simple procedure for addressing these queries. Oracle will first service the queries against the smaller reference tables and combine the result set into a Cartesian product table in memory. Once the sum of the reference tables has been addressed, Oracle will perform a nested-loop join of the intermediate table against the fact table. This approach is much faster than the traditional method of joining the smallest reference table against the fact table and then joining each of the other reference tables against the intermediate table.

Using Oracle's Bitmap Indexes. For data warehouse application that must attain maximum performance, there are some special situations in which Oracle bitmapped indexes may be useful. As you know, with very large tables, the sheer volume of rows can make even a trivial query run for a long time. Oracle has introduced bitmapped indexes with release 7.3 in an attempt to improve index lookup performance for queries, especially decision-support queries that may have many conditions in the where clause. Histograms can also improve the performance of certain queries. (Please see Steve Bobrowski's article "Tuning Application Logic" for a discussion of histograms.)

The bitmap approach to indexing is very different from the traditional B-tree style of indexes. In a traditional index, index keys are sorted and carried in several tree nodes. In a bitmapped index, an array is created. This array has all possible index values as one axis, while the other axis contains all rows in the base table. For example, consider a bitmapped index on the region field of the sales table, in which the regions are North, South, East, and West. If the sales table contains 1,000,000 rows, then the bitmapped index would create a array of 4 ¥ 1,000,000,000 to store the possible key values. Within this array, the index data is binary. If a value is true it is assigned a binary "1," whereas a false reading is set to binary "0." (See Figure 7.)

Figure 7Burleson Figure 7

In Figure 7, you can see how this query runs faster than a traditional query. The Oracle optimizer will notice that the items in the where clause have bitmapped indexes and will simply scan for non-zero values in the proper array column, quickly returning the row ID of the columns. A fast merge of the result set will then quickly identify the rows that meet the query criteria. Although it may appear that the index is very large, Oracle has developed a compression method whereby all of the binary zeros are omitted from the bitmap, making it very compact.

This may look reasonable at first glance. However, there are some drawbacks to bitmapped indexing. The first and most obvious is that bitmapped indexes are bases for columns that have only small numbers of possible values. For columns that have many values such as state_name or city_name, the index overhead would probably exceed any performance gains. However, for columns such as sex, color, and size that have a small number of finite values, bitmapped indexes will greatly improve query retrieval speeds. Bitmapped indexes are especially useful for DSSs in which many conditions are combined to a single where clause.

Using Oracle 7.3 Hash Joins. Oracle 7.3 also provides another method for speeding up decision-support and warehouse queries. This method is called the hash join. A hash join is a technique by which Oracle bypasses the traditional sort-merge join technique and replaces it with an algorithm that performs a full table scan, placing the rows into memory partitions. Partition pairs that do not fit into memory are placed in the temp tablespace. Oracle then builds a hash table on the smallest partition and uses the larger partition to probe the newly created hash table. This technique alleviates the need for in-memory sorting and does not require that indexes exist on the target tables.

The following four server parameters must be set to use hash joins:

 

  • optimizer_mode parameter must be set to cost
  • hash_join_enabled=true
  • hash_multiblock_io_count=true
  • hash_area_size=size

To execute a hash join, the hash join hint must be used. Following is an example:

SELECT /* USE_HASH/ *
FROM customer, order
WHERE
customer.cust_no = order.cust_no
AND
credit_rating = 'GOOD';

GOING WAREHOUSE

Oracle is now offering a considerable number of tricks that can help you use Oracle's proven relational technology to implement fast and robust warehouse systems. Prior to Oracle 7.3, a user who wanted to create a data warehouse was forced to deal with a vendor that did not have the same market presence as Oracle, and there were concerns as to whether these smaller data warehouse companies would survive. Oracle developers have long awaited these new performance features for their data warehouse operations. High-speed decision-support applications using Oracle's reliable relational database are now possible.


During the 1980s, Don Burleson established himself as a authority on database system architectures while serving on the faculty of several major universities. He is currently a senior DBA for a Fortune 50 company. He is also the author of Managing Distributed Databases (Wiley & Sons, 1995) and the forthcoming Oracle Performance and Tuning (Que, due in September 1996). You can email Don at 76443.2751@compuserve.com.



 


 

 

 

Burleson is the American Team

Note: This Oracle documentation was created as a support and Oracle training reference for use by our DBA performance tuning consulting professionals.  Feel free to ask questions on our Oracle forum.

Verify experience! Anyone considering using the services of an Oracle support expert should independently investigate their credentials and experience, and not rely on advertisements and self-proclaimed expertise. All legitimate Oracle experts publish their Oracle qualifications.

Errata?  Oracle technology is changing and we strive to update our BC Oracle support information.  If you find an error or have a suggestion for improving our content, we would appreciate your feedback.  Just  e-mail:  

and include the URL for the page.


                    









Burleson Consulting

The Oracle of Database Support

Oracle Performance Tuning

Remote DBA Services


 

Copyright © 1996 -  2017

All rights reserved by Burleson

Oracle ® is the registered trademark of Oracle Corporation.

Remote Emergency Support provided by Conversational