Home
E-mail Us
Oracle Articles
New Oracle Articles

Oracle Training
Oracle Tips
Oracle Forum
Class Catalog

Remote DBA
Oracle Tuning
Emergency 911
RAC Support
Apps Support
Analysis
Design
Implementation
Oracle Support

SQL Tuning
Security
Oracle UNIX
Oracle Linux
Monitoring
Remote support
Remote plans
Remote services
Application Server
Applications
Oracle Forms
Oracle Portal
App Upgrades
SQL Server
Oracle Concepts
Software Support
Remote Support
Development
Implementation

Consulting Staff
Consulting Prices
Help Wanted!

Oracle Posters
Oracle Books
Oracle Scripts
Ion
Excel-DB

Don Burleson Blog

Oracle predictions - Correlation vs. Causation

Oracle Database Tips by Donald Burleson

The problem with over generalizing testing

There is a disturbing trend in Oracle whereby beginner DBAs are tempted to take a single "test case", and over generalize it, drawing false conclusions and inferring causation where none exists.

A good Oracle DBA always performs full testing of system change in a test environment, simulating real-world loads, before migrating any change into production. The Oracle best practices for system testing include:

Gathering a real-world SQL workload: You can extract high-impact SQL from v$sql, stats$sqlstat, dba_hist_sqlstat or by using the Oracle SQL performance analyzer workload capture utility.
Simulating a production workload by replicating the high water mark of transactions per second.
Performing the benchmark on identical OS and hardware

However, instead of a full-blown benchmark using real SQL workloads, some DBAs misunderstand basic science and suggest that a single "test case" can be used to infer causation and explain how Oracle behaves.

For centuries, scientists have used statistically significant correlations for its predictive validity. If Event "A" correlates Event "B" with 99% predictive validity, then you can use this fact to predict events, even if you don't know the root cause of the correlation.

While it might be nice to unravel the root cause of a correlation, it's the strength of the correlation that matters, not the root cause. Many DBAs make the common mistake of assuming causation between actions and results, when in reality there be no causation at all.

The NEWSWEEK article titled Wanted: BS Detectors , Sharon Begley describes the the human tendency to falsely ascribe causation from limited data:

"What we need to teach is the ability to detect Bad Science—BS, if you will. . .

The brain stinks at distinguishing patterns from randomness (no wonder people can't tell that the climate change now underway is not just another turn in the weather cycle). For another, the brain overestimates causality"

False conclusions and causation

There is an old Hindi example that describes this human tendency to make false conclusions. In the parable, two blind men are asked to describe an elephant. One blind man feels the truck, and concludes that an elephant is pliable and narrow, while another blind man examines the leg and concludes that an elephant is stocky, like a tree.

This human tendency falsely infer causation where none exists has been well demonstrated in science, and even lower mammals have a tendency to draw false conclusions based on a single sample. The global warming hoax is a perfect example of human's predisposition to infer causation where none exists.

For example, according to Google trends, Oracle and global warming are correlated. Does this correlation imply that Oracle has caused global warming? Of course not.

Correlation is good for predictive modeling, but it's dangerous to imply causation

In a recent scientific study, scientists asked teachers to apply reward and punishment and see which was more effective, rewarding punctual students while punishing those who were late. However, this test was "rigged" by the scientists, who randomly assigned students to be prompt or tardy.

Universally, the teachers concluded that punishment worked best, wrongly inferring causation between a random event and their reaction to the event.

In sum, artificial test cases cannot be used to infer the general behavior of Oracle. Remember, single test cases sample only a tiny portion of the Oracle population. A test case is only representative of a specific Oracle release on a specific platform, and Oracle performance varies wildly between operating systems and hardware environments.

To learn how to make a statistically valid benchmark test, see the book Oracle Benchmarking by Bert Scalzo of Quest Software.

Understanding Oracle science

When performing predictive modeling, it's the strength of the correlation that determines the validity of an observation, and while it's interesting to understand the root causation behind the correlation, it's the strength of the correlation that matters.

"Predictions are difficult, especially about the future". Yogi Berra

Winston Churchill once said "The farther you can look into the past, the farther you can see into the future", and this applies to Oracle data as well,

Oracle knows this, and they have invested heavily in predictive modeling tools such as the "intelligent advisors" and the SGA and PGA advice tools:

To learn more about Oracle for the scientist, Dr. Carolyn Hamm goes into great detail on the use of correlation strength in her recommended book Oracle Data Mining.

Oracle sampling and Correlation

We have noted repeatedly that Oracle Data Mining and Oracle Data Warehouse tools have utilities for identifying significant correlations (i.e. consumer buying patterns), and that the predictive probability is more important than the underlying reason for the correlation.

For example, if I know that there is a 70% chance that people who bought Waldo's Widgets will also buy Cobb's Cogs, that's all we need to know to launch a marketing campaign, targeting consumers of Waldo's Widgets.

Many Oracle professionals misunderstand correlation and causation, but if something correlates, it can be extremely useful, regardless of the underlying root cause. If action A has a statistically significant correlation with result B, it can be used to predict the behavior of Oracle databases.

In medical research, there are many medicines that treat symptoms, while doctors do not understand how the medicines actually work.
In statistics, it's the strength of the correlation that determines its predictive validity, regardless of the root cause of the correlation.
In Oracle, correlations can be used in statistically valid forecasts, predicting changes to workloads before they occur!

Forecasters will tell you that the causation of a correlation is not as important as the strength of the correlation itself. There are many phenomenon's in the world that show a high correlation, and periodicals such as Poor Richards Almanac served as a reporting vehicle for predictive modeling in past centuries.

Tour operators - Seafaring folks know that the photo luminescent ritual of Ocean Glow worm breeding always occurs on the 4^th and 5^th night after a full-moon. The cause of this behavior may be interesting, but all the tour operators need to know is that there is a 90% probability that their tour boat guests will observe acres of luminescent horny glowworms on these special nights.

Farmers - Farmers have known for centuries that you should always plant root crops (potatoes, turnips) right after the full moon (and always before the new moon) to achieve optimal growth. The causation is fascinating; of course, but real value is the strong correlation of crop yield with lunar cycles.
Medicine: in Pharmacology, scientists often do not understand the causation for a drug, and that's OK. For example, the anti-depressant Wellbutrin was found to help people stop smoking. Renamed Zyban, this drug has helped millions of people quit smoking, and scientists have no idea why. The only thing that matters is that there exists a statistically significant correlation between Zyban use and smoking cessation.

Marketers - In a modern example, today's point-of-sale (POS) data warehouses perform multivariate chi-square analysis to categorize groups of consumers and predict their propensity to buy a certain type of product. Large manufacturing companies spend hundreds of millions of dollars a year on advertising, and being able to target their messages to those with a higher probability to buy the product can save them millions of dollars each year.

Psychologists - In personality profiling statistics (MMPI), a database of millions of respondents has been created and surprising scientific correlations have been found. For example, the true/false answer to the statement "I prefer a bath to a shower" has a very high correlation to the MMPI self-esteem scale. To date, no psychological researchers have discovered why people with a low self concept prefer different bathing techniques, but that does not diminish this questions value in personality assessment.

In sum, predictive modeling lives in the world of probabilities, and it's the strength of the correlation itself that has value. While the causation behind the correlation may be interesting, it's the cold, hard numbers that drive management decisions.

��