|
|
|
Oracle predictions - Correlation vs. Causation
Oracle Database Tips by Donald Burleson
|
The problem with over generalizing testing
There is a disturbing trend in Oracle whereby
beginner DBAs are tempted to take a single "test case", and over
generalize it, drawing false conclusions and inferring causation where
none exists.
A good Oracle DBA always performs full testing of
system change in a test environment, simulating real-world loads,
before migrating any change into production. The Oracle best
practices for system testing include:
-
Gathering a real-world SQL workload: You
can extract high-impact SQL from v$sql, stats$sqlstat,
dba_hist_sqlstat or by using the
Oracle SQL performance analyzer workload capture utility.
-
Simulating a production workload by replicating
the high water mark of transactions per second.
-
Performing the benchmark on identical OS and
hardware
However, instead of a full-blown benchmark using
real SQL workloads, some DBAs misunderstand basic science and suggest
that a single "test case" can be used to infer causation and explain
how Oracle behaves.
For centuries, scientists have used
statistically significant correlations for its predictive validity.
If Event "A" correlates Event "B" with 99% predictive validity, then
you can use this fact to predict events, even if you don't know the
root cause of the correlation.
While it might be nice to unravel the root
cause of a correlation, it's the strength of the correlation that
matters, not the root cause. Many DBAs make the common mistake
of assuming causation between actions and results, when in reality
there be no causation at all.
The NEWSWEEK article titled
Wanted: BS Detectors , Sharon Begley
describes the the human tendency to falsely ascribe causation from
limited data:
"What we need to teach is the ability to detect Bad
Science—BS, if you will. . .
The brain stinks at distinguishing patterns from
randomness (no wonder people can't tell that the climate change now
underway is not just another turn in the weather cycle). For another,
the brain overestimates causality"
False conclusions and causation
There is an old Hindi example that describes
this human tendency to make false conclusions. In the parable,
two blind men are asked to describe an elephant. One blind man
feels the truck, and concludes that an elephant is pliable and narrow,
while another blind man examines the leg and concludes that an
elephant is stocky, like a tree.
This human tendency falsely
infer causation where none exists has been well demonstrated in
science, and even lower mammals have a tendency to draw false
conclusions based on a single sample. The global warming hoax is
a perfect example of human's predisposition to infer causation where
none exists.
For example, according to Google trends,
Oracle and global warming are correlated. Does this
correlation imply that Oracle has caused global warming? Of
course not.
Correlation is good for predictive modeling,
but it's dangerous to imply causation
In a recent scientific study, scientists
asked teachers to
apply reward and punishment and see which was more effective,
rewarding punctual students while punishing those who were late.
However, this test was "rigged" by the scientists, who randomly
assigned students to be prompt or tardy.
Universally, the
teachers concluded that punishment worked best, wrongly inferring
causation between a random event and their reaction to the event.
In sum, artificial test cases cannot be used to infer the general
behavior of Oracle. Remember, single test cases sample only a
tiny portion of the Oracle population. A test case is only
representative of a specific Oracle release on a specific platform,
and Oracle performance varies wildly between operating systems and
hardware environments.
To learn how to make a statistically
valid benchmark test, see the book
Oracle
Benchmarking by Bert Scalzo of Quest Software.
Understanding Oracle science
When performing predictive modeling, it's the
strength of the correlation that determines the validity of an
observation, and while it's interesting to understand the root
causation behind the correlation, it's the strength of the correlation
that matters.
"Predictions are
difficult, especially about the future".
Yogi Berra
Winston Churchill once said "The farther you
can look into the past, the farther you can see into the future",
and this applies to Oracle data as well,
Oracle knows this, and
they have invested heavily in
predictive modeling
tools such as the "intelligent advisors" and the SGA and PGA
advice tools:
To learn more about Oracle for the
scientist, Dr. Carolyn Hamm goes into great detail on the use of
correlation strength in her recommended book
Oracle Data Mining.
Oracle sampling and Correlation
We have noted repeatedly that Oracle Data Mining
and Oracle Data Warehouse tools have utilities for identifying
significant correlations (i.e. consumer buying patterns), and that
the predictive probability is more important than the underlying
reason for the correlation.
For example, if I know that there is a 70% chance
that people who bought Waldo's Widgets will also buy
Cobb's Cogs, that's all we need to know to launch a marketing
campaign, targeting consumers of Waldo's Widgets.
Many Oracle professionals misunderstand correlation and
causation, but if something correlates, it can be extremely useful,
regardless of the underlying root cause. If action A has
a statistically significant correlation with result B, it can be used
to predict the behavior of Oracle databases.
-
In medical research, there are many medicines
that treat symptoms, while doctors do not understand how the
medicines actually work.
-
In statistics, it's the strength of the
correlation that determines its predictive validity, regardless of
the root cause of the correlation.
-
In Oracle, correlations can be used in
statistically valid forecasts, predicting changes to workloads
before they occur!
Forecasters will tell you that the causation of
a correlation is not as important as the strength of the correlation
itself. There are many phenomenon's in the world that show a high
correlation, and periodicals such as Poor Richards Almanac
served as a reporting vehicle for predictive modeling in past
centuries.
- Seafaring folks know that the photo luminescent ritual of
Ocean Glow worm breeding always occurs on the 4th and
5th night after a full-moon. The cause of this
behavior may be interesting, but all the tour operators need to
know is that there is a 90% probability that their tour boat
guests will observe acres of luminescent horny glowworms on
these special nights.
Farmers
- Farmers have known for centuries that you should always plant
root crops (potatoes, turnips) right after the full moon (and
always before the new moon) to achieve optimal growth. The
causation is fascinating; of course, but real value is the
strong correlation of crop yield with lunar cycles.
Medicine: in Pharmacology,
scientists often do not understand the causation for a drug, and
that's OK. For example, the anti-depressant Wellbutrin was
found to help people stop smoking. Renamed Zyban, this
drug has helped millions of people quit smoking, and scientists
have no idea why. The only thing that matters is that
there exists a statistically significant correlation between
Zyban use and smoking cessation.
Marketers
- In a modern example, today's point-of-sale (POS) data
warehouses perform
multivariate chi-square analysis to categorize groups of
consumers and predict their propensity to buy a certain type of
product. Large manufacturing companies spend hundreds of
millions of dollars a year on advertising, and being able to
target their messages to those with a higher probability to buy
the product can save them millions of dollars each year.
Psychologists -
In personality profiling statistics (MMPI),
a database of millions of respondents has been created and
surprising scientific correlations have been found. For
example, the true/false answer to the statement "I prefer a
bath to a shower" has a very high correlation to the MMPI
self-esteem scale. To date, no psychological researchers
have discovered why people with a low self concept prefer
different bathing techniques, but that does not diminish this
questions value in personality assessment.
In sum, predictive modeling lives in the world
of probabilities, and it's the strength of the correlation itself
that has value. While the causation behind the correlation may
be interesting, it's the cold, hard numbers that drive management
decisions.
|