Linear Regression modeling for Oracle databases

Don Burleson

Oracle started with predictive modeling in Oracle data mining (ODM) tools, and that Oracle Corporation is developing the Automatic Maintenance Tasks (AMT), a new Oracle10g feature that will automatically detect and re-build sub-optimal indexes."

There has been great discussion about using the scientific method with Oracle databases, and how mathematical models are developed for Oracle. Predicting the future without historical justifications is the realm of psychics, not scientists. Virtually every predictive model in Oracle software uses the database to create the predictive model:

Data mining can sift through massive amounts of data and find hidden information — valuable information that can help you better understand your customers and anticipate their behavior.

So, do Oracle modeling rules have to make-sense? No, of course not. The Oracle scientists who created the Oracle data mining tools make no such mistake. They scan historical data and identify statistically significant correlations (within 2 standard deviations of the mean value), and base their results on empirical truths, not theory.

For example, the popular MMPI test is a set of 500 true/false questions that accesses personality with remarkable validity, and it's results are accepted in all U.S. courts. Their test-base consists of hundreds of thousands of subjects,. with a pre-diagnosed mental disorder (see DSM IV). By comparing their responses to seemingly innocuous questions (e.g. "I read the editorials in the newspaper every day") a proven predictive model was created (Federal courts have affirmed the MMPI as a scientifically valid) and accepted procedure for personality assessment.

For example, the subjects preference to take showers vs. baths is an extremely reliable measure of self-esteem. Do we know why? No. Do we care? Not really. All that is proven is that this correlation is a statistically reliable predictor of feelings of self-worth. We see the exact same scientific principle applied to Oracle data mining (ODM) tools. For example, we might find-out that people with red hair buy a disproportionate amount of skin care products. Knowing "why" is not important. What's important is knowing that the data supports the assertion. Also useful is the book "Unobtrusive Measures", which shows creative techniques for finding "hidden" significant metrics.

In sum, rules don't have to be proven true to be statistically reliable, and exceptions do not make the rule invalid. For example, if two out of every 1,000 read-haired people don't buy skin care products, we still have a model with a very-high predictive quality.

��