This is an excerpt from Dr. Ham's premier book "Oracle
Data Mining: Mining Gold from your Warehouse".
But first, we'll build a new Adaptive Bayes
Networkmodel, and instead of using the
Single Feature model type as we did previously, choose Multi Feature, maximize
the Overall Accuracy, set the target attribute for forest cover = 3, and sample
20,000 rows.
Multi Feature will not give us the rules as
before, but will build multiple features to improve the model with each feature,
and result in a more effective model.
The results of our new model has improved the
overall accuracyfrom 21% to 45%, and
correctly classified 42% of the ponderosa pines, while keeping the overall
accuracy around 60%. Lower cost is an indicator of improved prediction, and the
cost of the Multi Feature model is 11804 compared to 13454 in the Single Feature
model.
By changing the
type of model used to classify our data, we were able to influence the
predictive accuracy
of our model. As mentioned earlier,
there are two ways to nudge the model into producing different results that we
might be more interested in seeing.
One of the
methods is to introduce cost bias into our build model, and the methods for
doing this were described in Chapter One. To review, go to the ROCtab in the result
viewer of your Mining Activity.
The Receiver
Operator Characteristics metric shows the change in probability given
modification to the Cost Matrix. We want to predict more of the ponderosa
pines and avoid false negative predictions. Under the ROCcurve, there are
two boxes labeled False Positiveand False NegativeCost. Type in ?3?
in the False Negative Cost box, telling the model that a false negative is three
times more costly than a false positive error, and click Compute Cost. Note
that the red line jumps to the right and in the detail section the line with
probability threshold 0.216 is highlighted. The confusion matrixchanges to show
that there are 22 false negatives, 200 false positives, 215 true positives and
3556 true negatives.
To modify the model test results, return to
the Mining Activityand click on Select ROCthreshold in the Test Metrics section. The default costs for False
Positive and False Negativeare assumed to be
equal and are set to 1 by default.
Now we change the setting from a probability
of 0.5 to 0.216 and notice that the cost of false negative cost is now 3.63.
Click OK to save the settings and now see that the ROC
Threshold is changed. The new cost bias will be used when the model is applied
to a dataset.
We have now built two types of Adaptive Bayes
Networkmodels, the Single Feature and the
Multi Feature. There is one more type in ODMr
which is the Pruned Naive Bayes,
very similar to the Na?e Bayes that we used in Chapter One, although the
results will not be exactly what you'd get using the Na?e Bayes Classification
modeldirectly. In the next section we'll
look at Decision Trees, which like the Adaptive Bayes Network Single Feature
modelgives rules for determining how cases
are classified.
In many types of data the target values may
comprise only a very small percentage of cases. In hospital data, for example,
preventable hospitalizations are somewhat rare events, and of all
hospitalizations account for a very small percentage of admissions. Likewise of
all the users hitting a website, only a small number are initiated by hackers
with malicious intent. A classification model built on datasets containing only
a few known cases will not be able to discriminate very effectively between the
two classes.
The model may in fact predict that no cases
are preventable, or are hacker attempts, and it will be 98 to 100% correct!
However, we really have not learned anything about these cases, and the model is
not very effective.
What we want to do is use case data for the
model build that has approximately equal numbers of positive and negative
cases. However, the algorithm will take this distribution as if it were
realistic, so we need to supply the actual distribution of target values, called
the Prior Distribution(Priors),
so the build process will result in more meaningful models. The ODMrclassification models will use Priors when you specify stratified
samplingin the Mining Activityadvanced settings for sampling.
Enhanced
Adaptive Bayes Network (ABN)
Oracle Data Mining (ODM) incorporates supervised and
unsupervised learning models. Supervised learning models, sometimes called
directed models, are used to predict a value. One supervised model, the
Adaptive Bayes Network (ABN) is a data-mining algorithm that provides
decision tree-like functionality in the database.
Oracle Database 10g has made the following ABN enhancements:
?
Enable access through a Java API to a
prediction's supporting rules.
?
Enable the user to control accuracy,
performance, and output parameters on a model-by-model basis rather than a
configuration table setting.
?
Automatically select a best Naive Bayes Model
baseline.
Enhanced
Adaptive Bayes Network (ABN)
Oracle Data Mining (ODM) incorporates supervised and
unsupervised learning models. Supervised learning models, sometimes called
directed models, are used to predict a value. One supervised model, the
Adaptive Bayes Network (ABN) is a data-mining algorithm that provides
decision tree-like functionality in the database.
Oracle Database 10g has made the following ABN enhancements:
?
Enable access through a Java API to a
prediction?s supporting rules.
?
Enable the user to control accuracy,
performance, and output parameters on a model-by-model basis rather than a
configuration table setting.
?
Automatically select a best Naive Bayes Model
baseline.
Next, we'll try this using
the Decision Tree
classification model.