 |
|
Running a Data Mining Activity
Data warehouse tips by Burleson Consulting |
This is an excerpt from Dr. Ham's premier book "Oracle
Data Mining: Mining Gold from your Warehouse".
While the Activity runs, the status bar shows which steps in
sequence are being completed. As shown in the activity screen, all steps from
discretize to test metrics have been successful.
Even though the data may already be binned, the algorithm
will take any further steps needed to ensure that numerical data and categorical
data are divided into appropriate bins. The dataset is then separated into
training and validation sets by random selection of cases.
We will construct a classifier
using the training dataset, and apply it to the validation set. Test metrics
are summarized and written in the result section.
Viewing your Results
Now that we have built the classification model, we’ll take
a look at the results.
Click Result in the test metric step and view the Confusion
Matrix. Because the dataset is small and the
training data was randomly selected, these results may be slightly different if
you ran this analysis yourself.
The first tab in the results window indicates that the
Predictive Confidence is “Good” in comparison to the naïve model.
A
very simple method for classifying customers is to classify the record as a
member of the majority class. In our dataset MINING_DATA_BUILD_V
_US, the majority of
cases (74.18%) have AFFINITY_CARD
=
0.
Ignoring all the predictor information that we have, the naïve rule would
classify all customers as not having an affinity card because only 25.82% have a
card.
The naïve rule is commonly used as a baseline for evaluating the performance of
classification models. The predictive confidence of 57.83% indicates that the
model we built is about 58% better than the naïve rule.
The Accuracy tab takes us to the classification matrix, also called the
confusion matrix,
where the model is applied to the hold-out test sample. Click More Detail
button to view the confusion matrix. The columns are the predictions made by
the classification model and the rows are the actual data.
As
we see, the overall accuracyof
the model is 77%, with 312 cases correctly classified as No-Affinity Card who
did not have one, and 26 misclassified as not having one. Similarly, 105 cases
were accurately classified as having a card, and 127 were misclassified as
having one. The cases that the model misclassified are the false-negative and
false-positive predictions.
Oracle Data
Mining Lift Curve
The "Lift" tab
demonstrates two graphical interpretations of the results, the cumulative and
cumulative positive cases chart.
We want our classification model to sift through the records and sort them
according to which customers are more likely to respond to our mailing.
The lift curve also called a gains curve or gains chart, is a popular technique
in direct marketing.
The lift curve will help us discover how to effectively “skim the cream” of our
mailing list so we pick the smallest number of cases with the greatest
probability of answering our mailing campaign.
ODMr applies the lift model to the test data, sorts the predicted results by
probability, divides the ranked list into 10 equal parts (quantiles), and counts
the actual positive values in each quantile.
The test results indicate that if we take the top 40%, we will have at least
twice the response expected from random sampling. A good classifier will give
us a high lift when we mail only a few customers.