 |
|
Viewing Model Histograms
Data warehouse tips by Burleson Consulting |
This is an excerpt from Dr. Ham's premier book "Oracle
Data Mining: Mining Gold from your Warehouse".
When you view the histogram
for each of the attributes, keep in mind that the sample count is less than the
581,012 cases in the entire dataset. If you want to report the average, max,
min and variance for the whole dataset, you can change the samplingsize to 582,000 by going to Tools, Preferences, Sampling. As you
can see when viewing the target forest cover, 85% of the forest trees are spruce
and pine.
Do we actually need all 55 columns to build a model? If you
look at some of the Soil Type (ST) variables, you?ll see that for ST?s 10, 11,
12, 17, 18, 19 (to name a few) there are not many samples. It is not likely
that all attributes will contribute to a predictive model. Some of them may in
fact simple add noise and detract from the model?s value.
Attribute Importance in the Model
ODMrhas an Attribute
Importance featurethat ranks the attributes
by significance in determining the target value. Attribute Importance can be
used to reduce the size of a classification problem by eliminating some
attributes, and consequently increase speed and accuracy when building models.
Let?s re-visit the Na?e Bayes
analysis we completed in the last chapter. We?ll use ODMr?s
Attribute Importance analysis to find the highest ranking attributes and use
these to build another model.
Pick Attribute Importance under Activity Build and choose MINING_DATA_BUILD_US
as the case table. Use Customer ID as the unique identifier, and keep the
default columns that the activity choose to build the model. Finish and run the
Activity.
Upon completion, we can view the ranking results.
You see that the Attribute Importance ranked HOUSEHOLD_SIZE
as the most important attribute, followed by marital status and so on. Now
we?ll enter this information in a new Na?e Bayes
model.
Using the new Na?e Bayes
Under Activity pick Build, then Classification as the
function type and Na?e Bayesas the
algorithm.
1. Choose MINING_DATA_BUILD_V_US for the
case table, and customer ID (CUST_ID) as the unique identifier. De-select
BULK_PACK_DISKETTES, COUNTRY_NAME, CUST_INCOME_LEVEL, FLAT_PANEL_MONITOR,
OS_DOC_SET_KANJI, and PRINTER_SUPPLIES from the Select Columns box.
2. Click next and check AFFINITY_CARD for
the target column. Keep Preferred Target Value ? 1, and name the activity
MINING_DATA_BUILD_US_NB2.
3. When you click finish, the Activity
Wizard will show the progress of sampling, discretizing, splitting, building and
testing the new model. Click on ?Result?, ?Accuracy? and ?More Detail? to view
the confusion matrix.
Table 1 shows the predictive accuracy, average accuracy,
overall accuracy, and total cost between the
two models. These differences appear to be negligible, showing that you can
drop one third of the data columns and not lose accuracy in the model, possibly
saving time and money.
Now let?s return to the forest cover dataset. Pick Attribute Importance under
Activity Build to find the predictor attributes that may have the most effect in
our model.
1. Choose COVER_TYPE_IMP as the case table,
and Compound or None for the Unique Identifier.
2. Select target (forest cover) as the
target column and make sure that it is set properly as a categorical mining
type.
3. Type in a name for the Mining Activity
and view the advanced settings before running the activity. We will not change
any of the default values for this analysis.
4. Click ?Finish? when you are ready to
create the activity. This model may take a while to run, since the dataset is
large and there is no unique identifier. You can view the progress of the steps
?Sample?, ?Discretize? and ?Build? as they run.