This is an excerpt from Dr. Ham's premier book "Oracle
Data Mining: Mining Gold from your Warehouse".
The Decision Tree
algorithm splits the data in the case data by internally optimizing attributes
to use at branching points. As the data is split, a Homogeneity Metric is
applied to ensure that the attribute values are predominately one or the other.
Branching stops when the algorithm has created 7 (the default) levels of
branches in the tree.
The Data Mining ActivityBuild is identical to those we created for the Naïve and Adaptive
Bayes Network. Under Advanced Settings on the Final Step page, the build
settings have options for the homogeneity metric, maximum depth, minimum records
in a node, minimum percent of records in a node, minimum records for a split,
and minimum percent of records for a split.
Using the forest cover data,
we’ll construct a new Build Activity
for Classification using the
Decision Tree
algorithm, keeping the default settings as shown. We’ll also use priors in the
classification build to target ponderosa pines. Set the sample size at 200,000
cases, using the stratified sampling
type.
When the build finishes, click
on the result in the Test Metrics section. Here we see that the predictive
confidence is good at 38%, the average accuracy
is 47%, and the overall
accuracy
is
71%. However, the accuracy in predicting ponderosa pines (Target = 3) is
greatly improved at 86.5% as a result of using Priors
in the model build.
If we look at the Results under the Build
Activitysection, we see the classification
tree and the set of rules for classifying forest cover. For example,
highlighting the row with the 59th shows one of the rule for
predicting ponderosa pines (Target = 3):
IF
Hillshade_am <= 213.5 AND
Elevation <= 2408.5 AND
ST2 is in 0 AND
Hz_dist_hyd <= 15.0 AND
Elevation <= 2513.5 AND
Elevation <= 3044.5
THEN Class = 3.
For this rule, there are 207
cases with .17% support and 38% confidence. The predicted value (3) is the
target value of the majority of records in that node. Confidence is the
percentage of records in the node having the predicted target value. Support is
the percentage of cases in the dataset satisfying the rule for that node.
Decision TreeClassification Rules
Decision tree classification is
popular because of these easily understandable classification rules. Scroll
down to examine the 116 rules available for this model.
The classifier
“choose Elevation” for the
first split with a splitting value of 3044.5. The data is now divided into two
sets of data, one with Elevation <= 3044.5 and the other with Elevation >
3044.5.
Each of the data in the split
is more homogeneous than before the split, although this is difficult to see in
this example due to the complexity of the dataset.
The Decision Tree
“chose this attribute” to
split after examining all the possible split values for each variable.
Check the box “Show Leaves Only”
to display only the terminal nodes, or leaves. These are the nodes used to make
the prediction when the model is applied to new data. Because the Decision Tree is
sensitive to missing values when applied to new data, ODMrwill assign a surrogate attribute if the attribute is missing in the
apply data.
By highlighting the leaves, and clicking the radio button for
Surrogate, you can see that ODMrwill substitute HILLSHADE_PM, or ASPECT in
place of HILLSHADE_AM, since these attributes are highly correlated with each
other.