This is an excerpt from Dr. Ham's premier book "Oracle
Data Mining: Mining Gold from your Warehouse".
After importing
the data using the ODMrImport Wizard, a
Classification Activity using SVM is initiated, setting CHAS as the target
attribute. Note that there are two classes: 1 if the tract bounds the Charles
River and 0 if not.
Of the total 506 properties in the dataset,
only 7% are next to the Charles River. The unique identifier in the case table
is OBS after changing the “OBS.” in the original dataset to “OBS” in order to
eliminate problems with having a “.” in the column headings. The preferred
target value is 1, and we’ll keep all the default advanced settings.
The results of the SVM
classification
activity show that the model predictive accuracy
is
in the best range of 69%, with 80% of the preferred target class 1 correctly
classified, and 89% of the class 0 correct.
Note that the SVM
algorithm chose the Gaussian
kernel function, so there are
no rules to examine. We’ll re-build the model and pick the Linear kernel, and
compare the two results.
Now, we click on “Activity”, and build another SVM classificationmodel as above. This time, however, after completing the New
Activity Wizard, choose Advanced Settings, and under the “Build” tab and select
“Algorithm” settings and pick “linear” as the kernel function, keeping the
default settings for tolerance, complexity factors, and Active Learning.
Finally, click “OK” and then “Finish”, completing the Build Activity.
Interpreting the
SVM Results
Examination of the Test metrics result shows
that the Predictive Confidenceis good at
57%, slightly less than the model built using the Gaussian kernel.
Click on Build Result to see the coefficients
and values of the attributes used to build the model. You can see that NOX,
a measure of air pollution, was the topmost attribute, and the towns of Dedham,
Waltham, Dover, Watertown, Newton, Wellesley and Boston following next.
The positive values of the coefficientsmean that these towns are highly likely have properties bordering
the Charles River, whereas towns like Brookline and Belmont with coefficients of
-1 are very unlikely to be near the Charles.
A Google Map of the Dedham area shows that
indeed there are many residential areas bordering the Charles River.
Refining the
SVM Model
But wait a minute! What if you were
searching the area for housing for yourself or a client? You are concerned
about the NOX having the highest coefficient of 3.97. This model does not give
any indication of whether the air pollution index is higher or lower for
properties around the Charles River, only that it is an important factor. If
you look at the Boston case dataset by right-clicking the table name and
choosing “Show Summary Single Record”,
you’ll find that NOX ranges from 0.38 to 0.87 with mean of 0.55 and variance of
0.01.
NOXis a
continuous as opposed to categorical variable, meaning that there are an
infinite number of possible values between the minimum and maximum. ODMrthe continuous variables as FLOAT data type, which is seen when you
click on data summary on Step 3 of the Activity Wizard. Discrete or
categorical variables can possess only exact values, and intermediate values are
not possible. To examine the effect of NOX on our Charles River target
attribute, we have a couple of options. One is to discretize NOX into High,
Low, and Medium values. ODMr has a discretize transformation that we’ll explore
in Chapter 5. For now, let’s look at the regression capabilities of the SVM
algorithmin modeling continuous variables.