This is an excerpt from Dr. Ham's premier book "Oracle
Data Mining: Mining Gold from your Warehouse".
The complexity factor prevents overfitting.
If a model is built that exactly fits the dataset used in its construction, the
ability of the model to predict target attributes in this build dataset would be
100%. This sounds great until you try to apply the model to your test dataset,
and find that the predictive accuracy is
poor. In fact, the model is only useful for the build dataset, because errant
or extreme values that are not seen anywhere else are operating to prevent the
model from being generally applicable to new data.
The SVM algorithm
will calculate the most optimal complexity factor to prevent over-fitting by
finding the best tradeoff between simplicity and complexity. You may if you
like re-build the model and specify a higher complexity factor than the one
chosen by SVM, especially if you find that the model is skewing(or favoring)
the prediction in favor of one class.
Activity Learning maintains accuracy while
enhancing the speed of building the model, and should not be disabled.
Sample
SVM Activity
For this exercise, we will keep the default
settings. When the Build and Test Activity Steps are completed, click on Result
in the Build section. There we see that ODMr
used the Gaussiankernel function to build
the model. Click on Weights and note that all seasons had equal weight. The
test results show that the predictive confidence is in the good range at
46.3%.
A look at the accuracy of the model indicates
that the spring season had the fewest number of correct predictions, and that
many spring days were actually classified as winter. A more accurate model
might be constructed by changing the months designated for the different
seasons, guided by better knowledge of Irish weather, but the point here is that
the model can differentiate calendar months simply by examining wind speed
data.
Re-building the model and
forcing the SVM model to use the linear kernel resulted in a very poor model for
this dataset, and reduced predictive accuracy
to 11.4%. The cost of the
linear kernel was 598 as compared to 502 for the Gaussian, illustrating that the
linear model was worse when measuring the relative accuracy of the two models.
Examining the coefficients
of the attributes is not very
revealing for this dataset, except to point out that even though “month” was
explicitly built into the definition of “season”, it was not as important as
wind speeds in predicting which season the data was recorded from.
Usually when you derive a new
attribute value from one of the existing attributes, you’ll want to exclude this
variable from the model since it is very highly correlated with the target
attribute in this example. Note that there are different coefficients
and rankings of attributes for
each season fall, winter, spring and summer, and the value of the coefficients
are very small.
To demonstrate a better model using the
linear kernel of SVM, we’ll import the Boston house price dataset from
http://lib.stat.cmu.edu/datasets.
This data is from the publication Harrison,
D. and Rubinfeld, D.L.'Hedonic prices and
the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102,
1978.
There are 20 attributes in this example:
OBS unique identifier for each case
TOWN town where area is located
TOWN# numeric identifier of the town
TRACT tract number
LON longitude
LAT latitude
CRIM per capita crime rate by town
ZN proportion of residential land zoned for
lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River
dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX
nitric oxides concentration (parts per 10 million)
RM
average number of rooms per dwelling
AGE proportion of owner-occupied units built prior
to 1940
DIS
weighted distances to five Boston employment centres
RAD
index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO
pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the
proportion of blacks by town
LSTAT
% lower status of the population
MEDV
Median value of owner-occupied homes in $1000's