This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
The
SVM algorithm
is a useful method for predicting the
value of a continuous value. To Build the regression model,
choose Build from the Activity tool, and pick Regression as the
function type. Note that Support Vector Machineis the
only algorithm available for regression.
We
will continue with the wizard as previously, choosing OBS as the
unique identifier and NOXas the
target attribute. Under Advanced Settings, the tabs are the
same as for the other SVM algorithms with the exception of the Build
settings. SVM will select and optimize all parameters, such as
kernel function, tolerance etc, so we’ll keep the default settings
and go ahead and build the model.
Building the New
SVM
Model
The
Build results show that SVM chose the Gaussiankernel
for the algorithm, and the predictive confidence of the resulting
model is between good and best at 66%. There are several new
measures available in the results of the regression model that
indicate the “goodness of fit” of the model.
A good fit explains a high proportion of variability in the data,
and is able to predict new cases with high certainty.
ODMr
provides both graphical
and statistical estimates of goodness of fit, a graphic plot of
residuals and calculation of root mean square error. Note that
there is a residual plot available in the Build Activity.
Residuals are the differences between the actual and predicted
values. If the residuals are randomly distributed around zero,
then the model is a good fit. Click on Result in the Residual
Plot box to see the graphic.
Dots on the red zero line means that the value was an
exact prediction, whereas dots above and below the line show the
relative error of the prediction. You can see that the dots
are randomly scattered until around NOX
= 0.55, where the error of the predictions begin to vary
considerably. This indicates that the model is much more
accurate for lower values of nitric oxide concentrations than for
higher concentrations. You can mouse over a data point to see
the actual and predicted values. For point 150 for example,
the actual value (x axis) was 0.614 and the model predicted 0.7018.
If you were building regression models for air pollution, you might
want to build one model for lower levels of NOX and another one for
levels exceeding 0.5.
Linear
Regression Analysis
Checking the predicted circle at
the bottom right of the residual plot will toggle between the actual
and predicted plots. The graph shows the predicted values on
the x-axis and shows which predictions can be trusted the most.
As in the actual residual plot, the graph indicates that predicted
values over 0.5 are inaccurate. A predicted value of 0.7 may
be very close to 0.7, or it could be 0.8 or 0.6.
However, predicted values of 0.6
or less will be very close to 0.6. Clicking on residual
plot data will show a listing of the actual and predicted values for
the test dataset.
The
statistical measures of goodness of fit are found under the Test
Metric Result. Here we have the Root Mean Square Error (RMSE)
also known as the standard error of the regression. An RMSE
closer to zero means that the model is a better predictor.
Compare this result to that of using TAX as the target for a
regression model. Here we see that the majority of points are
tightly clustered around the red zero line, the RMSEis
0.0418, and the predictive confidence is very good at 87%.
This
is a very highly accurate model for predicting tax rates for
properties in the Boston housing dataset.