This is an excerpt from Dr. Ham's premier book "Oracle
Data Mining: Mining Gold from your Warehouse".
Now we’ll execute the Build Activity and
choose the Support Vector Machinealgorithm
under the Classification Function. Select “season” as the target class, and
pick your favorite season as the preferred target value. All steps in the
Activity Wizard are the same as the Naïve Bayes,
Adaptive Bayes Network, and Decision Tree until we come to the Advanced Settings
Dialog. In the SVM algorithm, we have new tabs for Outlier Treatment, Missing
Values, and Normalize.
SVM may be adversely affected by extreme or
“outlier” values in the case data table, so we need to get rid of them, and ODMrgives you options of how to handle these by specifying the number of
standard deviations, the percent of upper and lower tailing values in the
distribution, or by typing in an actual value for the cutoff point. The “Replace
with” option gives you the choice of either replacing or discarding the
extreme values. The default is to use standard deviation.
Missing Values in
SVM Analysis
Missing values must also be addressed, and
under the Missing Values tab you can replace numeric types of data with the
mean, minimum value, maximum value, a custom value that you type in, or simply
drop the attribute if the column is null. For categorical data you can replace
the value with the mode, which is the most frequently occurring value, or a
custom value that you type in. The default is to replace missing values with
the mean if the attribute is numerical or mode for categorical fields.
Sparse Data in
SVM Analysis
What is the difference between missing values
and sparse data? In the Irish wind data, there is data for every row and every
attribute; the data is neither missing nor sparse. But let’s say that you are
analyzing patients who are hospitalized for reasons that may have been
preventable, such as a hospital admission for complications arising from a
chronic disease such as diabetes. Fortunately such admissions are rare compared
with most, so the target data is considered sparse in relation to the entire
universe of hospitalizations. Normally you won’t impose a missing value on
sparse data, but you can if you want to by un-checking the box at the top of the
Missing Values screen.
Normalization of
SVM Data
SVM requires that all numerical data is
normalized, which further reduces the variability in the raw data. Min/Max is
the default method for normalization,
where all values are re-coded in the range of 0 to 1. Z-score is a good choice
for normalization if you have chosen to keep outliers in your dataset. The
default strategy is to use the min/max scheme.
Linear and
Gaussian
Normalization
In the Build options, the kernel used in the
algorithm can be determined by ODMr, or you can specify linear or Gaussian.
If linear is used for the kernel, the coefficients
for each attribute used to build the model will be rank-ordered and you can see
which ones contribute the most in determining the target class. Tolerance valuetells the algorithm to stop building the model; increasing this
value to a higher number will build the model faster but may be less accurate.