This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
Taking the high variability of
the NOXdata into account, we’re
still not sure whether settling near the Charles River would be
associated with higher air pollution or not. Running the
following SQL statement for the towns with highest and lowest
coefficientsin our linear SVM
model reveals that towns more likely near the river have on average
lower nitric oxide concentrations than those that are not.
select AVG(NOX)
from "BOSTON_PRICE" where TOWN IN ('Somerville', 'Arlington',
'Belmont', 'Brookline') = 5.822.
select AVG(NOX
) from "
BOSTON_PRICE " where TOWN IN ('Dedham', 'Waltham', 'Dover',
'Watertown') = 4.905.
Given these results, we can
conclude that lower levels of pollution appear to be associated with
properties that border the Charles River, and that higher levels of
pollution are highly variable and perhaps not well modeled with the
20 attributes in our case dataset.
Using
Text Data in SVM Predictive Models
We have shown the usefulness of
the SVM algorithmfor modeling
categorical and continuous data, we’ll now examine how to utilize
text data in our predictive models. The dataset is found at
http://kdd.ics.uci.edu/summary.task.type.html and is the
Syskill Webert Web Data,
which contains HTML source web pages along with the ratings of a
single user on these web pages.
The web pages are on four
separate subjects: bands of recording artists, goats, sheep,
and biomedical. Users looked at each web page and rated the
content on a three point scale (hot, medium, cold). However,
there were very few ratings for medium.
The Web rating data is organized
into 4 folders: bands, biomedical, goats and sheep. The
folders have a number of files containing web page content and a
single file named index which relates viewer ratings to each of the
web pages. We will create a table with the web content stored
as CLOBtype data, and then match
this with the index file so that we have an ID field, viewer rating,
category of web page, and the web page content.
The steps in arriving at this
final SVM table are as follow:
1.
Import the index table for each subject
using the import wizard in ODMr.
2.
Create a table for importing
the CLOB data.
3.
Use sqlldr
to import the web content as CLOB
fields.
4.
Create views for each category of web
page by joining the index and CLOB
tables for each subject.
5.
Union all four views together into a
final table.
6.
Create a unique identifier for the
cases in this table.
Using the Import Wizard in ODMris straightforward. Rename the index file with a
“.dat” extension before attempting the import, and specify Vertical
Bar (|) as the delimiter. The field names are file_name,
rating, url, date_rated, and title. Import each file into a
separate table such as web_rating_goats, web_rating_sheep,
web_rating_bands, and web_rating_biomed.