This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
O-Cluster
is a density-based algorithm that does not use distance formulas.
O-Cluster is an Oracle proprietary algorithm. Technical
details about the O-Cluster algorithm can be found in Milenovaand Campos paper “Clustering Large Databases with
Numeric and Nominal Values Using Orthogonal Projections”
at
http://www.oracle.com/technology/products/bi/odm/pdf/ocluster_wnominal_data.pdf.
According to the Oracle Data
Miner Tutorial, O-Clusterfinds
“natural” clusters by identifying areas of density within the data,
up to the maximum number entered as a parameter. That is, the
algorithm is not forced into defining a user-specified number of
clusters, so the cluster membership is more clearly defined.
O-Cluster
Sensitivity Settings
The Sensitivity settingdetermines how sensitive the algorithm is to differences
in the characteristics of the population. O-cluster determines areas
of density by looking for a “valley” separating two “hills” of
density in the distribution curve of an attribute. A lower
sensitivity requires a deeper valley; a higher sensitivity allows a
shallow valley to define differences in density.
Thus, a higher sensitivity value
usually leads to a higher number of clusters. If the build
operation is very slow, you can increase the Maximum Buffer Sizein an attempt to improve performance.
For
our example, we’ll use the ODMr
K-means algorithm to find clusters in the CoIL dataset, found at
http://kdd.ics.uci.edu/databases/tic/tic.html. The
build dataset used in the CoIL 2000 Challenge has
86 attributes and 5822 descriptions of customers of a Dutch
insurance company. The target attribute is #86, “CARAVAN”
which is the number of mobile home policies.
Using
K-Means for Clustering
As
we start, be sure to name the file with the ‘.dat’ extension and
save as comma delimited if you use the ODMrImport
Wizard.
After importingthe
dataset, we will examine the histogramof the
target attribute by right-clicking the case table and choosing Show
Summary Single Record
.
Examining the K-Means Data
You’ll see that there are 348 cases where CARAVAN = 1, approximately
6% of the total. In order to more clearly distinguish clusters
around the target value of interest, we’ll stratify the case table
so that we have a more even distribution of 1’s and 0’s for the
CARAVAN attribute.
Use
the “transform wizard Stratified Sample” to create a new table with
1/3 of the target attribute = 1 (having insurance) and 2/3 of the
cases will be for customers who don’t have mobile home insurance.
The
new case table will have a total sample count of 1044 cases, 348
with insurance and a random sample of uninsured customers equaling
696 cases.
Now
let’s build a new cluster model using the stratified sample,
choosing K-means for the algorithm. There is no unique key for
the case data, so we choose “Compound” or “None” for the Unique
Identifier. Note that in-contrast to the classification
models, you do not choose a target variable in the Activity Wizard.