This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
O-Cluster,
a proprietary Oracle algorithm, has the advantage of handling large
numbers of attributes (high dimensionality), and is more appropriate
for large numbers of cases (more than 500). The number of leaf
clusters is determined automatically. The points where splits
occur can provide insight to the way data is structured and is
helpful in selecting features that help discriminate among cohortsof cases.
Be aware that O-Clusterdoes not necessarily use all the input data when it
builds a model. The algorithm reads 50,000 rows in a batch and
will only read in another batch if there is reason to suspect that
more clusters exist. Therefore, O-Cluster may stop building
the model before all the data is read in, and it is highly
recommended that the data be randomized, and discretized using equi-width
binning after clipping to handle outliers. Of course, the
Build ActivityWizard in ODMrtakes care of all data preparation for you.
Applying the Cohort Cluster
Now that we have defined cohortsof our population, we?re interested in applying the
cluster definition to a new dataset. We have downloaded the
COILtest dataset from http://kdd.ics.uci.edu/databases/tic/ticeval2000.txt.
Using the copy table Wizard in ODMr,
create an exact copy of the COIL build dataset. Then delete
all rows in the table using the following SQL statements in the SQL
Worksheet:
delete from coil_test;
commit;
Next, we create a control file
for SQLLDR with the following (note that the target attribute
CARAVAN is not included):
load data
infile '\\folder\shareddocs\coil_test.dat.csv'
replace
into table dmuser_book.coil_test
fields terminated by ','
(
MOSTYPE, MAANTHUI, MGEMOMV, MGEMLEEF, MOSHOOFD,
MGODRK, MGODRP, MGODOV, MGODGE, MRELGE, MRELSA, MRELOV, MFALLEEN,
MFGEKIND, MFWEKIND, MOPLHOOG, MOPLMIDD, MOPLLAAG,
MBERHOOG,
MBERZELF,
MBERBOER, MBERMIDD, MBERARBG, MBERARBO, MSKA,
MSKB1, MSKB2, MSKC, MSKD, MHHUUR,
MHKOOP, MAUT1, MAUT2,
MAUT0, MZFONDS, MZPART, MINKM30, MINK3045,
MINK4575, MINK7512,
MINK123M, MINKGEM, MKOOPKLA,
PWAPART, PWABEDR, PWALAND,
PPERSAIT,
PBESAUT, PMOTSCO, PVRAAUT, PAANHANG, PTRACTOR,
PWERKT, PBROM, PLEVEN, PPERSONG, PGEZONG, PWAOREG,
PBRAND,
PZEILPL, PPLEZIER, PFIETS, PINBOED, PBYSTAND,
AWAPART, AWABEDR,
AWALAND, APERSAUT, ABESAUT, AMOTSCO, AVRAAUT,
AAANHANG,
ATRACTOR, AWERKT, ABROM, ALEVEN, APERSONG, AGEZONG,
AWAOREG, ABRAND, AZEILPL, APLEZIER, AFIETS, AINBOED, ABYSTAND
)
Next, execute this line of code
at the command prompt in the directory where the SQLLDR control file
is located, substituting your password and database sid as
appropriate.
C:\scripts>sqlldr
dmuser /pswd@database control=coil.ctl log=coil.log
You
will now have 4000 records for 86 attributes for the apply dataset.
Choose Activity, Apply from the menu, and select the ?Cluster Build
Model?
that we built previously.
Select the table just created for the ?Apply Data Source?, and pick
any attributes that you want included in the result set. The
COIL
test data table does not
have a unique identifier, and ODMrwill
create one for you. In step 4 of 5 in the Apply Activity
Wizard, you have the choice of Most Probable
Cluster, Specific Cluster ID, and Number of Best Cluster ID?s as
output options.
The
Most Probable
Cluster ID will assign the cluster with highest probability to each
record, the default choice. These results are shown below.
Each case in the result table is assigned to a cluster with a
certain probability attached.
When
specific clusters are chosen, the output is shown below. For
each cluster chosen, ODMr
displays a column with the probability of that case fitting into the
cluster. Selecting the records with probability > .85, for
instance, will result in a cohort of customers who fit a particular
profile.
If
you choose the Number of Best Cluster ID?s, you will get a listing
of the number of best clusters with corresponding probabilities as
shown below.