This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
Clustering data is a very common
technique in data mining as well as many other fields, including
statistics, bioinformatics, pattern recognition, and machine
learning. Clustering is the unsupervised classification of
data, where the subsets of data share common traits. In
previous chapters we have discussed supervised classification,
meaning that a target was identified and the accuracy of the
prediction followed from how many cases were correctly classified
according to the target values. With clustering algorithms no
target is specified, you simply see what patterns are discovered by
the technique.
For example you may find clusters
in a large group of hospital patients, which are comprised of those
with the same diseases, such as coronary patients, pediatric
patients and so on. Furthermore, certain cancer patients may
exhibit a type of tumor characterized by a certain gene that is
sensitive to a specific type of drug treatment. Clustering can
reveal the characteristics of drugs, genes and the disease that may
respond best to a specific therapy.
Oracle Data Miner has two
algorithms for performing cluster analysis the k-Meanstechnique and the Orthogonal Partitioning Clustering(O-Cluster).
The enhanced k-means algorithm randomly defines initial centroids,
which approximate a “center of gravity” and uses distance measures
to calculate the distance between centroids and data objects.
ODMr uses either the Euclidean, Cosine, or Fast Cosine distance
metrics. From the Oracle Data Mining Forum in response to “How
does ODM cluster algorithm work?” posted May 2, 2006:
“ODM k-means builds a hierarchical tree. When a new cluster is
added, the parent node is replaced with two new nodes. Both children
have the same centroid as the parent except for a small perturbation
in the dimension with most variability. Then a few k-means
iterations are run on the two children and the points belonging to
the parent are distributed among the two new nodes.
There are a couple of different strategies how to choose which node
to split (e.g., size, dispersion). Once the desired number of
leaf nodes is reached, we run k-means across all leaf nodes.
The advantage is that all clusters have reasonable initial
centroids
and we are unlikely to get dead/empty clusters.
We explode categorical attributes into multiple binary dimensions
and compute distances using these new dimensions.”