This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
The Missing Values, Normalize, Numeric and
Outlier Treatment wizards are useful for prepping the data prior to
applying data mining algorithms. Most algorithms have a preferred
method for handling missing values, normalizing, and outliers, so in
most data mining tasks you can relax and let the Activity wizard
take care of this. In certain situations you may have unusual
anomalies and if you wish you can take advantage of these wizards to
help prepare your data for analysis.
Using the Missing Values
Transformation Wizard
In the Missing Values Numerical
Strategy, you have many choices for replacing the missing values,
including None, Mean, Max, Min, and Custom Value. The Mean
treatment replaces a missing value with the average of the values
for that attribute; max substitutes missing values with the maximum
of the values, and min replaces missing values with the minimum of
the values. The default custom value is zero; you can replace this
with any appropriate value.
If the value is NULL, you can drop
the case entirely by specifying Drop attribute.
The SQL statement is shown below that is automatically generated by
the Missing Values Transformation Wizardfor clinical patient data where missing values for
attribute ACV_CODE is replaced with the mode (‘E’), ANGINA_PROCEDURE
is replaced by ‘99’, and rows are dropped when ADULT_ASTHMA ,
BACTERIAL_PNEUMONIA, CHF, and COPD are NULL.
CREATE VIEW "DMUSER_BOOK"."AHRQ_INPT_STRAT406981843"
AS
SELECT
"ADMISSION_COUNT",
"ADMISSION_TYPE_HIGHEST",
"ADULT_ASTHMA",
"BACTERIAL_PNEUMONIA",
"CHF",
"COPD",
"DIABETES_LONG_TERM_CX",
"ER_VISIT_COUNT",
"PATIENT_KEY",
DECODE ( "ACV_CODE" , NULL,
'E' , "ACV_CODE" ) "ACV_CODE" ,
DECODE ( "ANGINA_PROCEDURE" , NULL,
99 , "ANGINA_PROCEDURE" ) "ANGINA_PROCEDURE" ,
"ASTHMA_PATIENT",
"CHF_PATIENT",
"SLEEP_APNEA_PATIENT",
FROM "DMUSER_BOOK"."AHRQ_INPT_STRAT"
WHERE
"ADMISSION_COUNT" NOT IN (
SELECT
"ADMISSION_COUNT"
FROM "DMUSER_BOOK"."AHRQ_INPT_STRAT"
WHERE "ADULT_ASTHMA" IS NULL AND "BACTERIAL_PNEUMONIA" IS
NULL AND "CHF" IS NULL AND "COPD" IS NULL )
Using the Normalize
Transformation Wizard
The Normalize transform is used to normalize
data using a predefined scheme, or you can select a transformation
for any numeric attribute. The available transformationsinclude:
(x-MIN(x)) / (MAX(x)
- MIN(x)) * (new max – new min) + new min
(x – AVG(x)) /
SQRT(VARIANCE(x))
(x / MAX(ABS(MIN(x)),
ABS(MAX(x))))
For example, if the {(x-MIN(x)) / (MAX(x) -
MIN(x)) * (new max – new min) + new min} normalizationscheme is chosen, for an average value 253.5, standard deviation
146.21, with minimum value = 1 and maximum value = 506, the
transformed average value is 0.5, standard deviation 0.29, minimum
value = 0 and maximum value = 1.
For the {(x – AVG(x)) / SQRT(VARIANCE(x))}
scheme, the transformed values average 0.0 with standard deviation
1, minimum value = -1.73 and maximum = 1.73.
The {(x / MAX(ABS(MIN(x)), ABS(MAX(x))))}
transformation results in an average of 0.5 with standard deviation
0.29, minimum value = 0 and maximum = 1.