This is an excerpt from Dr. Ham's premier book
Data Mining: Mining Gold from your Warehouse".
To customize AGE cut-offs, choose
the Define button. Finding it easier to edit one of the default
settings, choose Generate Default Bins,
and enter ?4? as the number of bins. You may have an Application
Warning appear that tells you ?3? is a better binning choice; click
OK to proceed.
You can edit the lower bounds for
each of the 4 bins listed, so let?s make the bins <25, 25-40, 40-50,
and >50 by typing in 25, 40, and 50. The resulting histogramshows that there are a disproportionate number of cases
in the 25-40 range (41%).
Although not optimal for building
a model, you may need to use this binning strategy for your business
application. By changing the lower boundary to 30 the histogramis more uniformly distributed with each of the bins
having 29, 28, 23, and 21% of the total number of cases,
In Step 5 of 5 of
Wizard, you can change the bins of the categorical attributes. For
instance, to change the CUST_MARITAL_STATUS attribute, highlight the
name and click Define.
A window with
existing strategies (if any) for that attribute appears, showing the
Bin Category and Value. You can add definitions by choosing Add, or
for a shortcut choose ?Generate Default Bins?. Keep the default
number of bins = 10 and say OK to the warning that there are fewer
than 10 bins in the case data if you choose the generate option.
The bin categories are now shown, with Married, NeverM, Divorc,
Separ, Widowed, Mabsent, and Mar-AF corresponding to the category
Let?s say that we
decide to re-bin the attribute to Married and Not_Married.
Highlight Married and click the Edit button. The Married bin
category corresponds to Married, so choose Divorc, Mabsent, Mar_AF,
Separ to add to the Married category. Click OK to complete the
Married category. Now there are 3 categories: NeverM, Widowed, and
Married. Highlight NeverM and click Edit. Re-name NeverM in the
Bin Category to Not_Married, and add Widowed to the selected Bin
After clicking OK,
you?ll see that the bin categories are now Married with values
Divorc, Mabsent, Mar-AF, Married, and Separ; Not_Married with the
values NeverM and Widowed. Click OK and Next to finish.
Right click on the
new view created, choose Show Summary Single-Record, and view the
histograms for AGE and CUST_MARITAL_STATUS.
Now, we right click the new view
and choose ?Show Lineage? to see details of the disretizetransformation.
The wizard used a CASE and DECODE
statement to create the bins.
CASE WHEN "AGE" < 30 THEN ' < 30'
WHEN "AGE" <= 40 THEN '30 - 40'
WHEN "AGE" <= 50 THEN '40 - 50'
WHEN "AGE" > 50 THEN ' > 50'
else null end) "AGE", "BOOKKEEPING_APPLICATION", "BULK_PACK_DISKETTES",
"COUNTRY_NAME", "CUST_GENDER", "CUST_ID", "CUST_INCOME_LEVEL",
"EDUCATION", "FLAT_PANEL_MONITOR", "HOME_THEATER_PACKAGE", "HOUSEHOLD_SIZE",
"PRINTER_SUPPLIES",( CASE WHEN "YRS_RESIDENCE" < 3
THEN ' < 3'
WHEN "YRS_RESIDENCE" <= 5 THEN '3 - 5'
WHEN "YRS_RESIDENCE" > 5 THEN ' > 5'
else null end) "YRS_RESIDENCE", "Y_BOX_GAMES"
Using the Discretizewizard gives you the ability to easily create new
attributes and bin the data to your liking. The view created is
used as the case table for data mining activities. How many bins do
you choose when doing the data prep depends on the nature of the
data you are processing. It is sometimes easy to categorize
variables into ?High?, ?Medium? and ?Low? to simplify interpretation
of the data mining results.