This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
The stratified sample transformation
is useful when your target attribute constitutes a small percentage
of cases. This transformation was utilized in Chapter 4 to create a
dataset for clustering. In that example we used the COIL
dataset and increased the
percentage of the target CARAVAN from 6% to 33% of the sampled data.
We kept all attribute’s where
CARAVAN = 1 (348 cases) and randomly selected cases where CARAVAN =
0 which made up 2/3 of the total sample. An alternate method is
using percentage of cases in Step 3 of the stratified sample
wizard. Click Percent of Cases, leaving 100% as the default. In
step 4 of 4, click the Equal Distribution button to create a case
dataset with a 50% distribution of the target attribute. The total
sample count is 696 cases, which is 12% of the original COIL
dataset.
The SQL used to create this transformation is
shown below:
CREATE TABLE "DMUSER_BOOK"."COIL128895151"
AS
SELECT "MOSTYPE", "MAANTHUI", "MGEMOMV", "MGEMLEEF", "MOSHOOFD", "MGODRK",
"MGODRP", "MGODOV", "MGODGE", "MRELGE", "MRELSA", "MRELOV", "MFALLEEN",
"MFGEKIND", "MFWEKIND", "MOPLHOOG", "MOPLMIDD", "MOPLLAAG", "MBERHOOG",
"MBERZELF", "MBERBOER", "MBERMIDD", "MBERARBG", "MBERARBO", "MSKA",
"MSKB1", "MSKB2", "MSKC", "MSKD", "MHHUUR", "MHKOOP", "MAUT1",
"MAUT2", "MAUT0", "MZFONDS", "MZPART", "MINKM30", "MINK3045",
"MINK4575", "MINK7512", "MINK123M", "MINKGEM", "MKOOPKLA", "PWAPART",
"PWABEDR", "PWALAND", "PPERSAIT", "PBESAUT", "PMOTSCO", "PVRAAUT",
"PAANHANG", "PTRACTOR", "PWERKT", "PBROM", "PLEVEN", "PPERSONG", "PGEZONG",
"PWAOREG", "PBRAND", "PZEILPL", "PPLEZIER", "PFIETS", "PINBOED", "PBYSTAND",
"AWAPART", "AWABEDR", "AWALAND", "APERSAUT", "ABESAUT", "AMOTSCO",
"AVRAAUT", "AAANHANG", "ATRACTOR", "AWERKT", "ABROM", "ALEVEN", "APERSONG",
"AGEZONG", "AWAOREG", "ABRAND", "AZEILPL", "APLEZIER", "AFIETS", "AINBOED",
"ABYSTAND", "CARAVAN"
FROM (SELECT /*+ no_merge */ t.*, ROWNUM RNUM FROM "DMUSER_BOOK"."COIL" t)
WHERE
( "CARAVAN"='1' and ORA_HASH(RNUM,(348-1),12345) < 348 )
OR ( "CARAVAN"='0' and ORA_HASH(RNUM,(5474-1),12345) < 348 )
Using the Filter
Single-Record Transformation Wizard
In some analysis, you may want to
examine a subset of the case dataset. You may see some interesting
results in a cohort of diabetic patients, or want to explore more
about customers in a specific region of the country. The Filter
Single Record transformation creates a view for the filtered
dataset. Filter conditions are specified using a Where clause,
which is easy to construct using the expression builder.
The resulting SQL to create the view is shown
below:
CREATE VIEW "DMUSER_BOOK"."boston_price261754772"
AS
SELECT * FROM "DMUSER_BOOK"."boston_price"
WHERE "boston_price"."CRIM" < 8.0 And "boston_price"."NOX
" > 5.5
Inside the Sample
Transformation Wizard
The Sample Transformation wizard creates a
random sample of a table or view; the generated random sample can be
either a table or a view. You can specify the sample size either as
a number of records or as a percentage of records. You can also
specify a random seed. A random sample is created by choosing a
sample of cases picked at random.
The random sample is smaller than the set that
it is based on. Using the smaller sample can result in improved
performance in building a model or performing unsupervised data
mining activities. If the sample is representative of the whole set,
results generated using the sample will be compatible with the
results generated using the whole set.