|
 |
|
Creating an Oracle Data Cleansing Architecture
Oracle Database Tips by Donald Burleson |
As we noted, replicating the decision processes
of a knowledgeable professional can appear to be a daunting task,
but daily successive refinement of the data scrubbing program can
yield remarkable results.
There are several cardinal rules for data
cleansing:
1 - Keep a
complete audit table - For every row that was unified or
scrubbed, record the record ID (Using the new Oracle10g
row timestamp feature), and record the old value, the new value,
the date of the change and the person approving the change.
2 - Facilitate the
Data Quality Officer (DQO) - Successful data scrubbing system
create a simple online interface (using tools such as HTML-DB), and
have the interface allow then to implement or reject the proposed
changes.
There are two "attack" approaches to data
cleansing, each one appropriate for the type of database:
- Brute force bots - These are
background processes that run during low-usage period that
bubble through the data looking for "questionable" data. The
rules are programmed, and this is a rudimentary form of data
mining. These low-impact bots run quietly in the back ground,
constantly scouring your database, looking for "fishy" data.
The brute-force approach is only appropriate for Oracle
environments where excess RAM and CPU are available, (to
accommodate the PGA memory region and overhead of the compiled
PL/SQL logic).
- Targeted jobs - The targeted jobs
examine the data manually and start by recognizing the
characteristics of data anomalies.
We also have to decide "when" to scrub the
data, a row-at-a-time when it enters the database (via ETL), of in a
batch, say, once per day:
- Row-at-a-time mode - In this
approach we unify and Scrub the data during ETL, one record
at-a-time, not allowing any questionable data into the
database. The downside to this approach is the massive amount
of work and the possibility 6that a later records in the ETL
feed might resolve the anomaly.
- Batch mode - This approach
runs a periodic batch job to unify and cleanse the day's new
information. This is the preferred approach for many Oracle
database because it can be performed during idle midnight hours
and the nature of data scrubbing is well-suited to a batch
approach.
|