Oracle online systems for data scrubbing
Oracle Database Tips by Donald Burleson
A Sample System for Data Cleansing
The job of the Oracle DBA is to build the
regular expressions to locate data anomalies, but you must also
create a method for the Data Quality Officer to review your finding
and implement or reject changes. Most important of all, this system
allow the Data Quality Officer an opportunity to give you
more-refined decision rules, so that subsequent jobs will have
higher precision and recall.
Any front-end for data cleansing and scrubbing
should have these features:
1 - Extensible
- The system must have stubs (usually at the main menu) to allow for
2 - Easy to use
- The DQO must be able to quickly see and approve large volumes of
data, and at their discretion, have you automate the changes (once
you have successfully replicated their human decision rules).
3 - Full auditing
- As the DQO approves unifications and cleansing, a full audit trail
is written to an Oracle table, the data is marked as unified (a flag
with a value of "u") or scrubbed.
4 - Full reject
notification - Then the data cleansing program hits a "false
positive" (The data is OK, after all), the system write this
information to a file and send the details via e-mail to the DBA and
DQO. The DOQ then explains to the DBA why the data was OK, and the
DBA refines the decision rules within the scrubbing program.
Here is a sample main menu for a data scrubbing
system (Figure 2). Note how new menu options can be added at-will
as the system becomes more sophisticated:
Figure 2 - A sample main menu
Let's take a look at the lower level
unification and scrubbing screens. As the DQO drills-down into
their system, the detail screens will display the existing and
suggested values for each "suspect" data column (Figure 3):
Figure 3 - A data unification approval
As we see, the default value for the DSS is
"just do it", making it easy for the DQO to approve large volumes of
data unifications each day. These screens can be enhanced to allow
the DQO to drill-in to the actual data, and also notify the source
database that they are sending invalid data values.
As the DSS evolves (more sophisticated
filtering rules are added to the identification program), the DSS
screens will get more sophisticated. This is especially true when
we move out of the single column arena and start advanced data
scrubbing where multiple columns must be compared into a "does
this look like that?" mode.
If we use Oracle's OWB "data profiling"
approach, we assume that the decision rules are buried inside the
data itself. Using this approach we start by manually examining the
data and attempt to focus-in on those areas where known data
It's also important to remember that some forms
of data scrubbing cannot be done without external data. For
example, if our system has non-uniform medical diagnosis codes
(ICD-9 codes), then the DSS would need to incorporate specialized
"translation tables" to unify the values.
To use a simple example, let's see that a
screen to resolve transposition name error might look like Figure
Figure4 - A sample data cleaning screen
|For expert Oracle data cleansing support and
data scrubbing consulting, use an expert from BC. We
understand the powerful Oracle data unification tools, and
we can aid in improving the data quality of any Oracle
database, large or small.