 |
|
Data Unification & Cleansing with Oracle
Oracle Database Tips by Donald Burleson |
Introduction to Oracle Data Cleansing Systems
The Oracle DBA is the chief custodian of the
mission-critical corporate data, and the advent of
time-saving automation tools in Oracle 10g has freed-up the
Oracle DBA to focus on issues of quality within their databases.
- Simple character validation
- Complex column format validation
- Multi-column validation
- Complex multi-table data validation
Some data quality fixes are so easy that you
may wonder why you have not found them before. For example we know
that transposition errors are very common with people's name
columns, you can quickly run a PL/SQL program to locate possible
transposition victims. It's easier than you think, especially when
the key table has unique "candidate keys" that can be used to ensure
our reliability.
Also see how
Oracle Data cleansing is performed in a Decision Support System
framework and
understanding Oracle knowledge engineering .
Let's take a look at each type of data
validation activity. Most Oracle databases perform two types of
data cleansing, namely data unification and data scrubbing:
- Data Unification - This is the
relatively-mundane process of unifying column items, such as
guaranteeing uniform values for salutations ("Mr.", "Mrs.") name
suffixes ("Jr,", "III") and street type abbreviations ("Blvd",
"Ave."). Oracle 10g Release 2 Warehouse Builder ("Paris")
introduces an exciting new data cleansing feature dubbed "Data
profiling" where you can do simple unification such as name
and address cleansing. Best of all OWB Data profiling allow the
system to be "self learning" using existing valid data to derive
the data validation rules. Most Oracle Data Warehouse DBA's
argue that data unification belongs in the ETL process, but they
differ about whether the data cleansing should be done
record-at-a-time during ETL, or daily with a batch job.
- Data Scrubbing - This is the more
complex process of creating rules to identify statistical
"outliers", suspicious rows that might contain invalid data. A
"suggestion" list is created along with recommended changes, and
a human data quality expert (the Data Quality Officer, or DQO)
evaluates each suspicious row and makes a decision:
·
Reject suggestion - There are two reason for
rejecting a data cleansing suggestion, either the data is just plain
bad (and it is marked with "i" for invalid, or because the decision
rules are incomplete. In this case, the DBA gathers the reason for
the rejection and adds this new rule or exception into the
identification program.
·
Accept suggestion - The data is changed and a
complete audit log is created. The data is marked as having been
scrubbed.
As the DSS evolves and the rules are refined to
100% reliability, some system components are automated, providing an
"expert system" component to reduce the manual workload of the DQO.
If we start by examining known data errors to
find common patterns, the Oracle team can design programs to detect
these types of errors and quickly clean-up a large amount of
transposition errors, and successively refine the model to identify
less obvious data anomalies. We can also search for statistical
"outliers", data that violates the norms of the database as-a-whole.
A successful data cleansing project requires an 8-week commitment of
these resources:
- Data Quality Officer (DQO) - This is
the resident "data expert", someone who understands the nature of
the data and can provide "decision rules" for incorporation into
the DSS.
- Data Analyst - This is the DSS expert
who works directly with the Data Quality Officer.
- Programmers - An Oracle professional
who is fluent in PL/SQL and Java write the main PL/SQL routines to
specify new decision rules and exceptions to the rules. The
program may also incorporate "translation tables" for unifying
items like IVCD-9 diagnosis codes. An online system programmer is
also required to build the HTML-DB front-end to the interactive
component of the DSS.
By using well-understood best practices for
Oracle data cleansing a robust and flexible system can be created to
dramatically reduce data anomalies. Using an iterative cycles of
refining the decision rules from the DQO, the DSS evolves to become
increasingly accurate and powerful.

This is a DSS for Oracle data cleansing in a
nutshell. Note that we start by examining the "nature" of known
data errors and seek "fishy" data (statistically valid outliners) for
creating the suggestion lists for the human expert (the Data quality
officer or DQO).
The DQO then manually resolved the errors and
works with the DBA to refine the decision rules until they are 100%
complete and accurate using the "feedback loop" of successive rule
refinement. At that point, that component of the Oracle data
cleansing is automated, becoming an "expert system" component of the
DSS.
For expert Oracle data cleansing support and
data scrubbing consulting, use an expert from BC. We
understand the powerful Oracle data unification tools, and
we can aid in improving the data quality of any Oracle
database, large or small. |
 |
|