Oracle Data cleansing anomalies
Oracle Database Tips by Donald Burleson
Oracle data cleansing is the science of quickly
locating and repairing data anomalies by comparing the universe of
values within the database. In other worlds, if a data item
"stands-out" (i.e. it is scientifically termed an "outliner", which
any statistically significant variance from a mean population), the
best-practice procedure is to "flag" the data as "questionable" and
forward it to the human "expert" who approves every suggestion.
The nature of data anomalies
Any Oracle application can have data-entry
errors, and it is not always the fault of poor-quality data entry.
For example, consider a database that tracks information about
elderly people. Many older folks forget their birthday's and some
even forget their names and data anomalies are guaranteed, no matter
how comprehensive the front-end.
Let's use a hospital clinic management database
to understand the common causes of data anomalies.
There are many sources of patient data quality
concerns, almost all relating to data entry inconsistencies. Names
can be mis-spelled or transposed, and important supplemental
information (e.g. SSN, age) might be entered incorrectly.
- Inconsistent Patient information -
One patient might appear to be many patients (as in the case
where patient ID, or the SSN contains typo's or transposition
errors), and conversely, many patients might appear to be one
patient. This anomaly is more rare, but possible, such as a
case where a typo relocates a clinic encounter into the
"history" of another patient.
- Inconsistent diagnosis codes - Many
clinics use a "check the box" approach for the physician to
enter their diagnoses. To minimize data entry errors, clinics
use well-defined diagnosis codes (ICD-9 and ICD-10), but these
codes have some limitations. The diagnosis codes are displayed
in a "choose all that apply" format, and the code range from
very general to very specific. It's not uncommon for a doctor
to under-diagnose a patient by choosing a broad disease code,
while failing to check-off the more descriptive codes.
This is an interesting area of Oracle data
cleansing because medical records are legal documents and there
could be serious ramifications to changing a physician's diagnosis.
Inconsistent or incorrect ICD-9 diagnosis codes can make it
impossible to track diseases or the efficacy of a treatment. There
are many variations on the ICD-9 diagnosis codes, over 50 disparate
clinics and hospitals in the data feed.
The first step is to identify the "outlier"
codes, figure-out how to standardize them, one at a time, using the
expert judgment of the Data Quality Officer. Once basic cleansing
rules are in-place, the uniqueness of bad ICD-9 code becomes easy to
spot (because they will not occur more than a few times in the whole
database). In the case of ICD-9 diagnosis codes, "outliers" can be
marked as invalid (using a Data Quality "status" flag, deleted or
changed), all at the discretion of the DQO.
Before we get into details, let's start with an
overview of the design approach for a Decision Support System (DSS)
for Oracle data quality assurance.
|For expert Oracle data cleansing support and
data scrubbing consulting, use an expert from BC. We
understand the powerful Oracle data unification tools, and
we can aid in improving the data quality of any Oracle
database, large or small.