Oracle Training Oracle Support Development Oracle Apps

 
 Home
 E-mail Us
 Oracle Articles
New Oracle Articles


 Oracle Training
 Oracle Tips

 Oracle Forum
 Class Catalog


 Remote DBA
 Oracle Tuning
 Emergency 911
 RAC Support
 Apps Support
 Analysis
 Design
 Implementation
 Oracle Support


 SQL Tuning
 Security

 Oracle UNIX
 Oracle Linux
 Monitoring
 Remote s
upport
 Remote plans
 Remote
services
 Application Server

 Applications
 Oracle Forms
 Oracle Portal
 App Upgrades
 SQL Server
 Oracle Concepts
 Software Support

 Remote S
upport  
 Development  

 Implementation


 Consulting Staff
 Consulting Prices
 Help Wanted!

 


 Oracle Posters
 Oracle Books

 Oracle Scripts
 Ion
 Excel-DB  

Don Burleson Blog 


 

 

 


 

 

 

 


Profiling and Cleansing Data Using OWB "Paris" - Part II
February 6, 2006
Mark Rittman

Then, when the table for correction has been selected, you need to select the data rules that will apply to the table of corrected data - this data rule translates to a check constraint on the table that ensures that only values that are allowed by the rule (a.k.a. domain) will be allowed in the column. Note that this check constraint only gets put on the corrected table, not the original source one.

Now comes the interesting bit. Once the data rule has been selected, you need to specify the correction method. To do this, you need to specify two things: the Action (cleanse, ignore, report) and the Cleanse Strategy (remove, similarity match, soundex match and custom).

In our case we want the action to be Cleanse, but what about the Cleanse Strategy? The docs are pretty vague on this at the moment, but I had a chat with Nick Goodman and it looks like the options translate to the following:

  • Remove - simple, just deletes the column value so that the offending entry is removed. I haven't tested this though, so there is the chance that it could in fact remove the entire row rather than just NULL the column.

  • Similarity Match, Soundex Match - again, not tested, but my take on this is that it will change the column value to the domain value that it most closely matches,

  • Custom - allows you to write custom code to make the correction. This is what we're going to use.

We select Custom as the Cleanse Strategy, press Next, and the Correction Wizard then goes away and creates the correction mappings, and the supporting tables, to produce the corrected data. Note that with this example, we're just correcting one column, but in reality you could carry out multiple corrections to multiple tables using this wizard.

Now the correction as been specified, you can take a look at the new correction module that's been created, and the objects that are inside it.

From this you can see that the Correction Wizard has created a number of objects:

  • A mapping to implement the correction

  • A function to implement the Custom cleansing process that we requested, and

  • Four tables: SRC_ACCOUNTS, which will contain the corrected version of SRC_ACCOUNTS from the source module, and other tables that will contain intermediate results from the correction process.

The first step then is to put the code into the Custom function to implement the correction. Note that this is just one way of doing it - you could use the matching (soundex, similiarity) transformations to match the erroneous values to ones within the domain, which would be useful for mis-spellings - but in our case we watch to translate all occurrences of UK countries into United Kingdom.

To do this, the code needs to be added to the CUSTOM_1 function, which has COUNTRY_DSC already defined for it as the input parameter. The code used to implement the function is pretty simple and looked like this (remember that only data that needs correction will pass through this, not all records):

You can also take a look at the mapping that the Correction Wizard has generated to implement the correction:

Working through the mapping, what appears to be going on here is that the first table, SRC_MAPPING is the table from the GLOBAL_SRC source module that will subsequently be corrected by this mapping. The SRC_ACCOUNTS_STAGING table is a copy of SRC_ACCOUNTS with the data rule attached, and note also the Error Group underneath it - this translates into another table that will contain the rows that fail the data rule we've specified, i.e. the ones that fail the check constraint.

The SRC_ACCOUNTS_1 and SRC_ACCOUNTS_2 tables are in fact aliases for the SRC_ACCOUNTS table within the GLOBAL_STG module, with the valid rows from SRC_ACCOUNTS_STG inserted into the SRC_ACCOUNTS_1 alias, and the records from the error group being passed through the ATTR_VALUE_2 pluggable mapping, corrected, and then inserted into the SRC_ACCOUNTS_2 alias.

Still with me? The ATTR_VALUE_2 pluggable mapping can itself be examined, using the View Child Graph toolbar button. When it's expanded out, it looks like this:

This sub-mapping contains an operator on the left-hand side, SRC_ACCOUNTS_STG_1, which is bound to the SRC_ACCOUNTS_STG_T1 table shown in the project tree earlier, and then splits the rows into two table operators, both of which are bound to the SRC_ACCOUNTS_STG_T2 table, with the second one using the custom PL/SQL transformation we set up earlier to convert all the incorrect country names to correct ones. I'm not sure what the purpose of the split is - I can't see what the split logic is from examining the operator, and both operators are bound to the same physical table. If I work it out later I'll post an update.

If we want to implement this correction, the first thing to do then is to deploy all of the objects (tables, mappings) in the Correction Schema, GLOBAL_STG, to the database. Once this is done, the following tables were created in the Correction Schema:

SQL> select table_name from user_tables;

TABLE_NAME
------------------------------
WB_RT_VERSION_FLAG
SRC_ACCOUNTS_STG_ERR
SRC_ACCOUNTS_STG
SRC_ACCOUNTS
SRC_ACCOUNTS_ERR
SRC_ACCOUNTS_STG_T2
SRC_ACCOUNTS_STG_T1

7 rows selected.

SQL> 

Note that this list includes the additional tables used by the pluggable mapping, not just those listed in the project tree.

Before the mapping is executed, you can take a look at how the data is held in the original SRC_ACCOUNTS table:

SQL> set pagesize 400
SQL> conn global_src/password@ora10g
Connected.
SQL> select country_dsc, count(*) from src_accounts
  2  group by country_dsc;

COUNTRY_DSC                      COUNT(*)
------------------------------ ----------
United Kingdom                          1
Scotland                                1
United States                          28
Germany                                 5
France                                  3
England                                 5
USA                                     5
Spain                                   1
Australia                               2
Canada                                  1
Singapore                               1
Italy                                   3
Japan                                   4
Hong Kong                               1

14 rows selected.

Now, when the mapping is executed, the corrected version of SRC_ACCOUNTS looks like this:

SQL> conn global_stg/password@ora10g
Connected.
SQL> select country_dsc, count(*) from src_accounts
  2  group by country_dsc;

COUNTRY_DSC      COUNT(*)
-------------- ----------
United Kingdom          7
United States          28
Germany                 5
France                  3
USA                     5
Spain                   1
Australia               2
Canada                  1
Singapore               1
Italy                   3
Japan                   4
Hong Kong               1

12 rows selected.

SQL> 

You can also examine the other tables in the correction schema for details of the rows that were corrected, the data rules that were broken and so forth.

So, how does this method of correcting data differ from a simple mapping you put together yourself, using the same custom PL/SQL function to correct the data? Well in some respects, it's a lot more complicated and has a lot more things that can go wrong, and it took me the best part of a couple of days to work out how it all fits together. From working with it though, I can see two benefits from this.

Firstly, it creates metadata and graphical descriptions around your data cleansing processes, so that data cleansing is a distinct part of your ETL, your data clearly moves from dirty to cleansed, and it's much easier for a second person to work out what this part of the ETL process is for, compared to data cleansing being carried out "ad-hoc" as part of a general data movement from source to target.

Secondly (and I haven't had a chance to play around with this yet) the ability to match dirty data to it's correct form, using the soundex and similarity transformations, sounds a good new way of correcting spelling mistakes and mis-keyed customer data. One thing I haven't seen in it is the ability to combine this with the match-merge functionality (unless the similarity/soundex functions are match-merge under a different badge) but if these two functions were combined, it'd be a pretty good way of cleansing, deduplicating and householding customer data as part of the data profiling process.

Finally, one drawback that Nick Goodman actually pointed out to me, is that as the Correct Wizard is essentially a mapping generator that works off of your supplied parameters, if you need to change the correction in any way (perhaps switch from soundex to similarity, or to remove or custom cleansing) you've pretty much got to delete the mapping, all the tables it generated (including those used by the pluggable mapping) and do it all again. It would be good if you could just edit the correction as a whole, change the definition and OWB would make the changes, but without that you're pretty much stuck with deleting it and starting again.


 

 
��  
 
 
Oracle Training at Sea
 
 
 
 
oracle dba poster
 

 
Follow us on Twitter 
 
Oracle performance tuning software 
 
Oracle Linux poster
 
 
 

 

Burleson is the American Team

Note: This Oracle documentation was created as a support and Oracle training reference for use by our DBA performance tuning consulting professionals.  Feel free to ask questions on our Oracle forum.

Verify experience! Anyone considering using the services of an Oracle support expert should independently investigate their credentials and experience, and not rely on advertisements and self-proclaimed expertise. All legitimate Oracle experts publish their Oracle qualifications.

Errata?  Oracle technology is changing and we strive to update our BC Oracle support information.  If you find an error or have a suggestion for improving our content, we would appreciate your feedback.  Just  e-mail:  

and include the URL for the page.


                    









Burleson Consulting

The Oracle of Database Support

Oracle Performance Tuning

Remote DBA Services


 

Copyright © 1996 -  2017

All rights reserved by Burleson

Oracle ® is the registered trademark of Oracle Corporation.

Remote Emergency Support provided by Conversational