Call now: 252-767-6166  
Oracle Training Oracle Support Development Oracle Apps

 
 Home
 E-mail Us
 Oracle Articles
New Oracle Articles


 Oracle Training
 Oracle Tips

 Oracle Forum
 Class Catalog


 Remote DBA
 Oracle Tuning
 Emergency 911
 RAC Support
 Apps Support
 Analysis
 Design
 Implementation
 Oracle Support


 SQL Tuning
 Security

 Oracle UNIX
 Oracle Linux
 Monitoring
 Remote s
upport
 Remote plans
 Remote
services
 Application Server

 Applications
 Oracle Forms
 Oracle Portal
 App Upgrades
 SQL Server
 Oracle Concepts
 Software Support

 Remote S
upport  
 Development  

 Implementation


 Consulting Staff
 Consulting Prices
 Help Wanted!

 


 Oracle Posters
 Oracle Books

 Oracle Scripts
 Ion
 Excel-DB  

Don Burleson Blog 


 

 

 


 

 

 

 

 

Symmetric Difference

SQL Tips by Donald Burleson

Suppose there are two tables A and B with the same columns, and you would like to know if there is any difference in their contents.  

Relation Equality

A reader with an Object Oriented programming background might wonder why the Relational model in general, and SQL in particular, does not have an equality operator. You'd think the first operator you'd implement for a data type, especially a fundamental data type, would be equality! Simple: this operator is not relationally closed, as the result is Boolean value and not a relation.

The Symmetric Difference query provides a definitive answer:

(
 select * from A 
 minus
 select * from B
) union all ( 
 select * from B  
 minus
 select * from A
)

In practice, however, this query is a sluggish performer.  For my test I?ve created tables A and B with 100000 and 100010 rows correspondingly:

create table A as
select obj# id, name from sys.obj$
where rownum < 100000;

create table B as
select obj# id, name from sys.obj$
where rownum < 100010;


For such moderately sized tables the execution time about 2 sec is unacceptable. For comparison, it takes only 100 msec to scan an individual table A or B.

With a na?e evaluation strategy, the execution flow and the operators are derived verbatim from the SQL which has been written. First, each table has to be scanned twice. Then, four sort operators are applied in order to exclude duplicates. Next, the two set differences are computed, and finally, the two results are combined together with the union operator.

RDBMS implementations today, however, have pretty impressive query transformation capabilities. In our case, set difference can already be internally rewritten as anti-join. If not then, perhaps, the query optimizer can be influenced to transform the query into the desired form. Otherwise, has to be explicitly rewritten:

select * from A 
where (col1,col2,?) not in (select col1,col2,? from B)
   union all  
select * from B 
where (col1,col2,?) not in (select col1,col2,? from A)

 

Duality between Set and Join Operators

For two tables A and B with the same columns, set intersection

select * from A

intersect

select * from B

can be expressed as a semi-join

select distinct * from A

where (col1,col2,?) in (select col1,col2,? from B)

Likewise, set difference

select * from A

intersect

select * from B

can be expressed as an anti-join

select distinct * from A

where (col1,col2,?) not in (select col1,col2,? from B)

Transforming set into join operators expands optimizer search space. Optimizer could explore new, previously unavailable join order permutations.

Unfortunately, transforming set operators into joins did not have any significant effect, at least in my experiment.

The symmetric difference can be expressed via aggregation:

select * from (
  select id, name,
    sum(case when src=1 then 1 else 0 end) cnt1,
    sum(case when src=2 then 1 else 0 end) cnt2
  from (
    select id, name, 1 src from A
    union all
    select id, name, 2 src from B
  )
  group by id, name
)
where cnt1 <> cnt2


This appeared to be a rather elegant solution where each table has to be scanned only once, until it was discovered that it has about the same performance as the canonic symmetric difference query.

When comparing data in two tables there are actually two questions that one might want to ask:

1)      Is there any difference? The expected answer is Boolean.

2)    What are the rows that one table contains, and the other does not?

Question #1 can be answered faster than #2 with a hash value based technique.

The standard disclaimer of any hash-based technique is that it is theoretically possible to get a wrong result. The rhetorical question, however, is how often did the reader experience a problem due to hash value collision? I never did. Would a user be willing to accept a (rather negligible) risk of getting an incorrect result for a significant performance boost?

In a hash-based method each table is associated with a single aggregated hash value. This should be carried out in such a way that changing a single character or digit in any field must cause the aggregate value change. Aggregation via min and max functions wouldn't meet this requirement, while the sum would.

Here is one method to calculate the aggregated table value:

select sum( ora_hash(id||'|'||name, POWER(2,16)-1) )
from A


There all row fields were concatenated, and the string translated into a hash value. Alternatively, hash values could have been calculated for each field, and some asymmetric function applied in order for the resulting hash value to be sensitive in respect to column permutations:

select 1 * sum( ora_hash(id,   POWER(2,16)-1) )
     + 2 * sum( ora_hash(name, POWER(2,16)-1) )
from A


Row hash values are added together with ordinary sum aggregate, but we could have written a modulo 216-1 user-defined aggregate hash_sum in the spirit of the CRC (Cyclic Redundancy Check) technique.

Hash value calculation time is proportional to the table size. This is noticeably better than the symmetric difference query, where the bottleneck was in the sorting.

We can do even better, at the expense of introducing a materialized view.  The hash value behaves like an aggregate, and therefore, can be calculated incrementally. If a row is added into a table, then a new hash value is a function of the old hash value and the added row. Incremental evaluation means good performance, and the comparison between the two scalar hash values is done momentarily.

     

This is an excerpt from the new book SQL Design Patterns: The Expert Guide to SQL Programming by Vadim Tropashko

You can buy it direct from the publisher for 30%-off.


 

 
��  
 
 
Oracle Training at Sea
 
 
 
 
oracle dba poster
 

 
Follow us on Twitter 
 
Oracle performance tuning software 
 
Oracle Linux poster
 
 
 

 

Burleson is the American Team

Note: This Oracle documentation was created as a support and Oracle training reference for use by our DBA performance tuning consulting professionals.  Feel free to ask questions on our Oracle forum.

Verify experience! Anyone considering using the services of an Oracle support expert should independently investigate their credentials and experience, and not rely on advertisements and self-proclaimed expertise. All legitimate Oracle experts publish their Oracle qualifications.

Errata?  Oracle technology is changing and we strive to update our BC Oracle support information.  If you find an error or have a suggestion for improving our content, we would appreciate your feedback.  Just  e-mail:  

and include the URL for the page.


                    









Burleson Consulting

The Oracle of Database Support

Oracle Performance Tuning

Remote DBA Services


 

Copyright © 1996 -  2020

All rights reserved by Burleson

Oracle ® is the registered trademark of Oracle Corporation.