 |
|
Symmetric Difference
SQL Tips by Donald Burleson |
Suppose there are two
tables A and B with the same columns, and you would
like to know if there is any difference in their contents.
Relation Equality
A reader with an Object
Oriented programming background might wonder why the Relational
model in general, and SQL in particular, does not have an equality
operator. You'd think the first operator you'd implement for a
data type, especially a fundamental data type, would be equality!
Simple: this operator is not relationally closed, as the result is
Boolean value and not a relation.
The
Symmetric Difference
query provides a definitive answer:
(
select * from A
minus
select * from B
) union all (
select * from B
minus
select * from A
)
In practice, however,
this query is a sluggish performer. For my test I?ve created
tables A and B with 100000 and 100010 rows
correspondingly:
create
table A as
select obj# id, name from sys.obj$
where rownum < 100000;
create table B as
select obj# id, name from sys.obj$
where rownum < 100010;
For such moderately sized tables the execution time about 2 sec is
unacceptable. For comparison, it takes only 100 msec to scan an
individual table A or B.
With a na?e evaluation
strategy, the execution flow and the operators are derived verbatim
from the SQL which has been written. First, each table has to be
scanned twice. Then, four sort operators are applied in order to
exclude duplicates. Next, the two set differences are computed, and
finally, the two results are combined together with the union
operator.
RDBMS implementations
today, however, have pretty impressive query transformation
capabilities. In our case, set difference can already be internally
rewritten as anti-join. If not then, perhaps, the query
optimizer can be influenced to transform the query into the desired
form. Otherwise, has to be explicitly rewritten:
select *
from A
where (col1,col2,?) not in (select col1,col2,? from B)
union all
select * from B
where (col1,col2,?) not in (select col1,col2,? from A)
Duality between Set and
Join Operators
For two tables A and B
with the same columns, set intersection
select *
from A
intersect
select *
from B
can be expressed as a
semi-join
select
distinct * from A
where
(col1,col2,?) in (select col1,col2,? from B)
Likewise, set difference
select *
from A
intersect
select *
from B
can be expressed as an
anti-join
select
distinct * from A
where
(col1,col2,?) not in (select col1,col2,? from B)
Transforming set into join
operators expands optimizer search space. Optimizer could explore
new, previously unavailable join order permutations.
Unfortunately,
transforming set operators into joins did not have any significant
effect, at least in my experiment.
The symmetric difference
can be expressed via aggregation:
select *
from (
select id, name,
sum(case when src=1 then 1 else 0 end) cnt1,
sum(case when src=2 then 1 else 0 end) cnt2
from (
select id, name, 1 src from A
union all
select id, name, 2 src from B
)
group by id, name
)
where cnt1 <> cnt2
This appeared to be a rather elegant solution where each table has
to be scanned only once, until it was discovered that it has about
the same performance as the canonic symmetric difference query.
When comparing data in
two tables there are actually two questions that one might want to
ask:
1)
Is there any
difference? The expected answer is Boolean.
2)
What are the rows
that one table contains, and the other does not?
Question #1 can be
answered faster than #2 with a hash value based technique.
The standard disclaimer
of any hash-based technique is that it is theoretically possible to
get a wrong result. The rhetorical question, however, is how often
did the reader experience a problem due to hash value collision? I
never did. Would a user be willing to accept a (rather negligible)
risk of getting an incorrect result for a significant performance
boost?
In a hash-based method
each table is associated with a single aggregated hash value. This
should be carried out in such a way that changing a single character
or digit in any field must cause the aggregate value change.
Aggregation via min and max functions wouldn't meet
this requirement, while the sum would.
Here is one method to
calculate the aggregated table value:
select sum(
ora_hash(id||'|'||name, POWER(2,16)-1) )
from A
There all row fields were concatenated, and the string translated
into a hash value. Alternatively, hash values could have been
calculated for each field, and some asymmetric function applied in
order for the resulting hash value to be sensitive in respect to
column permutations:
select 1 *
sum( ora_hash(id, POWER(2,16)-1) )
+ 2 * sum( ora_hash(name, POWER(2,16)-1) )
from A
Row hash values are added together with ordinary sum aggregate, but
we could have written a modulo 216-1 user-defined
aggregate hash_sum in the spirit of the CRC (Cyclic Redundancy
Check) technique.
Hash value calculation
time is proportional to the table size. This is noticeably better
than the symmetric difference query, where the bottleneck was in the
sorting.
We can do even better, at
the expense of introducing a materialized view. The hash value
behaves like an aggregate, and therefore, can be calculated
incrementally. If a row is added into a table, then a new hash value
is a function of the old hash value and the added row. Incremental
evaluation means good performance, and the comparison between the
two scalar hash values is done momentarily.