 |
|
Relational Division
SQL Tips by Donald Burleson |
Relational Division
is truly an exotic operator. Search any popular database forum for
problems involving Relational Division. There certainly would be a
few, but chances are they might be not recognized as such. For
example, Chapter 1 problem 6 is almost a literal reproduction of the
message posted at the newsgroup
microsoft.public.sqlserver.programming. If nothing else, this
section would help you to quickly identify a problem as Relational
Division. After all, using the proper names is very important for
effective communication.
Why it is called
Division? It is the inverse of the (Cartesian) Product. Given a set
of JobApplicants
Note: Not all code
is included due to formatting issues,
See book for full equations.
and a set of
JobRequirements,
then the Cartesian
Product JobApplicants -
JobRequirements is
Inversely, given
the JobApplicants -
JobRequirements relation (called the dividend), it could be
divided by JobRequirements (the divisor) and JobApplicants
(the quotient) would be obtained.
The analogy between
relational algebra and arithmetic goes one step further. Relational
division is similar to integer division. If the integer dividend
x is not a multiple of integer divisor y, the quotient is
defined as the maximal number q that satisfies the
inequality:
x ≥ q y
or the equality
where r is called the remainder.
x = q y + r
In Relational
Algebra, given a dividend X and divisor Y, the quotient Q is defined
as a maximum relation that satisfies the inequality:
X
⊇ Q - Y
or the equality:
X = Q - Y
∪ R
In the following
example, the JobApplicants -
JobRequirements
relation is reduced to
which is
appropriately called ApplicantSkills. Then,
ApplicantSkills
=
QualifiedApplicants -
JobRequirements
∪
UnqualifiedSkills
where the quotient
QualifiedApplicants is
and the remainder
UnqualifiedSkills is
Informally,
division identifies attribute values of the dividend that are
associated with every member of the divisor, such as find job
applicants who meet all job requirements.
Relational Division
is not a fundamental operator. It can be expressed in terms of
projection, Cartesian product, and set difference. The critical
observation is that multiplying the projection
πName(ApplicantSkills)
and JobRequirements results in the familiar Cartesian
Product:
Subtract
ApplicantSkills, and project the result to get the Names
of all the applicants who are not qualified. Then, find all the
applicants who are qualified as a complement to the set of all
applicants. Formally:
QualifiedApplicants =
πName(ApplicantSkills)-
-
πName(
πName(ApplicantSkills)?JobRequirements - ApplicantSkills )
Translating it into
SQL is quite straightforward, although the resulting query gets
quite unwieldy:
select
distinct Name from ApplicantSkills
minus
select Name from (
select Name, Language from (
select Name from ApplicantSkills
), (
select Language from JobRequirements
)
minus
select Name, Language from ApplicantSkills
)
The minus operator
is essentially an anti-join, so it is not surprising that it can be
expressed in another form that is popular in the literature:
select
distinct Name from ApplicantSkills i
where not exists (
select * from JobRequirements ii
where not exists (
select * from ApplicantSkills iii
where iii.Language = ii.Language
and iii.Name = i.Name
)
)
Admittedly, this query is difficult to understand because of
the double-negative construction - not exists clause
inside another not exists. The reason for the double
negation is SQL's inability to express universal quantification in
the relational division query:
Name the
applicants such that for all
job requirements there exists a
corresponding entry in the applicant
skills
Mathematical logic
is notorious for formal transformations of one logical expression
into another. In this case
∀x(∃y
f(x,y) )
is re-written into
∄x(∄y
f(x,y) )
without too much
thinking. The Relational division query becomes:
Name the
applicants such that there is
no job requirement such that
there doesn't exists a corresponding
entry in the applicant skills
which is a sloppy
wording for the SQL query that was being analyzed.
As usual, the most
elegant solution requires insight. The critical observation is
expressing the informal Relational division query as:
Name the
applicants for which the set
of all job skills is a
subset of their skills
It would be nice if
SQL had a subset operator (also called set
containment) so that we could write
select
distinct Name from ApplicantSkills i
where (select Language from JobRequirements ii
where ii.Name = i.Name)
in (select Language from ApplicantSkills)
Here the SQL syntax
has been abused and the already existing in operator
leveraged to denote set containment. Without an explicit subset
relation available it is expressed as the emptiness of the
difference between the two sets. Formally,
A
⊆ B
is equivalent to
A \ B
= ∅
Applied to our case
it allows the transformation of the rough first attempt to a
legitimate SQL query:
select
distinct Name from ApplicantSkills i
where not exists (
select Language from ApplicantSkills
minus
select Language from JobRequirements ii
where ii.Name = i.Name
)
We are not done
yet. Instead of checking the sets containment they could be simply
counted instead!
select Name
from ApplicantSkills s, JobRequirements r
where s.Language = r.Language
group by Name
having count(*) = (select count(*) from JobRequirements)
First, ?irrelevant
skills? were excluded by joining ApplicantSkills and
JobRequirements. Now that the result is subset of the required
skills, they are both counted and compared.
Relational division
can have really bizarre applications. Consider finding the Greatest
Common Divisor (GCD) for a set of integers.
Given a
set of integers find the
maximal integer that divides all
of them.
The GCD problem is
remarkable in a sense that it is connected to many areas of SQL
programming. It is very tempting, for example, to define GCD as yet
another user defined aggregate function. In this section, however,
the GCD problem seems to be out of context, because the only thing
that relates GCD and relational division is the word ?division?. It
does not mean the same thing in both cases, however. Or does it?
It is the universal
quantifier ∀
in our informal GCD query which makes the connection to the
relational division. For illustration purposes we?ll find the GCD of
just three numbers 9, 15 and 42. Let's try making this query fit
formally into the relational division framework. First it is a good
idea to simplify a query in order to
exclude the
aggregation:
Given a
set of integers, e.g. 9, 15,
and 42 find another set of
integers such that each of
them divides all the numbers from
the first set.
Admittedly, this
query does not look like relational division at all. The relational
division operator has two inputs: the dividend relation and the
divisor relation. All there is so far is something that might
measure up to the divisor role. Let's suppress the lingering doubt,
and promote the relation formally to the divisor. In fact, why not
even keep the JobRequirements relation name from the previous
development?
Therefore, we have
JobRequirements:
What could be the
missing ApplicantSkills dividend? The ?Applicants? are really
the integers; only those which have a 'skill? to divide all the
numbers listed in the JobRequirements would qualify. There we
have ApplicantSkills!
This relation is
imaginary, of course. The chapter on integer generators presented a
way to generate all the integer pairs, and this method could in fact
be adopted for this case. Next, simply plug in
JobRequirements and ApplicantSkills into any of the
relational division queries that were written in this chapter, say
select Name
from ApplicantSkills s, JobRequirements r
where s.Language = r.Language
group by Name
having count(*) = (select count(*) from JobRequirements)
and we have the
answer! There is one little optimization that can
simplify the answer. The ApplicantSkills.Language column can
be eliminated. Projection of the ApplicantSkills relation to
the Name column is the familiar Integers relation. Now
that there is no longer any column for the equijoin predicate
s.Language = r.Language, the join condition
between the Integers and JobRequirements demands the
remainder of integer division to be 0. With the names that
were mindlessly borrowed from the sample relational division
problem, the revised query may enter an obfuscated SQL coding
contest.
select Name
from (
select num# as Name from Integers
where num# <= (select min(language) from JobRequirements)
), JobRequirements
where mod(Language,Name)=0
group by Name
having count(*) = (select count(*) from JobRequirements)
With one more step,
switching to more appropriate names, and we have the final GCD
query:
select
Divisor from (
select num# as Divisor from Integers
where num# <= (select min(Element) from NumberSet)
), NumberSet
where mod(Element, Divisor)=0
group by Divisor
having count(*) = (select count(*) from NumberSet)
The only remaining action is finding the maximum in the
resulting set.
Outer
Union
The union operator
definition in Relational Algebra has a problem. It can be applied
only to compatible relations that have identical attributes. The
Outer Union operator, invented by E.F.Codd, can be applied to any
pair of relations. Each relation is extended to the schema
that contains all the attributes from both relations. The newly
introduced columns are padded with NULLs. The resulting relations
have the same schema and their tuples can, therefore, be combined
together by an ordinary union.
Example. Suppose we
have the Personnel relation
and the
Department
The outer union
Personnel ⊎
Department is
This operator seems
to have an extremely limited practical scope. The only usage of the
outer union operator I?m aware of is the definition of (the full)
outer join. Formally, the outer join of the relations A and B is
defined in four steps:
1.
Join both relations: A
wv
B. In the example above we?ll
have
2.
Join both relations, project to the first relation schema and
subtract the result from the first relation: A \
psch(A)
(A wv
B)
3.
Join both relations, project to the first relation schema and
subtract the result from the first relation: B \
psch(B)
(A wv
B)
4.
Apply outer union to the results of 1-3.
This is actually
the relational algebra versions of the outer union and outer join
operations. Perhaps they should be properly named natural outer
union and natural outer join. In SQL the two relations are always
considered to have disjoint sets of attributes.