 |
|
Table Function
SQL Tips by Donald Burleson |
Either way, the previous
solution looks ridiculous to anybody with little programming
background. Integers can be easily created on the fly, so why does
one have to refer to some stored relation let alone a view over
several tables? The idea of relations which can be manufactured with
code as opposed to stored relations leads to the concept of Table
Function.
A table function is a
piece of procedural code that can produce a result which can be
understood by the SQL engine -- that is, a relation! A table
function can have a parameter, so that the output relation depends
on it. For all practical purposes it looks like a Parameterized
view, and it is even called this in the SQL Server world. The
Integers view is naturally parameterized with an upper bound
value.
The table function
concept evolved further to embrace the idea of pipelining.
From a logical perspective, pipelining does not change anything: the
table function output is still a relation. The function, however,
does not require materializing the whole relation at once; it
supplies rows one by one along with the consumer's demand. This
implementation difference, however, has a significant implication in
our case. The integers relation is infinite! By no means can
one hope to materialize it in all entirety, hence the upper bound
parameter in a non-pipelined version mentioned in the previous
paragraph.
The size of the pipelined
table function output does not necessarily have to be controlled by
purely logical means. A row's producer (table function) is in an
intricate relationship with a row's consumer (the enclosing SQL
query), and there is a way for the consumer to tell the producer to
stop.
Let's write such table
function producing an infinite list of integers and see what
happens:
CREATE TYPE
IntSet AS TABLE OF Integer;
CREATE or
replace FUNCTION Integers
RETURN IntSet PIPELINED IS
BEGIN
loop
PIPE ROW(0);
end loop;
END;
/
Each table function is
required to define the shape of the output it produces as an object
type. We declared IntSet as a list of integers. The function
implementation is unbelievably simple: the flow of control enters an
infinite loop, where it creates a new output row during each
iteration. Since the function is capable of producing a list of 0s
only, it is calling the SQL query's responsibility to have a pseudo
column expression that assigns integers incrementally.
Alternatively, we could have a slightly more verbose Integers
table function with a counter variable incremented during each loop
iteration which pipes that integer to the output.
How is the Integers
table function used? Simply use:
select
rownum from Table(Integers)
Be careful though, a
typical client programmatic interface allows opening a cursor and
fetching the result set row by row. Normally, the client would stop
when it exhausts the result set. In this case, the result set is
infinite, so the client has to decide by itself when to stop. Of
course, the decision when to stop can be moved to the server side
and made explicit such as:
select
rownum from Table(Integers)
where rownum < 1000
When designing a pipelined Integers function a rather popular
alternative was neglected. Many would find it natural for the
Integers function to have an argument specifying an upper limit in
the range of generated integers. For example, Integers(5) returns
the list 1,2,3,4,5; and the last could be reformulated in terms of
this new function without the predicate:
select
rownum from Table(Integers(1000))
Which style is better?
Suppose somebody unfamiliar with the Integers function
implementation asks:
What is the maximum
number in the list of integers produced by this query?
Predicate rownum <=
1000 gives the answer to the question immediately. While with
the function parameter it might be 1000, 999, or even
21000 - it is impossible to tell for sure without examining
the Integers function implementation.
Function Argument or
Predicate?
In general, I favor the
solution with predicates. Being at the high abstraction level,
programs with predicate expressions are shorter and cleaner. In
this particular case, my preference for clarity even goes as far
as compromising program safety. A carelessly written query against
the Integers function with no argument would not terminate.
The case when one would
need raw or only a slightly cooked list of integers is relatively
simple. If the integers relation is a part of a more complex
query, it would require a more elaborate analysis to decide whether
the query would terminate without an explicit server-side stop
condition.
Cube
One more approach
leverages the cube operator:
select
rownum from (
select 0
from dual
group by cube(1,1,1,1,1)
)
This solution had drawn the following eloquent comment from a reader
named Mikito Harakiri:
Most posters here seem to
have trouble seeing the difference between finite and infinite.
OK, exponentiate the
number, it is still produces a finite number (of rows). One day your
program with this "cool" cube solution would break just because you
have been lazy enough to code a trivial pipelined function.
Essentially, cube is as bad as selecting from obj$ (OK, col$ is
probably bigger). Well, except that you take your fellow
programmer's time, who has to understand your code. What is the
purpose of this group by cube, in general, and why there are 5
parameters (and, say, not 6), in particular.
Table function method
select
rownum from Table(Integers()) where rownum < 50
is much cleaner and
robust.
Hierarchical Query
I have mentioned already
DB2 integer generator leveraging recursive SQL. Oracle does not have
recursive SQL capabilities at the time of this writing, so users
have to use a non-standard hierarchical query extension. A contrast
between Oracle and DB2 solutions is often enlightening. One
fundamental difference between these two platforms is that Oracle
seems to be able to detect loops, while DB2 does not make such a
claim. Detecting loops in general is undecideable, which is the
basis for DB2's position. Does Oracle's loop detection work because
hierarchical extension has narrowed query expression capabilities
compared to recursive SQL? Can this be challenged?
Consider the following
typical hierarchical query:
select
ename from emp
connect by empno = prior mgr
start with empno = 1000
First, Oracle finds all the nodes in the graph satisfying the
start with condition. Then, for each batch of nodes found on a
previous step, it finds a batch of extra nodes that satisfy the
connect by condition. Any time the new node collides with the
nodes that have been already discovered, it signals the connect
by loop error. How does it identify the nodes? Should it compare
all the relation attributes, or only those in the select
clause or choose some other ones?
It is easy to see that
the attributes in the select clause should not matter.
Indeed, adding a rownum pseudo column would artificially make
the next node always appear different from its predecessors. The
loop, however, is a property of the graph. Graph either has a cycle
or it does not no matter what node labels exist. Therefore, the only
columns which should be relevant for loop detection are the ones in
the predicate with the prior prefix.
What hierarchical query
is written without the prior? This experiment reveals a
remarkably succinct integer generator:
select
rownum from dual
connect by 0=0
As an alternative to the
rownum pseudo column, level could be used:
select
level from dual
connect by 0=0
Both queries produce an
infinite result set. It is pipelined, however, so that the execution
can be stopped either on the client or explicitly on the server:
select
level from dual
connect by level < 100
Now that we have an ample
supply of integer generators, we can proceed with applications.