 |
|
String Decomposition
SQL Tips by Donald Burleson |
Given a relation A with a
single column LST the query should output a relation with tuples
containing individual substrings that are 'separated by commas? in
the original strings:
The clause 'separated by
commas? is decorated with quotation marks on purpose. While the
value 3 is indeed enclosed with commas on both sides, 1 and 7 are
delimited from one side only.
To convert this informal
requirement into a formal language, the first thing to do is fix
this nuisance. It is quite easy to accomplish using:
select
','||LST||',' LST from A
Now, each value is
enclosed from both sides. The answer is a substring between n-th
and n+1-th delimiter position. Getting a little bit ahead of
myself, I?ll mention that a very elegant recursive SQL solution
awaits us later. The following is one of Oracle's integer number
generators:
with B as
(select ','||LST||',' LST from A)
select substr(LST,instr(LST,',',1,num)+1,
instr(LST,',',1,num+1)-instr(LST,',',1,num)-1) PREF
from B,(
select rownum num from dual connect by rownum < 10
)
With this solution, it is
instructive to reflect back and notice that without the additional
commas in the beginning and at the end of the string the solution
would be quite messy. Separate cases would have to be made for the
values that are delimited from one side only. In one word,
rigorous means simple!
Why does the query insist
on producing exactly 9 integers? Well, 9 is a ?reasonably big?
number. Any number greater or equal to LENGTH(LST) would do. If we
have more than one row in the original relation A, then we could
simply take a maximum of the string lengths.
What about those rows
with NULLs? Easy: anything unwanted could be filtered out with a
proper predicate.
Admittedly, the last two
answers taste like kludges. The solution feels even less satisfying
if it is compared with the recursive SQL solution, which I promised
before:
with
decomposed( pref, post ) as (
select '', lst from A
union
select substr(lst, 1, instr(lst,',')),
substr(lst, instr(lst,',')+1, length(lst)) from decomposed
) select pref from decomposed where pref <> ''
The execution steps can
be easily visualized:
Interestingly, no longer
is it required to decorate strings with a starting and trailing
delimiter.
If your RDBMS of choice
supports table functions, you might argue that the task of string
parsing could be delegated to some procedural code. Here is
implementation in Oracle posted on the OTN SQL& PL/SQL forum by
Scott Swank:
CREATE TYPE
Strings AS TABLE OF varchar2(100);
FUNCTION
parse (
text IN
VARCHAR2,
delimiter IN VARCHAR2 DEFAULT ',')
RETURN Strings
IS
delim_len CONSTANT PLS_INTEGER
:= LENGTH (delimiter);
tokens
Strings := Strings ();
text_to_split
VARCHAR2 (32767);
delim_pos
PLS_INTEGER;
BEGIN
tokens.DELETE;
text_to_split := text;
WHILE text_to_split IS NOT NULL
LOOP
delim_pos := INSTR (text_to_split, delimiter);
IF delim_pos > 0
THEN
tokens.EXTEND;
tokens (tokens.LAST) :=
SUBSTR (text_to_split, 1, delim_pos - 1);
text_to_split :=
SUBSTR (text_to_split, delim_pos + delim_len);
ELSE
tokens.EXTEND;
tokens (tokens.LAST) := text_to_split;
text_to_split := NULL;
END IF;
END LOOP;
RETURN tokens;
END parse;
Even though the
implementation looks complicated, all that is visible from the SQL
query is a function call:
select
column_value
from table(parse('1,2,3,4'));
It might not seem obvious
how to call this table function in a context where the input data is
a set of strings. A table expression cannot be placed inside
some subquery within the select or where clause, since
it returns multiple rows. The only solution is keeping it
inside the where clause, although the function parameter has
to be correlated with the source table:
create
table sources (
num integer,
src varchar2(1000)
);
insert into sources values(1, '1.2.3,4.5,6,7');
insert into sources values(2, '8,9');
select num, t.*
from sources, table(parse(src)) t;
This syntax is reminiscent of ANSI SQL lateral views, where the
second table expression is allowed to refer to the first one. From a
theoretical perspective an expression with a lateral view is an
asymmetric join with the join condition pushed inside the second
table. This solution scales up nicely. For example we
can parse a comma separated list, first and then parse the result
once more
select
num,
t1.column_value a,
t2.column_value b
from sources t, table(parse(src)) t1,
table(parse(column_value,'.')) t2;
Enumerating Pairs
Enumerating pairs seems
easy. Just build a Cartesian product of the two integers
relations:
select
i1.num x, i2.num y
from Integers i1, Integers i2
With a finite integers
relation bounded by an arbitrary upper limit, there should not be
any surprises. What if it is infinite, is it possible to drop the
limit predicate on the server side and leverage the pipelining idea?
Pipelined Operators
When executing an SQL
statement, the RDBMS engine represents it internally as a tree of
operators. Each operator performs a relatively simple task;
the complexity lies in the way the operators are combined
together.
Each operator works as a
black box. It consumes a relation as an input, massages it, and
outputs the result to the other operator. If the operator is able
to start outputting rows as soon as it consumed one or several
rows, it is called pipelined. Pipelined operators do not
block the execution flow; whereas blocking operators have
to consume the whole input relation before outputting even a
single row.
At first, the answer
seems to depend on a join method employed by the SQL execution
engine for this particular query. Both hash join and sorted merge
join materialize their argument relations, therefore the execution
plan involving any of those methods is never pipelined.
Nested loops are
typically a pipelined method. It iterates via the outer relation,
and for each row finds matching rows from the inner relation. In
case of the Cartesian product of Integer relations, however, all the
rows from the inner relation match and there are an infinite number
of them! Therefore, the execution would be stuck scanning the inner
table, and would never be able to get past the first row in the
outer table. In other words, the nested loop join fails to deliver
pipelined set of integer pairs as well.
Let's approach the
problem from another angle. When a mathematician says ?enumerating?
she really means a one-to-one mapping some set of objects into the
integers. Figure 2.2 shows a nearly ubiquitous mapping of integer
pairs into integers.
Without further ado, here
are the formulas for the (x,y) -> n mapping
and the reverse n -> (x,y)
mapping
Proving these formulas is
left as an exercise for curious reader. Translating the above
formulas into SQL is straightforward:
select n,n-(xplusy+1)*xplusy/2
x,
xplusy-n+(xplusy+1)*xplusy/2 y
from (
select FLOOR((SQRT(8*(rownum-1)+1)-1)/2) xplusy,
rownum-1 n from dual connect by 1=1
)
Even though the
implementation side varies from vendor to vendor, it is instructive
to see the execution statistics:
Ignoring fancy operator
names in the execution statistics above will show that the very
first operator produces one row. This row performs a basis for the
connect by iteration that starts producing rows one-by-one.
Each row is pipelined through two more levels of processing to the
top. The client receives each row, and after getting the tenth row,
loses any interest in continuing further. The execution statistics
is a snapshot at that moment.
Admittedly, using the
above pipelined integer pairs implementation inside other queries as
an inner view or subquery would only raise eyebrows. Those square
root expressions are better be hidden behind a named view:
create view
IntegerPairs as
select n,n-(xplusy+1)*xplusy/2 x,
xplusy-n+(xplusy+1)*xplusy/2 y
from (
select FLOOR((SQRT(8*(rownum-1)+1)-1)/2) xplusy,
rownum-1 n from dual connect by 1=1
)
Was anything achieved with this pipelining idea using integer pairs
implementation? Consider the following query:
Find positive integers X
and Y satisfying both x + y = 10 and x - y = 2 equations
With the IntegerPairs
view the solution is immediate:
select x,y
from IntegerPairs
where x+y=10 and x-y=2
Would it actually work? A pair x=6 and y=5 is the unique
solution of the system of equations, but would the execution really
stop after we have it in the result set? Even if the client is not
interested in more than one row from the result set, it has no way
to communicate this to the server. If the result set is buffered,
then the server would continue producing more pairs, and never stop.
Ad-Hoc Constants
If pipelining doesn't
really work for constraint problems with two integer variables,
why mention it at all? In an alternative approach it would simply
be written
select
x,y from (
select num x from Integers where num < 234
), (
select num y from Integers where num < 567
) where
x+y=10 and x-y=2
which produces the correct
answer without any risk of getting stuck in the infinite loop.
Much left is to be desired for the code clarity, however. Those
'safety? constants are eyesore, and they have to be estimated in
advance. The values that work for one problem, might not work for
next problem. A cartoon description of pipelining would define it
as a programming style that avoids magical constants.