Home
E-mail Us
Oracle Articles
New Oracle Articles

Oracle Training
Oracle Tips
Oracle Forum
Class Catalog

Remote DBA
Oracle Tuning
Emergency 911
RAC Support
Apps Support
Analysis
Design
Implementation
Oracle Support

SQL Tuning
Security
Oracle UNIX
Oracle Linux
Monitoring
Remote support
Remote plans
Remote services
Application Server
Applications
Oracle Forms
Oracle Portal
App Upgrades
SQL Server
Oracle Concepts
Software Support
Remote Support
Development
Implementation

Consulting Staff
Consulting Prices
Help Wanted!

Oracle Posters
Oracle Books
Oracle Scripts
Ion
Excel-DB

Don Burleson Blog

Trees in SQL

SQL Tips by Donald Burleson

Graphs and trees are ubiquitous data structures. They do not easily fit into a Relational model; therefore querying them requires a little bit more ingenuity than the routine select-project-join.

Compared to graphs, trees are relatively simple creatures. They are easy to draw. Almost any problem involving a tree structure is easy to solve. Algorithms on trees are generally fast. Edges, which are very important in graph definition, can be almost completely ignored for a tree. The tree structure could be encrypted in the nodes alone, and those tree encodings could be invented almost on a daily basis.

Most of this chapter will focus on tree encodings. The rest is dedicated to smaller problems such as node ordering by ad-hoc criteria. Yet, several problems are postponed until the next chapter, wherein hierarchical aggregate queries and tree comparisons will be studied. A reader who is primarily looking to develop some intuition with vendor specific hierarchical SQL extensions (be that the connect by, or the recursive with operator) is advised to proceed to the next chapter.

Materialized Path

Tree is a subclass of graph. However, since graphs are more complex entities with their own set of problems, they will not be explored in depth in this chapter.

For all practical purposes a tree can be defined as a set of nodes arranged into a hierarchical structure via tree encoding. The purpose of tree encoding is to assign a special label to each node and to manipulate tree nodes - i.e. query and update - by means of those labels. Informally, each node is equipped with a global positioning device that transmits the node's coordinates. Once each node's geographical position is known, typical queries such as the following can be answered:

Count all the employees who are located south of the ?King?, in other words, who report directly or indirectly to him.

Without a doubt you are already familiar with at least one such encoding: a UNIX directory structure. Each file location in the hierarchy is defined by an absolute pathname -- a chain of directories that a user has to navigate from root to the leaf of the hierarchy. For example, /usr/bin/ls is an absolute pathname. On the top of the directory structure there is a directory called usr, which contains a directory called bin, which contains a file called ls.

This seemingly straightforward idea can be applied to any tree structure. First, discover or cook up some unique key, which would distinguish the node's children. Then, list all the ancestor unique keys as the node's encoding. This list can be represented as a string (if so, there must be agreement upon a string delimiter), or as a collection datatype. This encoding will be referred to as materialized path. The adjective materialized emphasizes the fact that the path is stored. If the path is built dynamically, then the adjective is omitted and thus this dynamically generated encoding is referred to as simply path.

At this moment, there is enough expressive power for basic queries:

An employee JONES and all his (indirect) subordinates:

select e1.ename from emp e1, emp e2
where e1.path like e2.path || '%'
and e2.ename = 'JONES'

An employee FORD and the chain of his supervisors:

select e1.ename from emp e1, emp e2
where e2.path like e1.path || '%'
and e2.ename = 'FORD'

Usually, query performance is unrelated to the form in which the query is written in SQL. In principle, a query optimizer has powerful techniques for transforming any query into an equivalent, better performing expression. Not in this case!

The first query is fine. The matching a string prefix is roughly equivalent to a range check where chr(255) is the last ASCII code:

select e1.ename from emp e1, emp e2
where e1.path between e2.path and e2.path || chr(255)
and e2.ename = 'JONES'

A reasonable execution strategy would be finding the unique employee record e2 matching ename='JONES', first. Finding a unique record is typically done via an index lookup, in other words, extremely fast. The first query execution step establishes the range of paths, which the e1.path encoding has to fall into. If this range of paths does not contain too many paths, then the best way to find them is to iterate via the index range scan. The more subordinates JONES has, the longer it will take to output them. In other words, the speed of this query is determined by the size of the output - there is hardly a more efficient way to express this query.

The equivalent range check rewriting is valid for the second query as well:

select e1.ename from emp e1, emp e2
where e2.path between e1.path and e1.path || chr(255)
and e2.ename = 'FORD'

Unlike the previous case, however, now not only is the interval of paths known, but also the path e2.path itself, which will be matched against all the intervals of the e1 table. Certainly, there would not be that many paths that match with e2.path, because the chain of ancestors in a balanced hierarchy is never too long.

Yet, there is no obvious index that could leverage this idea. The condition of a point belonging to an interval consists of the two predicates e2.path >= e1.path and e2.path <= e1.path || chr(255). A normal B-Tree index on the e1.path column could be leveraged while processing the first predicate only, and it would have to scan half of the records on average.

Finding a Set of Intervals Covering a Point

Querying ranges is asymmetric from a performance perspective. It is easy to answer if a point falls inside some interval, but it is hard to index a set of intervals that contain a given point. Applied to nested sets, there will be difficulty in answering queries about the node's ancestors.

The critical observation is that a chain of ancestors is encoded in the node's materialized path encoding. The database does not have to be accessed in order to tell that the ancestors of nodes 1.5.3.2 are nodes 1.5.3, 1.5, and 1. A simple function could parse the materialized path. This function's ?natural habitat? is the client side. There a dynamic SQL query is built:

select ename from emp
where path in ('1.5.3', '1.5', '1')

On the server side the implementation could be little bit more sophisticated. The list of ancestors can be implemented as a temporary table built by a table function. This sketchy idea will be developed in greater detail in later sections where more elegant encodings than the materialized path will be studied.

This section concludes with a materialized path tree encoding schema design:

table TreeNodes (
path varchar2(2000), -
)

This schema leaves the structure of the TreeNodes.path column unspecified. Ideally, some constraints could be added, but once again, a much nicer development that does not require string parsing techniques awaits us ahead.

This is an excerpt from the new book SQL Design Patterns: The Expert Guide to SQL Programming by Vadim Tropashko

You can buy it direct from the publisher for 30%-off.

��