| |
 |
|
Enterprise Solutions with Oracle ETL
Oracle Tips by V.J. Jain
|
Enterprise Solutions
with Oracle ETL
By V.J.
Jain, July 2007 (varun.jain@varunjaininc.com)
Varun Jain, Inc. (an Oracle Partner)
Abstract:
Oracle 9i and 10g have introduced tools that allow for the composition of an
enterprise ETL solution that is more preferable than off-the-shelf products.
The features include:
-
setup and use of an external table and analytic SQL which loads data from a
complex XML and text file, allowing 1,674 records to be loaded from the file into a
relational table in just 6.85 seconds
-
setup and use of a pipelined table function to vertically load more
complex files with dynamic columns. This allows as many as 31,233
records to be
inserted from the file into a relational table in only 0.96 seconds
-
application of an
itemized enterprise ETL solution after the functionality and efficiency of
the previous features have been achieved
-
an exhibition of the
dynamic properties of external tables, illustrate through altering file
source by DDL. This includes ETL procedures which design load files dynamically
to accept the file path as a parameter.
-
use of NFS drives for
space efficiency to achieve dynamic alterations
-
availability of a custom queue table, created
by using an Oracle directory and a custom procedure, which populates based
on the file listing by using a semaphore for external procedures
-
creation of DBMS_SCHEDULER
job that monitors and processes the queue
table.
It is the combination of
these components which result in a highly functional enterprise solution,
offering benefits that surpass off-the-shelf products such as Informatica,
Data Stage, or Ab Initio.
Furthermore, Oracle
9i and 10g have added some very powerful tools to support high performance ETL.
While there may be familiarity with Oracle external tables and pipelined table
functions, the full potential of Oracle can go unrealized.
The
demand for high performance ETL is growing as organizations realize the
importance of business intelligence and the need for a comprehensive data
warehouse. The majority of these organizations use expensive off-the-shelf
products without realizing that they have already licensed the necessary tools
to build a potentially better solution. This article discusses the use of
Oracle ETL features with Oracle Scheduler to provide a real-time, high scale,
and high performance ETL solution.
Familiarity with external
tables and pipelined table functions is necessary to fully understand the
solution:
External Tables
External tables are used to provide relational access to the contents of a file
from an Oracle directory. The external table script defines the rules for
reading the file, and once the table is created, the external table is
accessed just like a regular table. Here’s an example:
|
Create directory DAT_DIR as ‘/usr/tmp’;
|
File:
sample.csv saved to /usr/tmp
ID,
DATE, FIRST_NAME, LAST_NAME, STATUS
1,
17-JUL-07, John, Adams, Active
2,
17-JUL-07, Tyler, Howell, Active
3,
17-JUL-07, Jim, Lopez, Active
4,
17-JUL-07, Carlos, White, Inactive
5,
17-JUL-07, Scott, Tiger, Active
External table definition:
|
CREATE TABLE SAMPLE_EXT
(
ID NUMBER,
DATE VARCHAR2(10),
FIRST_NAME VARCHAR2(30),
LAST_NAME VARCHAR2(30),
STATUS VARCHAR2(10)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY DAT_DIR
ACCESS PARAMETERS
(
records delimited by newline LOAD WHEN (ID != ‘ID’)
badfile BAD_DIR:'SAMPLE_EXT%a_%p.bad'
logfile LOG_DIR:'SAMPLE_EXT%a_%p.log'
fields terminated by ',' optionally enclosed by '"' LRTRIM
MISSING FIELD VALUES ARE NULL
)
LOCATION (‘sample.csv')
)
PARALLEL 4
REJECT LIMIT 1000;
|
Note:
Several of the properties in this definition are optional. For example, if “skip 1”
could have been used in the table definition, but the method above can be a better
alternative if it cannot be guaranteed that all file sources will use a header
line.
Now
that the external table has been created, and the file has been placed on the
server, it is possible to access the data using a SQL such as:
|
Select * from SAMPLE_EXT
/
ID DATE FIRST_NAME LAST_NAME STATUS
1 17-JUL-07 John Adams Active
2 17-JUL-07 Tyler Howell Active
3 17-JUL-07 Jim Lopez Active
4 17-JUL-07 Carlos White Inactive
5 17-JUL-07 Scott Tiger Active
Select * from SAMPLE_EXT where STATUS = ‘Inactive’
/
ID DATE FIRST_NAME LAST_NAME STATUS
4 17-JUL-07 Carlos White Inactive
|
This exhibits the
relational access to this file
in completion, even though the
file is outside of the database. Loading the contents of this file into a
relational table is also easy to accomplish.
First,
create a matching relational table as exemplified below:
|
CREATE TABLE SAMPLE_TAB
(
ID NUMBER,
DATE VARCHAR2(10),
FIRST_NAME VARCHAR2(30),
LAST_NAME VARCHAR2(30),
STATUS VARCHAR2(10)
);
|
Next,
insert the data from the external table to the relational table:
|
Insert into sample_tab
Select * from sample_ext
/
|
It is a
common misconception that external tables are only useful for simple formats
such as comma or tab delimited files. In loading a CSV file easily
however, one can see that this is far from the truth.
External tables can also be
used for complex files. For example, because external tables are
relationally accessible, the full flexibility of the SQL language can be
utilized to get the data in the desired
format. If the file format is too complex for SQL to
structure appropriately, a pipelined table function can be used.
When
using external tables for more complex file formats, placing
each row from the file into single columned row in the external table may offer
great flexibility for transforming the data.
Consider the following file:
sample2.dat (actual file is several megabytes)
23:18:00 PST 18 Feb 2007
<Inventory name=" CORVETTE">
<AppId
id="1"/>
<Color>Yellow</Color>
</Inventory>
23:18:00 PST 18 Feb 2007
<Inventory name="CAMARO">
<AppId
id="2"/>
<Color>White</Color>
</Inventory>
23:18:00 PST 18 Feb 2007
<Inventory name="PROWLER">
<AppId
id="3"/>
<Color>Blue</Color>
</Inventory>
This
file is obviously a blend between text and XML. Though this is not a
preferred format, in ETL there is often no control over the data
sources. How could an external table to efficiently load a file
structured like this be used? The solution is probably easier than
assumed.
First,
create the external table:
|
CREATE TABLE SAMPLE2_EXT (
field1 varchar2(4000)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY DAT_DIR
ACCESS PARAMETERS
(
records delimited by newline
badfile DAT_DIR:'SAMPLE2_EXT%a_%p.bad'
logfile DAT_DIR:'SAMPLE2_EXT%a_%p.log'
fields LRTRIM REJECT ROWS WITH ALL NULL FIELDS (field1 char(4000))
)
LOCATION ('sample2.dat')
)
REJECT LIMIT 10;
|
Next,
create the relational table:
|
CREATE TABLE SAMPLE2
(
BEGIN_DATE DATE,
INVENTORY VARCHAR2(30),
APPID_ID VARCHAR2(30),
COLOR VARCHAR2(30)
);
|
Now extract, transform, and load data from external table into the relational
table:
|
SQL> insert /* append */ into sample2
2 select /*+ materialize */
3 max(BEGIN_DATE) BEGIN_DATE,
4 max(INVENTORY) INVENTORY,
5 max(APPID_ID) APPID_ID
6 ,max(COLOR) COLOR
7 FROM
8 (
9 select r, grp,
10 case when
(length(translate(substr(field1,1,1),' 0123456789',' '))) is null then
to_date(to_
char(to_timestamp_tz(field1, 'HH24:MI:SS TZD DD Mon YYYY'),
'DD-MON-YYYY HH:MI:SS AM'), 'DD-MON-YYYY
HH:MI:SS AM') end BEGIN_DATE,
11 case when
(instr(field1,'<Inventory')>0) then extractValue(XMLType(replace(field1,
'">', '"/
>')), '/Inventory/@name') end INVENTORY,
12 case when (instr(field1,'<AppId')>0)
then extractValue(XMLType(replace(field1, '">', '"/>')),
'/AppId/@id') end APPID_ID,
13 case when (instr(field1,'<Color')>0)
then extractValue(XMLType(field1), '/Color') end COLOR
14 from
15 (
16 select r, field1, nvl(max(grp) over (order by
r),0) grp FROM
17 (
18 select A.r, A.field1, case
when (length(translate(substr(field1,1,1),' 0123456789',' '))) i
s null then A.r end grp from
19 (
20 select rownum r,
field1 from SAMPLE2_EXT A
21 ) A
22 )
23 )
24 )
25 group by grp
26 /
1674 rows created.
Elapsed: 00:00:06.85
SQL>
|
The query above can extract, transform, and load 1,674
records into a relational table in under 7 seconds. This is good
performance considering the format has XML and non-XML content.
Here is a
sample of what is in our relational table:
|
select * from sample2
where rownum <=3
/
BEGIN_DATE INVENTORY
APPID_ID COLOR
2/18/2007 11:18:00 PM CORVETTE
1 Yellow
2/18/2007 11:18:00 PM CAMARO
2 White
2/18/2007 11:18:00 PM PROWLER
3 Blue
|
The example above demonstrates
a relatively simple use of a single columned external table transforming raw data into a relational format,
but what happens when the format is more complex?
Pipelined Table Functions
Though
they can get quite complicated, a table function is simply a function that
returns rows. Inside the table function, the row is prepared and each row
is piped back to the session when they are ready. The
output rows are based on a type definition.
Here is
a simple example of a table function that accepts a VARCHAR2 input and returns
rows with 1 character per row.
|
CREATE TYPE row_typ AS OBJECT
(p_row_output
varchar2(10));
/
CREATE TYPE row_typ_tab AS TABLE OF row_typ;
/
create or replace function example_tab_fnc (pi_value IN
varchar2)
return row_typ_tab PIPELINED AS
l_length number;
l_index PLS_INTEGER
:= 1;
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
LOOP
l_length := length(pi_value);
EXIT WHEN l_index > l_length;
PIPE ROW ( row_typ(substr(pi_value, l_index, 1)));
l_index := l_index + 1;
END LOOP;
RETURN;
END;
select * from table (example_tab_fnc('ABCDE'))
/
P_ROW_OUTPUT
A
B
C
D
E
|
It is
obvious that this feature could certainly prove useful in an ETL environment.
While the demonstration above is simplified, pipelined table
functions can be used to transform complex files into a usable format. For
example, as input, a table function can accept a REF cursor pointing at an external table. Consider the following file:
Sample3.dat
[CLASS_XXX]
[COLUMN_LISTING]
COL1,COL2,COL3,COL4,COL5
[COLUMN_VALUES]
XXX 219/ 1/ 0/ 0/ 1,95,0,0,0,0
XXX 219/ 1/ 0/ 1/ 1,23,0,0,0,0
XXX 219/ 1/ 0/ 2/ 1,4,0,0,0,4
XXX 219/ 1/ 0/ 3/ 1,37,1,0,0,33
[CLASS_YYY]
[COLUMN_LISTING]
COL21,COL22,COL23,COL24
[COLUMN_VALUES]
YYY 219/ 1/ 1/ 1/ 1,11092,49.52,627,179827
YYY 219/ 1/ 2/ 1/ 1,11092,199.52,628,179828
YYY 219/ 1/ 3/ 1/ 1,11092,459.52,629,179829
[CLASS_ZZZ]
[COLUMN_LISTING]
COL31,COL32,COL33,COL34,COL35,COL36,COL37
[COLUMN_VALUES]
ZZZ 219/ 0/ 0/ 1/ 1,0,0.00,0,0,7201,0,0
ZZZ 219/ 0/ 1/ 2/ 1,0,1.00,0,1,7202,0,2
ZZZ 219/ 0/ 2/ 1/ 2,0,2.00,0,2,7203,0,3
This file details the
inventory of three different warehouses included in the same file, with each in its
own class. Every class has a dynamic listing of columns and column values.
Preceding the values line is a string such as “XXX 219/ 1/ 0/
3/ 1”, “YYY 219/ 1/ 1/ 1/ 1”, and “ZZZ 219/ 0/ 0/ 1/ 1” .
Each of the values in these strings map to source properties that form
an inventory locator, and must be stored for each row.
For
example, “ZZZ 219/ 0/ 0/ 1/ 1” maps to
Class:
ZZZ
Store
Number: 219
Pallet:
0
Lot: 0
Case: 1
Item: 1
It is not feasible
to have a separate table column for each possibility because there could be
thousands of different columns for each class. The best solution is
to store the data vertically, by using a standard label for column name and
column value. A desired relational storage would include the following
columns:
CLASS
STORE_NUMBER PALLET
LOT CASE_NUMBER ITEM
COL_NAME COL_VALUE
In order to have table function pipe rows with these columns, we need to define
an appropriate type:
|
CREATE TYPE sample3_typ AS OBJECT
(
CLASS varchar2(10),
STORE_NUMBER number(6),
PALLET number(6),
LOT number(6),
CASE_NUMBER number(6),
ITEM number(6),
COL_NAME varchar2(100),
COL_VALUE varchar2(100)
);
/
CREATE TYPE sample3_typ_tab AS TABLE OF sample3_typ;
/
|
If
the number of columns were fixed, we might try defining an external table that
leverages the comma delimiters in the file format. However, since the
number of columns is unknown, we will define our external table to have a single
column.
In the original file, the
column listing was over 16,000 characters long; to allow for this condition, we
must define the
column to be of type CLOB:
|
CREATE TABLE SAMPLE3_EXT (
field1 clob
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY DAT_DIR
ACCESS PARAMETERS
(
records delimited by newline
badfile DAT_DIR:'SAMPLE3_EXT%a_%p.bad'
logfile DAT_DIR:'SAMPLE3_EXT%a_%p.log'
fields LRTRIM REJECT ROWS WITH ALL NULL FIELDS (field1 char(32000))
)
LOCATION ('sample3.dat')
)
PARALLEL 10
REJECT LIMIT 10;
|
And create
the relational table:
|
CREATE TABLE SAMPLE3
(
CLASS
VARCHAR2(10 BYTE),
STORE_NUMBER NUMBER(6),
PALLET
NUMBER(6),
LOT
NUMBER(6),
CASE_NUMBER NUMBER(6),
ITEM
NUMBER(6),
COL_NAME VARCHAR2(100
BYTE),
COL_VALUE VARCHAR2(100 BYTE)
);
|
In order to transform raw
data into the relational format, it may be beneficial to try to using SQL instead of
a table function. However, since the number of columns is unknown, and
data is being stored vertically, the best solution is to use a pipelined table
function to prepare the rows.
Now
that there are external table and type definitions, table
function that accepts a REF cursor (pointing to the external table) as an input
parameter needs to be created. This function will start at the beginning of the CURSOR (the first row in the external table),
then loop through the rows of the external table, and simultaneously return rows to the session.
The
keywords CLASS, COLUMN_LISTING, and COLUMN_VALUES instruct the
actions of the function on the next loop. The CLASS keyword sets the local variable
for class, the COLUMN_LISTING keyword builds a collection to store the
column names, and the COLUMN_VALUES keyword causes the preparation and piping
of each row as they are parsed.
|
CREATE OR REPLACE function sample3_tab_fnc (pi_row IN
sys_refcursor)
return sample3_typ_tab PIPELINED AS
l_string LONG;
l_comma_index PLS_INTEGER;
l_index PLS_INTEGER
:= 1;
TYPE ListTyp IS TABLE OF VARCHAR2(100);
l_names ListTyp := ListTyp();
l_namesIter NUMBER :=1;
l_Iter NUMBER :=1;
l_cIter NUMBER :=1;
l_element varchar2(100);
l_field1 varchar2(32000);
pi_rownumber number := 1;
l_class varchar2(10);
l_store_number varchar2(6);
l_pallet varchar2(6);
l_lot varchar2(6);
l_case varchar2(6);
l_item varchar2(6);
l_action varchar2(10);
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
LOOP
FETCH pi_row INTO l_field1;
EXIT WHEN pi_row%NOTFOUND;
if instr(l_field1, ',', -1) = length(l_field1) then
l_string := l_field1;
else
l_string := l_field1 || ',';
end if;
l_index :=1;
l_Iter :=1;
l_cIter :=1;
if l_field1 = '[CLASS_XXX]' then
l_class := 'XXX';
l_action := null;
l_names := ListTyp();
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[CLASS_YYY]' then
l_class := 'YYY';
l_action := null;
l_names := ListTyp();
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[CLASS_ZZZ]' then
l_class := 'ZZZ';
l_action := null;
l_names := ListTyp();
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[COLUMN_LISTING]' then
l_action := 'COLUMNS';
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[COLUMN_VALUES]' then
l_action := 'VALUES';
GOTO end_loop;
end if;
--Get Value Names into l_names
if l_action = 'COLUMNS' then
LOOP
l_comma_index := INSTR(l_string, ',', l_index);
EXIT WHEN l_comma_index = 0;
l_names.EXTEND;
l_names(l_namesIter) := SUBSTR(l_string, l_index, l_comma_index - l_index);
l_index := l_comma_index + 1;
l_namesIter := l_namesIter + 1;
END LOOP;
end if;
--Get Element and Pipe Rows
if l_action = 'VALUES' then
LOOP
l_comma_index := INSTR(l_string, ',', l_index);
EXIT WHEN l_comma_index = 0;
if l_Iter = 1 then
l_element := SUBSTR(l_string, l_index, l_comma_index - l_index);
if l_class IN ('XXX', 'YYY', 'ZZZ') then
l_store_number := trim(substr(l_element,4,instr(l_element, '/',1,1)-4));
l_pallet := trim(substr(l_element,instr(l_element,
'/',1,1)+1,instr(l_element, '/',1,2)-instr(l_element, '/',1,1)-1 ));
l_lot := trim(substr(l_element,instr(l_element, '/',1,2)+1,instr(l_element,
'/',1,3)-instr(l_element, '/',1,2)-1 ));
l_case := trim(substr(l_element,instr(l_element, '/',1,3)+1,instr(l_element,
'/',1,4)-instr(l_element, '/',1,3)-1 ));
l_item := trim(substr(l_element,instr(l_element, '/',1,4)+1,instr(l_element,
'/',1,5)-instr(l_element, '/',1,4)-1 ));
end if;
else
PIPE ROW ( sample3_typ(trim(l_class), l_store_number, l_pallet, l_lot,
l_case, l_item, trim(l_names(l_cIter)), trim(SUBSTR(l_string, l_index,
l_comma_index - l_index))) );
l_cIter := l_cIter + 1;
end if;
l_Iter := 2; --move out of element row
l_index := l_comma_index + 1;
END LOOP;
end if;
<<end_loop>>
null;
END LOOP;
RETURN;
END;
/
|
The
data can be selected by using SQL
|
select * from table (sample3_tab_fnc(CURSOR(select * from SAMPLE3_EXT)))
/
CLASS STORE_NUMBER PALLET
LOT CASE_NUMBER
ITEM COL_NAME
COL_VALUE
XXX 219
1 0
0
COL1 95
XXX 219
1 0
0
COL2 0
XXX 219
1 0
0
COL3 0
XXX 219
1 0
0
COL4 0
XXX 219
1 0
0
COL5 0
XXX 219
1 0
1
COL1 23
XXX 219
1 0
1
COL2 0
XXX 219
1 0
1
COL3 0
XXX 219
1 0
1
COL4 0
XXX 219
1 0
1
COL5 0
XXX 219
1 0
2
COL1 4
XXX 219
1 0
2
COL2 0
XXX 219
1 0
2
COL3 0
XXX 219
1 0
2
COL4 0
XXX 219
1 0
2
COL5 4
XXX 219
1 0
3
COL1 37
XXX 219
1 0
3
COL2 1
XXX 219
1 0
3
COL3 0
XXX 219
1 0
3
COL4 0
XXX 219
1 0
3
COL5 33
YYY
219 1
1 1
COL21 11092
YYY
219 1
1 1
COL22 49.52
YYY
219 1
1 1
COL23 627
YYY
219 1
1 1
COL24 179827
YYY
219 1
2 1
COL21 11092
YYY
219 1
2 1
COL22 199.52
YYY
219 1
2 1
COL23 628
YYY
219 1
2 1
COL24 179828
YYY
219 1
3 1
COL21 11092
YYY
219 1
3 1
COL22 459.52
YYY
219 1
3 1
COL23 629
YYY
219 1
3 1
COL24 179829
ZZZ 219
0 0
1
COL31 0
ZZZ 219
0 0
1
COL32 0.00
ZZZ 219
0 0
1
COL33 0
ZZZ 219
0 0
1
COL34 0
ZZZ 219
0 0
1
COL35 7201
ZZZ 219
0 0
1
COL36 0
ZZZ 219
0 0
1
COL37 0
ZZZ 219
0 1
2
COL31 0
ZZZ 219
0 1
2
COL32 1.00
ZZZ 219
0 1
2
COL33 0
ZZZ 219
0 1
2
COL34 1
ZZZ 219
0 1
2
COL35 7202
ZZZ 219
0 1
2
COL36 0
ZZZ 219
0 1
2
COL37 2
ZZZ 219
0 2
1
COL31 0
ZZZ 219
0 2
1
COL32 2.00
ZZZ 219
0 2
1
COL33 0
ZZZ 219
0 2
1
COL34 2
ZZZ 219
0 2
1
COL35 7203
ZZZ 219
0 2
1
COL36 0
ZZZ 219
0 2
1
COL37 3
51 rows
selected.
Elapsed: 00:00:00.76
SQL>
|
Using
the same procedure on a larger file, the query above
extracts, transforms, and loads 31,233 records into the relational table in
under 1 second.
|
SQL> insert /*+ append */
into sample3
2 select *
from table (sample3_tab_fnc(CURSOR(select * from SAMPLE3_EXT)))
| |