 |
|
Enterprise Solutions with Oracle ETL
Oracle Tips by V.J. Jain
|
Enterprise Solutions
with Oracle ETL
By V.J.
Jain, July 2015 (varun.jain@varunjaininc.com)
Varun Jain, Inc. (an Oracle Partner)
Abstract:
Oracle 9i and 10g have introduced tools that allow for the composition of an
enterprise ETL solution that is more preferable than off-the-shelf products.
The features include:
-
setup and use of an external table and analytic SQL which loads data from a
complex XML and text file, allowing 1,674 records to be loaded from the file into a
relational table in just 6.85 seconds
-
setup and use of a pipelined table function to vertically load more
complex files with dynamic columns. This allows as many as 31,233
records to be
inserted from the file into a relational table in only 0.96 seconds
-
application of an
itemized enterprise ETL solution after the functionality and efficiency of
the previous features have been achieved
-
an exhibition of the
dynamic properties of external tables, illustrate through altering file
source by DDL. This includes ETL procedures which design load files dynamically
to accept the file path as a parameter.
-
use of NFS drives for
space efficiency to achieve dynamic alterations
-
availability of a custom queue table, created
by using an Oracle directory and a custom procedure, which populates based
on the file listing by using a semaphore for external procedures
-
creation of DBMS_SCHEDULER
job that monitors and processes the queue
table.
It is the combination of
these components which result in a highly functional enterprise solution,
offering benefits that surpass off-the-shelf products such as Informatica,
Data Stage, or Ab Initio.
Furthermore, Oracle
9i and 10g have added some very powerful tools to support high performance ETL.
While there may be familiarity with Oracle external tables and pipelined table
functions, the full potential of Oracle can go unrealized.
The
demand for high performance ETL is growing as organizations realize the
importance of business intelligence and the need for a comprehensive data
warehouse. The majority of these organizations use expensive off-the-shelf
products without realizing that they have already licensed the necessary tools
to build a potentially better solution. This article discusses the use of
Oracle ETL features with Oracle Scheduler to provide a real-time, high scale,
and high performance ETL solution.
Familiarity with external
tables and pipelined table functions is necessary to fully understand the
solution:
External Tables
External tables are used to provide relational access to the contents of a file
from an Oracle directory. The external table script defines the rules for
reading the file, and once the table is created, the external table is
accessed just like a regular table. Here's an example:
Create directory DAT_DIR as ?/usr/tmp?;
|
File:
sample.csv saved to /usr/tmp
ID,
DATE, FIRST_NAME, LAST_NAME, STATUS
1,
17-JUL-07, John, Adams, Active
2,
17-JUL-07, Tyler, Howell, Active
3,
17-JUL-07, Jim, Lopez, Active
4,
17-JUL-07, Carlos, White, Inactive
5,
17-JUL-07, Scott, Tiger, Active
External table definition:
CREATE TABLE SAMPLE_EXT
(
ID NUMBER,
DATE VARCHAR2(10),
FIRST_NAME VARCHAR2(30),
LAST_NAME VARCHAR2(30),
STATUS VARCHAR2(10)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY DAT_DIR
ACCESS PARAMETERS
(
records delimited by newline LOAD WHEN (ID != ?ID?)
badfile BAD_DIR:'SAMPLE_EXT%a_%p.bad'
logfile LOG_DIR:'SAMPLE_EXT%a_%p.log'
fields terminated by ',' optionally enclosed by '"' LRTRIM
MISSING FIELD VALUES ARE NULL
)
LOCATION ('sample.csv')
)
PARALLEL 4
REJECT LIMIT 1000;
|
Note:
Several of the properties in this definition are optional. For example, if 'skip 1?
could have been used in the table definition, but the method above can be a better
alternative if it cannot be guaranteed that all file sources will use a header
line.
Now
that the external table has been created, and the file has been placed on the
server, it is possible to access the data using a SQL such as:
Select * from SAMPLE_EXT
/
ID DATE FIRST_NAME LAST_NAME STATUS
1 17-JUL-07 John Adams Active
2 17-JUL-07 Tyler Howell Active
3 17-JUL-07 Jim Lopez Active
4 17-JUL-07 Carlos White Inactive
5 17-JUL-07 Scott Tiger Active
Select * from SAMPLE_EXT where STATUS = ?Inactive?
/
ID DATE FIRST_NAME LAST_NAME STATUS
4 17-JUL-07 Carlos White Inactive
|
This exhibits the
relational access to this file
in completion, even though the
file is outside of the database. Loading the contents of this file into a
relational table is also easy to accomplish.
First,
create a matching relational table as exemplified below:
CREATE TABLE SAMPLE_TAB
(
ID NUMBER,
DATE VARCHAR2(10),
FIRST_NAME VARCHAR2(30),
LAST_NAME VARCHAR2(30),
STATUS VARCHAR2(10)
);
|
Next,
insert the data from the external table to the relational table:
Insert into sample_tab
Select * from sample_ext
/
|
It is a
common misconception that external tables are only useful for simple formats
such as comma or tab delimited files. In loading a CSV file easily
however, one can see that this is far from the truth.
External tables can also be
used for complex files. For example, because external tables are
relationally accessible, the full flexibility of the SQL language can be
utilized to get the data in the desired
format. If the file format is too complex for SQL to
structure appropriately, a pipelined table function can be used.
When
using external tables for more complex file formats, placing
each row from the file into single columned row in the external table may offer
great flexibility for transforming the data.
Consider the following file:
sample2.dat (actual file is several megabytes)
23:18:00 PST 18 Feb 2015
<Inventory name=" CORVETTE">
<AppId
id="1"/>
<Color>Yellow</Color>
</Inventory>
23:18:00 PST 18 Feb 2015
<Inventory name="CAMARO">
<AppId
id="2"/>
<Color>White</Color>
</Inventory>
23:18:00 PST 18 Feb 2015
<Inventory name="PROWLER">
<AppId
id="3"/>
<Color>Blue</Color>
</Inventory>
This
file is obviously a blend between text and XML. Though this is not a
preferred format, in ETL there is often no control over the data
sources. How could an external table to efficiently load a file
structured like this be used? The solution is probably easier than
assumed.
First,
create the external table:
CREATE TABLE SAMPLE2_EXT (
field1 varchar2(4000)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY DAT_DIR
ACCESS PARAMETERS
(
records delimited by newline
badfile DAT_DIR:'SAMPLE2_EXT%a_%p.bad'
logfile DAT_DIR:'SAMPLE2_EXT%a_%p.log'
fields LRTRIM REJECT ROWS WITH ALL NULL FIELDS (field1 char(4000))
)
LOCATION ('sample2.dat')
)
REJECT LIMIT 10;
|
Next,
create the relational table:
CREATE TABLE SAMPLE2
(
BEGIN_DATE DATE,
INVENTORY VARCHAR2(30),
APPID_ID VARCHAR2(30),
COLOR VARCHAR2(30)
);
|
Now extract, transform, and load data from external table into the relational
table:
SQL> insert /* append */ into sample2
2 select /*+ materialize */
3 max(BEGIN_DATE) BEGIN_DATE,
4 max(INVENTORY) INVENTORY,
5 max(APPID_ID) APPID_ID
6 ,max(COLOR) COLOR
7 FROM
8 (
9 select r, grp,
10 case when
(length(translate(substr(field1,1,1),' 0123456789',' '))) is null then
to_date(to_
char(to_timestamp_tz(field1, 'HH24:MI:SS TZD DD Mon YYYY'),
'DD-MON-YYYY HH:MI:SS AM'), 'DD-MON-YYYY
HH:MI:SS AM') end BEGIN_DATE,
11 case when
(instr(field1,'<Inventory')>0) then extractValue(XMLType(replace(field1,
'">', '"/
>')), '/Inventory/@name') end INVENTORY,
12 case when (instr(field1,'<AppId')>0)
then extractValue(XMLType(replace(field1, '">', '"/>')),
'/AppId/@id') end APPID_ID,
13 case when (instr(field1,'<Color')>0)
then extractValue(XMLType(field1), '/Color') end COLOR
14 from
15 (
16 select r, field1, nvl(max(grp) over (order by
r),0) grp FROM
17 (
18 select A.r, A.field1, case
when (length(translate(substr(field1,1,1),' 0123456789',' '))) i
s null then A.r end grp from
19 (
20 select rownum r,
field1 from SAMPLE2_EXT A
21 ) A
22 )
23 )
24 )
25 group by grp
26 /
1674 rows created.
Elapsed: 00:00:06.85
SQL>
|
The query above can extract, transform, and load 1,674
records into a relational table in under 7 seconds. This is good
performance considering the format has XML and non-XML content.
Here is a
sample of what is in our relational table:
select * from sample2
where rownum <=3
/
BEGIN_DATE INVENTORY
APPID_ID COLOR
2/18/2007 11:18:00 PM CORVETTE
1 Yellow
2/18/2007 11:18:00 PM CAMARO
2 White
2/18/2007 11:18:00 PM PROWLER
3 Blue
|
The example above demonstrates
a relatively simple use of a single columned external table transforming raw data into a relational format,
but what happens when the format is more complex?
Pipelined Table Functions
Though
they can get quite complicated, a table function is simply a function that
returns rows. Inside the table function, the row is prepared and each row
is piped back to the session when they are ready. The
output rows are based on a type definition.
Here is
a simple example of a table function that accepts a VARCHAR2 input and returns
rows with 1 character per row.
CREATE TYPE row_typ AS OBJECT
(p_row_output
varchar2(10));
/
CREATE TYPE row_typ_tab AS TABLE OF row_typ;
/
create or replace function example_tab_fnc (pi_value IN
varchar2)
return row_typ_tab PIPELINED AS
l_length number;
l_index PLS_INTEGER
:= 1;
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
LOOP
l_length := length(pi_value);
EXIT WHEN l_index > l_length;
PIPE ROW ( row_typ(substr(pi_value, l_index, 1)));
l_index := l_index + 1;
END LOOP;
RETURN;
END;
select * from table (example_tab_fnc('ABCDE'))
/
P_ROW_OUTPUT
A
B
C
D
E
|
It is
obvious that this feature could certainly prove useful in an ETL environment.
While the demonstration above is simplified, pipelined table
functions can be used to transform complex files into a usable format. For
example, as input, a table function can accept a REF cursor pointing at an external table. Consider the following file:
Sample3.dat
[CLASS_XXX]
[COLUMN_LISTING]
COL1,COL2,COL3,COL4,COL5
[COLUMN_VALUES]
XXX 219/ 1/ 0/ 0/ 1,95,0,0,0,0
XXX 219/ 1/ 0/ 1/ 1,23,0,0,0,0
XXX 219/ 1/ 0/ 2/ 1,4,0,0,0,4
XXX 219/ 1/ 0/ 3/ 1,37,1,0,0,33
[CLASS_YYY]
[COLUMN_LISTING]
COL21,COL22,COL23,COL24
[COLUMN_VALUES]
YYY 219/ 1/ 1/ 1/ 1,11092,49.52,627,179827
YYY 219/ 1/ 2/ 1/ 1,11092,199.52,628,179828
YYY 219/ 1/ 3/ 1/ 1,11092,459.52,629,179829
[CLASS_ZZZ]
[COLUMN_LISTING]
COL31,COL32,COL33,COL34,COL35,COL36,COL37
[COLUMN_VALUES]
ZZZ 219/ 0/ 0/ 1/ 1,0,0.00,0,0,7201,0,0
ZZZ 219/ 0/ 1/ 2/ 1,0,1.00,0,1,7202,0,2
ZZZ 219/ 0/ 2/ 1/ 2,0,2.00,0,2,7203,0,3
This file details the
inventory of three different warehouses included in the same file, with each in its
own class. Every class has a dynamic listing of columns and column values.
Preceding the values line is a string such as ?XXX 219/ 1/ 0/
3/ 1?, ?YYY 219/ 1/ 1/ 1/ 1?, and ?ZZZ 219/ 0/ 0/ 1/ 1? .
Each of the values in these strings map to source properties that form
an inventory locator, and must be stored for each row.
For
example, ?ZZZ 219/ 0/ 0/ 1/ 1? maps to
Class:
ZZZ
Store
Number: 219
Pallet:
0
Lot: 0
Case: 1
Item: 1
It is not feasible
to have a separate table column for each possibility because there could be
thousands of different columns for each class. The best solution is
to store the data vertically, by using a standard label for column name and
column value. A desired relational storage would include the following
columns:
CLASS
STORE_NUMBER PALLET
LOT CASE_NUMBER ITEM
COL_NAME COL_VALUE
In order to have table function pipe rows with these columns, we need to define
an appropriate type:
CREATE TYPE sample3_typ AS OBJECT
(
CLASS varchar2(10),
STORE_NUMBER number(6),
PALLET number(6),
LOT number(6),
CASE_NUMBER number(6),
ITEM number(6),
COL_NAME varchar2(100),
COL_VALUE varchar2(100)
);
/
CREATE TYPE sample3_typ_tab AS TABLE OF sample3_typ;
/
|
If
the number of columns were fixed, we might try defining an external table that
leverages the comma delimiters in the file format. However, since the
number of columns is unknown, we will define our external table to have a single
column.
In the original file, the
column listing was over 16,000 characters long; to allow for this condition, we
must define the
column to be of type CLOB:
CREATE TABLE SAMPLE3_EXT (
field1 clob
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY DAT_DIR
ACCESS PARAMETERS
(
records delimited by newline
badfile DAT_DIR:'SAMPLE3_EXT%a_%p.bad'
logfile DAT_DIR:'SAMPLE3_EXT%a_%p.log'
fields LRTRIM REJECT ROWS WITH ALL NULL FIELDS (field1 char(32000))
)
LOCATION ('sample3.dat')
)
PARALLEL 10
REJECT LIMIT 10;
|
And create
the relational table:
CREATE TABLE SAMPLE3
(
CLASS
VARCHAR2(10 BYTE),
STORE_NUMBER NUMBER(6),
PALLET
NUMBER(6),
LOT
NUMBER(6),
CASE_NUMBER NUMBER(6),
ITEM
NUMBER(6),
COL_NAME VARCHAR2(100
BYTE),
COL_VALUE VARCHAR2(100 BYTE)
);
|
In order to transform raw
data into the relational format, it may be beneficial to try to using SQL instead of
a table function. However, since the number of columns is unknown, and
data is being stored vertically, the best solution is to use a pipelined table
function to prepare the rows.
Now
that there are external table and type definitions, table
function that accepts a REF cursor (pointing to the external table) as an input
parameter needs to be created. This function will start at the beginning of the CURSOR (the first row in the external table),
then loop through the rows of the external table, and simultaneously return rows to the session.
The
keywords CLASS, COLUMN_LISTING, and COLUMN_VALUES instruct the
actions of the function on the next loop. The CLASS keyword sets the local variable
for class, the COLUMN_LISTING keyword builds a collection to store the
column names, and the COLUMN_VALUES keyword causes the preparation and piping
of each row as they are parsed.
CREATE OR REPLACE function sample3_tab_fnc (pi_row IN
sys_refcursor)
return sample3_typ_tab PIPELINED AS
l_string LONG;
l_comma_index PLS_INTEGER;
l_index PLS_INTEGER
:= 1;
TYPE ListTyp IS TABLE OF VARCHAR2(100);
l_names ListTyp := ListTyp();
l_namesIter NUMBER :=1;
l_Iter NUMBER :=1;
l_cIter NUMBER :=1;
l_element varchar2(100);
l_field1 varchar2(32000);
pi_rownumber number := 1;
l_class varchar2(10);
l_store_number varchar2(6);
l_pallet varchar2(6);
l_lot varchar2(6);
l_case varchar2(6);
l_item varchar2(6);
l_action varchar2(10);
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
LOOP
FETCH pi_row INTO l_field1;
EXIT WHEN pi_row%NOTFOUND;
if instr(l_field1, ',', -1) = length(l_field1) then
l_string := l_field1;
else
l_string := l_field1 || ',';
end if;
l_index :=1;
l_Iter :=1;
l_cIter :=1;
if l_field1 = '[CLASS_XXX]' then
l_class := 'XXX';
l_action := null;
l_names := ListTyp();
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[CLASS_YYY]' then
l_class := 'YYY';
l_action := null;
l_names := ListTyp();
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[CLASS_ZZZ]' then
l_class := 'ZZZ';
l_action := null;
l_names := ListTyp();
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[COLUMN_LISTING]' then
l_action := 'COLUMNS';
l_namesIter :=1;
GOTO end_loop;
elsif l_field1 = '[COLUMN_VALUES]' then
l_action := 'VALUES';
GOTO end_loop;
end if;
--Get Value Names into l_names
if l_action = 'COLUMNS' then
LOOP
l_comma_index := INSTR(l_string, ',', l_index);
EXIT WHEN l_comma_index = 0;
l_names.EXTEND;
l_names(l_namesIter) := SUBSTR(l_string, l_index, l_comma_index - l_index);
l_index := l_comma_index + 1;
l_namesIter := l_namesIter + 1;
END LOOP;
end if;
--Get Element and Pipe Rows
if l_action = 'VALUES' then
LOOP
l_comma_index := INSTR(l_string, ',', l_index);
EXIT WHEN l_comma_index = 0;
if l_Iter = 1 then
l_element := SUBSTR(l_string, l_index, l_comma_index - l_index);
if l_class IN ('XXX', 'YYY', 'ZZZ') then
l_store_number := trim(substr(l_element,4,instr(l_element, '/',1,1)-4));
l_pallet := trim(substr(l_element,instr(l_element,
'/',1,1)+1,instr(l_element, '/',1,2)-instr(l_element, '/',1,1)-1 ));
l_lot := trim(substr(l_element,instr(l_element, '/',1,2)+1,instr(l_element,
'/',1,3)-instr(l_element, '/',1,2)-1 ));
l_case := trim(substr(l_element,instr(l_element, '/',1,3)+1,instr(l_element,
'/',1,4)-instr(l_element, '/',1,3)-1 ));
l_item := trim(substr(l_element,instr(l_element, '/',1,4)+1,instr(l_element,
'/',1,5)-instr(l_element, '/',1,4)-1 ));
end if;
else
PIPE ROW ( sample3_typ(trim(l_class), l_store_number, l_pallet, l_lot,
l_case, l_item, trim(l_names(l_cIter)), trim(SUBSTR(l_string, l_index,
l_comma_index - l_index))) );
l_cIter := l_cIter + 1;
end if;
l_Iter := 2; --move out of element row
l_index := l_comma_index + 1;
END LOOP;
end if;
<<end_loop>>
null;
END LOOP;
RETURN;
END;
/
|
The
data can be selected by using SQL
select * from table (sample3_tab_fnc(CURSOR(select * from SAMPLE3_EXT)))
/
CLASS STORE_NUMBER PALLET
LOT CASE_NUMBER
ITEM COL_NAME
COL_VALUE
XXX 219
1 0
0
COL1 95
XXX 219
1 0
0
COL2 0
XXX 219
1 0
0
COL3 0
XXX 219
1 0
0
COL4 0
XXX 219
1 0
0
COL5 0
XXX 219
1 0
1
COL1 23
XXX 219
1 0
1
COL2 0
XXX 219
1 0
1
COL3 0
XXX 219
1 0
1
COL4 0
XXX 219
1 0
1
COL5 0
XXX 219
1 0
2
COL1 4
XXX 219
1 0
2
COL2 0
XXX 219
1 0
2
COL3 0
XXX 219
1 0
2
COL4 0
XXX 219
1 0
2
COL5 4
XXX 219
1 0
3
COL1 37
XXX 219
1 0
3
COL2 1
XXX 219
1 0
3
COL3 0
XXX 219
1 0
3
COL4 0
XXX 219
1 0
3
COL5 33
YYY
219 1
1 1
COL21 11092
YYY
219 1
1 1
COL22 49.52
YYY
219 1
1 1
COL23 627
YYY
219 1
1 1
COL24 179827
YYY
219 1
2 1
COL21 11092
YYY
219 1
2 1
COL22 199.52
YYY
219 1
2 1
COL23 628
YYY
219 1
2 1
COL24 179828
YYY
219 1
3 1
COL21 11092
YYY
219 1
3 1
COL22 459.52
YYY
219 1
3 1
COL23 629
YYY
219 1
3 1
COL24 179829
ZZZ 219
0 0
1
COL31 0
ZZZ 219
0 0
1
COL32 0.00
ZZZ 219
0 0
1
COL33 0
ZZZ 219
0 0
1
COL34 0
ZZZ 219
0 0
1
COL35 7201
ZZZ 219
0 0
1
COL36 0
ZZZ 219
0 0
1
COL37 0
ZZZ 219
0 1
2
COL31 0
ZZZ 219
0 1
2
COL32 1.00
ZZZ 219
0 1
2
COL33 0
ZZZ 219
0 1
2
COL34 1
ZZZ 219
0 1
2
COL35 7202
ZZZ 219
0 1
2
COL36 0
ZZZ 219
0 1
2
COL37 2
ZZZ 219
0 2
1
COL31 0
ZZZ 219
0 2
1
COL32 2.00
ZZZ 219
0 2
1
COL33 0
ZZZ 219
0 2
1
COL34 2
ZZZ 219
0 2
1
COL35 7203
ZZZ 219
0 2
1
COL36 0
ZZZ 219
0 2
1
COL37 3
51 rows
selected.
Elapsed: 00:00:00.76
SQL>
|
Using
the same procedure on a larger file, the query above
extracts, transforms, and loads 31,233 records into the relational table in
under 1 second.
SQL> insert /*+ append */
into sample3
2 select *
from table (sample3_tab_fnc(CURSOR(select * from SAMPLE3_EXT)))
3 /
31233 rows created.
Elapsed: 00:00:00.96
SQL>
|
This
design allows for the content of the data file to change dynamically without
requiring modifications to the code. For example, if a new class
(Warehouse) is added or new columns are added or removed, the data will continue
to load without interruption.
Building an enterprise ETL solution
The
previous examples provided convincing evidence that Oracle external tables and
pipelined table functions allow an efficient, highly flexible method for
performing ETL. However, the next solution consists of a few
elements that are the building blocks of the enterprise solution which show
these features in a dynamic, high scale, real-time loading environment.
Dynamic External Table Definitions
The
first element of the solution is the ability to dynamically control the data
source location for an external table. The following command can change
SAMPLE3_EXT from using sample3.dat to using sample3a.dat, meaning the file source from PL/SQL
can be dynamically changed:
SQL> ALTER TABLE
SAMPLE3_EXT LOCATION('sample3a.dat');
Table altered.
|
Keep this in mind while other elements of the solution
are explained.
NFS - Network File Systems
While having a network
drive is not necessary, it can help in case the incoming directories can be on
the network drive instead of the Oracle drive. This can help minimize the
risk of disk space problems affecting the database as well as improve the security model,
because the NFS drive can be written to
without access to the Oracle server.
Create and Populate Custom Queue
The second element of the
solution is a table to track all of the ETL
processes. This queue table should reflect all the normal information such
as date, start, end, status, path, etc.
There are a couple of
options in populating a custom queue table, beginning with monitoring folders
for new files using a process on the operating system. Another option is
to define the incoming directories as Oracle, by using a directory listing
custom procedure. This invokes a Java procedure to
get relational access to a listing of files. The filenames in the directory are
then copied to a temporary table that can
be queried.
Tom Kyte gives a great explanation for how to accomplish this.
For
example, consider an incoming folder ?/incoming/vendor? and a corresponding
Oracle directory ?VENDOR_DIR?. New files ?file1.dat?, ?file2.dat?, and
?file3.dat? are placed in our folder, and the directory listing function is
executed from a procedure scheduled in DBMS_SCHEDULER.
Begin
Get_directory_list(?/incoming/vendor?);
End;
Select filename from
directory_list_tmp;
/
FILENAME
file1.dat
file2.dat
file3.dat
|
The
file listing is selected from the scheduled process. Here, files are compared
with those already queued, and the unrecognized files are
inserted into the queue. This simple process can add all incoming files to
the custom queue table on nearly a real-time basis.
ETL
Procedures
As
previously explained, external tables and pipelined table functions can be used
in PL/SQL procedures to load the data from the files to the relational tables.
There is a different procedure for each file format, all accepting the file path as a parameter. Before initiating the load,
the procedure should dynamically change the file source using the syntax
mentioned above. It may be helpful to change the log and bad files to be
file specific. At the end of the procedure, the corresponding queue record
should be updated appropriately.
Concurrency (Multi-threaded external tables)
Because
the Oracle ETL procedures are so fast, a single thread (single concurrency per
external table) should be sufficient for the majority of requirements.
However, multi-thread of your external table is necessary, and this is fairly simple
to do by using a semaphore.
Putting the pieces together
There should be a scheduled
job that scans and processes the queue based on particular business rules. If
files in the queue have not been started, this process will submit a new job to
DBMS_SCHEDULER. This runs the appropriate ETL procedure
for that file and updates the record in queue appropriately. By passing the file path to the ETL
procedure, the procedure is able to dynamically redefine the data source for
that external table to point to the file. Rules for handling queued
records in error or complete status can be added to this job, as well as rules for
defining concurrency.
At any
time, the queue table can provide a real-full view of a system's entire ETL
activity. Having all elements of the ETL architecture within the
database has many benefits including: reporting, cost management, error handling
and maintenance. All of the elements in the solutions above exhibit an efficient alternative for an
off the shelf ETL tool.