SQL*Loader: Dealing with delimiter characters in data

Question

I am loading some data to Oracle via SQLLDR. The source file is "pipe delimited".

FIELDS TERMINATED BY '|'

But some records contain pipe character in data, and not as separator. So it breaks correct loading of records as it understands indata pipe characters as field terminator.

Can you point me a direction to solve this issue?

Data file is about 9 GB, so it is hard to edit manually.

For example,

Loaded row:

ABC|1234567|STR 9 R 25|98734959,32|28.12.2011

Rejected Row:

DE4|2346543|WE| 454|956584,84|28.11.2011

Error:

Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month

DATE_N column is the last one.

Are the fields with pipes `|` enclosed in quotes ? for example `DATA1|DATA2|"DATAWITH|PIPE"|DATA3`. In this case you can use `FIELDS TERMINATED BY '|' OPTIONALLY ENCLOSED BY '"'` — Vincent Malgrat, Jan 20 '12 at 13:07
@VincentMalgrat It would be easier but fields are not enclosed by quotes. — bonsvr, Jan 20 '12 at 14:41
Then how is SQL*Loader supposed to know when a field ends? How do you differentiate the "true" pipes from the "false" pipes ? :) — Vincent Malgrat, Jan 20 '12 at 14:51
if the pipe appears only in one field I can help you with an awk parse of the text. So, pipes can appear only in the 3rd field? — Florin Ghita, Jan 20 '12 at 15:05
Generally, that's what the `OPTIONALLY ENCLOSED BY` clause is for; data fields which include the delimiter are to be enclosed by different delimiter. — Adam Musch, Jan 20 '12 at 15:50
Going forward you should tell the person who handed you this file not to include the delimiter in any of the fields unless it is enclosed, this is bad data and should be discouraged before you are given any more : ) — Matt Donnan, Jan 20 '12 at 18:03

Benoit · Accepted Answer · 2012-01-20T16:37:15.627

You could not use any separator, and do something like:

field FILLER,
col1 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\1')",
col2 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\2')",
col3 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\3')",
col4 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\4')",
col5 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\5')",
col6 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\6')"

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

The FILLER absorbs the whole line of data, and the regular expression splits it into its 6 meaningful parts. If you don't know what a regular expression is, lookup the regex tag here. — Benoit, Jan 20 '12 at 16:02
Ok, I got it, +1 :) . I know what is a regular expression, but I never used Oracle REGEXP functions. It is like what I did with awk. You did it with regexps. — Florin Ghita, Jan 20 '12 at 16:04
+1 fantastic. I always loved REGEXPs, however I'm not an expert about them. Can you please explain a little further the regexp code. — bonsvr, Jan 20 '12 at 16:11
@bonsvr, expanded answer. Note that I have not tested anything. — Benoit, Jan 20 '12 at 16:37

score 2 · Answer 2 · answered Jan 20 '12 at 15:14

It looks to me that it's not really possible for SQL*Loader to handle your file because of the third field which: can contain the delimiter, is not surrounded by quotes and is of a variable length. Instead, if the data you provide is an accurate example then I can provide a sample workaround. First, create a table with one column of VARCHAR2 with length the same as the maximum length of any one line in your file. Then just load the entire file into this table. From there you can extract each column with a query such as:

with CTE as
       (select 'ABC|1234567|STR 9 R 25|98734959,32|28.12.2011' as CTETXT
          from dual
        union all
        select 'DE4|2346543|WE| 454|956584,84|28.11.2011' from dual)
select substr(CTETXT, 1, instr(CTETXT, '|') - 1) as COL1
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 1) + 1
             ,instr(CTETXT, '|', 1, 2) - instr(CTETXT, '|', 1, 1) - 1)
         as COL2
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 2) + 1
             ,instr(CTETXT, '|', -1, 1) - instr(CTETXT, '|', 1, 2) - 1)
         as COL3
      ,substr(CTETXT, instr(CTETXT, '|', -1, 1) + 1) as COL4
  from CTE

It's not perfect (though it may be adaptable to SQL*Loader) but would need a bit of work if you have more columns or if your third field is not what I think it is. But, it's a start.

Florin Ghita · Answer 3 · 2012-01-20T16:07:15.173

OK, I recomend you to parse the file and replace the delimiter. In command line in Unix/linux you should do:

cat current_file | awk -F'|' '{printf( "%s,%s,", $1, $2); for(k=3;k<NF-2;k++) printf("%s|", $k); printf("%s,%s,%s", $(NF-2),$(NF-1),$NF);print "";}' > new_file

This command will not change your current file. Will create a new file, comma delimited, with five fields. It splits the input file on "|" and take first, second, anything to antelast, antelast, and last chunk.

You can try to sqlldr the new_file with "," delimiter.

UPDATE: The command can be put in a script like (and named parse.awk)

#!/usr/bin/awk
# parse.awk
BEGIN {FS="|"}
{
printf("%s,%s,", $1, $2);

for(k=3;k<NF-2;k++)
        printf("%s|", $k);

printf("%s,%s,%s\n", $(NF-2),$(NF-1),$NF);
}

and you can run in this way:

cat current_file | awk  -f parse.awk > new_file

wow. anyway, if you have an unix/linux at hand it can be faster to correct the file in this way :) — Florin Ghita, Jan 20 '12 at 15:50

SQL*Loader: Dealing with delimiter characters in data

3 Answers3