-1

I have very dirty data in a csv file I need to periodically dump into a table. This csv file contains all the previous records from the previous days (thanks third party vendor!), but I am content to let the unique constraint keep those from duplicating.

However, the nature of this dirty data has nincompoops fat-fingering extra commas into the data itself, and double-quoting the fields is just too difficult (don't strain yourself third party vendor!). So, while the ideal record has 7 fields, some of these are clocking in at up to 12 fields.

And the field I really need is the date attached. It's always the 7th field (unless it's the 8th, 9th, 10th, 11th, or 12th).

Is there a way to specify that this column is the last field, always?

The control file seems to allow fields-by-position, but only for fixed length records (characters 45-68, etc.). Is there some way to just tell it "the last one" and work from that?

Note: Having dug through the documentation and learned that every interesting/useful thing I need to do is impossible, I already know the answers, but I have to ask anyway.

John O
  • 4,863
  • 8
  • 45
  • 78
  • Do you have the option of pre-processing the file, maybe using a script to move the last field to the start of the line? – Alex Poole Apr 11 '16 at 21:39
  • @AlexPoole I have that option, I suppose. Also whining to my boss to see if the vendor can fix this. I honestly don't want to have to go through a bunch of different steps with this though, starting to feel like I'd have been better off rolling my own (even if I know that's a mistake). Mostly needed confirmation that what I'm asking is impossible. – John O Apr 11 '16 at 21:45
  • Assuming that maximal efficiency isn't a major concern, I'd be tempted to load the data into a single `varchar2(4000)` (or 32000 if you're on 12.1 with extra-long strings) and then write your own parsing logic (i.e. everything to the right of the 6th comma is the field you're looking for). – Justin Cave Apr 11 '16 at 22:32
  • Please show some sample good and bad records so we can see the complexity of the data. First step is to lean on the vendor to give scrubbed data. That said, I have to deal with the same crap :-/ Maybe read the entire line into a BOUNDFILLER, then parse using REGEXP_SUBSTR() calls if the data is not too complex or performance is not a concern. I'm basically echoing what Alex and Justin said, but it is not impossible, just a pain to parse on patterns in the data, not necessarily by a delimiter when you can't trust it will be in the right place. – Gary_W Apr 12 '16 at 14:42
  • Please show the format of that date data in the last field. – Gary_W Apr 12 '16 at 15:24

1 Answers1

0

With a .csv looking like this (variable length records with a date last per your problem):

Lance,Link,07/28/1968
Mata,M,Hairi,11/22/1969
Ba,ron,von But,cher,,,02/28/1966

Construct your control file like this:

load data
infile 'x_test.csv'
into table x_test
truncate
FIELDS TERMINATED BY x'0D0A' TRAILING NULLCOLS
(
  x     BOUNDFILLER,
  col1  EXPRESSION  "TO_DATE(REGEXP_SUBSTR(:x, '.*,(.*)$', 1, 1, NULL, 1), 'MM/DD/YYYY')"
)

Sqlldr will read each line and since it's defined BOUNDFILLER will not attempt to load it but 'remember' it for later use as 'x'. Next it sees a 'col1' which matches a column in the table so it runs the expression, which extracts from remembered buffer 'x' what is after the last comma until the end of the line, converts it into a date and loads that date into col1.

SQL> select col1
     from x_test;

COL1
---------
28-JUL-68
22-NOV-69
28-FEB-66

SQL>

I believe that takes care of the issue. You will most likely need to tweak date formats and really the regex should be more strict as to the date format you are actually dealing with, not just accepting what is there at the end of the line.

Gary_W
  • 9,933
  • 1
  • 22
  • 40