2

I have some very large files (millions of records) that I need to load into a database. They are of the form:

word1\tblahblahblah
word2\tblahblah
word3\tblahblah
word4
word5\tblahblah
...

My problem, is that I want to ignore the lines that have no second record (the 'blahblah's'), like word4.

I'm currently using the following query to load the file:

LOAD DATA LOCAL INFILE 'file' 
IGNORE INTO TABLE tablename 
COLUMNS TERMINATED BY '\t' 
LINES TERMINATED BY '\n'
(col1, col2);

This has the functionality I want, except that it still accepts the null values. Is there a way to skip the word4 type lines and not add them to the DB?

Intuitively, I thought using a WHEN or WHERE clause would be perfect, but looking through documentation and other info online, I can't find any examples where this would work. Am I forced to filter these records out before hand, or can I do it within this query?

Any help is grealy appreciated!

Thomp
  • 386
  • 3
  • 13

1 Answers1

2

I would do this by filtering the file with grep or awk and then piping it into MySQL (via /dev/stdin). Something like this:

cat your_file.txt |
  awk '/\t.+/' |
    mysql -u your_username -pyour_password \
      -e "LOAD DATA LOCAL INFILE '/dev/stdin' \
          IGNORE INTO TABLE tablename         \
          COLUMNS TERMINATED BY '\t'          \
          LINES TERMINATED BY '\n'            \
          (col1, col2);" \
      your_database_name

The regular expression given to awk on the second line just matches any line that has a tab character followed by one or more of any character. You might want to tweak it to fit your needs.

Edit: One other possibility occurred to me. You could use SET to set some magic value on columns that are blank and put a BEFORE INSERT trigger on the table that will bail on a row when it sees that value. I don't have much experience with triggers but I think something like this should work:

CREATE TRIGGER skip_magic_rows
  BEFORE INSERT ON tablename
  FOR EACH ROW
  BEGIN
    IF NEW.col2 = 'IDSPISPOPD4815162342' THEN  # Some unlikely magic string
      # Trigger an error, which will cause the INSERT to fail†

      # If you have MySQL < 5.5 this is kludgy -- see Note 1
      DROP TABLE `Skipped row`

      # OR

      # In MySQL >= 5.5 you can send a signal--'45000' is a generic error
      SIGNAL SQLSTATE '45000' SET message_text = 'Skipped row';  # See Note 2

    END IF
  END
;

†: Per the docs:

An error during either a BEFORE or AFTER trigger results in failure of the entire statement that caused trigger invocation.

Then...

LOAD DATA LOCAL INFILE 'file' 
  IGNORE INTO TABLE tablename 
  COLUMNS TERMINATED BY '\t' 
  LINES TERMINATED BY '\n'
  (col1, @var2)
  SET col2 = IF(@var2 IN (NULL, ''), 'IDSPISPOPD4815162342', @var2)
;

Hope it's helpful!

Note 1: Relevant blog post and comments
Note 2: Relevant SO thread

Community
  • 1
  • 1
Jordan Running
  • 102,619
  • 17
  • 182
  • 182
  • Thanks Jordan, this is similar to the solution that I ended up using. That being said, this is also what I was trying to avoid (filtering the data before feeding it to the LOAD DATA INFILE command), I wanted to incorporate this filter into the loader somehow. But if I don't get any other hits on this in the next few days, I'll mark this as correct, since as far as I can tell, it's the best way to handle this. – Thomp Sep 29 '11 at 21:19
  • @Thomp: Added an alternate solution that might work out for you, take a look above. – Jordan Running Sep 29 '11 at 23:47
  • The only thing I worry about with this second solution is time. Does it slow down the loading to perform these extra operations? I suppose the best way to know is to try out both implementations and time them, but I've already moved ahead in my development. Perhaps when I get some free time. At any rate, thanks for the solutions. A variation of the first one ended up working well for me. – Thomp Oct 06 '11 at 20:21
  • Glad you were able to move forward, Thomp. I'd be interested to hear about the performance implications as well. – Jordan Running Oct 06 '11 at 20:27