2

I'm attempting to get this csv of Russian troll tweets into a mysql database.

I'm trying to use LOAD DATA LOCAL INFILE like this:

LOAD DATA LOCAL INFILE
'/path/to/csv/data.csv' 
INTO TABLE
mytable
CHARACTER SET
utf8mb4
FIELDS TERMINATED BY 
','
ENCLOSED BY 
'"'
LINES TERMINATED BY
'\n'
IGNORE 1 LINES;

It seems to work for a small sample of the data, but when I try to do the full csv, I'm getting this error:

Error 1300 (HY000): Invalid utf8mb4 character string: 'Those who studied history know this is not even considered histo'

The line throwing the error is this one:

4036537452,4MYSQUAD,Those who studied history know this is not even considered history b\с it was pretty recent. #BlackHistoryMonth [shortened link omitted here],United States,English,2/8/2016 23:18,2/8/2016 23:20,4836,2802,1053,,left,0,0,LeftTroll

If use CHARACTER SET latin1, then it imports just fine, but I lose the emojis from the tweets as well as the tweets in Russian.

the csv has tweets in Russian, German, Swedish and emojis. Is there a way to get all these into my database?

Thank you, and let me know if there is any more information I should include in this question.

georgedum
  • 491
  • 3
  • 10
  • 1
    Why is `b\с` there with a backslash? That's causing the trouble. You need to escape your input. – marekful Aug 06 '18 at 16:14
  • Hmm, looks like the content of that particular tweet had a backslash. Maybe I could set NO_BACKSLASH_ESCAPES for the imports? Not sure if that would have some unwanted consequences. – georgedum Aug 06 '18 at 16:25
  • Welp, that just breaks the LINES TERMINATED BY '\n' part. Maybe there's a workaround for this. Thanks for putting me on the right track anyway. – georgedum Aug 06 '18 at 16:49
  • 1
    Haven't tested, but I would try escaping the \ like mareful suggested so any \ would be like \\ ? – Freddythunder Aug 06 '18 at 16:55

1 Answers1

1

I ended up doing a massive find/replace to replace every '\' with '\\'.

Worked like a charm. Thanks, marekful and Freddythunder for putting me on the right track.

georgedum
  • 491
  • 3
  • 10