12

I'm loading a CSV file from S3 into Redshift. This CSV file is analytics data which contains the PageUrl (which may contain user search info inside a query string for example).

It chokes on rows where there is a single, double-quote character, for example if there is a page for a 14" toy then the PageUrl would contain:

http://www.mywebsite.com/a-14"-toy/1234.html

Redshift understandably can't handle this as it is expecting a closing double quote character.

The way I see it my options are:

  1. Pre-process the input and remove these characters
  2. Configure the COPY command in Redshift to ignore these characters but still load the row
  3. Set MAXERRORS to a high value and sweep up the errors using a separate process

Option 2 would be ideal, but I can't find it!

Any other suggestions if I'm just not looking hard enough?

Thanks

Duncan

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Duncan
  • 10,218
  • 14
  • 64
  • 96

3 Answers3

16

It's 2017 and I run into the same problem, happy to report there is now a way to get redshift to load csv files with the odd " in the data.

The trick is to use the ESCAPE keyword, and also to NOT use the CSV keyword. I don't know why, but having the CSV and ESCAPE keywords together in a copy command resulted in failure with the error message "CSV is not compatible with ESCAPE;" However with no change to the loaded data I was able to successfully load once I removed the CSV keyword from the COPY command.

You can also refer to this documentation for help: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-escape

ayeletd
  • 361
  • 4
  • 7
11

Unfortunately, there is no way to fix this. You will need to pre-process the file before loading it into Amazon Redshift.

The closest options you have are CSV [ QUOTE [AS] 'quote_character' ] to wrap fields in an alternative quote character, and ESCAPE if the quote character is preceded by a slash. Alas, both require the file to be in a particular format before loading.

See:

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 1
    Thanks John. Yeah I've resigned myself to writing a little PIG job to precede all " with \" which should do the trick. – Duncan Aug 08 '15 at 12:13
  • Depending upon the size of your data, you could just run things through `sed`. You could even do it as a streaming job through `aws s3 cp s3://b/f - | sed xxx | aws s3 cp - s3://b/f2`. See streaming on https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html – John Rotenstein Aug 09 '15 at 05:55
  • @ayeletd 's answer below provides a solution without pre-processing – MyStackRunnethOver Oct 07 '20 at 20:15
-1

I have done this using ---> DELIMITER ',' IGNOREHEADER 1; at the replacement for 'CSV' at the end of COPY command. Its working really fine.