5

I have datasets in HDFS which is in parquet format with snappy as compression codec. As far as my research goes, currently Redshift accepts only plain text, json, avro formats with gzip, lzo compression codecs.

Alternatively, i am converting the parquet format to plain text and changing the snappy codec to gzip using a Pig script.

Is there currently a way to load data directly from parquet files to Redshift?

Amelia N Chu
  • 323
  • 1
  • 5
  • 11
cloudninja
  • 133
  • 1
  • 2
  • 7
  • Is there a question that you wanted to ask in the post? – rahulbmv Mar 10 '16 at 08:32
  • Sorry, yes. I am looking for solution on processing Parquet format files to Redshift without conversion – cloudninja Mar 10 '16 at 14:39
  • You can use Scala and Spark to do this programatically. [see this question](http://stackoverflow.com/questions/36635241/can-you-copy-straight-from-parquet-s3-to-redshift-using-spark-sql-hive-presto) – ratchet Dec 26 '16 at 03:35

1 Answers1

14

No, there is currently no way to load Parquet format data directly into Redshift.

EDIT: Starting from April 19, 2017 you can use Redshift Spectrum to directly query Parquet data on S3. Therefore you can now "load" from Parquet with INSERT INTO x SELECT * FROM parquet_data http://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html

EDIT 2: Starting from May 17, 2018 (for clusters on version 1.0.2294 or later) you can directly load Parquet and ORC files into Redshift. https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-columnar.html

Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • Can we offload data files from redshift to s3 in parquet format? – Teja Feb 07 '18 at 21:12
  • Not at the moment. Use a Glue "crawler" to convert them for you. Spectrum performance is still very good with CSV though. Use MAXFILESIZE 128MB in your UNLOAD. – Joe Harris Feb 07 '18 at 23:59
  • How do I convert the csv files that are already sitting on S3 into Parquet format? Is there a way to do it? – Teja Feb 08 '18 at 18:51
  • Have a look at the Glue FAQ here: https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md or the example n the docs here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html – Joe Harris Feb 09 '18 at 20:59
  • The [Redshift Documentation](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-load-listing-from-parquet) here shows an example of `copy`ing parquet files, however I too recall that Redshift does not support Parquet format data. – Lim May 15 '18 at 14:31
  • That's a new feature that was released on the day of your comment. :) – Joe Harris May 21 '18 at 19:06