1

Using PySpark I'm trying to save an Avro file with compression (preferably snappy).

This line of code successfully saves a 264MB file:

df.write.mode('overwrite').format('com.databricks.spark.avro').save('s3n://%s:%s@%s/%s' % (access_key, secret_key, aws_bucket_name, output_file))

When I add the codec option .option('codec', 'snappy') the code successfully runs but the file size is still 264MB:

df.write.mode('overwrite').option('codec', 'snappy').format('com.databricks.spark.avro').save('s3n://%s:%s@%s/%s' % (access_key, secret_key, aws_bucket_name, output_file))

I've also tried 'SNAPPY' and 'Snappy' and it also runs successfully but with the same file size.

I've read the documentation but it focuses on Java and Scala. Is this not supported in PySpark, is Snappy the default and it's not documented, or am I not using the correct syntax? There's also a related question (I assume) but it's focused on Hive and has no answers.

TIA

Community
  • 1
  • 1
Frank B.
  • 1,813
  • 5
  • 24
  • 44
  • The documentation is also there https://docs.databricks.com/spark/latest/data-sources/read-avro.html (exactly the same but with a nicer display) and it's quite explicit: **AVRO does not support the `.option()` syntax at DataFrame level, you must set a global Spark property.** – Samson Scharfrichter Feb 28 '17 at 17:21
  • 1
    Also, the source code here https://github.com/databricks/spark-avro/blob/branch-3.2/src/main/scala/com/databricks/spark/avro/DefaultSource.scala shows that AVRO supports only "snappy" and "deflate" codecs. And **the default is "snappy"** so why do you care??? – Samson Scharfrichter Feb 28 '17 at 17:22
  • If you really can't guess how to change the Spark conf in Python, then look at that example: http://www.programcreek.com/python/example/83823/pyspark.SparkConf *(note that the `SparkConf` object appears to be a clone, you cannot just update it, you have to re-create the `SparkContext` to enforce the new properties)* – Samson Scharfrichter Feb 28 '17 at 17:27

1 Answers1

0

I believe by default, spark is enabled with Snappy compression. you try to compare the size with uncompressed format, you should see the size difference.