Using PySpark I'm trying to save an Avro file with compression (preferably snappy).
This line of code successfully saves a 264MB file:
df.write.mode('overwrite').format('com.databricks.spark.avro').save('s3n://%s:%s@%s/%s' % (access_key, secret_key, aws_bucket_name, output_file))
When I add the codec option .option('codec', 'snappy')
the code successfully runs but the file size is still 264MB:
df.write.mode('overwrite').option('codec', 'snappy').format('com.databricks.spark.avro').save('s3n://%s:%s@%s/%s' % (access_key, secret_key, aws_bucket_name, output_file))
I've also tried 'SNAPPY'
and 'Snappy'
and it also runs successfully but with the same file size.
I've read the documentation but it focuses on Java and Scala. Is this not supported in PySpark, is Snappy the default and it's not documented, or am I not using the correct syntax? There's also a related question (I assume) but it's focused on Hive and has no answers.
TIA