0

We are using pyspark 1.6. and are trying to convert Text to other file format (like Json,csv etc) with compression (gzip,lz4,snappy etc). But unable to see compressing working.

Please find the code blow we tried. please help us in pointing the issue in our code else suggest an work around. Just to add to the question, none of the compressions are working in 1.6, but its working fine in spark 2.X

Option 1:

from pyspark import SparkContext SparkConf
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').save('hdfs:///user/U1/parquet_json_snappy')

Option 2:

df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').option('codec','com.apache.hadoop.io.compress.SnappyCodec').save('hdfs:///user/U1/parquet_json_snappy_4')

Option 3:

df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').option('compression','snappy').save('hdfs:///user/U1/parquet_json_snappy')
rogue-one
  • 11,259
  • 7
  • 53
  • 75
Kalyan P
  • 1
  • 1
  • The class name looks wrong in the second one. Can you try with Bzip2 or Gzip. Class names are listed here : [org.apache.hadoop.io.compress](https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/io/compress/package-summary.html) – philantrovert Jul 21 '17 at 14:27
  • @philantrovert: Thanks for quick response . I have tried as suggested but no luck. Its not getting compressed. f.write.format('json').option('codec','org.apache.hadoop.io.compress.BZip2Codec').save('hdfs:///user/U1/parquet_json_bzip1') – Kalyan P Jul 21 '17 at 14:37

1 Answers1

0

For Spark 1.6, to save text/json output, try using the

spark.hadoop.mapred.output.compression.codec parameter

There are 4 parameters to be set. This has been answered already and more details are in this link

With Spark 2.x, the API is simpler and you can use

df.write.option("compression", "gzip")
ganeiy
  • 298
  • 2
  • 9