1

For writing a parquet file and compressing it with LZO codec, I wrote the following code -

df.coalesce(1).write.option("compression","lzo").option("header","true").parquet("PARQUET.parquet")

But, I am getting this error -

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.lzo.LzoCodec

According to the spark documentation, brotli requires BrotliCodec to be installed. But there are no steps given to install it. The same error is given while compressing with Brotli codec.

How can I install/add the required codecs for running it on PySpark ?


EDIT - LZO compression works with ORC but not with Parquet

2 Answers2

1

For writing in lzo, you need the below steps:

  • sudo apt-get install -y lzop
  • Add jar to pyspark jars(change path according to your pyspark env): wget https://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.20/hadoop-lzo-0.4.20.jar -P /usr/local/lib/python3.7/dist-packages/pyspark/jars/
  • set this config option in SparkSession, ("spark.sql.parquet.compression.codec", "lzo")

Now you should be able to write using parquet with lzo compression.

0

Copy the jar files to <python environment name>/lib/python3.9/site-packages/pyspark/jars