How to install various compression codecs like LZO and BROTLI on pyspark?

Question

For writing a parquet file and compressing it with LZO codec, I wrote the following code -

df.coalesce(1).write.option("compression","lzo").option("header","true").parquet("PARQUET.parquet")

But, I am getting this error -

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.lzo.LzoCodec

According to the spark documentation, brotli requires BrotliCodec to be installed. But there are no steps given to install it. The same error is given while compressing with Brotli codec.

How can I install/add the required codecs for running it on PySpark ?

EDIT - LZO compression works with ORC but not with Parquet

if you are using Java/Scala , then you can add the jars directly using --jars option. if you are writing in python then you will need to make sure your jar is available on the all the directory which is being used spark.jars (you can get this info from spark ui) — Aditya Vikram Singh, Nov 05 '21 at 15:26
@AdityaVikramSingh which libraries should I download/install to get the required compression codecs ? — Techie Baba, Nov 05 '21 at 18:51
What is the spark version and jvm version you are using @Techie baba — Aditya Vikram Singh, Nov 08 '21 at 18:56

score 1 · Answer 1 · answered Jul 26 '22 at 13:59

For writing in lzo, you need the below steps:

sudo apt-get install -y lzop
Add jar to pyspark jars(change path according to your pyspark env): wget https://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.20/hadoop-lzo-0.4.20.jar -P /usr/local/lib/python3.7/dist-packages/pyspark/jars/
set this config option in SparkSession, ("spark.sql.parquet.compression.codec", "lzo")

Now you should be able to write using parquet with lzo compression.

score 0 · Answer 2 · answered May 24 '23 at 07:25

0

Copy the jar files to <python environment name>/lib/python3.9/site-packages/pyspark/jars

answered May 24 '23 at 07:25

Peeyush Majgawali

1

How to install various compression codecs like LZO and BROTLI on pyspark?

2 Answers2