Today am seeking your help with an issue am having in the last couple of days with bzip2 compression. We need to compress our output text files into bzip2 format.
The problem is that we only pass from 5 Gb uncompressed to 3.2 Gb compressed with bzip2. Seeing other projects compressing their 5 GB files to only 400 Mb makes me wonder if am doing something wrong.
Here is my code:
iDf
.repartition(iNbPartition)
.write
.option("compression","bzip2")
.mode(SaveMode.Overwrite)
.text(iOutputPath)
I am also importing this codec :
import org.apache.hadoop.io.compress.BZip2Codec
Besides that am not setting any configs in my spark-submit because i've tried many with no luck.
Would really appreciate your help with this.