0

Today am seeking your help with an issue am having in the last couple of days with bzip2 compression. We need to compress our output text files into bzip2 format.

The problem is that we only pass from 5 Gb uncompressed to 3.2 Gb compressed with bzip2. Seeing other projects compressing their 5 GB files to only 400 Mb makes me wonder if am doing something wrong.

Here is my code:

iDf
  .repartition(iNbPartition)
  .write
  .option("compression","bzip2")
  .mode(SaveMode.Overwrite)
  .text(iOutputPath)

I am also importing this codec :

import org.apache.hadoop.io.compress.BZip2Codec

Besides that am not setting any configs in my spark-submit because i've tried many with no luck.

Would really appreciate your help with this.

  • 1
    Have you tried to compress the very same data with other bzip tool? If and only if other bzip give better performance, than you can wonder whether there is an issue with the current one. Compression depends on too many things to conclude anything without comparision. – cchantep May 04 '22 at 16:35
  • Thanks for your answer, can you please tell me what are the other bzip2 tools please? Am trying to compress the same data that the other team is compressing ( 5 gb of logs) they get 400Mb i get 3.2 gb. Only difference i read the uncompressed data from hive, they read it from a json file.. – KhribiHamza May 04 '22 at 20:22
  • Use ```xz, zpaq,paq8``` – TTho Einthausend Aug 31 '22 at 17:15

1 Answers1

0

Thanks for your help guys, the solution was in the algorithm bzip itself. Actually given that my data is anonymized in a random way, it was very random that the algorithme is no longer efficient.

Thank you again

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 17 '22 at 10:45