3

I have raw binary files compressed with LZMA algorithm (xz extension), and I want to read them natively with Spark.

I am testing this library: https://github.com/yongtang/hadoop-xz, and it seems that it works only for compressed text files, and I am wondering if it could also work for compressed binary files.

In my build.gradle, I have added this dependency:

implementation 'io.sensesecure:hadoop-xz:1.4'

Then in my code, I have:

  val spark = SparkSession.builder()
    .master("local[*]")
    .appName("myApp")
    .getOrCreate()

   spark.
     sparkContext.
     hadoopConfiguration.
     set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec")


  val binaryDataDF = spark.
    read.
    format("binaryFile").
    option("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec").
    load("binaryRawFile.xz")

When, I try to process the content with xzreadDF.select("content"), I see that the content is not uncompressed.

However, when I test with text file, example:

  val textDataDF = spark.
    read.
    format("text").
    option("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec").
    load("test.txt.xz")

the above code works correctly and textDataDF.select("value") returns the uncompressed text.

The other workaround I have found, is to read the files as PortableDataStream with spark.sparkContext.binaryFiles and then add a map on the return rdd, and open each file as an instance of XZInputStream, then get the binary content with IOUtils.toByteArray. This also works, but I am wondering if it is not possible to do it natively with Spark using hadoop.xz.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Mehdi
  • 140
  • 10

0 Answers0