I have raw binary files compressed with LZMA algorithm (xz extension), and I want to read them natively with Spark.
I am testing this library: https://github.com/yongtang/hadoop-xz, and it seems that it works only for compressed text files, and I am wondering if it could also work for compressed binary files.
In my build.gradle
, I have added this dependency:
implementation 'io.sensesecure:hadoop-xz:1.4'
Then in my code, I have:
val spark = SparkSession.builder()
.master("local[*]")
.appName("myApp")
.getOrCreate()
spark.
sparkContext.
hadoopConfiguration.
set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec")
val binaryDataDF = spark.
read.
format("binaryFile").
option("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec").
load("binaryRawFile.xz")
When, I try to process the content with xzreadDF.select("content")
, I see that the content is not uncompressed.
However, when I test with text file, example:
val textDataDF = spark.
read.
format("text").
option("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec").
load("test.txt.xz")
the above code works correctly and textDataDF.select("value")
returns the uncompressed text.
The other workaround I have found, is to read the files as PortableDataStream
with spark.sparkContext.binaryFiles
and then add a map on the return rdd, and open each file as an instance of XZInputStream
, then get the binary content with IOUtils.toByteArray
. This also works, but I am wondering if it is not possible to do it natively with Spark using hadoop.xz
.