5

I have tried with api spark.read.csv to read compressed csv file with extension bz or gzip. It worked. But in source code I don't find any option parameter that we can declare the codec type.

Even in this link, there is only setting for codec in writing side. Could anyone tell me or give the path to source code that showing how spark 2.x version deal with the compressed csv file.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
G_cy
  • 994
  • 3
  • 13
  • 28
  • 1
    Note that Spark will read the compressed CSV with a single task as opposed to parallelising the read across multiple tasks when reading an uncompressed CSV. – Chris May 27 '21 at 11:09

2 Answers2

4

All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in Spark Core's RDDs too).

You can find the relevant lines in readFile that leads to HadoopFileLinesReader which has the following lines:

val fileSplit = new FileSplit(
  new Path(new URI(file.filePath)),
  file.start,
  file.length,
  // TODO: Implement Locality
  Array.empty)

That uses Hadoop's org.apache.hadoop.fs.Path that deals with compression of the underlying file(s).


After quick googling, I was able to find the Hadoop property that deals with compression which is mapreduce.output.fileoutputformat.compress.

That led me to Spark SQL's CompressionCodecs with the following compression configuration:

"none" -> null,
"uncompressed" -> null,
"bzip2" -> classOf[BZip2Codec].getName,
"deflate" -> classOf[DeflateCodec].getName,
"gzip" -> classOf[GzipCodec].getName,
"lz4" -> classOf[Lz4Codec].getName,
"snappy" -> classOf[SnappyCodec].getName)

Below in the code, you can find setCodecConfiguration that uses "our" option.

  def setCodecConfiguration(conf: Configuration, codec: String): Unit = {
    if (codec != null) {
      conf.set("mapreduce.output.fileoutputformat.compress", "true")
      conf.set("mapreduce.output.fileoutputformat.compress.type", CompressionType.BLOCK.toString)
      conf.set("mapreduce.output.fileoutputformat.compress.codec", codec)
      conf.set("mapreduce.map.output.compress", "true")
      conf.set("mapreduce.map.output.compress.codec", codec)
    } else {
      // This infers the option `compression` is set to `uncompressed` or `none`.
      conf.set("mapreduce.output.fileoutputformat.compress", "false")
      conf.set("mapreduce.map.output.compress", "false")
    }
  }

The other method getCodecClassName is used to resolve compression option for JSON, CSV, and text formats.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Thank you, man. I checked the `Path` package file and still get little confused. It would be very kind if you can give a little more detail; like which part of `Path` package deals with compression. THX again. – G_cy Jun 29 '17 at 03:15
  • Added few additional links to Spark SQL's code where it deals with compression. Since I know nothing about Hadoop's source code, I'll leave exploring it as a home exercise for you. – Jacek Laskowski Jun 29 '17 at 03:29
  • 2
    Thank you so much for your patience and kindness. I read that `getCodecClassName` code through the chain. I found that part of code called only on the writing side. I didn't find usage on reading side. I thought this work might be done by filesystem; but not find the evidence. – G_cy Jun 29 '17 at 06:28
  • 1
    All the pieces you reference deal with writing files. Hence why all the mentioned options have 'output' in their name. The question is about reading files. – mvherweg Nov 15 '17 at 09:23
  • 2
    Interesting information, but like some commenters noted it's only about the write side, not the read side. [This answer](https://stackoverflow.com/a/44374539/877069) doesn't show the relevant internals for how Spark picks a codec on read, but it does at least demonstrate how to specify a custom read codec. – Nick Chammas Sep 19 '19 at 23:47
3

You dont have to do anything special for the gz compressed csv,tsv file to get read by spark 2.x version. The below code is tried with spark 2.0.2

val options= Map("sep" -> ",")
val csvRDD = spark.read.options(options).csv("file.csv.gz")

I have done similarly for tab separated gz files

val options= Map("sep" -> "\t")
val csvRDD = spark.read.options(options).csv("file.tsv.gz")

Also you can specify the folder to read mulitple .gz file with combination of unzipped files

 val csvRDD = spark.read.options(options).csv("/users/mithun/tsvfilelocation/")
maxmithun
  • 1,089
  • 9
  • 18