All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in Spark Core's RDDs too).
You can find the relevant lines in readFile that leads to HadoopFileLinesReader which has the following lines:
val fileSplit = new FileSplit(
new Path(new URI(file.filePath)),
file.start,
file.length,
// TODO: Implement Locality
Array.empty)
That uses Hadoop's org.apache.hadoop.fs.Path that deals with compression of the underlying file(s).
After quick googling, I was able to find the Hadoop property that deals with compression which is mapreduce.output.fileoutputformat.compress
.
That led me to Spark SQL's CompressionCodecs with the following compression configuration:
"none" -> null,
"uncompressed" -> null,
"bzip2" -> classOf[BZip2Codec].getName,
"deflate" -> classOf[DeflateCodec].getName,
"gzip" -> classOf[GzipCodec].getName,
"lz4" -> classOf[Lz4Codec].getName,
"snappy" -> classOf[SnappyCodec].getName)
Below in the code, you can find setCodecConfiguration that uses "our" option.
def setCodecConfiguration(conf: Configuration, codec: String): Unit = {
if (codec != null) {
conf.set("mapreduce.output.fileoutputformat.compress", "true")
conf.set("mapreduce.output.fileoutputformat.compress.type", CompressionType.BLOCK.toString)
conf.set("mapreduce.output.fileoutputformat.compress.codec", codec)
conf.set("mapreduce.map.output.compress", "true")
conf.set("mapreduce.map.output.compress.codec", codec)
} else {
// This infers the option `compression` is set to `uncompressed` or `none`.
conf.set("mapreduce.output.fileoutputformat.compress", "false")
conf.set("mapreduce.map.output.compress", "false")
}
}
The other method getCodecClassName is used to resolve compression
option for JSON, CSV, and text formats.