25

Apache Spark's DataFrameReader.json() can handle gzipped JSONlines files automatically but there doesn't seem to be a way to get DataFrameWriter.json() to write compressed JSONlines files. The extra network I/O is very expensive in the cloud.

Is there a way around this problem?

Sim
  • 13,147
  • 9
  • 66
  • 95

3 Answers3

34

With Spark 2.X (and maybe earlier, I did not test) there is a simpler way to write a compressed JSON, which does not require changing the configuration:

val df: DataFrame = ...
df.write.option("compression", "gzip").json("/foo/bar")

This also works for CSV and for Parquet, just use .csv() and .parquet() instead of .json() to write the file after setting the compression option.

The possible codecs are: none, bzip2, deflate, gzip, lz4 and snappy.

nsantos
  • 568
  • 1
  • 5
  • 15
17

Setting compression options on a SparkConf is NOT a good practice, as the accepted answer. It changed the behaviour globally instead of indicating the settings on a per-file basis. The truth is, explicit is always better than implicit. There are also some cases where users cannot manipulate the context configuration easily, like spark-shell or in codes designed as a submodule of another.

The correct way

Writing DataFrame with compression is supported since Spark 1.4. Several ways to achieve that:

One

df.write.json("filename.json", compression="gzip")

That's it! Just use DataFrameWriter.json() as you wish.

The magic is hidden in the code pyspark/sql/readwriter.py

@since(1.4)
def json(self, path, mode=None, compression=None, dateFormat=None, timestampFormat=None):
    """Saves the content of the :class:`DataFrame` in JSON format
    (`JSON Lines text format or newline-delimited JSON <http://jsonlines.org/>`_) at the
    specified path.

    :param path: the path in any Hadoop supported file system
    :param mode: ...

    :param compression: compression codec to use when saving to file. This can be one of the
                        known case-insensitive shorten names (none, bzip2, gzip, lz4,
                        snappy and deflate).
    :param dateFormat: ...
    :param timestampFormat: ...

    >>> df.write.json(os.path.join(tempfile.mkdtemp(), 'data'))
    """
    self.mode(mode)
    self._set_opts(
        compression=compression, dateFormat=dateFormat, timestampFormat=timestampFormat)
    self._jwrite.json(path)

Supported compression formats are bzip2, gzip, lz4, snappy and deflate, case insensitive.

The scala API should be the same.

Another

df.write.options(compression="gzip").json("filename.json")

Similar as above. more options can be suppied as keyword arguments. available since Spark 1.4.

Third

df.write.option("compression", "gzip").json("filename.json")

DataFrameWriter.option() is added since Spark 1.5. Only one parameter could be added at a time.

ttimasdf
  • 1,367
  • 1
  • 12
  • 19
15

The following solutions use pyspark, but I assume the code in Scala would be similar.

First option is to set the following when you initialise your SparkConf:

conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
conf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

With the code above any file you produce using that sparkContext is automatically compressed using gzip.

Second option, if you want to compress only selected files within your context. Lets say "df" is your dataframe and filename your destination:

df_rdd = self.df.toJSON() 
df_rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
Fokko Driesprong
  • 2,075
  • 19
  • 31
giorgioca
  • 713
  • 1
  • 8
  • 21
  • 1
    The Scala RDD API is `def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec])` so the code class should be passed directly and not as a string. – Sim Sep 22 '15 at 15:56
  • Wondering whether it is possible to avoid the hadoopish format when storing data to a file. I can't use the directory with `_SUCCES` and `part-*` file. I just need a specifical named single file... – lisak May 19 '16 at 20:45
  • Sorry about the resurrection, but I find it hard to believe that ```conf.set("spark.hadoop.mapred.output.compression.codec", "true")``` is necessary – oarsome Dec 14 '16 at 23:04
  • DataFrame is not RDD. Change the compression settings globally making it implicit is NOT a good practice as well. – ttimasdf Dec 19 '18 at 05:45