How to avoid generating crc files and SUCCESS files while saving a DataFrame?

Question

I am using the following code to save a spark DataFrame to JSON file

unzipJSON.write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")

the output result is:

part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
_SUCCESS
._SUCCESS.crc

How do I generate a single JSON file and not a file per line?
How can I avoid the *crc files?
How can I avoid the SUCCESS file?

score 44 · Accepted Answer · answered Dec 20 '15 at 20:43

If you want a single file, you need to do a coalesce to a single partition before calling write, so:

unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")

Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write - especially if you do a write with a partitionBy - but as far as I know, there are currently no other way.

I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Note, that you may also want to disable generation of the metadata files with:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.

The question is why we need `CRC` and `_SUCCESS` files? Spark (worker) nodes write data simultaneously and these files act as checksum for validation. Writing to a single file takes away the idea of distributed computing and this approach may fail if your resultant file is too large. — CᴴᴀZ, Dec 21 '16 at 14:48
`_metadata` will not create since [this spark jira ticket](https://issues.apache.org/jira/browse/SPARK-15719) merge to spark master and spark-2.0, [related PR](https://github.com/apache/spark/pull/13455) — zhongjiajie, May 15 '19 at 07:29

score 6 · Answer 2 · answered Nov 18 '22 at 17:29

6

Ignore crc files on .write

val hadoopConf = spark.sparkContext.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
fs.setWriteChecksum(false)

answered Nov 18 '22 at 17:29

gunship

61
1
1

This worked for me. I wanted to keep `_SUCCESS` file, but suppress writing `.crc` files. – Merlin Jan 14 '23 at 22:26

score 5 · Answer 3 · answered Jan 09 '22 at 10:52

5

Just a little update in above answer. To disable crc and SUCCESS file, simply set property in spark session as follows(example)

spark = SparkSession \
        .builder \
        .appName("Python Spark SQL Hive integration example") \
        .config("spark.sql.warehouse.dir", warehouse_location) \
        .enableHiveSupport() \
        .getOrCreate()

    spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

answered Jan 09 '22 at 10:52

Hafiz Muhammad Shafiq

8,168
12
63
121

tx, but it doesn't suppress crc files – Juh_ Feb 22 '22 at 15:01
@Juh_ Did you find any solution so that .crc file won't be created? – unknown Jul 26 '22 at 11:46
No sorry, i ditn't investigate further – Juh_ Jul 27 '22 at 12:50

How to avoid generating crc files and SUCCESS files while saving a DataFrame?

3 Answers3

Linked