Databricks Checksum error while writing to a file

Question

I am running a job in 9 nodes.

All of them are going to write some information to files doing simple writes like below:

dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

However I am receiving this exception:

py4j.protocol.Py4JJavaError: An error occurred while calling o106.save. : java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error: file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0 exp: 1179219224 got: -1020415797

It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.

Is there any known scenario that may be causing it?

Does it work without the coalesce? It’s very easy to cause memory problems with coalesce, and the errors aren’t often very helpful. — Bob Swain, Jul 12 '19 at 20:17
Try repartitioning instead of coalescing. `dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)` — Rob, Jul 14 '19 at 14:52
Bob Swain Same error, even without coalesce. @Rob That actually worked, but I didn't undestand why. Could you post this as an answer with more details? — Flavio Pegas, Jul 15 '19 at 18:21
@FlavioDiasPs have posted the answer and hope that helps you. — Rob, Jul 15 '19 at 19:34

score 1 · Accepted Answer · answered Jul 15 '19 at 19:30

So there are a couple of things going on and it should explain why coalesce may not work.

What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.

Why coalesce would not work? Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.

Here is the code that would work:

dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Databricks Checksum error while writing to a file

1 Answers1