0

I am running on following config: Cluster type: E64_v3 (1 driver + 3 workers) other spark cnfigs:

spark.shuffle.io.connectionTimeout 1200s 
spark.databricks.io.cache.maxMetaDataCache 40g 
spark.rpc.askTimeout 1200s 
spark.databricks.delta.snapshotPartitions 576 
spark.databricks.optimizer.rangeJoin.binSize 256 
spark.sql.inMemoryColumnarStorage.batchSize 10000 
spark.sql.legacy.parquet.datetimeRebaseModeInWrite CORRECTED 
spark.executor.cores 16 
spark.executor.memory 54g 
spark.rpc.lookupTimeout 1200s 
spark.driver.maxResultSize 220g 
spark.databricks.io.cache.enabled true 
spark.rpc.io.backLog 256 
spark.sql.shuffle.partitions 576 
spark.network.timeout 1200s 
spark.sql.inMemoryColumnarStorage.compressed true 
spark.databricks.io.cache.maxDiskUsage 220g 
spark.storage.blockManagerSlaveTimeoutMs 1200s 
spark.executor.instances 12 
spark.sql.windowExec.buffer.in.memory.threshold 524288 
spark.executor.heartbeatInterval 100s 
spark.default.parallelism 576 
spark.core.connection.ack.wait.timeout 1200s

and this is my error stack:

---> 41     df.write.format("delta").mode("overwrite").save(path) 
/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
825             self._jwrite.save()
826         else:
--> 827             self._jwrite.save(path)

Py4JJavaError: An error occurred while calling o784.save.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:230)
.
.
.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 13 (execute at DeltaInvariantCheckerExec.scala:88) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.179....

Any idea how to mitigate this?

user3868051
  • 1,147
  • 2
  • 22
  • 43
  • Your question missed a lot of information... Are you be able to save any sample Dataframe, or just this particular `df`? How big is your data size? Typically with Databricks, you don't have to specify `spark.executor.cores` or `spark.executor.memory`, not to mention you have 3 workers but `spark.executor.instances` is 12 ? – pltc Oct 11 '21 at 03:04
  • @pltc the dataframe logical plan mentions 'sizeInBytes=1.26E+52 B' – user3868051 Oct 12 '21 at 03:50
  • there are more than 1 question above :) – pltc Oct 12 '21 at 15:53

0 Answers0