1

I have code that runs well on a small dataset (few million rows), but fails on a larger dataset (> 1 billion rows). The error it throws is:

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 

I have gone over the Executor and Driver logs with a fine tooth comb. There is nothing in there to suggest what is going on differently between the two sized datasets. The code I am using is:

spark_df = spark_df.repartition([KEY COLUMNS])
rdd = spark_df.rdd.mapPartitions(lambda x: process_partition(x))
final_df = spark.createDataFrame(rdd, schema=schema, verifySchema=True)
final_df.write.format("delta").mode([MODE]).save([SAVE_LOCATION])

I have tried so many things:

  1. Changed the groupby to make the groups smaller
  2. Increased the resources of the machines in the cluster
  3. Commented out all but 1 of the "transformations" in the code base.
  4. Changed or added the following cluster configuration options:
    • spark.network.timeout 10000000
    • spark.executor.heartbeatInterval 10000000
  5. Added a timeout to the job: 10000000

Throughout it all, the error hasn't changed and the logs don't seem to contain any useful information helping me to understand what is going on.

Garet Jax
  • 1,091
  • 3
  • 17
  • 37

1 Answers1

1

The solution ended up being very simple although the logs and documentation were really no help linking the solution to the problem.

By default, Databricks/Spark use 200 partitions. For the smaller dataset, that works fine. For the larger dataset, it is too small. The solution is to provide the desired number of partitions as part of the repartition call.

spark_df = spark_df.repartition(1000, [KEY_COLUMNS])
Garet Jax
  • 1,091
  • 3
  • 17
  • 37
  • Hi, i have a similar error, can you please help? https://stackoverflow.com/questions/75199778/aws-glue-executorlostfailure-executor-15-exited-caused-by-one-of-the-running-ta Can you please suggest the number of partitions that I should use? – Vijeth Kashyap Jan 22 '23 at 11:09