How to configure Spark / Databricks memory to collect large R data.frame?

Question

Out of memory issues caused by collecting spark DataFrame into R data.frame has been discussed here several times (e.g. here or here). However, none answer seems to be usable in my environment.

Problem:

I'm trying to collect some transactional data using inSample <- collect(read.parquet(inSamplePath)) (the read.parquet from SparkR library). My driver has 256 GB RAM. The sample data is smaller than 4 GB (when in uncompressed CSV, the parquet is much smaller), but command is failing with java.lang.OutOfMemoryError: Requested array size exceeds VM limit error.

Environment and Data :

Azure Databricks cluster with Standard_E32s_v3 driver node (256 GB RAM) and 2 - 8 Standard_D16s_v3 workers
Input data is stored as parquet files in Azure Datalake Gen2
Raw data size ~4GB (mix of integers, strings and doubles; the data is originating in a upstream transactional SQL DB thus structure is quite clean, none string exceeds 256 length)

Attempted Resolution:

Cluster Spark Config: spark.driver.maxResultSizei 128g
Cluster Environment variable: JAVA_OPTS="-Xms120G -Xmx120G -XX:MaxPermSize=8G -XX:PermSize=4G"
Script configuration (tested when using sparklyr) config$spark.driver.memory (set to 120G)

None of the attempts helped. I must be missing some configuration property that allocates memory to local R. Please recommend which server configuration(s) or environment variable(s) should be changed.

EDIT:

Another option tested w/o success:
- Updated to Spark 3.0
- spark.driver.memory 130g
- spark.driver.memoryOverhead 62g (added based on Spark 3.0 configuration description)
- spark.driver.maxResultSize 64g
Capacity validation: input can be collected (val x = myInput.collect) when using Spark Scala even for maxResultSize 16g; issue is restricted to R module only

@Dipperman no - gave up some accuracy to partition the input to fit. Not a great solution. — Dan, Feb 26 '21 at 22:33
Are you using the `databricks-connect` client library to handle the back-end? I am working through a similar issue currently with the azure/databricks support team, I think our problems have the same root cause — Matt Summersgill, Mar 17 '21 at 20:00
@MattSummersgill no, just keeping code in the Databricks notebook(s). I believe the behavior will be the same (issue is happening when cluster is executing the code, code origin hopefully doesn't matter). I'll try to test databricks-connect as soon as I'm out of current fire-drill and share results. Please keep me posted about your findings too. — Dan, Mar 17 '21 at 20:20
@Dan - I'll keep you posted if my investigation bears any fruit. I was able to get this down to a relatively minimal example that generates it's own data here: https://gist.github.com/msummersgill/fb61204b73c2bebcaf5a1fe299172b45 — Matt Summersgill, Mar 17 '21 at 21:19

How to configure Spark / Databricks memory to collect large R data.frame?

0 Answers0