Out of memory issues caused by collecting spark DataFrame into R data.frame has been discussed here several times (e.g. here or here). However, none answer seems to be usable in my environment.
Problem:
I'm trying to collect some transactional data using inSample <- collect(read.parquet(inSamplePath))
(the read.parquet
from SparkR
library). My driver has 256 GB RAM. The sample data is smaller than 4 GB (when in uncompressed CSV, the parquet is much smaller), but command is failing with java.lang.OutOfMemoryError: Requested array size exceeds VM limit
error.
Environment and Data :
- Azure Databricks cluster with Standard_E32s_v3 driver node (256 GB RAM) and 2 - 8 Standard_D16s_v3 workers
- Input data is stored as parquet files in Azure Datalake Gen2
- Raw data size ~4GB (mix of integers, strings and doubles; the data is originating in a upstream transactional SQL DB thus structure is quite clean, none string exceeds 256 length)
Attempted Resolution:
- Cluster Spark Config:
spark.driver.maxResultSizei 128g
- Cluster Environment variable:
JAVA_OPTS="-Xms120G -Xmx120G -XX:MaxPermSize=8G -XX:PermSize=4G"
- Script configuration (tested when using
sparklyr
)config$spark.driver.memory
(set to 120G)
None of the attempts helped. I must be missing some configuration property that allocates memory to local R. Please recommend which server configuration(s) or environment variable(s) should be changed.
EDIT:
- Another option tested w/o success:
- Updated to Spark 3.0
- spark.driver.memory 130g
- spark.driver.memoryOverhead 62g (added based on Spark 3.0 configuration description)
- spark.driver.maxResultSize 64g
- Capacity validation: input can be collected (
val x = myInput.collect
) when using Spark Scala even for maxResultSize 16g; issue is restricted to R module only