4

Out of memory issues caused by collecting spark DataFrame into R data.frame has been discussed here several times (e.g. here or here). However, none answer seems to be usable in my environment.

Problem:

I'm trying to collect some transactional data using inSample <- collect(read.parquet(inSamplePath)) (the read.parquet from SparkR library). My driver has 256 GB RAM. The sample data is smaller than 4 GB (when in uncompressed CSV, the parquet is much smaller), but command is failing with java.lang.OutOfMemoryError: Requested array size exceeds VM limit error.

Environment and Data :

  • Azure Databricks cluster with Standard_E32s_v3 driver node (256 GB RAM) and 2 - 8 Standard_D16s_v3 workers
  • Input data is stored as parquet files in Azure Datalake Gen2
  • Raw data size ~4GB (mix of integers, strings and doubles; the data is originating in a upstream transactional SQL DB thus structure is quite clean, none string exceeds 256 length)

Attempted Resolution:

  • Cluster Spark Config: spark.driver.maxResultSizei 128g
  • Cluster Environment variable: JAVA_OPTS="-Xms120G -Xmx120G -XX:MaxPermSize=8G -XX:PermSize=4G"
  • Script configuration (tested when using sparklyr) config$spark.driver.memory (set to 120G)

None of the attempts helped. I must be missing some configuration property that allocates memory to local R. Please recommend which server configuration(s) or environment variable(s) should be changed.


EDIT:

  • Another option tested w/o success:
  • Capacity validation: input can be collected (val x = myInput.collect) when using Spark Scala even for maxResultSize 16g; issue is restricted to R module only
Dan
  • 494
  • 2
  • 14
  • did you find out how to solve this? – Dipperman Feb 20 '21 at 08:00
  • @Dipperman no - gave up some accuracy to partition the input to fit. Not a great solution. – Dan Feb 26 '21 at 22:33
  • Are you using the `databricks-connect` client library to handle the back-end? I am working through a similar issue currently with the azure/databricks support team, I think our problems have the same root cause – Matt Summersgill Mar 17 '21 at 20:00
  • @MattSummersgill no, just keeping code in the Databricks notebook(s). I believe the behavior will be the same (issue is happening when cluster is executing the code, code origin hopefully doesn't matter). I'll try to test databricks-connect as soon as I'm out of current fire-drill and share results. Please keep me posted about your findings too. – Dan Mar 17 '21 at 20:20
  • 1
    @Dan - I'll keep you posted if my investigation bears any fruit. I was able to get this down to a relatively minimal example that generates it's own data here: https://gist.github.com/msummersgill/fb61204b73c2bebcaf5a1fe299172b45 – Matt Summersgill Mar 17 '21 at 21:19

0 Answers0