SparkR code fails if Apache Arrow is enabled

Question

I am running gapply function on SparkRDataframe which looks like below

 df<-gapply(sp_Stack, function(key,e) {
    
      Sys.setlocale('LC_COLLATE','C')
   
      suppressPackageStartupMessages({ 
      library(Rcpp)
      library(Matrix)
      library(reshape)
      require(parallel)
      require(lubridate)
      library(plyr)
      library(reticulate)
      library(stringr)
      
      library(data.table)
   })
      calcDecsOnly(e,RequestNumber=RequestNumber,
                   ...)
    },cols="udim",schema=schema3)

The above code runs without any errors if we set spark.sql.execution.arrow.sparkr.enabled = "false" , but if I set spark.sql.execution.arrow.sparkr.enabled = "true" the spark Job fails with below error

Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:154)

Environment : Google Cloud Dataproc Spark version : 3.1.1 Dataproc version : Custom image built on 2.0.9-debian10

Any help here is appreciated , thanks in advance

Yes I tried with the latest image dataproc-version 2.0.12-debian10 I tried with both arrow-4.0.1 and arrow-3.0.0 — Benak Raj, Jul 16 '21 at 08:05
Yes , I did it as part of initialisation script of custom Dataproc image. command used /opt/conda/miniconda3/bin/conda install r-arrow==3.0.0 — Benak Raj, Jul 16 '21 at 08:56
Could you try this instruction? https://spark.apache.org/docs/3.1.1/sparkr.html#apache-arrow-in-sparkr — Dagang, Jul 16 '21 at 18:00
I did try this , in fact I am using both installation instruction in my image creation /opt/conda/miniconda3/bin/conda install r-arrow==3.0.0 as well as the one mentioned in the doc above. If I only use Rscript -e 'install.packages("arrow", repos="https://cloud.r-project.org/") arrow does not get installed correctly in Dataproc executors — Benak Raj, Jul 19 '21 at 13:04
Did you guys try increasing the memory for non-JVM processes on the executors? This looks like an error on the JVM executor read side where the streaming arrow data is halting early. This could be due to a crash of the R sidecar process due to something as trivial as OOM. I'd recommend attempting to bump the value of spark.executor.memoryOverhead which defaults to 10% of spark.executor.memory. Do note that Dataproc defaults spark.executor.memory based on VM size for executor bin packing, if you bump memoryOverhead you may want an equal decrease in memory to ensure good resource utilization. — KoopaKing, Aug 27 '21 at 20:28
Hey , I had tried with increased overhead memory as R process gets it memory from overhead, Did not help though. Our R process is CPU intensive , I also think CPU may be a bottleneck here, however I did not find a way to give dedicated CPU to Backend (tried numbackendthreads) — Benak Raj, Aug 28 '21 at 06:26

SparkR code fails if Apache Arrow is enabled

0 Answers0