Sparklyr: sdf_copy_to fails with 350 MB dataset

Question

I'm facing a problem trying to write 2 dataset using sparklyr::spark_write_csv(). This is my configuration:

# Configure cluster
config <- spark_config()
config$spark.yarn.keytab <- "mykeytab.keytab"
config$spark.yarn.principal <- "myyarnprincipal"
config$sparklyr.gateway.start.timeout <- 10
config$spark.executor.instances <- 2
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
config$spark.driver.memory <- "4G"

config$spark.kryoserializer.buffer.max  <- "1G"

Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH/lib/spark")
Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf.cloudera.hdfs')
Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf.cloudera.yarn')

# Configure cluster
sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.0')

Once the spark context is successfully created, I'm trying to save 2 datasets on hdfs using spark_write_csv(). As an intermediate step I need to transform the dataframe into a tbl_spark. Unfortunately, I'm able to correctly save only the first one, meanwhile the second one (which is bigger but absolutely not big for hadoop standards i.e. 360 MB) takes a long time and finally crashes.

# load datasets
tmp_small <- read.csv("first_one.csv", sep = "|") # 13 MB
tmp_big <- read.csv("second_one.csv", sep = "|") # 352 MB

tmp_small_Spark <- sdf_copy_to(sc, tmp_small, "tmp_small", memory = F, overwrite = T)
tables_preview <- dbGetQuery(sc, "SHOW TABLES")

tmp_big_Spark <- sdf_copy_to(sc, tmp_big, "tmp_big", memory = F, overwrite = T) # fail!!
tables_preview <- dbGetQuery(sc, "SHOW TABLES")

It is probably a configuration problem but I can't figure it out. This is the error: |================================================================================| 100% 352 MB

Error in invoke_method.spark_shell_connection(sc, TRUE, class, method,  : 
No status is returned. Spark R backend might have failed.

Thanks

score 0 · Answer 1 · answered Jul 04 '17 at 21:49

0

I was also having issues loading larger files. Try adding this to the spark connection config file:

config$spark.rpc.message.maxSize <- 512

It's a workaround, though.

answered Jul 04 '17 at 21:49

Pedro Fernández

23
5

Sparklyr: sdf_copy_to fails with 350 MB dataset

1 Answers1