I am (trying) to load about 40 large json
files (150 - 200GB each on average) into Spark
using sparklyr
. Some of the files would fit entirely in the RAM of a cluster, some of them would be too big.
Unfortunately, the command :
spark_read_json(mypath/files_json*, memory = FALSE)
creates about 500k jobs and takes forever, no matter what is my cluster allocation is (number of cores allocated, number of executors, RAM, etc).
I kept playing with config <- spark_config()
, config$spark.executor.memory
, config$spark.executor.core
, config$spark.default.parallelism
and others but the number of tasks does not change. I do have a large cluster available though.
I feel there is a serious optimization problem here. Any idea what should I change in the Spark
options or somewhere? I desperately tried playing with all of the options I could think of.
Thanks!!!