In Java/Scala/Python implementations of Spark, one can simply call the foreach
method of RDD
or DataFrame
types in order to parallelize the iterations over a dataset.
In SparkR I can't find such instruction. What would be the proper way to iterate over the rows of a DataFrame
?
I could only find the gapply
and dapply
functions, but I don't want to calculate new column values, I just want to do something by taking one element from a list, in parallel.
My previous attempt was with lapply
inputDF <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "")
createOrReplaceTempView(inputDF,'inputData')
distinctM <- sql('SELECT DISTINCT(ID_M) FROM inputData')
collected <- collect(distinctM)[[1]]
problemSolver <- function(idM) {
filteredDF <- filter(inputDF, inputDF$ID_M == idM)
}
spark.lapply(c(collected), problemSolver)
but I'm getting this error:
Error in handleErrors(returnStatus, conn) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 1 times, most recent failure: Lost task 1.0 in stage 5.0 (TID 207, localhost, executor driver): org.apache.spark.SparkException: R computation failed with
Error in callJMethod(x@sdf, "col", c) :
Invalid jobj 3. If SparkR was restarted, Spark operations need to be re-executed.
Calls: compute ... filter -> $ -> $ -> getColumn -> column -> callJMethod
What would be the solution provided by R to solve such problems?