2

In Java/Scala/Python implementations of Spark, one can simply call the foreach method of RDD or DataFrame types in order to parallelize the iterations over a dataset.

In SparkR I can't find such instruction. What would be the proper way to iterate over the rows of a DataFrame?

I could only find the gapply and dapply functions, but I don't want to calculate new column values, I just want to do something by taking one element from a list, in parallel.

My previous attempt was with lapply

inputDF <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "")
createOrReplaceTempView(inputDF,'inputData')

distinctM <- sql('SELECT DISTINCT(ID_M) FROM inputData')

collected <- collect(distinctM)[[1]]

problemSolver <- function(idM) {
  filteredDF <- filter(inputDF, inputDF$ID_M == idM)
}

spark.lapply(c(collected), problemSolver)

but I'm getting this error:

Error in handleErrors(returnStatus, conn) : 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 1 times, most recent failure: Lost task 1.0 in stage 5.0 (TID 207, localhost, executor driver): org.apache.spark.SparkException: R computation failed with
 Error in callJMethod(x@sdf, "col", c) : 
  Invalid jobj 3. If SparkR was restarted, Spark operations need to be re-executed.
Calls: compute ... filter -> $ -> $ -> getColumn -> column -> callJMethod

What would be the solution provided by R to solve such problems?

Vektor88
  • 4,841
  • 11
  • 59
  • 111

1 Answers1

3

I had a similar problem as well. Collecting a DataFrame puts it into R as a dataframe. From there, you can get at each row as you normally would in regular old R. In my opinion, this is a horrible motif for processing data as you lose the parallel processing Spark provides. Instead of collecting the data and then filtering, use the built in SparkR functions, select, filter,etc. If you wish to do row-wise operators, the built in SparkR functions will generally do this for you, otherwise, I have found selectExpr or expr to be very useful when the original Spark functions are designed to work on a single value (think: from_unix_timestamp)

So, to get what you want, I would try something like this (I'm on SparkR 2.0+):

Frist Read in the data as you have done:

inputDF<- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "")

Then make this an RDD:inputSparkDF<- SparkR::createDataFrame(inputDF)

Next, isolate only the distinct/unique values (I'm using magrittr for piping (works in SparkR)):

distinctSparkDF<- SparkR::select(inputSparkDF) %>% SparkR::distinct()

From here, you can apply filtering while still living in Spark's world:

filteredSparkDF<- SparkR::filter(distinctSparkDF, distinctSparkDF$variable == "value")

After Spark has filtered that data for you, it makes sense to collect the subset into base R as the last step in the workflow:

myRegularRDataframe<- SparkR::collect(filteredSparkDF)

I hope this helps. Best of luck. --nate

nate
  • 1,172
  • 1
  • 11
  • 26