0

In SparkR I have a DataFrame data and it containd id as well. I also have a liste= 2 9 12 102 154 ... 1451 where length(liste)=3001. I want entries in data where id equals liste. In sparkR I do this

newdata <- unionAll(filter(data, data$id == liste[1] ), filter(data, data$id == liste[2] ))
for(j in 3:10){
newdata <- unionAll(newdata, filter(data, data$id==good[j] ))
}

For these 10 iterations it takes long time, about 5min. When I want to do all iterations, namely 3001, sparkR say "error returnstatus==0 is not true". How should one solve this?

Ole Petersen
  • 670
  • 9
  • 21

1 Answers1

1

I did not check yet wether %in% is supported in Spark-1.5, but it is always a possibility to filter via a join:

DF <- createDataFrame(sqlContext,
                      data.frame(id = c(1,1,2,3,3,4),
                                 value = c(1,2,3,4,5,6)))

goodID <- createDataFrame(sqlContext, data.frame(goodID = c(1,3)))

newData <- join(DF, goodID, DF$id == goodID$goodID)
newData$goodID <- NULL
collect(newData)
Wannes Rosiers
  • 1,680
  • 1
  • 12
  • 18