Check two relevant references Extracting Class Probabilities from SparkR ML Classification Functions and sparkR 1.6: How to predict probability when modeling with glm (binomial family)
I'm just wondering whether there is any method to get these done without converting the SparkDataFrame back to an R data.frame via either as.data.frame or collect. Cuz it seems impossible when there is millions of data...