I found this example of calling spark.mllib functions directly from Scala library. I don't get all things here, but anyway is it possible to call any MLlib function (which is not present via, let's say, spaklyr) this way? In particular I am interested in
org.apache.spark.mllib.stat.kolmogorovSmirnovTest
The code I found:
envir <- new.env(parent = emptyenv())
df <- spark_dataframe(x)
sc <- spark_connection(df)
df <- ml_prepare_features(df, features)
tdf <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)
sparklyr ensures that you have proper connection to spark data frame and prepares features in convenient form and naming convention. At the end it prepares a Spark DataFrame for Spark ML routines.
You can construct a simple model calling a Spark ML class like this
envir$model <- "org.apache.spark.ml.clustering.KMeans"
kmeans <- invoke_new(sc, envir$model)
model <- kmeans %>%
invoke("setK", centers) %>%
invoke("setMaxIter", iter.max) %>%
invoke("setTol", tolerance) %>%
invoke("setFeaturesCol", envir$features)
# features where set in ml_prepare_dataframe
fit <- model %>% invoke("fit", tdf)
# reminder:
# tdf <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)
Source: http://r-addict.com/DataScienceWarsaw25/show/#/preparing-spark-ml-algorithm