0

I found this example of calling spark.mllib functions directly from Scala library. I don't get all things here, but anyway is it possible to call any MLlib function (which is not present via, let's say, spaklyr) this way? In particular I am interested in

org.apache.spark.mllib.stat.kolmogorovSmirnovTest

The code I found:

envir <- new.env(parent = emptyenv())
df    <- spark_dataframe(x)
sc    <- spark_connection(df)
df    <- ml_prepare_features(df, features)
tdf   <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)

sparklyr ensures that you have proper connection to spark data frame and prepares features in convenient form and naming convention. At the end it prepares a Spark DataFrame for Spark ML routines.

You can construct a simple model calling a Spark ML class like this

envir$model <- "org.apache.spark.ml.clustering.KMeans"
kmeans      <- invoke_new(sc, envir$model)

model <- kmeans %>%
    invoke("setK", centers) %>%
    invoke("setMaxIter", iter.max) %>%
    invoke("setTol", tolerance) %>%
    invoke("setFeaturesCol", envir$features)
# features where set in ml_prepare_dataframe

fit <- model %>% invoke("fit", tdf)
# reminder: 
# tdf   <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)

Source: http://r-addict.com/DataScienceWarsaw25/show/#/preparing-spark-ml-algorithm

Alexey Burnakov
  • 259
  • 2
  • 14
  • 1
    Short answer is no (or at least not in an acceptable manner), as `mllib` package is not compatible with `Dataset` API. Long answer is - are you ready to learn Scala? :) That being said, you should avoid `mllib` package in general - it is frozen, and slowly deprecated. With `ml` package calling unimplemented methods is usually not that hard. – zero323 Feb 07 '18 at 17:40
  • 1
    With the latest `sparklyr` release, most, if not all, of the `spark.ml` stuff you may need should already be available. Feel free to file an issue if something is blocking you https://github.com/rstudio/sparklyr/issues – kevinykuo Feb 08 '18 at 09:02
  • @kevinykuo, thank you. I got it. I am a statistician and most of my methods rely on either direct hypothesis testing (Chi-sq, KS-test, T-test) or using the distribution functions to get p-values. Neither is available via sparklyr, while Kolmogorov-Smirnov is supported in sparkR at least. That is why I was wondering if I could employ the ml functions directly. – Alexey Burnakov Feb 08 '18 at 09:06
  • 1
    @AlexeyBurnakov thanks. It looks like `ChiSquareTest` is the only one available in the `Dataset` API. We'll implement it in `sparklyr` soon. Tracking at https://github.com/rstudio/sparklyr/issues/1247 – kevinykuo Feb 08 '18 at 09:27
  • @kevinykuo, am I right then that checking this documentation on stat methods tells me what is supported in {Dataset API}? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.stat.package – Alexey Burnakov Feb 08 '18 at 09:39
  • 1
    @AlexeyBurnakov that's correct – kevinykuo Feb 08 '18 at 09:40

0 Answers0