how to achieve the same result with sparklyr on a spark dataframe as with dplyr on an R dataframe?

Question

The following code calculates a set of regression coefficients for each of three dependent variables regressed on the set of six independent variable for each of two groups and it works fine.

library(tidyverse)
library(broom)
n  <- 20
df4  <- data.frame(groupingvar= sample(1:2, size = n, replace = TRUE),
                   y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
                   x1=  rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
                   x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df4 <- arrange(df4,groupingvar)

regs <- df4 %>% group_by(groupingvar) %>%
  do(fit = lm(cbind(y1,y2,y3) ~ . -groupingvar, data = .))
coeffs <- tidy(regs, fit)

I would like to replicate the same logic using a spark dataframe instead of an R dataframe. For example, something along the lines of:

library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")
sparkdf4ref <- sdf_copy_to(sc, df4, "sparkdf4", overwrite=T)

sparkdf4refregs <- sparkdf4ref %>% group_by(groupingvar) %>%
  do(sparkfit = lm(cbind(y1,y2,y3) ~ . -groupingvar, data = .))
coeffs <- tidy(sparkdf4refregs, sparkfit)

This code fails primarily because I need to use 'ml_linear_regression' instead of 'lm', but it fails even if do the substitution. If I keep 'ml_linear_regression', but remove the cbind() and keep only one depended variable, then some coefficients are calculated, although broom::tidy is not able to pick up the coefficients.

Is there a way to produce this result in the sparklyr framework or with another method if need be?

Follow https://github.com/rstudio/sparklyr/issues/670, I'll try to send a PR in the next few days. — kevinykuo, May 25 '17 at 02:30
Tidy has been implemented for `ml_linear_regression` in latest dev version. — kevinykuo, May 27 '17 at 15:53
Thank you! As for 'cbind()' not working with 'ml_linear_regression()' is there a way to write the above code with a loop and pass the loop results onto 'tidy()'? — user1689945, Jun 01 '17 at 13:36

how to achieve the same result with sparklyr on a spark dataframe as with dplyr on an R dataframe?

0 Answers0