The following code calculates a set of regression coefficients for each of three dependent variables regressed on the set of six independent variable for each of two groups and it works fine.
library(tidyverse)
library(broom)
n <- 20
df4 <- data.frame(groupingvar= sample(1:2, size = n, replace = TRUE),
y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
x1= rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df4 <- arrange(df4,groupingvar)
regs <- df4 %>% group_by(groupingvar) %>%
do(fit = lm(cbind(y1,y2,y3) ~ . -groupingvar, data = .))
coeffs <- tidy(regs, fit)
I would like to replicate the same logic using a spark dataframe instead of an R dataframe. For example, something along the lines of:
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")
sparkdf4ref <- sdf_copy_to(sc, df4, "sparkdf4", overwrite=T)
sparkdf4refregs <- sparkdf4ref %>% group_by(groupingvar) %>%
do(sparkfit = lm(cbind(y1,y2,y3) ~ . -groupingvar, data = .))
coeffs <- tidy(sparkdf4refregs, sparkfit)
This code fails primarily because I need to use 'ml_linear_regression' instead of 'lm', but it fails even if do the substitution. If I keep 'ml_linear_regression', but remove the cbind() and keep only one depended variable, then some coefficients are calculated, although broom::tidy is not able to pick up the coefficients.
Is there a way to produce this result in the sparklyr framework or with another method if need be?