I am using Spark from R, via sparklyr
package to run a regression on a huge dataset (>500mill obs). But I wanted a weighted regression and I can't seem to find the correct syntax / function to do that.
Currently I am doing
sparklyr::ml_linear_regression(
data_spark,
response = "y",
features = c("x1", "x2"))
Using base R I would simply do.
lm(y ~ x1 + x2, weights = "wt", data = data)
But of course base R can't handle my seemingly large data.
How can I do the same with spark from R, using the sparklyr
package to interface with Spark?
(I've tried to do all this with SparkR
bundled with Spark; SparkR::spark.glm()
has just what I need, the weightCol
argument, but I can't make Spark to work using this package because I could not copy the data to Spark; always hits "Error: memory exhausted (limit reached?)", even though I tweak the sparkConfig parameters)