0

I am using Spark from R, via sparklyr package to run a regression on a huge dataset (>500mill obs). But I wanted a weighted regression and I can't seem to find the correct syntax / function to do that.

Currently I am doing

sparklyr::ml_linear_regression(
    data_spark, 
    response = "y", 
    features = c("x1", "x2"))

Using base R I would simply do.

lm(y ~ x1 + x2, weights = "wt", data = data)

But of course base R can't handle my seemingly large data.

How can I do the same with spark from R, using the sparklyr package to interface with Spark?

(I've tried to do all this with SparkR bundled with Spark; SparkR::spark.glm() has just what I need, the weightCol argument, but I can't make Spark to work using this package because I could not copy the data to Spark; always hits "Error: memory exhausted (limit reached?)", even though I tweak the sparkConfig parameters)

Hernando Casas
  • 2,837
  • 4
  • 21
  • 30
  • 1
    watch https://github.com/rstudio/sparklyr/issues/217 -- I hope to look into it soon – kevinykuo Apr 29 '17 at 16:13
  • thanks @kevinykuo for the answer. Yeah, for the moment I did it modifying your function. I added the `weightCol = NULL` and in the code, right after you assign model I added these couple of lines `if (!is.null(weightCol)) model %>% invoke("setWeightCol", as.character(weightCol))`. It works and the results mirror those of base `R` `lm`. – Hernando Casas Apr 29 '17 at 18:30
  • 1
    thanks for the experiment! For clarification I didn't write the function, just trying to help since I ran into the same problem with glm ;) – kevinykuo Apr 29 '17 at 19:17
  • regarding the `Error: memory exhausted (limit reached?)`, i think this is not related to the sparkConfig, but to `R` memory. How did you try to load the data to `Spark`? If you use `Spark::read.df` to load a datafile directly you don't need to load it into R first. – Janna Maas May 03 '17 at 11:08
  • @JannaMaas yeah I think that was R. But loading the data directly to Spark, using `SparkR::read.df` seems to work at first, but later hits errors like `java.lang.IllegalArgumentException: requirement failed: Decimal precision 6 exceeds max precision 5` or `scala.MatchError: [458552.068965517,348,(52,[5,49],[1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.regression.GeneralizedLinearRegression$$anonfun$5.apply(GeneralizedLinearRegression.scala:260)` – Hernando Casas May 07 '17 at 15:58
  • admittedly I am very new to Spark and I've been playing around with it just a few days. But even though I followed every tutorial I found on the web (starting with the instructions on the official page http://spark.apache.org/docs/latest/sparkr.html), I couldn't make it work using `SparkR`. – Hernando Casas May 07 '17 at 16:29
  • 1
    which version of Spark are you using? It could be you're running into this bug- https://issues.apache.org/jira/browse/SPARK-18877 – Janna Maas May 08 '17 at 06:54
  • thanks. I was using 2.1.0 (spark-2.1.0-bin-hadoop2.7.tgz); I've just updated to 2.1.1 released just a few days ago and the decimal precision issue is gone (which is weird 'cause it seems 2.1.0 was not an affected version, but anyways). – Hernando Casas May 08 '17 at 10:24

0 Answers0