Nonlinear Regressions on large datasets using SparkR (or other methods?)

Question

I'm trying to run a nonlinear-regression (NLR) on a very large dataset. For smaller test datasets I have working code in R and I am trying to port this over to SparkR.

I'm new to Spark(R and otherwise).

R (my working code):

After a bit of manipulation I obtain the R-DataFrame df and run the following NLR:

nls(y1 ~ b0/(1+exp(b1+b2*y2+b3*y3)),df)

sparkR:

After starting SparkR with the csv-package ($ sparkR --packages com.databricks:spark-csv_2.11:1.3.0) I have managed to create the SparkR-DataFrame and run a linear regression as a test case

customSchema <- structType(...)
spk_df = read.df(sqlContext, path, header='true', source = "com.databricks.spark.csv", schema=customSchema)
test_linear_model <- glm(y1 ~ y2 + y3, data = spk_df)
summary(test_linear_model)

(side-note: I had to create the customSchema because inferSchema always casted to strings instead of doubles)

How do you run a NLR in SparkR? Is it possible or does the nonlinear-ness necessarily exclude Sparks's parallelizing magic?
I'm assuming there's no benefit in just collecting the spark-df nls(y1 ~ b0/(1+exp(b1+b2*y2+b3*y3)),collect(spk_df))

back to R:

If the nonlinearity is going prevent me from using spark in a useful way, how should I approach NLRs for large datasets?

I've tried using the r ff package, specifically ff-data.frames ffdf, but I'm having trouble, I imagine, for the same reason sparkR is failing.

I could in principle work with a rows of randomly selected data, much like this SO, but my dataframe is actually created from manipulating several files/dataframes, and I'll need to select the same random rows from each file. I've been able to generate these random files with

$ dd if=/dev/random of=rsource count=150000
$ N=500000
$ gshuf -n $N --random-source rsource first.csv > first_sample.csv
$ gshuf -n $N --random-source rsource second.csv > second_sample.csv
$ gshuf -n $N --random-source rsource third.csv > third_sample.csv

Are there better solutions? This makes me a little nervous because although in principle the files should all be the same number of lines/order I am worried sometimes there might be a bad image.

Ideas?

Thanks!!!

Nonlinear Regressions on large datasets using SparkR (or other methods?)

R (my working code):

sparkR:

back to R:

0 Answers0