I'm trying to run a nonlinear-regression (NLR) on a very large dataset. For smaller test datasets I have working code in R and I am trying to port this over to SparkR.
I'm new to Spark(R and otherwise).
R (my working code):
After a bit of manipulation I obtain the R-DataFrame df
and run the following NLR:
nls(y1 ~ b0/(1+exp(b1+b2*y2+b3*y3)),df)
sparkR:
After starting SparkR with the csv-package ($ sparkR --packages com.databricks:spark-csv_2.11:1.3.0
) I have managed to create the SparkR-DataFrame and run a linear regression as a test case
customSchema <- structType(...)
spk_df = read.df(sqlContext, path, header='true', source = "com.databricks.spark.csv", schema=customSchema)
test_linear_model <- glm(y1 ~ y2 + y3, data = spk_df)
summary(test_linear_model)
(side-note: I had to create the customSchema because inferSchema always casted to strings instead of doubles)
- How do you run a NLR in SparkR? Is it possible or does the nonlinear-ness necessarily exclude Sparks's parallelizing magic?
- I'm assuming there's no benefit in just collecting the spark-df
nls(y1 ~ b0/(1+exp(b1+b2*y2+b3*y3)),collect(spk_df))
back to R:
If the nonlinearity is going prevent me from using spark in a useful way, how should I approach NLRs for large datasets?
I've tried using the r ff
package, specifically ff-data.frames ffdf
, but I'm having trouble, I imagine, for the same reason sparkR is failing.
I could in principle work with a rows of randomly selected data, much like this SO, but my dataframe is actually created from manipulating several files/dataframes, and I'll need to select the same random rows from each file. I've been able to generate these random files with
$ dd if=/dev/random of=rsource count=150000
$ N=500000
$ gshuf -n $N --random-source rsource first.csv > first_sample.csv
$ gshuf -n $N --random-source rsource second.csv > second_sample.csv
$ gshuf -n $N --random-source rsource third.csv > third_sample.csv
Are there better solutions? This makes me a little nervous because although in principle the files should all be the same number of lines/order I am worried sometimes there might be a bad image.
Ideas?
Thanks!!!