sparklyr for big csv file

Question

I am trying to load a dataset with a million rows and 1000 columns with sparklyr. I am running Spark on a very big cluster at work. Still the size of the data seems to be too big. I have tried two different approaches:

This is the dataset: (train_numeric.csv) https://www.kaggle.com/c/bosch-production-line-performance/data

1) - Put .csv into hdfs - spark_read_csv(spark_context, path)

2) - read the csv file in as a regular R dataframe - spark_frame<-copy_to(sc,R-dataframe)

Both ways work perfectly fine on a subset of the dataset, but fail when I try to read the entire dataset.

Is anybody aware of a method that is suitable for large datasets?

Thanks, Felix

what error are you getting – kevinykuo May 30 '17 at 21:17 — kevinykuo, May 30 '17 at 21:17

score 2 · Answer 1 · answered May 30 '17 at 13:59

2

The question is - do you need to read the entire data set into the memory?

First of all - note that Spark evaluates transformations lazily. Setting spark_read_csv memory parameter to FALSE would make Spark map the file, but not make a copy of it in memory. The whole calculation will take place only as soon as collect() is called.

spark_read_csv(sc, "flights_spark_2008", "2008.csv.bz2", memory = FALSE)

So consider cutting down on the rows and columns before doing any calculations and getting the results back to R as in the example below:

http://spark.rstudio.com/examples-caching.html#process_on_the_fly

answered May 30 '17 at 13:59

michalrudko

1,432
2
16
30

I understand that, but i actually need to read in the whole data frame – Felix May 30 '17 at 20:05
But why? What are you going to do with this data? I'd still suggest to set memory to FALSE and pipe the operations you want to perform. – michalrudko May 30 '17 at 23:31
small caveat: setting `memory` to `TRUE` means your data gets cached _in spark_, which is the way to go if you want to perform more than one operation with that `Spark Dataframe`. – Janna Maas May 31 '17 at 08:14

sparklyr for big csv file

1 Answers1