0

I am trying to load a dataset with a million rows and 1000 columns with sparklyr. I am running Spark on a very big cluster at work. Still the size of the data seems to be too big. I have tried two different approaches:

This is the dataset: (train_numeric.csv) https://www.kaggle.com/c/bosch-production-line-performance/data

1) - Put .csv into hdfs - spark_read_csv(spark_context, path)

2) - read the csv file in as a regular R dataframe - spark_frame<-copy_to(sc,R-dataframe)

Both ways work perfectly fine on a subset of the dataset, but fail when I try to read the entire dataset.

Is anybody aware of a method that is suitable for large datasets?

Thanks, Felix

Felix
  • 309
  • 2
  • 12

1 Answers1

2

The question is - do you need to read the entire data set into the memory?

First of all - note that Spark evaluates transformations lazily. Setting spark_read_csv memory parameter to FALSE would make Spark map the file, but not make a copy of it in memory. The whole calculation will take place only as soon as collect() is called.

spark_read_csv(sc, "flights_spark_2008", "2008.csv.bz2", memory = FALSE)

So consider cutting down on the rows and columns before doing any calculations and getting the results back to R as in the example below:

http://spark.rstudio.com/examples-caching.html#process_on_the_fly

michalrudko
  • 1,432
  • 2
  • 16
  • 30
  • I understand that, but i actually need to read in the whole data frame – Felix May 30 '17 at 20:05
  • But why? What are you going to do with this data? I'd still suggest to set memory to FALSE and pipe the operations you want to perform. – michalrudko May 30 '17 at 23:31
  • small caveat: setting `memory` to `TRUE` means your data gets cached _in spark_, which is the way to go if you want to perform more than one operation with that `Spark Dataframe`. – Janna Maas May 31 '17 at 08:14