I am trying to load a dataset with a million rows and 1000 columns with sparklyr. I am running Spark on a very big cluster at work. Still the size of the data seems to be too big. I have tried two different approaches:
This is the dataset: (train_numeric.csv) https://www.kaggle.com/c/bosch-production-line-performance/data
1) - Put .csv into hdfs - spark_read_csv(spark_context, path)
2) - read the csv file in as a regular R dataframe - spark_frame<-copy_to(sc,R-dataframe)
Both ways work perfectly fine on a subset of the dataset, but fail when I try to read the entire dataset.
Is anybody aware of a method that is suitable for large datasets?
Thanks, Felix