0

I have a dataset (15 GB): 72 million records and 26 features. I would like to compare 7 supervised ML models (classification problem): SVM, random forest, decision tree, naive bayes, ANN, KNN and XGBoosting. I created a sample set of 7.2 million records (10% of the entire set). Running models on the sample set (even feature selection) is already an issue. It has a very long processing time. I use only RStudio at this moment.

I've been looking for an answer to my questions for days. I tried the following things: - data.table - still not sufficient to reduce the processing time - sparklyr - can't copy my dataset, because it's too large

I am looking for a costless solution to my problem. Can someone please help me?

  • What is the source of the data? Is it .csv, a database connection, etc? If we know where the data is coming from we can think about how to get it into Spark. – Raphael K Nov 14 '19 at 13:28
  • Hi Raphael, it's a csv file. No database connection. I just downloaded from a website. – Yuliya Khan Nov 15 '19 at 15:21

3 Answers3

0

If you have access to Spark, you can use sparklyr to read the CSV file directly.

install.packages('sparklyr')
library(sparklyr)

## You'll have to connect to your Spark cluster, this is just a placeholder example
sc <- spark_connect(master = "spark://HOST:PORT")

## Read large CSV into Spark
sdf <- spark_read_csv(sc, 
                      name = "my_spark_table", 
                      path = "/path/to/my_large_file.csv")

## Take a look
head(sdf)

You can use dplyr functions to manipulate data (docs). To do machine learning, you'll need to use the sparklyr functions for SparkML (docs). You should be able to find almost all of what you want in sparklyr.

Raphael K
  • 2,265
  • 1
  • 16
  • 23
0

Try Google Colab. This can help you in running your dataset easily.

Running Rabbit
  • 2,634
  • 15
  • 48
  • 69
-1

You should look into the disk.frame package.

hello_friend
  • 5,682
  • 1
  • 11
  • 15