I have a dataset (15 GB): 72 million records and 26 features. I would like to compare 7 supervised ML models (classification problem): SVM, random forest, decision tree, naive bayes, ANN, KNN and XGBoosting. I created a sample set of 7.2 million records (10% of the entire set). Running models on the sample set (even feature selection) is already an issue. It has a very long processing time. I use only RStudio at this moment.
I've been looking for an answer to my questions for days. I tried the following things: - data.table - still not sufficient to reduce the processing time - sparklyr - can't copy my dataset, because it's too large
I am looking for a costless solution to my problem. Can someone please help me?