Best way to handle big dataset in R

Question

I have to run some regression models and descriptives on a big dataset. I have a folder of around 500 files (update: txt files) which I would like to merge, and are in total 250GB.

I know how to merge all files from a folder, but although I am running it on a 128RAM server, I keep getting out of memory.

I am looking for any tips/advice on how to load in/merge these files in a manageable way (if possible) way using R.I have been looking into packages such as "ff" and "bigmemory", will these offer me a solution?

don't know what you are using to read your data, but read_csv (from readr) and fread (from data.table) are usually faster than read.csv or read.table — MLavoie, Dec 24 '15 at 09:48

Han de Vries · Answer 1 · 2015-12-24T10:16:03.270

0

I would suggest the ff and biglm packages. The latter allows you to run a regression on the entire dataset stored on disk (using ff), by loading smaller chunks of it in RAM. Use read.table.ffdf() to convert the separate txt files to an ff file on disk. See the example in the help file for chunk.ffdf() how to run a regression using biglm().

edited Dec 24 '15 at 10:16

answered Dec 24 '15 at 10:09

Han de Vries

118
4

Best way to handle big dataset in R

1 Answers1