I am trying to rbind a large number of csv files that have many overlapping columns.
library(plyr)
filenames <- list.files("output/", pattern = "*.csv")
filenames = paste("output/", filenames, sep="")
frames = lapply(filenames,read.csv,header=TRUE)
input = do.call(rbind.fill,frames)
write.csv(input, file="stacked.csv", quote=FALSE, na="", row.names=FALSE)
The combined frame will have around 300,000 rows and the union of the columns is 3,000 or so and there are 104 files. Doing this my computer blows past its memory limit (32gb) and R crashes. I've also tried splitting the process up to deal with segmented memory. Again, no luck:
library(plyr)
filenames <- list.files("output/", pattern = "*.csv")
filenames = paste("output/", filenames, sep="")
input = lapply(filenames,read.csv,header=TRUE)
part1 = do.call(rbind.fill,input[1:30])
save(part1,part2,file="p1")
part2 = do.call(rbind.fill,input[31:70])
part3 = do.call(rbind.fill,input[71:104])
write.table(input, file="stacked.csv", quote=FALSE, na="", row.names=FALSE)
The above is just a sample of what I did. Once I load part1, part2, part3 back into memory they are ~6gb total. Before saving and loading my memory usage is around 20gb. I then try rbind.fill(part1,part2)
and the memory usage spikes >32gb again.
Any suggestions on how to approach this? I'm open to other solutions (python, sql, etc).