1

I have a 16 GB ram running w10 64 Bit on a 64 bit version of R . Im trying to merge a bunch of CSVs on this link (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml) specifically the yellow bit Edit: only for one year atm, but would want to import more data once this works

heres the code im running

library(readr)
FList <- list.files(pattern = "*.csv")
for (i in 1:length(FList))
  {
  print(i)
  assign(FList[i], read_csv(FList[i]))
  if (i==2) {
    DF<-rbind(get(FList[1]),get(FList[2]))
    rm(list = c(FList[1],FList[2]))
  }
  if (i>2)
    {
    DF<-rbind(DF,get(FList[i]))
    rm(list = FList[i])
  }
  gc()
}

I get the error on the 6th iteration, task manager shows the memory usage in the 90% during the rbind operation but drops to 60 after its done

Running gc() after the error gives the following

> gc()
             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    3821676   204.1   10314672   550.9   13394998   715.4
Vcells 1363034028 10399.2 3007585511 22946.1 2058636792 15706.2
> 

I do not have a lot of experience with this, any help in optimizing the code would be appreciated. p.s would running it with read.csv help? I'm assuming the date time format in the few columns might be resource hungry. Havent tried it yet because I need the columns in datetime format.

Zain Gill
  • 31
  • 5
  • 1
    Do not call `gc` manually in a loop. You won't achieve anything other than bypassing optimizations and potentially slowing down execution severely. Your problem is that you are growing an object in a loop. Don't do that. It's not only slow but also fragments your memory. – Roland Apr 09 '18 at 12:55

1 Answers1

2

You can try it with lapply instead of a loop

files <- list.files(pattern = glob2rx("*.csv"))

df <- lapply(files, function(x) read.csv(x))
df <- do.call(rbind, df)

Another way is to append them in the command line instead of R. This should be less memory intensive. Just google appends csv and your OS appropriate command line tool.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • The problem is if i try to read all the files and save them to variables it gives the same error: "cannot allocate vector of size ..." I tried to resolve that by deleting data frames that were merged. – Zain Gill Apr 09 '18 at 12:56
  • I have changed your code to ensure that the correct regex is passed to `pattern`. – Roland Apr 09 '18 at 12:58
  • 2
    @ZainGill Try if [this equivalent data.table solution](https://stackoverflow.com/a/32841176/1412059) doesn't exhaust the RAM. If it does, you simply don't have sufficient memory to import the data. – Roland Apr 09 '18 at 13:01
  • @Roland, I've never had problems with not using glob2rx. Why would it be strictly necessary here? – Roberto Moratore Apr 09 '18 at 13:02
  • I agree with @Roland about you possibly not having enough memory. Each csv is around 1.8GB, so they add up fast. you could perhaps sample the data before appending all of it together. – Roberto Moratore Apr 09 '18 at 13:07
  • 1
    Do you know what this regex does? Try `grepl("*.csv", "acsverified.dat")`. – Roland Apr 09 '18 at 13:11
  • @Roland that regex shows me why you added glob2rx to my post :) Long story short, I've never had an issue cause I've been lucky. I have to say, that was a great example :) – Roberto Moratore Apr 09 '18 at 13:18
  • 1
    `rbindlist` from `data.table` is best way to go. See this post for more information http://winvector.github.io/Accumulation/ – Tung Apr 09 '18 at 14:28