0

I'm trying to make one dataframe by reading two datasets but the methodology I`m using is extremely slow - it can take as long as 10 hours to read and process 600Mb of data. I believe there must be a much faster way to do this but I guess I cannot see what seems to be slowing down the process. In the following is a reproducible example to present the steps.

Required packages:

library(tidyverse)

The first set is a .csv file. A sample can be recreated with the following:

info <- data.frame(identification = c("a", "b", "c", "d", "e"), attr = c(0:4))
info %>% write_csv("folder/info.csv") 

The second is a zip file. A sample can be recreated with the following:

a <- data.frame(var = c(41:50), val = c(31:40))
a %>% write_csv("folder/file/a_df.csv")  

b <- data.frame(var = c(41:50), val = c(31:40))
b %>% write_csv("folder/file/b_df.csv")

c <- data.frame(var = c(41:50), val = c(31:40))
c %>% write_csv("folder/file/c_df.csv")

d <- data.frame(var = c(41:50), val = c(31:40))
d %>% write_csv("folder/file/d_df.csv")

e <- data.frame(var = c(41:50), val = c(31:40))
e %>% write_csv("folder/file/e_df.csv")

files2zip <- dir('folder/file/', full.names = TRUE)
zip(zipfile = 'testZip', files = files2zip)

The methodology I use is the following:

 data1 <- read_csv("folder/info.csv")

read_from_zip <- function(identification) {
  fn <- paste0("folder/file/", identification, ".csv")  
  # read zip files
  zip_file <- paste0("./folder/testZip.zip")
  id_2_zip <- unzip( zip_file
                     ,files = fn)  
  df <- read_csv(id_2_zip)
  }

df <- data1 %>% group_by(identification) %>% nest() %>%
  mutate(trj = map(identification, read_from_zip)) 

df <- df %>% select(identification, trj) %>% unnest()
adl
  • 1,390
  • 16
  • 36
  • 1
    Is there a particular reason you are not `unzip`ing the files you need in one pass? My guess is that you are effectively unzipping the same (rather large) zipfile multiple times to get to each individual file. I've always found `utils::unzip` to be a small convenience function that provides no huge advantage, since it saves the file to the filesystem anyway. (If it allowed piped read or even write access to the internal file without saving externally, that might be more useful. This can be done somewhat with `unzip -p zipfile.zip some/file.csv`, though less-conveniently using `system`.) – r2evans Mar 28 '18 at 06:25
  • First unzipping and then doing the operation is currently not what I`m looking for, but thank you for your answer – adl Mar 28 '18 at 07:02
  • 1
    Unzipping is an expensive operation, I would guess it takes 90% of the time. Hence, if you insist on unzipping in a loop, I doubt you could improve performance much. Regardless, I personally found Rs unzipping functionality limited and slow. I'm usually using the 7zip CLI or `unzip` and calling them using `system` (like Evans suggested). – David Arenburg Mar 28 '18 at 07:28
  • @DavidArenburg I believe you`re right, but can you please suggest a full answer using the methodology you described ? It would help a lot – adl Mar 28 '18 at 07:38

1 Answers1

1

I'd guess something like this would work:

tmpdir <- tempfile()
dir.create(tmpdir)

a convenience vector, if you desire:

filesvec <- paste0(letters[1:5], '.csv')

Note that this needs to be "verbatim" as listed in the zipfile, including any leading directories. (You can use junkpaths=TRUE for unzip() or system('unzip -j ...') to drop the leading paths.) In the past, I've created this vector of filenames from a quick call to unzip(zipfile, list=TRUE) and greping the output. This way, if you are careful then you will (a) always know before extraction that a file is missing, and (b) not cause an exception within unzip() or a non-zero return code from system('unzip ...'). You might do:

filesvec <- unzip(zipfile, list=TRUE)
filesvec <- filesvec[ grepl("\\.csv$", filesvec) ]
# some logic to ensure you have some or all of what you need

And then one of the following:

unzip(zipfile, files=filesvec, exdir=tmpdir)
system(paste(c("unzip -d", shQuote(c(tempdir(), 'foo.zip', 'a.csv','b.csv')))))

From here, you can access the files with:

alldata <- sapply(file.path(tmpdir, filesvec), read.csv, simplify=FALSE)

where the names of the list are the filenames (including leading path?), and the contents should all be data.frames.

When done, whether you clean up the temp files or not is dependent on how OCD you are with temp files. Your OS might clean them up for you after some time. If you are tight on space or just paranoid, you could do a cleanup with:

ign <- sapply(file.path(tmpdir, filesvec), unlink) 
unlink(tmpdir, recursive=TRUE) # remove the temp dir we created

(You could just use the second command, but in case you are using a different temp-directory method, I thought I'd be careful.)

r2evans
  • 141,215
  • 6
  • 77
  • 149