I'm trying to make one dataframe by reading two datasets but the methodology I`m using is extremely slow - it can take as long as 10 hours to read and process 600Mb of data. I believe there must be a much faster way to do this but I guess I cannot see what seems to be slowing down the process. In the following is a reproducible example to present the steps.
Required packages:
library(tidyverse)
The first set is a .csv
file. A sample can be recreated with the following:
info <- data.frame(identification = c("a", "b", "c", "d", "e"), attr = c(0:4))
info %>% write_csv("folder/info.csv")
The second is a zip file. A sample can be recreated with the following:
a <- data.frame(var = c(41:50), val = c(31:40))
a %>% write_csv("folder/file/a_df.csv")
b <- data.frame(var = c(41:50), val = c(31:40))
b %>% write_csv("folder/file/b_df.csv")
c <- data.frame(var = c(41:50), val = c(31:40))
c %>% write_csv("folder/file/c_df.csv")
d <- data.frame(var = c(41:50), val = c(31:40))
d %>% write_csv("folder/file/d_df.csv")
e <- data.frame(var = c(41:50), val = c(31:40))
e %>% write_csv("folder/file/e_df.csv")
files2zip <- dir('folder/file/', full.names = TRUE)
zip(zipfile = 'testZip', files = files2zip)
The methodology I use is the following:
data1 <- read_csv("folder/info.csv")
read_from_zip <- function(identification) {
fn <- paste0("folder/file/", identification, ".csv")
# read zip files
zip_file <- paste0("./folder/testZip.zip")
id_2_zip <- unzip( zip_file
,files = fn)
df <- read_csv(id_2_zip)
}
df <- data1 %>% group_by(identification) %>% nest() %>%
mutate(trj = map(identification, read_from_zip))
df <- df %>% select(identification, trj) %>% unnest()