0

I have a folder with multiple *.rar and *.zip files. Each *.rar and *.zip files have one folder and inside this folder have multiples folders.

I would like to generate a dataset with the names of these multiple folders.

How can I do this using R?

I trying:

temp <- list.files(pattern = "\\.zip$")
lapply(temp, function(x) unzip(x, list = T))

But it returns:

enter image description here

I would like to get just the names: "Nova pasta1" and Nova pasta2"

Thanks

Wilson Souza
  • 830
  • 4
  • 12
  • 1
    Why not just unzip and use `list.dirs()`? – dcsuka Aug 03 '22 at 19:09
  • I have a lot of zip files with huge folders and files inside them. I don't think that unzip all is a better way. I would just need to get the names of the folders in it. – Wilson Souza Aug 03 '22 at 19:14
  • 1
    Here are some helpful links: https://stackoverflow.com/questions/22099468/getting-zip-rar-structure-without-full-downloading https://osxdaily.com/2013/06/17/view-zip-archive-contents-without-extracting-mac-os-x/ . If you want to run in R, you can use system(command, intern = TRUE) and work with the output as a text file. – dcsuka Aug 03 '22 at 19:18
  • `<- unzip(my_zipped.zip, list = TRUE)` returns df (metadata), without uncompressing. – Chris Aug 03 '22 at 19:26
  • Your edit to the question just changed the substance of the question pretty substantially. See my answer below, which both addresses the original question and your new one. – socialscientist Aug 03 '22 at 19:47

2 Answers2

1

Let's create an simple set of directories/files that are representative of your own. You described having a single .zip file that contains multiple zipped directories, which may contain unzipped files and/or sub-directoris.

# Example main directory
dir.create("main_dir")

# Example directory with 1 file and a subdirectory with 1 file
dir.create("main_dir/example_dir1")
write.csv(data.frame(x = 5), file = "main_dir/example_dir1/example_file.csv")
dir.create("main_dir/example_dir1/example_subdir")
write.csv(data.frame(x = 5), file = "main_dir/example_dir1/example_subdir/example_subdirfile.csv")

# Example directory with 1 file
dir.create("main_dir/example_dir2")
write.csv(data.frame(x = "foo"), file = "main_dir/example_dir2/example_file2.csv")

# NOTE: I was having issues with using `zip()` to zip each directory
# then the main (top) directory, so I manually zipped them below.

# Manually zip example_dir1 and example_dir2, then zip main_dir at this point.

Given this structure, we can get the paths to all of the directories within the highest level directory (main_dir) using unzip(list = TRUE) since we know the name of the single zipped directory containing all of these additional zipped sub-directories.

# Unzip the highest level directory available, get all of the .zip dirs within
ex_path <- "main_dir"
all_zips <- unzip(zipfile = paste0(ex_path, ".zip"), list = TRUE)
all_zips

# We can remove the main_path string if we want so that we only
# the zip files within our main directory instead of the full path.
library(dplyr)

all_zips %>%
  filter(Name != paste0(ex_path, "/")) %>%
  mutate(Name = sub(paste0(ex_path, "/"), "", Name))

If you had multiple zipped directories with nested directories similar to main_dir, you could just put their paths in a list and apply the function to each element of the list. Below I reproduce this.

# Example of multiple zip directory paths in a list
ziplist  <- list(ex_path, ex_path, ex_path)

lapply(ziplist, function(x) {
  temp <- unzip(zipfile = paste0(x, ".zip"), list = TRUE)
  temp <- temp %>% mutate(main_path = x)
  temp <- temp %>% 
           filter(Name != paste0(ex_path, "/")) %>%
           mutate(Name = sub(paste0(ex_path, "/"), "", Name))
  temp
})

If all of the .zip files in the current working directory are files you want to do this for, you can get ziplist above via:

list.files(pattern = ".zip") %>% as.list()
socialscientist
  • 3,759
  • 5
  • 23
  • 58
  • Thanks a lot. It is almost it. But I would like to get just the names, in your amazing example, of directories into the main_dir (example_dir1 and example_dir2). – Wilson Souza Aug 03 '22 at 19:58
  • Where I wrote "can remove the main path if we want" it will do that. just throw that into the lapply. I'll update to show – socialscientist Aug 03 '22 at 19:59
  • @WilsonSouza The above will produce an object with `example_dir1` and `example_dir2.` You can extract that specific vector of values with `all_zips$Name`. If you wanted to drop the ".zip" from the directory, just use the same string substitution approach I employed above: `all_zips$Name %>% sub(".zip","",.)` – socialscientist Aug 03 '22 at 20:06
0

I appreciate all help, but I think that I found a short way to solve my question.

temp.zip <- list.files(pattern = ".zip")
temp.rar <- list.files(pattern = ".rar")

mydata <- lapply(c(temp.rar, temp.zip),
                 function(x) unique(c(na.omit(str_extract(unlist
                                                          (untar(tarfile = x, 
                                                                 list = TRUE)),
                                                          '(?<=/).*(?=/)')))))

unlist(mydata)

Thanks all

Wilson Souza
  • 830
  • 4
  • 12