3

New member here. Trying to download a large number of files from a website in R (but open to suggestions as well, such as wget.)

From this post, I understand I must create a vector with the desired URLs. My initial problem is to write this vector, since I have 27 states and 34 agencies within each state. I must download one file for each agency for all states. Whereas the state codes are always two characters, the agency codes are 2 to 7 characters long. The URLs would look like this:

http://website.gov/xx_yyyyyyy.zip

where xxis the state code and yyyyyyy the agency code, between 2 and 7 characters long. I am lost as to how to build one such loop.

I assume I can then download this url list with the following function:

for(i in 1:length(url)){
download.file(urls, destinations, mode="wb")}

Does that make sense?

(Disclaimer: an earlier version of this post was uploaded earlier but incomplete. My mistake, sorry!)

Community
  • 1
  • 1
  • 1
    This simple example may help `paste0(rep(letters[1:4], 4), rep(1:4, each=4))`. Without more information as to the names of the agencies, it will not be possible to say much more. – lmo Dec 16 '16 at 13:45
  • Thanks for your input. Agency names are acronyms: FAA, DEA, NTSB, and such. I've created a vector `agency` with these acronyms, as well as a `states` vector with the 27 states I need. Will try your suggestion and post back. – questionMarc Dec 16 '16 at 13:55
  • Thanks, @Imo! Your input helped me a lot. Now I understand the dynamics of adding string variables. – questionMarc Dec 16 '16 at 14:22

3 Answers3

7

This will download them in batches and take advantage of the speedier simultaneous downloading capabilities of download.file() if the libcurl option is available on your installation of R:

library(purrr)

states <- state.abb[1:27]
agencies <- c("AID", "AMBC", "AMTRAK", "APHIS", "ATF", "BBG", "DOJ", "DOT",
              "BIA", "BLM", "BOP", "CBFO", "CBP", "CCR", "CEQ", "CFTC", "CIA",
              "CIS", "CMS", "CNS", "CO", "CPSC", "CRIM", "CRT", "CSB", "CSOSA",
              "DA", "DEA", "DHS", "DIA", "DNFSB", "DOC", "DOD", "DOE", "DOI")

walk(states, function(x) {
   map(x, ~sprintf("http://website.gov/%s_%s.zip", ., agencies)) %>% 
    flatten_chr() -> urls
    download.file(urls, basename(urls), method="libcurl")
}) 
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
1

This should do the job:

agency <- c("FAA", "DEA", "NTSB")
states <- c("AL", "AK", "AZ", "AR")

URLs <-
paste0("http://website.gov/",
       rep(agency, length(agency)),
       "_",
       rep(states, length(states)),
       ".zip")

Then loop through the URLs vector to pull the zip files. It will be faster if you use an apply function.

epo3
  • 2,991
  • 2
  • 33
  • 60
  • 1
    this works, but only because the lengths of agency and states are relatively prime. To make it more general, one of the rep commands should specify `each`, and the other should specify `times`. Will `apply` really make it faster in this case? – Miff Dec 16 '16 at 14:22
  • Thanks, @epo3. It worked perfectly. Now I suppose I can use `download.file()` and just cycle through all urls. Thanks again! – questionMarc Dec 16 '16 at 14:25
  • @Miff: I suppose I need to specify each state one `time` and `each` agency 27 times? – questionMarc Dec 16 '16 at 14:33
  • 1
    @questionMarc `paste0("http://website.gov/", rep(agency, each=length(agency)), "_", rep(states, times=length(states)), ".zip")` – Miff Dec 16 '16 at 15:14
  • Excellent, @Miff. Makes perfect sense, thank you so much. – questionMarc Dec 16 '16 at 18:06
0

If all your agency codes are the same within each state code you could use the below to create your vector of urls to loop through. (You will also need a vector of destinations the same size).

#Getting all combinations
States <- c("AA","BB")
Agency <- c("ABCDEFG","HIJKLMN")
AllCombinations <- expand.grid(States, Agency)
AllCombinationsVec <- paste0("http://website.gov/" ,AllCombinations$Var1, "_",AllCombinations$Var2,".zip" )

You can then try looping through each file something like this:

#loop method

for(i in seq(AllCombinationsVec)){
  download.file(AllCombinationsVec[i], destinations[i], mode="wb")}

This is also another way of looping through items apply functions will apply a function to every item in a list or vector.

#lapply method

mapply(function(x, y) download.file(x,y, mode="wb"),x = AllCombinationsVec, y = destinations)
DataJack
  • 341
  • 2
  • 13