0

I've been trying to download pdfs embedded in a map following this code (original one can be found here). Each pdf refers to a brazilian municipality (5,570 files).

library(XML)
library(RCurl)
url <- "http://simec.mec.gov.br/sase/sase_mapas.php?uf=RJ&tipoinfo=1"
page   <- getURL(url)
parsed <- htmlParse(page)
links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds   <- grep("*.pdf", links)
links  <- links[inds]
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
for(i in seq_along(links)){
  download.file(links[i], destfile=destination[i])
  Sys.sleep(runif(1, 1, 5))
}

I already used this code in other projects a few times and it worked. For this specific case, it doesn't. In fact, I've tried many things to scrape these files but it seems impossible to me. Recently, I got the following link. Then it makes possible to combine uf (state) and muncod (municipal code) to download the file, but I dont know how to include this to the code though.

http://simec.mec.gov.br/sase/sase_mapas.php?uf=MT&muncod=5100102&acao=download

Thanks in advance!

  • do you know what the values are for the possible states (i.e., what are the two-character codes that refer to each state)? you have one that is RJ...what are the others? – Chris Aug 27 '18 at 22:47
  • nevermind, I found them – Chris Aug 27 '18 at 22:48

1 Answers1

0
devtools::install_github("ropensci/RSelenium")

library(rvest)
library(httr)
library(RSelenium)

# connect to selenium server from within r (REPLACE SERVER ADDRESS)
rem_dr <- remoteDriver(
  remoteServerAddr = "192.168.50.25", port = 4445L, browserName = "firefox"
)

rem_dr$open()

# get the two-digit state codes for brazil by scraping the below webpage
tables <- "https://en.wikipedia.org/wiki/States_of_Brazil" %>%
  read_html() %>%
  html_table(fill = T)
states <- tables[[4]]$Abbreviation

# for each state, we are going to go navigate to the map of that state using
# selenium, then scrape the list of possible municipality codes from the drop
# down menu present in the map
get_munip_codes <- function(state) {
  url <- paste0("http://simec.mec.gov.br/sase/sase_mapas.php?uf=", state)
  rem_dr$navigate(url)
  # have to wait until the drop down menu loads. 8 seconds will be enough time
  # for each state
  Sys.sleep(8)
  src <- rem_dr$getPageSource()

  out <- read_html(src[[1]]) %>%
    html_nodes(xpath = "//select[@id='muncod']/option[boolean(@value)]") %>%
    xml_attrs("value") %>%
    unlist(use.names = F)

  print(state)
  out
}

state_munip <- sapply(
  states, get_munip_codes, USE.NAMES = TRUE, simplify = FALSE
)

# now you can download each pdf. first create a directory for each state, where
# the pdfs for that state will go:
lapply(names(state_munip), function(x) dir.create(file.path("brazil-pdfs", x)))

# ...then loop over each state/municipality code and download the pdf
lapply(
  names(state_munip), function(state) {
    lapply(state_munip[[state]], function(munip) {
      url <- sprintf(
        "http://simec.mec.gov.br/sase/sase_mapas.php?uf=%s&muncod=%s&acao=download",
        state, munip
      )
      file <- file.path("brazil-pdfs", state, paste0(munip, ".pdf"))
      this_one <- paste0("state ", state, ", munip ", munip)
      tryCatch({
        GET(url, write_disk(file, overwrite = TRUE))
        print(paste0(this_one, " downloaded"))
      },
      error = function(e) {
        print(paste0("couldn't download ", this_one))
        try(unlink(file, force = TRUE))
      }
      )
    })
  }
)

STEPS:

  1. Get the IP address of your windows machine (see https://www.digitalcitizen.life/find-ip-address-windows)

  2. start selenium server docker container by running this: docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1

  3. start rocker/tidyverse docker container by running this: docker run -v `pwd`/brazil-pdfs:/home/rstudio/brazil-pdfs -dp 8787:8787 rocker/tidyverse

  4. Go into your preferred browser and enter this address: http://localhost:8787 ...This will take you to the login screen for rstudio server. login using the username "rstudio" and password "rstudio"

  5. Copy/paste the code shown above in a new Rstudio .R document. Replace the value for remoteServerAddr with the IP address you found in step 1.

  6. Run the code...this should write the pdfs to a directory "brazil-pdfs" that is both inside the container and mapped to your windows machine (in other words, the pdfs will show up in the brazil-pdfs dir on your local machine as well). note, it takes a while to run the code b/c there are a lot of pdfs.

Chris
  • 1,575
  • 13
  • 20
  • Hi Chris, thank you very much for your help!! I just couldn't test the code yet because it seems like RSelenium is having issues (https://github.com/ropensci/RSelenium/issues/172). I'll get back to you as soon as possible. – claudiacerqn Aug 28 '18 at 15:23
  • RSelenium isn't on CRAN...You can install it by using the devtools package though: devtools::install_github("ropensci/RSelenium") – Chris Aug 28 '18 at 15:35
  • I did it, then I get this message: "Installation failed: Command failed (1)". According to the discussion on ropensci github, it seems RSelenium has been removed because it depends on `binman` and `wdman` which both would have checks problems. – claudiacerqn Aug 28 '18 at 17:07
  • Users suggested to install the following sequence, but RSelenium didn't work. `library(devtools)` `install_version("binman", version = "0.1.0", repos = "https://cran.uni-muenster.de/")` `install_version("wdman", version = "0.2.2", repos = "https://cran.uni-muenster.de/")` `install_version("RSelenium", version = "1.7.1", repos = "https://cran.uni-muenster.de/")` – claudiacerqn Aug 28 '18 at 17:07
  • what operating system are you on? – Chris Aug 28 '18 at 20:27
  • I am using Windows. – claudiacerqn Aug 28 '18 at 23:10
  • can you give me the output of running this command: devtools::install_github("ropensci/RSelenium") – Chris Aug 30 '18 at 16:48
  • Erro: unexpected input in "C:\" // Execução interrompida // ERROR: loading failed for 'i386', 'x64' // * removing 'C:/Users/Claudia Cerqueira/Documents/R/win-library/3.4/RSelenium' // In R CMD INSTALL // Installation failed: Command failed (1) – claudiacerqn Aug 31 '18 at 16:42
  • Hmm, unfortunatly that doesn't help. if you can install docker (see https://docs.docker.com/docker-for-windows/install/#about-windows-containers) I can show you how to run the code inside a container – Chris Sep 01 '18 at 01:02
  • Well, I'd like to. Docker is installed. Thank you! – claudiacerqn Sep 02 '18 at 21:51
  • 1
    it worked perfectly. the entire process took 36h [due to heavy files], but it went very nice! – claudiacerqn Sep 10 '18 at 17:55