Webscraping and downloading PDFs in R

Question

I'm trying to loop through different pages of this website https://burnsville.civicweb.net/filepro/documents/25657/ and download all the PDFs to a folder. Because of the way the website is set up, my usual download.file solution wont work. Any other suggestions?

Does this answer your question? [Problems with Downloading pdf file using R](https://stackoverflow.com/questions/9280243/problems-with-downloading-pdf-file-using-r) — Mohamed Desouky, Jun 08 '22 at 19:46
Unfortunately not! The website I'm trying to gather from doesnt have a .pdf URL for each file, so it doesnt seem I can use download.file in this situation — scotiaboy, Jun 08 '22 at 19:51
In the source of that page there are 6 href's that start with `href="/document` — IRTFM, Jun 08 '22 at 19:59
Thanks @IRTFM, you're right! So I guess I could go about it by scraping the hrefs and then suing download.file? — scotiaboy, Jun 08 '22 at 20:12
Yes, assuming your goal is to automate this action, the hrefs are partial URLs and you would need to also extract the "base" URL from the page so you could concatenate those character values.. If you just want the files, then it will be lot fasted to do it by hand. — IRTFM, Jun 08 '22 at 20:25
It is to automate the process, as I have several other such websites ill be needing to grab PDFs from. Ill try to do this - thanks very much! — scotiaboy, Jun 08 '22 at 20:32

score 1 · Answer 1 · answered Jun 26 '22 at 10:23

You probably have found a solution by now, but here is my suggestion with rvest and purrrs method of loop. This should work across the Burnsville database, just replace the page variable.

library(tidyverse)
library(rvest)

page <-
  "https://burnsville.civicweb.net/filepro/documents/25657/" %>%
  read_html

df <- tibble(
  names = page %>%
    html_nodes(".document-link") %>%
    html_text2() %>%
    str_remove_all("\r") %>%
    str_squish(),
  links = page %>%
    html_nodes(".document-link") %>%
    html_attr("href") %>%
    paste0("https://burnsville.civicweb.net/", .)
)

# A tibble: 6 × 2
  names                                                                links
  <chr>                                                                <chr>
1 Parks & Natural Resources Commission - 06 Dec 2021 Work Session - M… http…
2 Parks & Natural Resources Commission - 15 Nov 2021 - Minutes - Pdf   http…
3 Parks & Natural Resources Commission - 04 Oct 2021 - Minutes - Pdf   http…
4 Parks & Natural Resources Commission - 07 Jun 2021 - Minutes - Pdf   http…
5 Parks & Natural Resources Commission - 19 Apr 2021 - Minutes - Pdf   http…
6 Parks & Natural Resources Commission - 04 Jan 2021 - Minutes - Pdf   http…
    
df %>% 
  map(~ download.file(links, destfile = paste0(names, ".pdf")))

score 0 · Answer 2 · answered Jun 08 '22 at 19:58

0

This worked for me

download.file("https://burnsville.civicweb.net/filepro/documents/36906", "a1.pdf", mode="wb")

answered Jun 08 '22 at 19:58

Mohamed Desouky

4,340
2
4
19

1

So your advice is that he look at the source and manually "scrape" the document numbers? I don't see that that would be any easier than just clicking on the icons that have links. – IRTFM Jun 08 '22 at 20:01
I think so but keep your question active for others to help . – Mohamed Desouky Jun 08 '22 at 20:02

Webscraping and downloading PDFs in R

2 Answers2