I'm trying to loop through different pages of this website https://burnsville.civicweb.net/filepro/documents/25657/ and download all the PDFs to a folder. Because of the way the website is set up, my usual download.file solution wont work. Any other suggestions?
Asked
Active
Viewed 55 times
-1
-
Does this answer your question? [Problems with Downloading pdf file using R](https://stackoverflow.com/questions/9280243/problems-with-downloading-pdf-file-using-r) – Mohamed Desouky Jun 08 '22 at 19:46
-
Unfortunately not! The website I'm trying to gather from doesnt have a .pdf URL for each file, so it doesnt seem I can use download.file in this situation – scotiaboy Jun 08 '22 at 19:51
-
In the source of that page there are 6 href's that start with `href="/document` – IRTFM Jun 08 '22 at 19:59
-
Thanks @IRTFM, you're right! So I guess I could go about it by scraping the hrefs and then suing download.file? – scotiaboy Jun 08 '22 at 20:12
-
1Yes, assuming your goal is to automate this action, the hrefs are partial URLs and you would need to also extract the "base" URL from the page so you could concatenate those character values.. If you just want the files, then it will be lot fasted to do it by hand. – IRTFM Jun 08 '22 at 20:25
-
It is to automate the process, as I have several other such websites ill be needing to grab PDFs from. Ill try to do this - thanks very much! – scotiaboy Jun 08 '22 at 20:32
2 Answers
1
You probably have found a solution by now, but here is my suggestion with rvest
and purrr
s method of loop. This should work across the Burnsville database, just replace the page variable.
library(tidyverse)
library(rvest)
page <-
"https://burnsville.civicweb.net/filepro/documents/25657/" %>%
read_html
df <- tibble(
names = page %>%
html_nodes(".document-link") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish(),
links = page %>%
html_nodes(".document-link") %>%
html_attr("href") %>%
paste0("https://burnsville.civicweb.net/", .)
)
# A tibble: 6 × 2
names links
<chr> <chr>
1 Parks & Natural Resources Commission - 06 Dec 2021 Work Session - M… http…
2 Parks & Natural Resources Commission - 15 Nov 2021 - Minutes - Pdf http…
3 Parks & Natural Resources Commission - 04 Oct 2021 - Minutes - Pdf http…
4 Parks & Natural Resources Commission - 07 Jun 2021 - Minutes - Pdf http…
5 Parks & Natural Resources Commission - 19 Apr 2021 - Minutes - Pdf http…
6 Parks & Natural Resources Commission - 04 Jan 2021 - Minutes - Pdf http…
df %>%
map(~ download.file(links, destfile = paste0(names, ".pdf")))

Chamkrai
- 5,912
- 1
- 4
- 14
0
This worked for me
download.file("https://burnsville.civicweb.net/filepro/documents/36906", "a1.pdf", mode="wb")

Mohamed Desouky
- 4,340
- 2
- 4
- 19
-
1So your advice is that he look at the source and manually "scrape" the document numbers? I don't see that that would be any easier than just clicking on the icons that have links. – IRTFM Jun 08 '22 at 20:01
-