Using R pagedown package to extract webpages as PDFs without pop-ups and cookie warnings

Question

So a friend of mine has written over 800 articles in a food blog, and I am looking to extract all of these to PDFs so that I can bind them nicely and gift them to him. There are simply too many articles to use Chrome's "Save as PDF" manually, so I am looking for the crispest possible way to run through a loop that saves the sites in this format. I have a working solution, however, the final PDFs have ugly ads and cookie warning banners on every single page. I don't see this when I manually select "Print" as PDF on Chrome. Is there a way to pass settings to Chromium using pagedown to have it print without these elements? I've pasted my code below, with the website in question.

library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(downloader)

#Specifying the url for desired website to be scraped

url1 <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', '1', '/')

#Reading the HTML code from the website
webpage1 <- read_html(url1)

# Pull the links for all articles on George's initial author page

dat <- html_attr(html_nodes(webpage1, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

# Pull the links for all articles on George's 2nd-89th author page

for (i in 2:89) {

url <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', i, '/')

#Reading the HTML code from the website
webpage <- read_html(url)

links <- html_attr(html_nodes(webpage, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

dat <- bind_rows(dat, links) %>%
  unique()

}

dat <- dat %>%
  arrange(link)

# form 1-link vector to test with

tocollect<- dat$link[1]

pagedown::chrome_print(input=tocollect,
                       wait=20,
                       format = "pdf",
                       verbose = 0,
                       timeout=300)

Martin Schmelzer · Accepted Answer · 2020-07-29T16:22:03.020

1

I would rather strip the page of all the elements you do not need (especially the scripts, whereas you want to keep the stylesheets), save as a temporary HTML and then print it. The written HTML file looks nice in the browser, I could not test the printing though:

for(l in articleUrls) {
  a <- read_html(l) 
  xml_remove(a %>% xml_find_all("aside"))
  xml_remove(a %>% xml_find_all("footer"))
  xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'article-related mb20')]"))
  xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'tags')]"))
  xml_remove(a %>% xml2::xml_find_all("//script"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'ad box')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'newsletter-signup')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer-sidebar')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-footer')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'sticky-newsletter')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-header')]"))
  
  xml2::write_html(a, file = "currentArticle.html")
  
  pagedown::chrome_print(input = "currentArticle.html")
}

edited Jul 29 '20 at 16:22

answered Jul 28 '20 at 14:15

Martin Schmelzer

23,283
6
73
98

This is very helpful. It would allow me to get nice looking HTML natively. I just realized that you said you did not test the printing, however, and that is giving me an error: Error in force(expr) : Failed to generate output. Reason: Cannot navigate to invalid URL – chrisrogers37 Jul 29 '20 at 03:43
Just kidding, I was able to get it converted from HTML to PDF! It almost looks as great as on the browser. However, there is one Facebook Like/Share button and a "Get The Latest!" tag on the bottom footer of every page. How would I inspect the HTML and find elements I might want to remove? Thanks again. – chrisrogers37 Jul 29 '20 at 04:08
And finally, is there a way that I can get the title out of the HTML, or the last section of the URL, and make that be the HTML file name? – chrisrogers37 Jul 29 '20 at 04:32
Right click on the element in the browser, click on "Inspect element" and look for the tag, class or id of the element and add it to my list in the same fashion. And yes, you could, do you want to keep all 800 HTML files? Just use `gsub`to extract the part after the last `/` and create the filename using `paste0(filename, ".html")` – Martin Schmelzer Jul 29 '20 at 06:30
Awesome. When I do that on the buttons, it highlights this: ==$0. Do I just duplicate one of your lines and replace 'site-header', for example, with '_8f1i'? – chrisrogers37 Jul 29 '20 at 16:13
The FB stuff gets added by scripts that werent removed. I fixed the line where we remove all script tags. In your example replacing site-header with _8f1i is correct. Always move up the tree until you find the parent element and remove that one. If you are satisfied, please accept the answer :) – Martin Schmelzer Jul 29 '20 at 16:30
I've successfully removed the FB links, and the "Get The Latest!" tag using your guidance and the Inspect function. Thank you! Last question. When writing from HTML -> PDF it repeats the href portion of hyperlinked words as links. Is there a way to clean this up? I'll try deleting class href in the meantime. – chrisrogers37 Jul 29 '20 at 16:38
Actually, it appears I should try and remove all href components while leaving the anchor? – chrisrogers37 Jul 29 '20 at 16:39
href is an attribute of the `` tag. If you remove all `` tags, you will also remove their link text. I again cannot fiddle with this, but this answer might help you https://stackoverflow.com/a/45872665/1777111 – Martin Schmelzer Jul 30 '20 at 07:34

Using R pagedown package to extract webpages as PDFs without pop-ups and cookie warnings

1 Answers1