Scraping from website using getURL() returns string of urls, not website content. How do I get the contents of the site? (R studio, windows 10)

Question

I am completely new to scraping, using Windows 10 PC. I am trying to run this code from class to scrape the content of the party platforms form the URLs below:

years=c(1968, 1972, 1976)
urlsR=paste("https://maineanencyclopedia.com/republican-party-platform-",
            years,"/",sep='')
urlsD=paste("https://maineanencyclopedia.com/democratic-party-platform-",
            years,"/",sep='')
urls=c(urlsR,urlsD)
scraped_platforms <- getURL(urls)

When I run "scraped_platforms" the result is what is shown below rather than the content of the party platforms from the website.

https://maineanencyclopedia.com/republican-party-platform-1968/ 
                                                             "" 
https://maineanencyclopedia.com/republican-party-platform-1972/ 
                                                             "" 
https://maineanencyclopedia.com/republican-party-platform-1976/ 
                                                             "" 
https://maineanencyclopedia.com/democratic-party-platform-1968/ 
                                                             "" 
https://maineanencyclopedia.com/democratic-party-platform-1972/ 
                                                             "" 
https://maineanencyclopedia.com/democratic-party-platform-1976/ 
                                                             ""

I've seen Windows 10 might be incompatible with getURL (re: How to get getURL to work on R on Windows 10? [tlsv1 alert protocol version]). Even after looking online though, I'm still unclear on how to fix my specific code?

List of links used here:

https://maineanencyclopedia.com/republican-party-platform-1968/
https://maineanencyclopedia.com/republican-party-platform-1972/ 
https://maineanencyclopedia.com/republican-party-platform-1976/
https://maineanencyclopedia.com/democratic-party-platform-1968/  
https://maineanencyclopedia.com/democratic-party-platform-1972/  
https://maineanencyclopedia.com/democratic-party-platform-1976/

Is `getURL()` a base function? If not, can you specify which package it's from? — cory, Feb 17 '22 at 21:41

Bloxx · Answer 1 · 2022-02-17T22:51:46.857

I don't know getURL() function, but in R, there is one very handy package for scraping: rvest

You can just use your urls object which has all URLs and loop over:

library(rvest)
library(dplyr)

df <- tibble(Title= NULL,
             Text= NULL)
for (url in urls){
  t <- read_html(url) %>% html_nodes(".entry-title") %>% html_text2()
  p <- read_html(url) %>% html_nodes("p") %>%   html_text2()
  tp <- tibble(Title = t,
               Text = p)
  df <- rbind(df, tp)
}

df

This is a bit unorganized output, but you can adjust for loop so you can get it a bit nicer.

Here is also a bit nicer presentation of data:

df2 <- df %>% group_by(Title) %>%
  slice(-1) %>%
  mutate(Text_all = paste0(Text, collapse = "\n")) %>%
  dplyr::select(-Text) %>%
  distinct()

df2

Scraping from website using getURL() returns string of urls, not website content. How do I get the contents of the site? (R studio, windows 10)

1 Answers1