The reason it isn't working is that nodes are an "xptr" or "external pointer" and they aren't serialized when they get saved to an R data file. The xml2
package repository and various other places in R docs do have cautionary guidance for this, but nobody RTFM anymore. #sigh
One way to tackle your problem and stop yourself from DoSing the site again in the future is to extract the data from the nodes vs try to save the raw nodes and keep a copy of the source page so you can scrape that vs go back to the site and waste their bandwidth (again).
We'll need some packages:
library(rvest)
library(httr)
library(tidyverse)
You should always start with checking out the site robots.txt
and terms of service/terms and conditions. This site has a robots.txt
but no ToS/T&C, so we'll see if they allow what you're trying to do:
robotstxt::get_robotstxt(urltools::domain("http://www.gbig.org/buildings/")) %>%
cat()
## # See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
## #
## # To ban all spiders from the entire site uncomment the next two lines:
## # User-Agent: *
## # Disallow: /
## User-Agent: *
## Crawl-delay: 10
## Disallow: /beta_invites
## Disallow: /admin
## Disallow: /search
## Disallow: /green_schools
## Disallow: /api
## Disallow: /places/8194/activities
## Disallow: /places/935/activities
So, we need to use a 10s crawl delay between page requests and you better hope you didn't violate the techincal control by using the /search
or /api
paths to get that list of pages.
Also, we'll need this later since we're going to take an alternate approach to getting the nodes you want:
c(
".badge-info .cert-badge", ".event", ".date" , ".media-heading a",
".truncated", ".location" , ".buildings-type"
) -> target_nodes
And, we'll need to clean up ^^ later, too
clean_node_names <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
x <- make.unique(x, sep = "_")
x
}
For this example --- since you didn't provide any data --- we'll need some URLs so we'll grab the first 12 from this page:
pg <- read_html("http://www.gbig.org/buildings/")
html_nodes(pg, "a.cell") %>%
html_attr("href") %>%
sprintf("http://www.gbig.org%s", .) -> building_urls
Now, setup a progress bar since a 10s delay between pages is going to make this seem slow. I realize you may be and many others will be unlikely to follow the robots.txt
rules but that doesn't mean you shouldn't.
pb <- progress_estimated(length(building_urls))
Finally, iterate over those URLs and:
- pause
- read the page
- build a data frame by extracting the node text from each CSS selector path; they're uneven in length so we make them all
list()
columns
- save the character source of the HTML page
NOTE: you may be able to make a nicer data frame with more individual/deliberate node extraction than this smash-and-grab approach one.
map_df(building_urls, ~{
pb$tick()$print()
Sys.sleep(10)
x <- read_html(.x)
map(target_nodes, html_nodes, x=x) %>%
map(html_text) %>%
set_names(clean_node_names(target_nodes)) %>%
map(~list(.x)) %>%
as_data_frame() -> tmpdf
tmpdf$src_html <- as.character(pg)
tmpdf
}) -> xdf
And, after a bit of waiting:
glimpse(xdf)
## Observations: 12
## Variables: 8
## $ badge_info_cert_badge <list> [<"Case Study", "Case Study", "Case Stu...
## $ event <list> [<"Whole Building Design Guide Case Stu...
## $ date <list> [<"06/20/2014", "08/13/2013", "08/13/20...
## $ media_heading_a <list> [<"The Mutual Building Christman Compa...
## $ truncated <list> ["\nThe Christman Building LEED-EB, The...
## $ location <list> ["208 N Capitol Ave, Lansing, MI, USA",...
## $ buildings_type <list> ["\n\nThe Christman Building\n", "\n\nS...
## $ src_html <chr> "<!DOCTYPE html>\n<html lang=\"en\">\n<h...
Because we store src_html
you can process that with read_html()
if you do need to get more/different info from each building.
NOTE: There is an alternate method using xml2::xml_serialize()
:
pb <- progress_estimated(length(building_urls))
map(building_urls, ~{
pb$tick()$print()
Sys.sleep(10)
read_html(.x) %>%
html_nodes(
'.badge-info .cert-badge , .event , .date , .media-heading a , .truncated , .location , .buildings-type'
) %>%
xml_serialize(NULL) -> nodes
nodes
}) -> bldg_lst
Now, it's a list of raw vectors:
str(bldg_lst)
## List of 12
## $ : raw [1:4273] 58 0a 00 00 ...
## $ : raw [1:4027] 58 0a 00 00 ...
## $ : raw [1:3164] 58 0a 00 00 ...
## $ : raw [1:7718] 58 0a 00 00 ...
## $ : raw [1:2996] 58 0a 00 00 ...
## $ : raw [1:2908] 58 0a 00 00 ...
## $ : raw [1:4506] 58 0a 00 00 ...
## $ : raw [1:4127] 58 0a 00 00 ...
## $ : raw [1:2982] 58 0a 00 00 ...
## $ : raw [1:3034] 58 0a 00 00 ...
## $ : raw [1:1800] 58 0a 00 00 ...
## $ : raw [1:1877] 58 0a 00 00 ...
That you can save out.
When read back in, you would do:
map(bldg_lst, xml_unserialize)
## [[1]]
## {xml_nodeset (65)}
## [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
## [2] <p class="location">208 N Capitol Ave, Lansing, MI, USA</p>
## ...
##
## [[2]]
## {xml_nodeset (62)}
## [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
## [2] <p class="location">3825 Wisconsin Ave NW, Washington, DC, USA</p>
## ...
##
## [[3]]
## {xml_nodeset (54)}
## [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
## [2] <p class="location"> San Francisco, CA, USA</p>
## ...
##
## [[4]]
## {xml_nodeset (127)}
## [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
## [2] <p class="location"> Washington, DC, USA</p>
## ...
##
## [[5]]
## {xml_nodeset (50)}
## [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
## [2] <p class="location">4940 N 118th St, Omaha, NE, USA</p>
## ...
##
## [[6]]
## {xml_nodeset (47)}
## [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
## [2] <p class="location"> Dallas, TX, USA</p>
## ...
##
### (etc)
I still think the first suggested method is better.