2

I am very new to scraping using R and xml, and I have a question about saving and loading the dataset.

I scraped a fairly large dataset using a code as follows

data<-list()
for(i in page[1:10]){
  pages<-read_html(paste0("http://www.gbig.org/buildings/", i))
  nodes<-html_nodes(pages, '.badge-info .cert-badge , .event , 
.date , .media-heading a , .truncated , .location , .buildings-type')
 data[[i]]  <-nodes
}

I thought I could save data and load it again for future using

save(data, file="trials.RData")

when I load it and try to use it again, I get a certain error message. What have I done wrong? and what would be the best way to save and load xml nodes?

{xml_nodeset (10)}
Error in node_write_character(x$node, options = options, encoding = encoding) : 
  external pointer is not valid

EDIT

My attempted load command is:

load("trials.RData")

Thank you

wyatt
  • 371
  • 3
  • 13
  • 1
    Yep. This is _technically_ a duplicate, but see the answer provided for both an alternate approach and one additional approach that uses the serialization alluded to in the dup'd answer. – hrbrmstr Apr 22 '18 at 03:58

1 Answers1

6

The reason it isn't working is that nodes are an "xptr" or "external pointer" and they aren't serialized when they get saved to an R data file. The xml2 package repository and various other places in R docs do have cautionary guidance for this, but nobody RTFM anymore. #sigh

One way to tackle your problem and stop yourself from DoSing the site again in the future is to extract the data from the nodes vs try to save the raw nodes and keep a copy of the source page so you can scrape that vs go back to the site and waste their bandwidth (again).

We'll need some packages:

library(rvest)
library(httr)
library(tidyverse)

You should always start with checking out the site robots.txt and terms of service/terms and conditions. This site has a robots.txt but no ToS/T&C, so we'll see if they allow what you're trying to do:

robotstxt::get_robotstxt(urltools::domain("http://www.gbig.org/buildings/")) %>%
  cat()
## # See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
## #
## # To ban all spiders from the entire site uncomment the next two lines:
## # User-Agent: *
## # Disallow: /
## User-Agent: *
## Crawl-delay: 10
## Disallow: /beta_invites
## Disallow: /admin
## Disallow: /search
## Disallow: /green_schools
## Disallow: /api
## Disallow: /places/8194/activities
## Disallow: /places/935/activities

So, we need to use a 10s crawl delay between page requests and you better hope you didn't violate the techincal control by using the /search or /api paths to get that list of pages.

Also, we'll need this later since we're going to take an alternate approach to getting the nodes you want:

c(
  ".badge-info .cert-badge", ".event", ".date" , ".media-heading a",
  ".truncated", ".location" , ".buildings-type"
) -> target_nodes

And, we'll need to clean up ^^ later, too

clean_node_names <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

For this example --- since you didn't provide any data --- we'll need some URLs so we'll grab the first 12 from this page:

pg <- read_html("http://www.gbig.org/buildings/")

html_nodes(pg, "a.cell") %>%
  html_attr("href") %>%
  sprintf("http://www.gbig.org%s", .) -> building_urls

Now, setup a progress bar since a 10s delay between pages is going to make this seem slow. I realize you may be and many others will be unlikely to follow the robots.txt rules but that doesn't mean you shouldn't.

pb <- progress_estimated(length(building_urls))

Finally, iterate over those URLs and:

  • pause
  • read the page
  • build a data frame by extracting the node text from each CSS selector path; they're uneven in length so we make them all list() columns
  • save the character source of the HTML page

NOTE: you may be able to make a nicer data frame with more individual/deliberate node extraction than this smash-and-grab approach one.

map_df(building_urls, ~{

  pb$tick()$print()

  Sys.sleep(10)

  x <- read_html(.x)

  map(target_nodes, html_nodes, x=x) %>%
    map(html_text) %>%
    set_names(clean_node_names(target_nodes)) %>%
    map(~list(.x)) %>%
    as_data_frame() -> tmpdf

  tmpdf$src_html <- as.character(pg)

  tmpdf

}) -> xdf

And, after a bit of waiting:

glimpse(xdf)
## Observations: 12
## Variables: 8
## $ badge_info_cert_badge <list> [<"Case Study", "Case Study", "Case Stu...
## $ event                 <list> [<"Whole Building Design Guide Case Stu...
## $ date                  <list> [<"06/20/2014", "08/13/2013", "08/13/20...
## $ media_heading_a       <list> [<"The Mutual Building  Christman Compa...
## $ truncated             <list> ["\nThe Christman Building LEED-EB, The...
## $ location              <list> ["208 N Capitol Ave, Lansing, MI, USA",...
## $ buildings_type        <list> ["\n\nThe Christman Building\n", "\n\nS...
## $ src_html              <chr> "<!DOCTYPE html>\n<html lang=\"en\">\n<h...

Because we store src_html you can process that with read_html() if you do need to get more/different info from each building.

NOTE: There is an alternate method using xml2::xml_serialize():

pb <- progress_estimated(length(building_urls))

map(building_urls, ~{

  pb$tick()$print()

  Sys.sleep(10)

  read_html(.x) %>%
    html_nodes(
      '.badge-info .cert-badge , .event , .date , .media-heading a , .truncated , .location , .buildings-type'
    ) %>%
    xml_serialize(NULL) -> nodes

  nodes

}) -> bldg_lst

Now, it's a list of raw vectors:

str(bldg_lst)
## List of 12
##  $ : raw [1:4273] 58 0a 00 00 ...
##  $ : raw [1:4027] 58 0a 00 00 ...
##  $ : raw [1:3164] 58 0a 00 00 ...
##  $ : raw [1:7718] 58 0a 00 00 ...
##  $ : raw [1:2996] 58 0a 00 00 ...
##  $ : raw [1:2908] 58 0a 00 00 ...
##  $ : raw [1:4506] 58 0a 00 00 ...
##  $ : raw [1:4127] 58 0a 00 00 ...
##  $ : raw [1:2982] 58 0a 00 00 ...
##  $ : raw [1:3034] 58 0a 00 00 ...
##  $ : raw [1:1800] 58 0a 00 00 ...
##  $ : raw [1:1877] 58 0a 00 00 ...

That you can save out.

When read back in, you would do:

map(bldg_lst, xml_unserialize)
## [[1]]
## {xml_nodeset (65)}
##  [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
##  [2] <p class="location">208 N Capitol Ave, Lansing, MI, USA</p>
## ...
## 
## [[2]]
## {xml_nodeset (62)}
##  [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
##  [2] <p class="location">3825 Wisconsin Ave NW, Washington, DC, USA</p>
## ...
## 
## [[3]]
## {xml_nodeset (54)}
##  [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
##  [2] <p class="location"> San Francisco, CA, USA</p>
## ...
## 
## [[4]]
## {xml_nodeset (127)}
##  [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
##  [2] <p class="location"> Washington, DC, USA</p>
## ...
## 
## [[5]]
## {xml_nodeset (50)}
##  [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
##  [2] <p class="location">4940 N 118th St, Omaha, NE, USA</p>
## ...
## 
## [[6]]
## {xml_nodeset (47)}
##  [1] <h2 class="buildings-page-title buildings-type"><img alt="Building" ...
##  [2] <p class="location"> Dallas, TX, USA</p>
## ...
## 
### (etc)

I still think the first suggested method is better.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205