1

I'm attempting to learn webscraping using rvest and am trying to reproduce the example given here:

https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/

Having installed rvest, I simply copy-pasted the code given in the article:

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>%
  html_table()
population <- population[[1]]

The only difference is that I use read_html() rather than html(), since the latter is deprecated.

Rather than the output reported in the article, this code yields the familiar:

Error in population[[1]] : subscript out of bounds

The origin of which is that running the code without final two lines gives population a value of {xml_nodeset (0)}

All of the previous questions regarding this suggest that this is caused by the table being dynamically formatted in javascript. But this is not the case here (unless Wikipedia has change its formatting since the rbloggers article in 2015).

Any insight would be much appreciated since I'm at a loss!

QHarr
  • 83,427
  • 12
  • 54
  • 101
natedjurus
  • 319
  • 3
  • 11

1 Answers1

1

The html has changed. That xpath is no longer valid. You could do the following:

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  read_html() %>%
  html_node(xpath='//table') %>%
  html_table()

As I have switched to html_node, which returns the first match, I no longer need the index of [1].

The longer xpath now has a div in your original path:

//*[@id="mw-content-text"]/div/table[1]

That is the path you get it you right click copy xpath in the browser on the table.

You want to avoid long xpaths as they are fragile and, as seen, can break easily when the html of the page is changed.

You could also use css and grab by class (for example)

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  read_html() %>%
  html_node(css='.wikitable') %>%
  html_table()
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks so much! Both those worked. While I have you, I'm now trying to scrape the first table from this url: https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal This time, the code `population <- url %>% read_html() %>% html_node(xpath='//*[@id="top-player-stats-summary-grid"]')` gives `population` a value of `{xml_missing} ` Your css approach with '.grid' as the class yields the same. Do you have any idea what's going on here? Googling suggests a Javascript issue but I'm not sure the table is in dynamic format. – natedjurus Aug 18 '19 at 19:29
  • 1
    indeed, table appears after js runs (you can verify by disabling js in browser and refreshing page). If you open a new question and drop a link here I will look into it. – QHarr Aug 18 '19 at 19:31
  • thanks very much for the offer - I've posted a new question here: https://stackoverflow.com/questions/57547825/xml-nodeset-0-issue-when-webscraping-table I have seen some answers mentioning RSelenium, but that seems to be very slow for me so I'm not sure if there's a quicker way? Sorry, I really don't know anything about javascript/html/anything but R really – natedjurus Aug 18 '19 at 19:45