3

I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info.

Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored.

The saving grace should be that the relevant table is always in a section with the word "Standings".

Is there some way I can grep a section name and only extract the table node(s) there?

Here are some sample pages to demonstrate the structure:

  • 1922 season - Only one division, one table; table is found under heading "Standings" and has xpath //*[@id="mw-content-text"]/table[2] and CSS selector #mw-content-text > table.wikitable.

  • 1950 season - Two divisions, two tables; both found under heading "Final standings". First has xpath //*[@id="mw-content-text"]/div[2]/table / CSS #mw-content-text > div:nth-child(20) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(21) > table.

  • 2000 season - Two conferences, 6 divisions, two tables; both found under heading "Final regular season standings". First has xpath //*[@id="mw-content-text"]/div[2]/table and selector #mw-content-text > div:nth-child(16) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(17) > table

In summary:

# season |                                   xpath |                                          css
-------------------------------------------------------------------------------------------------
#   1922 |     //*[@id="mw-content-text"]/table[2] |           #mw-content-text > table.wikitable
#   1950 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(20) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(21) > table
#   2000 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(16) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(17) > table

Scraping, e.g., 1922 would be easy:

output <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>%
  html_node(xpath = '//*[@id="mw-content-text"]/table[2]') %>% whatever_else(...)

But I didn't see any pattern that I could use in the xpath nor the CSS selector that I could use to generalize this so I don't have to make 80 individual scraping exercises.

Is there any robust way to try and scrape all these tables, especially given the crucial insight that all the tables are located below a heading which would return TRUE from grepl("standing", tolower(section_title))?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 1
    what about scraping all the tables on the page using `tbls <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>% html_table(fill = T)`, and then down-selecting by those that contain one/some/all of `c("W", "L", "PCT", "PF", "PA")` in a row? – SymbolixAU Apr 11 '16 at 01:44
  • 1
    Good, but not necessarily perfect: `'html_nodes(xpath = '(//abbr[@title="Winning percentage"]|//th[text()="PCT"])/ancestor::table')` – alistaire Apr 11 '16 at 04:04
  • @alistaire oh wow, i didn't know xpath could perform such tricks. can you recommend any reference for pickling up the basics? – MichaelChirico Apr 11 '16 at 04:38
  • also, is there no way to extend that `ancestor` logic to incorporate the bit about the section title? – MichaelChirico Apr 11 '16 at 04:39
  • 1
    I've just been using W3Schools; the [Syntax](http://www.w3schools.com/xsl/xpath_syntax.asp) page is a good baseline, and the [Axes](http://www.w3schools.com/xsl/xpath_axes.asp) and [XSLT/XPath Functions](http://www.w3schools.com/xsl/xsl_functions.asp) pages are useful. – alistaire Apr 11 '16 at 04:42
  • 1
    ...and you can match it with `/preceding::span[contains(@id,"tandings")]`, but I'm not sure how to fit it into the logic usefully. – alistaire Apr 11 '16 at 04:47
  • @alistaire I think that's certainly the right idea... xpath `//span[contains(@id,"tandings")]/following::table` seems like progress; is there a way to get siblings between `span`s? I'm seeing some related `xpath` questions but still don't have the syntax down. – MichaelChirico Apr 11 '16 at 16:18
  • I think I got it: `'//span[contains(@id, "tandings")]/following::*[@title="Winning percentage" or text()="PCT"]/ancestor::table'` – alistaire Apr 11 '16 at 16:36
  • @alistaire that worked beautifully! Feel free to add as answer – MichaelChirico Apr 11 '16 at 17:29

1 Answers1

1

You can scrape everything at once by looping the URLs with lapply and pulling the tables with a carefully chosen XPath selector:

library(rvest)

lapply(paste0('https://en.wikipedia.org/wiki/', 1920:2015, '_NFL_season'), 
       function(url){ 
           url %>% read_html() %>% 
               html_nodes(xpath = '//span[contains(@id, "tandings")]/following::*[@title="Winning percentage" or text()="PCT"]/ancestor::table') %>% 
               html_table(fill = TRUE)
       })

The XPath selector looks for

  • //span[contains(@id, "tandings")]
    • all spans with an id with tandings in it (e.g "Standings", "Final standings")
  • /following::*[@title="Winning percentage" or text()="PCT"]
    • with a node after it in the HTML with
      • either a title attribute of "Winning Percentage"
      • or containing "PCT"
  • /ancestor::table
    • and selects the table node that is up the tree from that node.
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • 1
    Thanks! Especially for making me realize that learning `xpath` syntax is definitely worthwhile if I'm going to be scraping more often. Powerful stuff. – MichaelChirico Apr 11 '16 at 17:51
  • Also noting that this isn't _perfect_ perfect because, e.g., in 2015, it will also pull the [conference table](https://en.wikipedia.org/wiki/2015_NFL_season#Conference), but I don't think it's crucial to deal with that, and this certainly answers the question as I asked it. – MichaelChirico Apr 11 '16 at 17:52
  • Yeah. You could write some carve-outs in the XPath, or just chop them out later. Also, I realized that the `paste0` missed the AFL pages, if you want them. – alistaire Apr 11 '16 at 18:00
  • I don't, I was `grep`ping for `"NFL"` before, thanks again – MichaelChirico Apr 11 '16 at 18:04