RSelenium: How to scrape data from Pro Football Reference.com?

Question

I'm working with one of my students to help him scrape data from the "Full Play-By-Play" table provided in game box scores by Pro Football Reference.com. (He's a Sports Studies major, so this is more than just having fun for him.)

As the box scores are generated dynamically, I'm using RSelenium package, and can apparently read the data, but can't seem to parse it out properly. I've tried working with rvest and XLM packages to do this, but so far, no luck.

The code that seems to work:

 rD <- rsDriver(browser="firefox") # My chrome browser has an issue...I'll fix it later
    remDr <- rD[["client"]]
    remDr$navigate("http://www.pro-football-reference.com/boxscores/201609110rav.htm")
    webElem <- remDr$findElement('xpath', "//*[@id='all_pbp']")
    page_source<-remDr$getPageSource()

Everthing I've tried after this seems to not work as I expect it. Looking at what is in page_source, and comparing it to the web-site, I can see all the appropriate data there. I could, I suppose, write a C++ app to parse it all, but surely there is a way within R. How can I parse page_source to get the data out in some reasonable format?

BTW, I'm not 100% positive about the XPath; inspecting the source indicates that it could be all_pbp, div_pbp, pbp, or even //*[@id="all_pbp"/div[3], but I've tried each of those with the same result. (They all give the same Full Play-by-Play table, but some have additional header information, etc. in them.)

Thanks!

_Except as specifically provided in this paragraph, you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read, view, and submit materials"_ Perhaps encourage ethical behaviour in students vs "if you can do it, it's cool". The site clearly does not want scrapers. — hrbrmstr, Feb 16 '17 at 22:19
Thanks for the comment; I agree that ethics are important. I do, however, believe that we/he are/is within the terms of the agreement: We could simply use a mouse and copy the data from the site, so a single scrape is not sending more request messages than a human would do. (Additionally, as the student does not intend to distribute the data, etc., this should fall within fair use guidelines, but that's a broader issue.) Using data for evaluation, etc., is permitted. — Barney Ricca, Feb 17 '17 at 01:44

RSelenium: How to scrape data from Pro Football Reference.com?

0 Answers0