I'm working with one of my students to help him scrape
data from the "Full Play-By-Play" table provided in game box scores by Pro Football Reference.com. (He's a Sports Studies major, so this is more than just having fun for him.)
As the box scores are generated dynamically, I'm using RSelenium package
, and can apparently read
the data, but can't seem to parse
it out properly. I've tried working with rvest
and XLM packages
to do this, but so far, no luck.
The code that seems to work:
rD <- rsDriver(browser="firefox") # My chrome browser has an issue...I'll fix it later
remDr <- rD[["client"]]
remDr$navigate("http://www.pro-football-reference.com/boxscores/201609110rav.htm")
webElem <- remDr$findElement('xpath', "//*[@id='all_pbp']")
page_source<-remDr$getPageSource()
Everthing I've tried after this seems to not work as I expect it. Looking at what is in page_source, and comparing it to the web-site, I can see all the appropriate data there. I could, I suppose, write a C++
app to parse
it all, but surely there is a way within R
. How can I parse
page_source to get the data out in some reasonable format?
BTW, I'm not 100% positive about the XPath
; inspecting the source indicates that it could be all_pbp
, div_pbp
, pbp
, or even //*[@id="all_pbp"/div[3]
, but I've tried each of those with the same result. (They all give the same Full Play-by-Play table, but some have additional header information, etc. in them.)
Thanks!