8

I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format.

http://www.sports-reference.com/cfb/players/ryan-aplin-1.html

I cannot find an aggregate of all players so I need to go page by page and pull out the bottom row of each passing scoring Rushing & receiving etc. html table

Each player is catagorized by their last name with links to each alphabet going here.

http://www.sports-reference.com/cfb/players/

For instance, each player with the last name A is found here.

http://www.sports-reference.com/cfb/players/a-index.html

This is my first time really getting into data scraping so I tried to find similar questions with answers. The closest answer I found was this question

I believe I could use something very similar where I switch page number with the collected player's name. However, I'm not sure how to change it to look for player name instead of page number.

Samuel L. Ventura also gave a talk about data scraping for NFL data recently that can be found here.

EDIT:

Ben was really helpful and provided some great code. The first part works really well, however when I attempt to run the second part I run into this.

> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it! 
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") : 
 no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) : 
  unexpected '<' occurred on line 1
 use a command like
 x <- edit()
 to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
  deparse may be incomplete

This could be an error due to not having sufficient memory (only 4gb on computer I am currently using). Although I do not understand the error

    > all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") : 
 no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

Looking through my other datasets my players really only go back to 2007. If there would be some way to pull just people from 2007 onwards that may help shrink the data. If I had a list of people whose names I wanted to pull could I just replace the lnk in

 links[[i]] <- paste0("http://www.sports-reference.com", lnk)

with only the players that I need?

Community
  • 1
  • 1
Steve Bronder
  • 926
  • 11
  • 17
  • 1
    You may be better off using specialized web scrapping tool. I tried to do something similar in R, but gave up and ended up using [Scrapy](http://doc.scrapy.org/en/latest/intro/overview.html) to dump the data into CSV and then analyzing it in R. Scrapy is written in Python, so may not be usable. There are other similar frameworks in other languages as well: [iRobot Visual Scraping](http://irobotsoft.com), various Ruby [gems](http://stackoverflow.com/questions/15037392/web-page-scraping-gems-tools-available-in-ruby), etc. – Alex Popov Dec 02 '13 at 01:39
  • The error you got probably resulted from a glitch in your internet connection or the sport website server. I've updated my answer to handle errors, it will skip over URLs that give an error and carry on. I haven't run it to completion, but it's going well for the last few hours. If you run into further problems you should accept the answer you've got here and post a new question to get some fresh eyes on it. In the code I posted, sub-setting only the 2007 and later data is only possible once you've got all the tables to start with. There may be other ways though. – Ben Dec 03 '13 at 07:35
  • If you have a list of players then that will save of *lot* of time since we can subset the list of URLs before scraping them all. That might be the way forward with refining your method. Or trying python's `scrapy` as @aseidlitz suggested, there are some experts on that here at SO also. I've used it too with success, but am currently an R monoglot. – Ben Dec 03 '13 at 07:40
  • I've now completed a full run of this code (took about 5h, with 6 Gb RAM, never more than 40% used) and it appears to work just fine. The RData file is here: http://www.fileswap.com/dl/tNJYJ9yrN/ (9 Mb) – Ben Dec 03 '13 at 23:58

1 Answers1

8

Here's how you can easily get all the data in all the tables on all the player pages...

First make a list of the URLs for all the players' pages...

require(RCurl); require(XML)
n <- length(letters) 
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
  print(i) # keep track of what the function is up to
  # get all html on each page of the a-z index pages
  inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
  # scrape URLs for each player from each index page
  lnk <- unname(xpathSApply(inx_page, "//a/@href"))
  # skip first 63 and last 10 links as they are constant on each page
  lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
  # only keep links that go to players (exclude schools)
  lnk <- lnk[grep("players", lnk)]
  # now we have a list of all the URLs to all the players on that index page
  # but the URLs are incomplete, so let's complete them so we can use them from 
  # anywhere
  links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)

Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:

Second, scrape all the tables at each URL to get their data, like so:

# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
    # pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
  print(i)
  # error handling - skips to next URL if it gets an error
  result <- try(
    all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
  ); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names

The result looks like this (this is just a snippet of the output):

all_tables
$`neli-aasa`
$`neli-aasa`$defense
   Year School Conf Class Pos Solo Ast Tot Loss  Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007   Utah  MWC    FR  DL    2   1   3  0.0 0.0   0   0      0  0  0   0  0  0
2 *2010   Utah  MWC    SR  DL    4   4   8  2.5 1.5   0   0      0  1  0   0  0  0

$`neli-aasa`$kick_ret
   Year School Conf Class Pos Ret Yds  Avg TD Ret Yds Avg TD
1 *2007   Utah  MWC    FR  DL   0   0       0   0   0      0
2 *2010   Utah  MWC    SR  DL   2  24 12.0  0   0   0      0

$`neli-aasa`$receiving
   Year School Conf Class Pos Rec Yds  Avg TD Att Yds Avg TD Plays Yds  Avg TD
1 *2007   Utah  MWC    FR  DL   1  41 41.0  0   0   0      0     1  41 41.0  0
2 *2010   Utah  MWC    SR  DL   0   0       0   0   0      0     0   0       0

Finally, let's say we just want to look at the passing tables...

# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)

And we end up with a data frame that is ready for further analyses (also just a snippet)...

             Year             School Conf Class Pos Cmp Att  Pct  Yds Y/A AY/A TD Int  Rate
james-aaron  1978          Air Force  Ind        QB  28  56 50.0  316 5.6  3.6  1   3  92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA    JR  QB 100 182 54.9 1135 6.2  6.0  5   3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA    SR  QB  77 148 52.0  828 5.6  4.3  4   6  99.8
Ben
  • 41,615
  • 18
  • 132
  • 227
  • This was really helpful!! However, I ran into some problems in the second part and made an edit to my original post. – Steve Bronder Dec 03 '13 at 02:04
  • 1
    @user2269255 I've updated the code to handle errors without stopping the scraping. – Ben Dec 03 '13 at 07:41
  • 1
    This code would not work on the computer I originally had because of my memory limit. I spoiled myself and upgraded and can now say this works like a charm!! – Steve Bronder Jan 03 '14 at 16:23