I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format.
http://www.sports-reference.com/cfb/players/ryan-aplin-1.html
I cannot find an aggregate of all players so I need to go page by page and pull out the bottom row of each passing scoring Rushing & receiving etc. html table
Each player is catagorized by their last name with links to each alphabet going here.
http://www.sports-reference.com/cfb/players/
For instance, each player with the last name A is found here.
http://www.sports-reference.com/cfb/players/a-index.html
This is my first time really getting into data scraping so I tried to find similar questions with answers. The closest answer I found was this question
I believe I could use something very similar where I switch page number with the collected player's name. However, I'm not sure how to change it to look for player name instead of page number.
Samuel L. Ventura also gave a talk about data scraping for NFL data recently that can be found here.
EDIT:
Ben was really helpful and provided some great code. The first part works really well, however when I attempt to run the second part I run into this.
> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it!
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) :
unexpected '<' occurred on line 1
use a command like
x <- edit()
to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
deparse may be incomplete
This could be an error due to not having sufficient memory (only 4gb on computer I am currently using). Although I do not understand the error
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
Looking through my other datasets my players really only go back to 2007. If there would be some way to pull just people from 2007 onwards that may help shrink the data. If I had a list of people whose names I wanted to pull could I just replace the lnk in
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
with only the players that I need?