On basketball-reference.com, there is an injury page that shows all of the current injuries in the NBA. I'd like to begin archiving this data to keep a record of whose injured in the NBA daily. Apart from simply being a basketball stat nut, this is will be an input to a Bayesian Model that predicts a players playing time from his teammates injuries.
Now, I could simply go to his page once a day, click the Get Table as a CSV" button
, and copy and paste that into a file, but this seems like a cron job.
I could grab the raw html and parse it but the web page already has a get_csv_output(e)
function in its sr-min.js
file readily available. In fact, if I open up the developer console and type in
get_csv_output("injuries")
I get all of the csv dumped out as a string. It feels an awful lot like reinventing the wheel when I could simply use this function.
Somehow there is a disconnect in my mind though. I don't grok how I can visit a page, run a js function, and save the output without spinning up a full chrome driver instance through selenium or something. This feels like a simple problem with a simple solution that I just don't know.
I don't particularly care what language the solution is in, although I'd prefer a python, bash, or some other light weight solution.
Please let me know if I'm being naive.
Edit: The page is https://www.basketball-reference.com/friv/injuries.cgi
Edit 2: The accepted answer is an excellent solution for future reference.
I ended up doing
curl https://www.basketball-reference.com/friv/injuries.cgi | python3 convert_injury_html_to_csv.py > "$(date +'%Y%m%d')".tsv
Where the python script is...
import sys
from bs4 import BeautifulSoup
def parse_injury_html(html_doc):
soup = BeautifulSoup(html_doc, "html.parser")
injuries_table = soup.find(id="injuries")
for row in injuries_table.tbody.find_all("tr"):
if row.get('class', None) == "thead":
continue
name = row.th
team, update, description = row.find_all("td")
yield((name.string, team.string, update.string, description.string))
def main():
for (name, team, update, description) in parse_injury_html(sys.stdin.read()):
print(f"{name}\t{team}\t{update}\t{description}")
if __name__ == '__main__':
main()