Is there a way to programmatically run a javascript function from the source of a website

Question

On basketball-reference.com, there is an injury page that shows all of the current injuries in the NBA. I'd like to begin archiving this data to keep a record of whose injured in the NBA daily. Apart from simply being a basketball stat nut, this is will be an input to a Bayesian Model that predicts a players playing time from his teammates injuries.

Now, I could simply go to his page once a day, click the Get Table as a CSV" button, and copy and paste that into a file, but this seems like a cron job.

I could grab the raw html and parse it but the web page already has a get_csv_output(e) function in its sr-min.js file readily available. In fact, if I open up the developer console and type in

get_csv_output("injuries")

I get all of the csv dumped out as a string. It feels an awful lot like reinventing the wheel when I could simply use this function.

Somehow there is a disconnect in my mind though. I don't grok how I can visit a page, run a js function, and save the output without spinning up a full chrome driver instance through selenium or something. This feels like a simple problem with a simple solution that I just don't know.

I don't particularly care what language the solution is in, although I'd prefer a python, bash, or some other light weight solution.

Please let me know if I'm being naive.

Edit: The page is https://www.basketball-reference.com/friv/injuries.cgi

Edit 2: The accepted answer is an excellent solution for future reference.

I ended up doing

curl https://www.basketball-reference.com/friv/injuries.cgi | python3 convert_injury_html_to_csv.py > "$(date +'%Y%m%d')".tsv

Where the python script is...

import sys
from bs4 import BeautifulSoup


def parse_injury_html(html_doc):
    soup = BeautifulSoup(html_doc, "html.parser")
    injuries_table = soup.find(id="injuries")
    for row in injuries_table.tbody.find_all("tr"):
        if row.get('class', None) == "thead":
            continue
        name = row.th
        team, update, description = row.find_all("td")
        yield((name.string, team.string, update.string, description.string))


def main():
    for (name, team, update, description) in parse_injury_html(sys.stdin.read()):
        print(f"{name}\t{team}\t{update}\t{description}")


if __name__ == '__main__':
    main()

Have you already considered downloading the JS file and running it directly, say in `node`? Might need to polyfill the network layer stuff (fetch API?, XHR?), but at least then you'll be able to call it more directly. — Jacob, Oct 25 '19 at 19:21

score 1 · Answer 1 · answered Oct 25 '19 at 19:27

You could more directly just run the code in that JS function. Node.js is a standalone JS engine, so you may be able to use it to run the exact same function.

That function is most likely just making HTTP requests to download the data from a server, perhaps with some mild data manipulations. The networking layer between node and browser JS are not the same, but there are polyfills available. If the JS function is using the fetch API, you can use node-fetch, or if it's using XHR-style requests, xmlhttprequest.

Since the code is probably a simple data fetch, it might be simple enough to reverse-engineer what's going on and write your own script yourself in whatever language you prefer to make the same type of HTTP request. Watching what's going on in the network tab of your developer tools should tell you where it's getting its data.

You are right about Node but unfortunately that function is not a simple fetch but a crazy HTML parser. — Anton Rusak, Oct 25 '19 at 19:36
Oh goody! Well if it's doing stuff with the browser DOM, you might want to enlist the `jsdom` package; it gives you a partial DOM implementation in node. — Jacob, Oct 25 '19 at 20:51

score 1 · Accepted Answer · answered Oct 25 '19 at 19:34

1

Just executing this function won't do no good because it must be executed in context of that injuries page. If you look at its code, it effectively parses html data. Weird way of doing things but I saw worse. Nevermind.

The easiest solution will be using something that opens the page and calls the function just like you do it in devtools. Barmar suggested Selenium, but I personally prefer puppeteer. It is run via NodeJS, it opens Chrome in windowless mode and executes any open API on any site. In our case - the get_csv_output function.

After that you may do whatever you want with the result string. Dump it to DB or save to file.

An example of puppeteer code.

answered Oct 25 '19 at 19:34

Anton Rusak

882
4
18

Thanks. Guess I was being naive. I think I'll just make a plain html request and parse the table myself. I use selenium at work so I don't relish the though of orchestrating some headless chrome instance, no matter the interface. I've already tested, and I can just curl the page, once a day then parse it later. – Dave Fol Oct 25 '19 at 19:40
1

@DaveFol Oh man you're breaking my heart. Here: https://repl.it/repls/LinearOnlyChapter Run it with node, it'll create a file 2019-10-27.csv – Anton Rusak Oct 25 '19 at 21:05
1

Wow this is actually pretty nice. I wis there was someway to give you rep. I came up with this `curl https://www.basketball-reference.com/friv/injuries.cgi | python3 convert_injury_html_to_csv.py > "$(date +'%Y%m%d')".tsv`. Where the python script is [snippet](https://trinket.io/python/eac6212612) – Dave Fol Oct 26 '19 at 17:42

Is there a way to programmatically run a javascript function from the source of a website

2 Answers2