How to download and query html pages where JS processing is necessary?

Question

I often compile informal datasets by running some kind of XPath/XQuery on publicly available web pages. Usually the structure of the HTML is regular enough that useful information can be extracted easily.

But today I've come across tunefind.com. This website makes extensive use of the REACTJS framework, and so most of the structure of the page is configured client-side by Javascript. The pages, when initially downloaded, are very basic and missing a lot of information. The pages are populated by a script that uses a hopelessly messy blob of JSON data at the bottom of the page.

The only way I can think of to deal with this would be to use some kind of GUI-based web engine and just not display the GUI part. But that is a preposterous amount of work for these casual little CLI tools that I use to gather information.

Is there any way to perform the javascript preprocessing without dealing with unnecessary graphics?

Maybe use headless chrome, give it a chance to run the js code and then parse the dom — Tiago Coelho, Apr 28 '18 at 21:29

Patrick Mead · Accepted Answer · 2018-04-28T22:08:04.657

Even if you were to process without the graphics the react javascript will be geared towards running in a browser context, at the very least it will expect a functioning DOM to exist, the application itself may also require clicks / transitions to happen before you can see some data.

Your best bet then is to load the page in a browser, to keep this simple, there are plenty of good browser automation frameworks designed for this.

I've used a fair few libraries over the years including phantomJS and recently I've gotten the most mileage out of nightmarejs.

It runs an electron browser for you and gives you a useful promisified javascript API to control it with, that has common browser functions such as clicking, following links etc.

You can configure it to hide the browser which is useful for making a CLI tool, however its a bit of a pseudo-headless mode and will still require a windowing/graphical context (e.g. x window).

Hope this helps.

PS - If you're at all used to docker it's not hard to make this just a running container!

Super cool tools! I always forget that JS isn’t just for browsers. — William Rosenbloom, May 01 '18 at 16:41

How to download and query html pages where JS processing is necessary?

1 Answers1