How do I crawl twitch.tv where the html body was empty upon initial http request, and contents were loaded by various scripts

Question

I am trying to use Scrapy to crawl through stream pages on twitch. The problem is that the html request returns no useful urls. For example, with wget to twitch.tv main page, I get an empty body tag:

<body>
    //some stuff
    <div id='flyout'>
        <div class='point'>
        </div>
        <div class='content'>
        </div>
    </div>
</body>

I understand the content was somehow loaded afterwards, but couldn't figure out how was it done. Any ideas, suggestions? Thanks!!!

@whale_steward not sure if selenium/scrapy combo is the way to go, you will lose out on the advantages of async request processing with selenium, not to mention that depending on your setup it may not be convenient to need a full fledged browser — Verbal_Kint, Mar 16 '17 at 02:08
selenium render the way web browser fetch the page, so it is a way to get it. but, if the twitch provide an api then just access the api is enough without the need to use selenium. — whale_steward, Mar 16 '17 at 02:49

Verbal_Kint · Answer 1 · 2017-03-16T02:39:58.180

open up a browser with the dev tools open as well. Click the network tab then goto twitch.tv and dive into all the requests to see which requests provide which parts of the content and narrow it down to the content you want (and given the example below, the request url will most likely be a request to some form of https://api.twitch.tv/{path to endpoint}/{name of endpoint}?{endpointarg=value}). For example:

If you want to get all the data for the featured content on the homepage you may find that instead of starting your crawl on twitch.tv, you should instead go to https://api.twitch.tv/kraken/streams/featured?limit=6&geo=US&lang=en&on_site=1, which provides nice JSON formatted data like so:

{"_links":
    {"self":"https://api.twitch.tv/kraken/streams/featured?geo=US&lang=en&limit=6&offset=0",
    "next":"https://api.twitch.tv/kraken/streams/featured?geo=US&lang=en&limit=6&offset=6"},
    "featured":[
        {"text":"<p>SNES Super Stars is a 11-day speedrun marathon devoted to the Super Nintendo Entertainment System. From March 10th-20th, watch over 200 games being beaten amazingly fast and races between some of the top speedrunners in the world!</p>\n\n<br>\n\n\n<p><a href=\"/speedgaming\">Click here</a> to watch and chat!</p>\n\n<p><a href=\"communitysuccess,speedrun\"></a></p>\n",
        "title":"SNES Super Stars Marathon",
        "sponsored":false,
        "priority":5,
        "scheduled":true,
...

And you could just follow links from there. You will also have to emulate the headers for that request. So the example above won't work unless you specify a client-id in your request header which you can probably pull from the header of the original request. Every section or feature of the site probably has its own api endpoint which you may be able to access and it is also a bit easier on twitch servers because they dont have to serve up all those pictures and video, kind of a win-win. Also if you notice some of the query arguments at the end of the url, you can probably manipulate how many items you get back (limit=6).

That should get what you want although you will have to dig around for the endpoints. But, if for whatever reason you really need to dynamically process javascript and don't want to automate a browser with selenium while staying within the scrapy ecosystem, then there is also scrapinghub's splash project which integrates quite well with scrapy.

How do I crawl twitch.tv where the html body was empty upon initial http request, and contents were loaded by various scripts

1 Answers1