3

I'm trying to scrape the viewers on www.twitch.tv/directory using Python. I have tried the basic BeautifulSoup script:

url= 'https://www.twitch.tv/directory'
html= urlopen(url)
soup = BeautifulSoup(url, "html5lib") #also tried using html.parser, lxml
soup.prettify()

This gives me html without the actual viewer numbers shown.

Then I tried using param ajax data. From this thread

param = {"action": "getcategory",
        "br": "f21",
        "category": "dress",
        "pageno": "",
        "pagesize": "",
        "sort": "",
        "fsize": "",
        "fcolor": "",
        "fprice": "",
        "fattr": ""}

url = "https://www.twitch.tv/directory"
# Also tried with the headers parameter headers={"User-Agent":"Mozilla/5.0...
js = requests.get(url,params=param).json()

But I get a JSONDecodeError: Expecting value: line 1 column 1 (char 0) error.

From then I moved on to selenium

driver = webdriver.Edge()
url = 'https://www.twitch.tv/directory'
driver.get(url)
#Also tried driver.execute_script("return document.documentElement.outerHTML") and innerHTML
html = driver.page_source
driver.close()
soup = BeautifulSoup(html, "lxml")

These just yield the same result I get from the standard BeautifulSoup call.

Any help on scraping the view count would be appreciated.

BBBBInTown
  • 33
  • 3
  • I just checked the page and viewed the source - it appears that all the data is gotten via javascript, there's no "normal" HTML in there. So it's not going to be possible to scape that data from the HTML, which is all things like BeatifulSoup do - they parse HTML, they can't run Javascript as well. – Robin Zigmond Oct 06 '18 at 17:36
  • @RobinZigmond Hi Robin. Is there an alternate way to get this data that I can look into? Thanks. – BBBBInTown Oct 06 '18 at 17:37
  • I'm afraid I really don't know, I don't use twitch. It appears that twitch has an API, as I expected it to: https://dev.twitch.tv/api - I guess you can get the information you need from that, but I can't help with how to use it. – Robin Zigmond Oct 06 '18 at 17:42
  • @RobinZigmond I'll look into it. It might end up being the easiest way to do this – BBBBInTown Oct 06 '18 at 21:57

1 Answers1

4

The stats are not present in the page when its first loaded. The page makes a graphql request to https://gql.twitch.tv/gql to fetch the game data. When a user isn't logged in the graphql request asks for the query AnonFrontPage_TopChannels.

Here is a working request in python:

import requests
import json

resp = requests.post(
    "https://gql.twitch.tv/gql",
    json.dumps(
        {
            "operationName": "AnonFrontPage_TopChannels",
            "variables": {"platformType": "all", "isTagsExperiment": True},
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "d94b2fd8ad1d2c2ea82c187d65ebf3810144b4436fbf2a1dc3af0983d9bd69e9",
                }
            },
        }
    ),
    headers = {'Client-Id': 'kimne78kx3ncx6brgo4mv6wki5h1ko'},
)

print(json.loads(resp.content))

I've included the Client-Id in the request. The id doesn't seem to be unique to the session, but I imagine Twitch expires them, so this likely won't work forever. You'll have to inspect future graphql requests and grab a new Client-Id in the future or figure out how to programmatically scrape one from the page.

This request actually seems to be the Top Live Channels section. Here's how you can get the view counts and titles:

edges = json.loads(resp.content)["data"]["streams"]["edges"]
games = [(f["node"]["title"], f["node"]["viewersCount"]) for f in edges]

# games:
[
    ("Let us GAME", 78250),
    ("(REBROADCAST) Worlds Play-In Knockouts: Cloud9 vs. Gambit Esports", 36783),
    ("RuneFest 2018 - OSRS Reveals !schedule", 35042),
    (None, 25237),
    ("Front Page of TWITCH + Fortnite FALL SKIRMISH Training!", 22380),
    ("Reckful - 3v3 with barry and a german", 20399),
]

You'll need to check the chrome network inspector and figure out the structure of the other requests to get more data.

And here's an example for the directory page:

import requests
import json

resp = requests.post(
    "https://gql.twitch.tv/gql",
    json.dumps(
        {
            "operationName": "BrowsePage_AllDirectories",
            "variables": {
                "limit": 30,
                "directoryFilters": ["GAMES"],
                "isTagsExperiment": True,
                "tags": [],
            },
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "75fb8eaa6e61d995a4d679dcb78b0d5e485778d1384a6232cba301418923d6b7",
                }
            },
        }
    ),
    headers={"Client-Id": "kimne78kx3ncx6brgo4mv6wki5h1ko"},
)

edges = json.loads(resp.content)["data"]["directoriesWithTags"]["edges"]
games = [f["node"] for f in edges]
maxm
  • 3,412
  • 1
  • 19
  • 27
  • Thanks for the quick response! How would you modify the operationName to give the "twitch.tv/directory" views by top game? EDIT: I see I gotta go to the network inspector – BBBBInTown Oct 06 '18 at 17:59
  • Thanks so much! This is exactly what I was looking for – BBBBInTown Oct 06 '18 at 21:57