How to scrape table data from a website that is slow to load

Question

I'm attempting to scrape the table data from the following website: https://fantasyfootball.telegraph.co.uk/premier-league/statscentre/

The objective is to get all the player data and store it in a dictionary.

I'm using BeautifulSoup and I'm able to locate the table from the html content, however the table body that is returned is empty.

From reading other posts I saw this maybe be related to the way the website is slow to load the table data after loading the website, but I could not find a way around the problem.

My code is as follows:

from bs4 import BeautifulSoup
import requests

# Make a GET request to feth the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")

# Find the Title Data within the website
player_table = soup.find("table", attrs={"class": "player-profile-content"})

print(player_table)

The result I get is this:

<table class="playerrow playlist" id="table-players">
    <thead>
        <tr class="table-head"></tr>
    </thead>
    <tbody></tbody>
</table>

The actual HTML code on the website is quite long as they pack a lot of data into each <tr> as well as the subsequent <td> so I won't post it here unless someone asks. Suffice to say that there are several <td> lines within the header row, as well as seval <tr> lines within the body.

You need to look at the network monitor in the browser's webdev tool as you load the page. Then you find which request will load the data as a json, and then you can use the URL used in that request in your scraping. — Abang F., Aug 07 '20 at 10:18

score 3 · Accepted Answer · answered Aug 07 '20 at 11:57

This script will print all player stats (the data is loaded from external URL via Json):

import ssl
import json
import requests
from urllib3 import poolmanager

# workaround to avoid SSL errors:
class TLSAdapter(requests.adapters.HTTPAdapter):
    def init_poolmanager(self, connections, maxsize, block=False):
        """Create and initialize the urllib3 PoolManager."""
        ctx = ssl.create_default_context()
        ctx.set_ciphers('DEFAULT@SECLEVEL=1')
        self.poolmanager = poolmanager.PoolManager(
                num_pools=connections,
                maxsize=maxsize,
                block=block,
                ssl_version=ssl.PROTOCOL_TLS,
                ssl_context=ctx)

url = 'https://fantasyfootball.telegraph.co.uk/premier-league/json/getstatsjson'

session = requests.session()
session.mount('https://', TLSAdapter())
data = session.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for s in data['playerstats']:
    for k, v in s.items():
        print('{:<15} {}'.format(k, v))
    print('-'*80)

Prints:

SUSPENSION      None
WEEKPOINTS      0
TEAMCODE        MCY
SXI             34
PLAYERNAME      de Bruyne, K
FULLCLEAN       -
SUBS            3
TEAMNAME        Man City
MISSEDPEN       0
YELLOWCARD      3
CONCEED         -
INJURY          None
PLAYERFULLNAME  Kevin de Bruyne
RATIO           40.7
PICKED          36
VALUE           5.6
POINTS          228
PARTCLEAN       -
OWNGOAL         0
ASSISTS         30
GOALS           14
REDCARD         0
PENSAVE         -
PLAYERID        3001
POS             MID
--------------------------------------------------------------------------------

...and so on.

This looks cool. Can you explain what TLSAdapter class does? — Prayson W. Daniel, Aug 07 '20 at 12:06
@PraysonW.Daniel Without it, the `requests.get()` will complain about SSL error and won't download anything (the site probably uses old TLS). Taken from here: https://stackoverflow.com/a/61643770/10035985 — Andrej Kesely, Aug 07 '20 at 12:09
I did not hit that wall. I just used requests.get. It looks really cool and I have noted it for future use. thanks — Prayson W. Daniel, Aug 07 '20 at 12:12

Prayson W. Daniel · Answer 2 · 2020-08-07T12:18:39.640

A simple solution is to monitor the network traffic and understand how data is exchanged. You would see that the data comes from GET call Request URL: https://fantasyfootball.telegraph.co.uk/premier-league/json/getstatsjson It is a beautiful JSON, thus we do not need BeautifulSoup. Just requests will do the job.

import requests
import pandas as pd

URI = 'https://fantasyfootball.telegraph.co.uk/premier-league/json/getstatsjson'
r = requests.get(URI)

data = r.json()
df = pd.DataFrame(data['playerstats'])

print(df.head()) # head show first 5 rows

Results:

How to scrape table data from a website that is slow to load

2 Answers2