2

I'm attempting to scrape the table data from the following website: https://fantasyfootball.telegraph.co.uk/premier-league/statscentre/

The objective is to get all the player data and store it in a dictionary.

I'm using BeautifulSoup and I'm able to locate the table from the html content, however the table body that is returned is empty.

From reading other posts I saw this maybe be related to the way the website is slow to load the table data after loading the website, but I could not find a way around the problem.

My code is as follows:

from bs4 import BeautifulSoup
import requests

# Make a GET request to feth the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")

# Find the Title Data within the website
player_table = soup.find("table", attrs={"class": "player-profile-content"})

print(player_table)

The result I get is this:

<table class="playerrow playlist" id="table-players">
    <thead>
        <tr class="table-head"></tr>
    </thead>
    <tbody></tbody>
</table>

The actual HTML code on the website is quite long as they pack a lot of data into each <tr> as well as the subsequent <td> so I won't post it here unless someone asks. Suffice to say that there are several <td> lines within the header row, as well as seval <tr> lines within the body.

  • You need to look at the network monitor in the browser's webdev tool as you load the page. Then you find which request will load the data as a json, and then you can use the URL used in that request in your scraping. – Abang F. Aug 07 '20 at 10:18

2 Answers2

3

This script will print all player stats (the data is loaded from external URL via Json):

import ssl
import json
import requests
from urllib3 import poolmanager

# workaround to avoid SSL errors:
class TLSAdapter(requests.adapters.HTTPAdapter):
    def init_poolmanager(self, connections, maxsize, block=False):
        """Create and initialize the urllib3 PoolManager."""
        ctx = ssl.create_default_context()
        ctx.set_ciphers('DEFAULT@SECLEVEL=1')
        self.poolmanager = poolmanager.PoolManager(
                num_pools=connections,
                maxsize=maxsize,
                block=block,
                ssl_version=ssl.PROTOCOL_TLS,
                ssl_context=ctx)

url = 'https://fantasyfootball.telegraph.co.uk/premier-league/json/getstatsjson'

session = requests.session()
session.mount('https://', TLSAdapter())
data = session.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for s in data['playerstats']:
    for k, v in s.items():
        print('{:<15} {}'.format(k, v))
    print('-'*80)

Prints:

SUSPENSION      None
WEEKPOINTS      0
TEAMCODE        MCY
SXI             34
PLAYERNAME      de Bruyne, K
FULLCLEAN       -
SUBS            3
TEAMNAME        Man City
MISSEDPEN       0
YELLOWCARD      3
CONCEED         -
INJURY          None
PLAYERFULLNAME  Kevin de Bruyne
RATIO           40.7
PICKED          36
VALUE           5.6
POINTS          228
PARTCLEAN       -
OWNGOAL         0
ASSISTS         30
GOALS           14
REDCARD         0
PENSAVE         -
PLAYERID        3001
POS             MID
--------------------------------------------------------------------------------

...and so on.
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

A simple solution is to monitor the network traffic and understand how data is exchanged. You would see that the data comes from GET call Request URL: https://fantasyfootball.telegraph.co.uk/premier-league/json/getstatsjson It is a beautiful JSON, thus we do not need BeautifulSoup. Just requests will do the job.

import requests
import pandas as pd

URI = 'https://fantasyfootball.telegraph.co.uk/premier-league/json/getstatsjson'
r = requests.get(URI)

data = r.json()
df = pd.DataFrame(data['playerstats'])

print(df.head()) # head show first 5 rows

Results: enter image description here

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57