0

this is my first project with pandas and selenium so I may be making a dumb mistake. I've written this function to go through a list of nba players and scrape their game logs into data frames. It all works well but occasionally when I'm going through the list of players it'll just stop working at some random point and give me this error

Traceback (most recent call last):
  File "/Users/arslanamir/PycharmProjects/nba/main.py", line 154, in <module>
    Game_Log_Scraper(players, x)
  File "/Users/arslanamir/PycharmProjects/nba/main.py", line 48, in Game_Log_Scraper
    tables = pd.read_html(html, flavor='lxml')
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 1085, in read_html
    return _parse(
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 913, in _parse
    raise retained
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 893, in _parse
    tables = p.parse_tables()
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 213, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 684, in _parse_tables
    raise ValueError(f"No tables found matching regex {repr(pattern)}")
ValueError: No tables found matching regex '.+'

Process finished with exit code 1

This is the function

def Game_Log_Scraper(players):
    for name in players:
        first = name.split()[0]
        last = name.split()[1]
        if not Path(f'/Users/arslanamir/PycharmProjects/nba/{first} {last}').is_file():
            driver = webdriver.Chrome(executable_path='/Users/arslanamir/PycharmProjects/chromedriver')
            driver.get(f'https://www.nba.com/stats/players/boxscores/?CF=PLAYER_NAME*E*{first}%20{last}&Season=2020-21'
                       f'&SeasonType=Regular%20Season')
            html = driver.page_source

            tables = pd.read_html(html, flavor='lxml')
            data = tables[1]

            driver.close()

            not_needed = ['Match\xa0Up', 'Season', 'FGM', 'FGA', '3PM', '3PA', '3P%', 'FTM', 'FTA',
                          'FT%', 'STL', 'BLK', 'TOV', '+/-', 'FP', 'FG%', 'OREB', 'DREB', 'PF']

            for item in not_needed:
                data.drop(item, axis=1, inplace=True)

            data.dropna(axis=0, inplace=True)
            data.drop('W/L', axis=1, inplace=True)

            with open(f'{first} {last}', 'w+') as f:
                f.write(data.to_string())

    return players

I've tried changing the read_html flavor to html5lib and bs4 also and neither work. Here is an example of the webpage, https://www.nba.com/stats/players/boxscores/?CF=PLAYER_NAMEEMalik%20Beasley&Season=2020-21&SeasonType=Regular%20Season

  • first check what you get in `page_source` - maybe server sends `Captcha` to block bots/scripts/spamers/hackers. OR maybe it needs longer time to generate table and then you need `time.sleep()` or [Selenium Waits](https://selenium-python.readthedocs.io/waits.html) – furas Feb 09 '21 at 23:37

1 Answers1

1

Couple things right off the bat:

  1. No need to loop through columns to drop them. You can just use the list.

So change

for item in not_needed:
    data.drop(item, axis=1, inplace=True)

to

data.drop(not_needed, axis=1, inplace=True)
  1. You're not doing anything to the players list in the function, so really there's no need to return that, or anything really. All that function does is check if the file is already there, then writes it if its not there

  2. Selenium is over kill here (and slows down your process of having to process through the browser). The nba stats api can get you all the season and player data in 1 request. Then simply filter that table instead of doing it through the browser.

  3. To filter that table/data then from the api, we need an exact match of the player name you are providing and what is in the data. It's case sensitive too. So to account for typos, different names in your players list versus the data (Ie. Glenn Robinson won't return anything from the table since it's 'Glenn Robinson III'. So I added 1 extra process in there using a package called fuzzywuzzy. Make sure to pip install fuzzywuzzy to make it work.

  4. I didn't do anything more with your code, but keep in mind, if you need to update your files (so if you run it today, and then run it next week) your files won't include any new games in the files since you are simply checking if the file is present, not if the file is up-to-date.

Code:

import requests
import pandas as pd
from pathlib import Path

# pip install fuzzywuzzy
from fuzzywuzzy import process

def get_data():
    url = 'https://stats.nba.com/stats/leaguegamelog'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Referer': 'http://stats.nba.com'}
    
    payload = {
    'Counter': '1000',
    'DateFrom': '',
    'DateTo': '',
    'Direction': 'DESC',
    'LeagueID': '00',
    'PlayerOrTeam': 'P',
    'Season': '2020-21',
    'SeasonType': 'Regular Season',
    'Sorter': 'DATE'}
    
    jsonData = requests.get(url, headers=headers, params=payload).json()
    
    cols = jsonData['resultSets'][0]['headers']
    data = jsonData['resultSets'][0]['rowSet']
    df = pd.DataFrame(data, columns=cols)
    return df



def Game_Log_Scraper(players):
    data = get_data()
    for name in players:
        
        # Use fuzzywuzzy to match player name 
        choices = list(data['PLAYER_NAME'].unique())
        player = process.extractOne('{}'.format(name), choices)[0]
        
        #if not Path(f'/Users/arslanamir/PycharmProjects/nba/{first} {last}').is_file():
        if not Path(f'/Users/arslanamir/PycharmProjects/nba/{player}.csv').is_file():
            player_df = data[data['PLAYER_NAME'] == player]
            
            not_needed = ['MATCHUP', 'SEASON_ID', 'FGM', 'FGA', 'FG3M', 'FG3A', 
                          'FG3_PCT', 'FTM', 'FTA', 'WL', 'FT_PCT', 'STL', 'BLK', 
                          'TOV', 'PLUS_MINUS', 'FANTASY_PTS', 'FG_PCT', 'OREB', 
                          'DREB', 'PF', 'FANTASY_PTS', 'VIDEO_AVAILABLE']

            player_df.drop(not_needed, axis=1, inplace=True)
            player_df.dropna(axis=0, inplace=True)
            
            player_df.to_csv(f'/Users/arslanamir/PycharmProjects/nba/{player}.csv', index=False)
            print (f'{player} file saved.')
            
        else:
            print(f'{player} file already present.')

    
players = ['Zach LaVine', 'ZaCk LeViNE', 'LeBron James', 'Labron james Jr.', 'le brn jame Jr.']
Game_Log_Scraper(players)

Output:

Zach LaVine file saved.
Zach LaVine file already present.
LeBron James file saved.
LeBron James file already present.
LeBron James file already present.

LeBron James.csv

enter image description here

chitown88
  • 27,527
  • 4
  • 30
  • 59