I am trying to parse player level NBA boxscore data from EPSN. The following is the initial portion of my attempt:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date
request = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(request.text,'html.parser')
table = soup.find_all('table')
It seems that BeautifulSoup is giving me a strange result. The last 'table' in the source code contains the player data and that is what I want to extract. Looking at the source code online shows that this table is closed at line 421, which is AFTER both teams' box scores. However, if we look at 'soup', there is an added line that closes the table BEFORE the Miami stats. This occurs at line 350 in the online source code.
The output from the parser 'html.parser' is:
Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »
1 2 3 4 T
BOS 25 29 22 31107MIA 31 31 31 27120
Boston Celtics
STARTERS
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS
Kevin Garnett, PF324-80-01-11111220254-49
Brandon Bass, PF286-110-03-4651110012-815
Paul Pierce, SF416-152-49-905552003-1723
Rajon Rondo, PG449-140-22-4077130044-1320
Courtney Lee, SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS
Jared Sullinger, PF81-20-00-001100001-32
Jeff Green, SF230-40-03-403301010-73
Jason Terry, SG252-70-34-400011033-108
Leandro Barbosa, SG166-83-31-201110001+416
Chris Wilcox, PFDNP COACH'S DECISION
Kris Joseph, SFDNP COACH'S DECISION
Jason Collins, CDNP COACH'S DECISION
Darko Milicic, CDNP COACH'S DECISIONTOTALS
FGM-A
3PM-A
FTM-A
OREB
As you can see, it ends mid-table at 'OREB' and it never makes it to the Miami Heat section. The output using 'lxml' parser is:
Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »
1 2 3 4T
BOS 25 29 22 31107MIA 31 31 31 27120
This doesn't include the box scores at all. The complete code I'm using (due to Daniel Rodriguez) looks something like:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date
games = pd.read_csv('games_13.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'
request = requests.get(BASE_URL.format(games.index[0]))
table = BeautifulSoup(request.text,'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers
players = pd.DataFrame(columns=columns)
def get_players(players, team_name):
array = np.zeros((len(players), len(headers)+1), dtype=object)
array[:] = np.nan
for i, player in enumerate(players):
cols = player.find_all('td')
array[i, 0] = cols[0].text.split(',')[0]
for j in range(1, len(headers) + 1):
if not cols[1].text.startswith('DNP'):
array[i, j] = cols[j].text
frame = pd.DataFrame(columns=columns)
for x in array:
line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
new = pd.DataFrame(line, columns=frame.columns)
frame = frame.append(new)
return frame
for index, row in games.iterrows():
print(index)
request = requests.get(BASE_URL.format(index))
table = BeautifulSoup(request.text, 'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
bodies = table.find_all('tbody')
team_1 = heads[0].th.text
team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
team_1_players = get_players(team_1_players, team_1)
players = players.append(team_1_players)
team_2 = heads[3].th.text
team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
team_2_players = get_players(team_2_players, team_2)
players = players.append(team_2_players)
players = players.set_index('id')
print(players)
players.to_csv('players_13.csv')
A sample of the output I'd like is:
,id,team,player,MIN,FGM-A,3PM-A,FTM-A,OREB,DREB,REB,AST,STL,BLK,TO,PF,+/-,PTS
0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26