I am trying to extract the data from a table from a webpage, but keep receiving the above error. I have looked at the examples on this site, as well as others, but none deal directly with my problem. Please see code below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table', class_='dataframe')
rows = table.find_all('tr')[2:]
data = {
'RK' : [],
'PLAYER' : [],
'TEAM' : [],
'GP' : [],
'G' : [],
'A' : [],
'PTS' : [],
'+/-' : [],
'PIM' : [],
'PTS/G' : [],
'SOG' : [],
'PCT' : [],
'GWG' : [],
'G1' : [],
'A1' : [],
'G2' : [],
'A2' : []
}
for row in rows:
cols = row.find_all('td')
data['RK'].append( cols[0].get_text() )
data['PLAYER'].append( cols[1].get_text() )
data['TEAM'].append( cols[2].get_text() )
data['GP'].append( cols[3].get_text() )
data['G'].append( cols[4].get_text() )
data['A'].append( cols[5].get_text() )
data['PTS'].append( cols[6].get_text() )
data['+/-'].append( cols[7].get_text() )
data['PIM'].append( cols[8].get_text() )
data['PTS/G'].append( cols[9].get_text() )
data['SOG'].append( cols[10].get_text() )
data['PCT'].append( cols[11].get_text() )
data['GWG'].append( cols[12].get_text() )
data['G1'].append( cols[13].get_text() )
data['A1'].append( cols[14].get_text() )
data['G2'].append( cols[15].get_text() )
data['A2'].append( cols[16].get_text() )
df = pd.DataFrame(data)
df.to_csv("NHL_Players_Stats.csv")
I have eradicated the error, by seeing that the error was referring to the table (i.e. the Resultset) not having the method find_all and got the code running by commenting out the following line:
#rows = table.find_all('tr')[2:]
and changing this:
for row in rows:
This, however, does not extracts any data from the webpage and simply creates a .csv file with column headers.
I have tried to extract some data directly into rows using soup.find_all, but get the following error;
data['GP'].append( cols[3].get_text() )
IndexError: list index out of range
which I have not been able to resolve.
Therefore, any help would be very much appreciated.
Also, out of curiosity, are there any ways to achieve the desired outcome using:
dataframe = pd.read_html('url')
because, I have tried this also, but keep keeping:
FeatureNotFound: Couldn't find a tree builder with the features you
requested: html5lib. Do you need to install a parser library?
Ideally this is the method that I would prefer, but can't find any examples online.