1

I am trying to extract the data from a table from a webpage, but keep receiving the above error. I have looked at the examples on this site, as well as others, but none deal directly with my problem. Please see code below:

from bs4 import BeautifulSoup

import requests

import pandas as pd

url = 'http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2'

r = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

table = soup.find_all('table', class_='dataframe')

rows = table.find_all('tr')[2:]

data = {
    'RK' : [],
    'PLAYER' : [],
    'TEAM' : [],
    'GP' : [],
    'G' : [],
    'A' : [],
    'PTS' : [],
    '+/-' : [],
    'PIM' : [],
    'PTS/G' : [],
    'SOG' : [],
    'PCT' : [],
    'GWG' : [],
    'G1' : [],
    'A1' : [],
    'G2' : [],
    'A2' : []
}

for row in rows:
    cols = row.find_all('td')
    data['RK'].append( cols[0].get_text() )
    data['PLAYER'].append( cols[1].get_text() )
    data['TEAM'].append( cols[2].get_text() )
    data['GP'].append( cols[3].get_text() )
    data['G'].append( cols[4].get_text() )
    data['A'].append( cols[5].get_text() )
    data['PTS'].append( cols[6].get_text() )
    data['+/-'].append( cols[7].get_text() )
    data['PIM'].append( cols[8].get_text() )
    data['PTS/G'].append( cols[9].get_text() )
    data['SOG'].append( cols[10].get_text() )
    data['PCT'].append( cols[11].get_text() )
    data['GWG'].append( cols[12].get_text() )
    data['G1'].append( cols[13].get_text() )
    data['A1'].append( cols[14].get_text() )
    data['G2'].append( cols[15].get_text() )
    data['A2'].append( cols[16].get_text() )

df = pd.DataFrame(data)

df.to_csv("NHL_Players_Stats.csv")

I have eradicated the error, by seeing that the error was referring to the table (i.e. the Resultset) not having the method find_all and got the code running by commenting out the following line:

#rows = table.find_all('tr')[2:]

and changing this:

for row in rows:

This, however, does not extracts any data from the webpage and simply creates a .csv file with column headers.

I have tried to extract some data directly into rows using soup.find_all, but get the following error;

    data['GP'].append( cols[3].get_text() )
IndexError: list index out of range

which I have not been able to resolve.

Therefore, any help would be very much appreciated.

Also, out of curiosity, are there any ways to achieve the desired outcome using:

dataframe = pd.read_html('url')

because, I have tried this also, but keep keeping:

FeatureNotFound: Couldn't find a tree builder with the features you
requested: html5lib. Do you need to install a parser library?

Ideally this is the method that I would prefer, but can't find any examples online.

aLoHa
  • 165
  • 7
  • 1
    You couldn't find [html5lib](https://pypi.python.org/pypi/html5lib)? Well, there ya go :) – TemporalWolf Dec 09 '16 at 18:34
  • @TemporalWolf, yes that would seem to be the case. But, it seems to find it when I use it with the BeautifulSoup() method. Any suggestions? :) – aLoHa Dec 09 '16 at 20:31

2 Answers2

1

find_all returns a ResultSet, which is basically a list of elements. For this reason, it has no method find_all, as this is a method that belongs to an individual element.

If you only want one table, use find instead of find_all to look for it.

table = soup.find('table', class_='dataframe')

Then, getting its rows should work as you have already done:

rows = table.find_all('tr')[2:]

The second error you got is because, for some reason, one of the table's rows seems to have only 3 cells, thus your cols variable became a list with only indexes 0, 1 and 2. That's why cols[3] gives you an IndexError.

lucasnadalutti
  • 5,818
  • 1
  • 28
  • 48
  • thanks for the comments. I did try your suggestion, but the 2nd line: rows = table.find_all is still throwing up that error, even when the first line is changed to table = soup.find( .....). With regards to the second error that seems to make sense. Would it therefore, be better if I start the extraction from row(1) rather than row(0), which is what is causing the problem? – aLoHa Dec 09 '16 at 20:23
0

In terms of achieving the same outcome using: dataframe = pd.read_html('url')

It is achieved using just that or similar: dataframe = pd.read_html(url, header=1, index_col=None)

The reason why I was receiving errors previously is because I had not configured Spyder's iPython console's backend to 'automatic' in 'Preferences'.

I am still, however, trying to resolve this problem using BeautifulSoup. So any useful comments would be appreciated.

aLoHa
  • 165
  • 7