using Pandas to download/load xls from URL file

Question

I am trying to load the Excel file from the following URL into a dataframe using Python 3.5 and Pandas:

link = "https://hub.coursera-notebooks.org/user/ejquqxfjajkufidbixxvkx/notebooks/Energy%20Indicators.xls"

First I tried to download the file manually using urllib.request in order to read it right after:

import urllib.request
urllib.request.urlretrieve (link, "Energy Indicators.xls")

I got the file "Energy Indicators.xls", yes, but it is not a valid xls file. It seems more like a html file with the extension changed to xls.

Then I tried to load the file directly using read_csv:

energy = pd.read_csv(link, skiprows = 16, header = 0, skipfooter = 38)

But I got a traceback error: "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2". If I tried to read it without the arguments skiprows, header, etc. I got another error: "ValueError: Expected 1 fields in line 41, saw 3".

Any idea? BTW, I am using Mac OS Sierra and PyCharm Community Edition 2016.3

it seems need `read_excel` - `energy = pd.read_excel(link, skiprows = 16, header = 0, skipfooter = 38)` — jezrael, Dec 18 '16 at 20:23
Almost. I got a new error: "xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\n\n\n<!DOC'" — Antonio Serrano, Dec 18 '16 at 20:26
In that case, any idea about how to specify my credentials?? — Antonio Serrano, Dec 18 '16 at 20:37
Its going to depend on how coursera does things. Here is an example that may work using http.client http://stackoverflow.com/a/7000784/642070. The [requests](http://docs.python-requests.org/en/master/) module may be sufficient (`requests.get(theurl, auth=('user', 'passwd'))`). And there are coursera downloaders including a python module on pypi: [coursera](https://pypi.python.org/pypi/coursera). — tdelaney, Dec 18 '16 at 20:50

score 2 · Accepted Answer · answered Sep 18 '17 at 02:43

2

For this specific Coursera exercise, and not as a general case, you can use not the whole URL in read_excel function, but just 'Energy Indicators.xls'

energy = pd.read_excel('Energy Indicators.xls',...)

answered Sep 18 '17 at 02:43

Eduard3192993

216
3
12

using Pandas to download/load xls from URL file

1 Answers1