1

I am trying to load the Excel file from the following URL into a dataframe using Python 3.5 and Pandas:

link = "https://hub.coursera-notebooks.org/user/ejquqxfjajkufidbixxvkx/notebooks/Energy%20Indicators.xls"

First I tried to download the file manually using urllib.request in order to read it right after:

import urllib.request
urllib.request.urlretrieve (link, "Energy Indicators.xls")

I got the file "Energy Indicators.xls", yes, but it is not a valid xls file. It seems more like a html file with the extension changed to xls.

Then I tried to load the file directly using read_csv:

energy = pd.read_csv(link, skiprows = 16, header = 0, skipfooter = 38)

But I got a traceback error: "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2". If I tried to read it without the arguments skiprows, header, etc. I got another error: "ValueError: Expected 1 fields in line 41, saw 3".

Any idea? BTW, I am using Mac OS Sierra and PyCharm Community Edition 2016.3

Antonio Serrano
  • 882
  • 2
  • 14
  • 27
  • it seems need `read_excel` - `energy = pd.read_excel(link, skiprows = 16, header = 0, skipfooter = 38)` – jezrael Dec 18 '16 at 20:23
  • Almost. I got a new error: "xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\n\n\n<!DOC'" – Antonio Serrano Dec 18 '16 at 20:26
  • Hmmm, it seems complicated, because need autentification. – jezrael Dec 18 '16 at 20:31
  • In that case, any idea about how to specify my credentials?? – Antonio Serrano Dec 18 '16 at 20:37
  • Its going to depend on how coursera does things. Here is an example that may work using http.client http://stackoverflow.com/a/7000784/642070. The [requests](http://docs.python-requests.org/en/master/) module may be sufficient (`requests.get(theurl, auth=('user', 'passwd'))`). And there are coursera downloaders including a python module on pypi: [coursera](https://pypi.python.org/pypi/coursera). – tdelaney Dec 18 '16 at 20:50
  • For me it works without password... Maybe firewall? – jezrael Dec 19 '16 at 08:55

1 Answers1

2

For this specific Coursera exercise, and not as a general case, you can use not the whole URL in read_excel function, but just 'Energy Indicators.xls'

energy = pd.read_excel('Energy Indicators.xls',...)
Eduard3192993
  • 216
  • 3
  • 12