Pandas read_html taking way too long in Anaconda Spyder--why?

Question

I'm pretty sure this is basically an Anaconda/Spyder question, and probably a dumb newbie one at that. Basically: why is my code taking SO LONG in Spyder?

Background--I've been using pandas to read from Excel files (read_excel), manipulate the data, etc. Now, I have a very large data table that was generated in Business Objects (BO) and saved as an html file. (I couldn't save the BO file as Excel because that would generate and .xls file, not .xlsx, and these files often contain too many rows for .xls to handle.)

So I ran the code below in Anaconda Spyder, and it generated a list containing 1 item, and that item is a DataFrame representing my table (in this case, 47K rows x 5 columns). That's fine, but the problem is it took 1.5 hours to do it! Clearly there's something wrong, because I accomplished the same thing in VBA (from Excel) and it took about 1 minute. So what am I doing wrong?

import pandas as pd
import datetime as dt
print('start', dt.datetime.now())
fn = r'C:\[filepath goes here]\my_file.htm'
xx = pd.read_html(fn)
print('end',dt.datetime.now())

I should mention that if I open PowerShell (I'm on Windows 10) and run the same code in python, it takes maybe 20 seconds, and I can get at the dataframe with this:

df = pd.DataFrame(xx[0])

but obviously PowerShell doesn't have the functionality of Spyder (eg, variable explorer) and I want to do some additional data manipulation. Furthermore, if I try to write to Excel from PowerShell with

df.to_excel(<filepath>.xlsx, index = False, engine = io.excel.xlsx.writer)

I get NameError: name 'io' is not defined. So that's a pretty much a roadblock, at my beginner's level, too.

(*Spyder maintainer here*) Could you share `my_file.htm` file (privately, if required) so we can take a look at it? If not possible, could you at least share a small portion of it? Without looking at your data it's not possible for us to understand what's happening in your case. — Carlos Cordoba, Oct 15 '20 at 15:23
Yes, of course, and thank you...there are some sensitive data so I need to anonymise some values first, will get to as soon as possible. Thanks again. — P E, Oct 16 '20 at 12:41
Thanks, I really appreciate it! I'll follow up this question then for updates. Or if you prefer to send me your data directly, you can find my email [here](https://github.com/ccordoba12). — Carlos Cordoba, Oct 16 '20 at 16:44

Pandas read_html taking way too long in Anaconda Spyder--why?

0 Answers0