0

I'm trying to read .html files into pd.read_html(). However each .html file is within a different directory. So I've iterated over each directory and put the path/name + html_file_name in a list called html_paths. I want to iterate over this list and read each .html file in html_paths with pd.read_html()

I've tried to iterate over the html_paths like this:

for I in range(len(html_paths)):
     html_files = pd.read_html(html_paths[i])

I also tried to glob the original html_paths I set up with this:

for I in path.glob('**/*.html'):
     html_files = pd.read_html(i)

Any way I try to iterate over my path lib list I get an error similar to TypeError: Cannot read object type 'WindowsPAth'

So far I've written:

# initialize path
p = Path('C:\path\to\mother\directory')

# iterate over all directories within mother directory
# glob all html files
html_paths = [file for file in p.glob('**/*.html')

And now I want to iterate over each file in html_paths and read them into pd.read_html()

1 Answers1

1

Your html_paths list contains Path objects, not strings like read_html is expecting. Try converting it to a string:

for I in range(len(html_paths)):
    html_files = pd.read_html(str(html_paths[I]))
Alex
  • 827
  • 8
  • 18
  • It looks like that is now resulting in a `ValueError` that says "no tables found" – trombonebraveheart Mar 28 '19 at 00:01
  • Are you sure the HTML file you are trying to parse has a table in it? I'm not a pandas user. It sounds like your first problem has been resolved and you have a new error/question to work through. – Alex Mar 28 '19 at 00:05
  • Oh wait, I see that I can use pd.read_html() if I cast html_paths as a string one index at a time (e.g. pd.read_html(str(html_paths[0])) gives me the result I want). However, when I try and iterate over them like in your example it throws the "No tables found" error. – trombonebraveheart Mar 28 '19 at 00:07
  • And kind of, part of my problem was resolved in that I now know to cast the html_paths as a string in order to read it into the pd.read_html() function. But the main point of my question was to get to the point where I could iterate over each file and read it into pandas. Also, I'm sure each html file has a table in it. I can read them individually, just having problems iterating over them. – trombonebraveheart Mar 28 '19 at 00:13
  • I had a typo in my answer, the loop variable `I` was capitalized in the `for I in` part but not in the `html_paths[I]` part. I fixed it in the answer. – Alex Mar 28 '19 at 00:36