I have a test pdf file with just a 3x3 table that are marked properly with table headings and the sort. What I want to do is extract the format of the table. Like so:
left | center | right |
---|---|---|
One | Two | Three |
If that table was in the pdf, I want to be able to know programmatically that the table has three headers "" and one row of data. ""
I am using fitz and when i use this code:
for page in doc:
tp = page.get_textpage() # display list from above
html = tp.extractHTML() # HTML format
print(html)
It seems to just remove all the actual html and replace it with just paragraph tags and div tags. What am I doing wrong?