Extracting html structure from PDF

Asked Nov 03 '21 at 17:24

Active Nov 03 '21 at 17:24

Viewed 375 times

I have a test pdf file with just a 3x3 table that are marked properly with table headings and the sort. What I want to do is extract the format of the table. Like so:

left	center	right
One	Two	Three

If that table was in the pdf, I want to be able to know programmatically that the table has three headers "" and one row of data. ""

I am using fitz and when i use this code:

for page in doc:
   tp = page.get_textpage()                    # display list from above
  
   html = tp.extractHTML()                  # HTML format
   print(html)

It seems to just remove all the actual html and replace it with just paragraph tags and div tags. What am I doing wrong?

asked Nov 03 '21 at 17:24

Mat

1

Could you share the file? – Jackson H Nov 03 '21 at 17:26
https://www.yumpu.com/en/document/view/65955160/sample – Mat Nov 03 '21 at 18:40
@KJ when I deflate the pdf and open it in a editor i can see TH and TR headings in the file, they are just really obscured – Mat Nov 03 '21 at 18:41

Extracting html structure from PDF

0 Answers0