Convert PDF to HTML via PyMuPDF

Asked Apr 09 '22 at 23:40

Active Apr 09 '22 at 23:40

Viewed 2,245 times

For pages with tabular data in landscape format, the words in the HTML outcome overlap. For pages in portrait formats, the conversion is succesful. Any ideas how to fix that?

[Here is an example with the converted pdf to html in landscape format] [1]: https://i.stack.imgur.com/twbzw.png [2]: https://i.stack.imgur.com/Ln56P.png

import ntpath
from pathlib import Path
import fitz

doc = fitz.open(in_path)  # open document
out = open(in_path + ".html", "wb")  # open text output
for page in doc:  # iterate the document pages
    page.read_contents()
    text = page.get_text('html', clip = None).encode("utf8")  
    out.write(text)  # write text of page
    out.write(bytes((12,)))  # write page delimiter (form feed 0x0C)
out.close()

asked Apr 09 '22 at 23:40

Nick Tsagkarakis

Convert PDF to HTML via PyMuPDF

0 Answers0