1

For pages with tabular data in landscape format, the words in the HTML outcome overlap. For pages in portrait formats, the conversion is succesful. Any ideas how to fix that?

[Here is an example with the converted pdf to html in landscape format] [1]: https://i.stack.imgur.com/twbzw.png [2]: https://i.stack.imgur.com/Ln56P.png

import ntpath
from pathlib import Path
import fitz

doc = fitz.open(in_path)  # open document
out = open(in_path + ".html", "wb")  # open text output
for page in doc:  # iterate the document pages
    page.read_contents()
    text = page.get_text('html', clip = None).encode("utf8")  
    out.write(text)  # write text of page
    out.write(bytes((12,)))  # write page delimiter (form feed 0x0C)
out.close()

0 Answers0