I am using the Python library pdftotext to scrape the text of a PDF file. That works great but I need the "-layout" option that the command line tool offers with pdftotext -layout pdf_file.pdf
. Not sure if that's possible without having to explicitly use the command in my code.
Actual code:
pdf = pdftotext.PDF(file)
plain_text = "\n\n".join(pdf)
Ideal code with the layout option for better scraping:
pdf = pdftotext.PDF(file, "-layout")
plain_text = "\n\n".join(pdf)
Workaround I would like to avoid in the Python program:
cmd = ['pdftotext', '-f', str(1), '-l', str(1), str(pdf_file), '-layout', '-']
Thank you!