4

I am using the Python library pdftotext to scrape the text of a PDF file. That works great but I need the "-layout" option that the command line tool offers with pdftotext -layout pdf_file.pdf. Not sure if that's possible without having to explicitly use the command in my code.

Actual code:

pdf = pdftotext.PDF(file)
plain_text = "\n\n".join(pdf)

Ideal code with the layout option for better scraping:

pdf = pdftotext.PDF(file, "-layout")
plain_text = "\n\n".join(pdf)

Workaround I would like to avoid in the Python program:

cmd = ['pdftotext', '-f', str(1), '-l', str(1), str(pdf_file), '-layout', '-']

Thank you!

Alexandre
  • 105
  • 1
  • 6

1 Answers1

2
with open("file.pdf", "rb") as f:
    pdf=pdftotext.PDF(f,physical=True)

Inside the code found:
    "    raw: If True, page text is output in the order it appears in the\n"
    "        content stream.\n"
    "    physical: If True, page text is output in the order it appears
nazrigue
  • 31
  • 2