0

How can I detect different "blocks" of text extracted from a PDF to split them into paragraphs? Could I try to use to use their position to do this?

PyMuPDF only puts one newline character between the blocks, and also one newline after one of the lines, making it not possible to distinguish between a separate block and a new line.

enter image description here

Anm
  • 447
  • 4
  • 15
  • Not quite true: `page.get_text(opt, ...)` has a handful of different levels of detail for extracted text information, depending on the "opt" value: default (opt="text") delivers naive plain text as coded in the PDF. opt="blocks" delivers text lines aggregated by paragraphs, "dict" delivers position detail down to each text span, including font, font size, text color and more. "rawdict" does a similar thing but down to each character. So best try them out to see what fits your needs. – Jorj McKie Apr 15 '23 at 21:59

0 Answers0