Detecting paragraphs in a PDF

Asked Apr 15 '23 at 06:29

Active Apr 15 '23 at 06:29

Viewed 398 times

How can I detect different "blocks" of text extracted from a PDF to split them into paragraphs? Could I try to use to use their position to do this?

PyMuPDF only puts one newline character between the blocks, and also one newline after one of the lines, making it not possible to distinguish between a separate block and a new line.

asked Apr 15 '23 at 06:29

Anm

Not quite true: `page.get_text(opt, ...)` has a handful of different levels of detail for extracted text information, depending on the "opt" value: default (opt="text") delivers naive plain text as coded in the PDF. opt="blocks" delivers text lines aggregated by paragraphs, "dict" delivers position detail down to each text span, including font, font size, text color and more. "rawdict" does a similar thing but down to each character. So best try them out to see what fits your needs. – Jorj McKie Apr 15 '23 at 21:59

Detecting paragraphs in a PDF

0 Answers0