Paragraph extraction in PyMuPDF

Question

I'm using PyMuPDF to extract text from PDFs from block units. In many cases, "blocks" seem to just default to newline separated units, rather than logical paragraphs.

import fitz
doc = fitz.open("example.pdf")
blocks = [x[4] for x in  doc[0].getText("blocks")]
print(blocks)

(example.pdf can be found here)

I could live with this, were it not for the fact that straight copy/pasting from Mac's bog standard Preview app, beautifully retains the paragraphs. What is Preview doing that PyMuPDF isn't? The rest of my pipeline is pretty much locked into PyMuPDF, so I can't really use Preview for extraction.

score 0 · Answer 1 · answered Nov 12 '20 at 12:59

I wish there was a way to call the engine that Preview uses. It's much better than anything I've found for Python. But to answer your question, it looks to me that PyMuPDF inserts a string containing a single space (' ') between paragraphs.

For example, between the first and second paragraphs, you have:

...ontspannen. ', ' ', 'Kunnen...

You can replace all single space strings with a newline ('\n') like this:

for i in range(len(blocks)):
    if blocks[i] == ' ':
        blocks[i] = '\n'

Since each line of text is returned as a separate string, you might also want to join the strings that form a paragraph.

Yeah, I also wonder what's Preview's secret sauce. For now, I'm just extracting full pages and then perform sentence tokenization on the lot. Kinda works, but wish I could get it right from the source. — Guy De Pauw, Nov 13 '20 at 04:14

Paragraph extraction in PyMuPDF

1 Answers1