0

I'm using PyMuPDF to extract text from PDFs from block units. In many cases, "blocks" seem to just default to newline separated units, rather than logical paragraphs.

import fitz
doc = fitz.open("example.pdf")
blocks = [x[4] for x in  doc[0].getText("blocks")]
print(blocks)

(example.pdf can be found here)

I could live with this, were it not for the fact that straight copy/pasting from Mac's bog standard Preview app, beautifully retains the paragraphs. What is Preview doing that PyMuPDF isn't? The rest of my pipeline is pretty much locked into PyMuPDF, so I can't really use Preview for extraction.

Guy De Pauw
  • 3
  • 2
  • 3

1 Answers1

0

I wish there was a way to call the engine that Preview uses. It's much better than anything I've found for Python. But to answer your question, it looks to me that PyMuPDF inserts a string containing a single space (' ') between paragraphs.

For example, between the first and second paragraphs, you have:

...ontspannen. ', ' ', 'Kunnen...

You can replace all single space strings with a newline ('\n') like this:

for i in range(len(blocks)):
    if blocks[i] == ' ':
        blocks[i] = '\n'

Since each line of text is returned as a separate string, you might also want to join the strings that form a paragraph.

Mark Turner
  • 81
  • 2
  • 5
  • Yeah, I also wonder what's Preview's secret sauce. For now, I'm just extracting full pages and then perform sentence tokenization on the lot. Kinda works, but wish I could get it right from the source. – Guy De Pauw Nov 13 '20 at 04:14