I'm using PyMuPDF to extract text from PDFs from block units. In many cases, "blocks" seem to just default to newline separated units, rather than logical paragraphs.
import fitz
doc = fitz.open("example.pdf")
blocks = [x[4] for x in doc[0].getText("blocks")]
print(blocks)
(example.pdf can be found here)
I could live with this, were it not for the fact that straight copy/pasting from Mac's bog standard Preview app, beautifully retains the paragraphs. What is Preview doing that PyMuPDF isn't? The rest of my pipeline is pretty much locked into PyMuPDF, so I can't really use Preview for extraction.