How to extract only paragraphs from pdf using python or java?

Question

We could able to extract entire text from pdf using pypdf2 and pdfbox but not able to fetch only paragraphs.

What is a paragraph? Well, ok, I have an idea what a paragraph is when I see it, but in a PDF there doesn't need to be a structure marking a paragraph as such. Or do you happen to only deal with tagged PDFs marking paragraphs? — mkl, Aug 09 '19 at 14:36

score 0 · Answer 1 · answered Feb 25 '23 at 21:42

0

Extract the text and split by \n\n.

answered Feb 25 '23 at 21:42

Martin Thoma

1 Answers1