3

We could able to extract entire text from pdf using pypdf2 and pdfbox but not able to fetch only paragraphs.

Ashok Kuramdasu
  • 313
  • 4
  • 15
  • 1
    What is a paragraph? Well, ok, I have an idea what a paragraph is when I see it, but in a PDF there doesn't need to be a structure marking a paragraph as such. Or do you happen to only deal with tagged PDFs marking paragraphs? – mkl Aug 09 '19 at 14:36
  • Why would you want to do it? What have you tried? – Martin Thoma Jul 30 '22 at 10:35

1 Answers1

0

Extract the text and split by \n\n.

If you want to extract text from a specific region, use a visitor function: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#using-a-visitor

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958