We could able to extract entire text from pdf using pypdf2 and pdfbox but not able to fetch only paragraphs.
Asked
Active
Viewed 561 times
3
-
1What is a paragraph? Well, ok, I have an idea what a paragraph is when I see it, but in a PDF there doesn't need to be a structure marking a paragraph as such. Or do you happen to only deal with tagged PDFs marking paragraphs? – mkl Aug 09 '19 at 14:36
-
Why would you want to do it? What have you tried? – Martin Thoma Jul 30 '22 at 10:35
1 Answers
0
Extract the text and split by \n\n
.
If you want to extract text from a specific region, use a visitor function: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#using-a-visitor

Martin Thoma
- 124,992
- 159
- 614
- 958