Extract text from pdf ignoring cropped content

Asked Mar 13 '18 at 00:34

Active Aug 23 '22 at 03:28

Viewed 629 times

I'm trying to extract text from a pdf file that has been cropped. I.e it has a defined cropbox which only displays a portion of the page.

The problem is that the cropped part still exists in pdf files, its just not visible.

I've tried PyPDF2, pdfquery and pdfminer. They all read the entire content including the cropped portion.

PyPDF2 lets me access the dimensions of the cropbox using:

pdfFileObj=open(path,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.getPage(0).cropBox

But I'm not sure what I can do with it. The files are being cropped in java using apache pdfBOX. I'd prefer to only read the uncropped part of the files in python but I can also make changes to the java code cropping the files if that's the only solution.

Any help is appreciated.

asked Mar 13 '18 at 00:34

doddy

1

It's trivial to use PDFBox text extraction restricted to the crop box, if that would be a solution for you. – mkl Mar 13 '18 at 08:58
This question is a duplicate of [How can extract just the visible text from a PDF, ignoring cropped parts?](https://stackoverflow.com/q/19109465/562769) – Martin Thoma Feb 25 '23 at 21:41

Extract text from pdf ignoring cropped content

0 Answers0