4

I want to extract text from a cropped PDF document.

I tried pdfminer, but it gave me also the cropped text. I need only visible area text.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Umesha D
  • 826
  • 1
  • 7
  • 14
  • What do you mean by cropped pdf text? Which part? And since you are already tagging this question with PyPDF, which is already an available library for extracting parts from PDF, why do you still ask this question? What is your problem? – justhalf Oct 01 '13 at 06:32
  • Hi, Thanks for your reply. My question is not clear. sorry... I have croped the pdf using pypdf and I want to extract the text i.e croped area text only. If i use the pdfminer tool to extract the text it will give entire page text, I need Croped area text only and I don't want to print the pdf and remove the other objects. Please let me know, If it is possible... – Umesha D Oct 01 '13 at 10:01
  • Can you try saving the cropped PDF, then run pdfminer tool on it? – justhalf Oct 01 '13 at 16:15
  • Thanks, I tried like that, still entire page text will coming, Not croped text. When i Print the pdf to ps then ps to pdf then i run the pdfminer tool it is coming. The thing is i need co-ordinates also. when i print the croped document co-ordinates will going to lose. – Umesha D Oct 03 '13 at 04:54
  • You can try converting it into image, then use some OCR tool to get the text only at specified area. Since the image is generated digitally, I'm sure the OCR accuracy will be very high, if not 100%. My experience using this method tells me that so far the OCR accuracy is always 100%. – justhalf Oct 03 '13 at 05:12

0 Answers0