Is there a PDF parsing library that can extract text from given coordinates?

Question

Good morning, fellas. I have been assigned a task wherein I am supposed to extract text from a PDF file (a bank invoice), as per the given specification of fields and sections. This specification is given in a YAML file. The fields are expressed as a set of two coordinates - top left and right bottom of the rectangle in which the text resides, and the name of the field. I am using SnakeYAML to load this info into objects. I have been successful upto this point. For the next part, where I have to extract text from PDFs using this data, well... I am kind of stuck here. For one, I am yet unable to decide on what PDF parsing library to use. Can you please suggest me a PDF parsing library suited to my task, and how should I go about accomplishing the above mentioned task? Thanks!

score 2 · Answer 1 · answered Sep 02 '11 at 09:09

2

PDF Box is able to extract text from a given area. Have a look at PDFTextStripperByArea!

answered Sep 02 '11 at 09:09

Vlad

10,602
2
36
38

Is there a PDF parsing library that can extract text from given coordinates?

1 Answers1