Good morning, fellas. I have been assigned a task wherein I am supposed to extract text from a PDF file (a bank invoice), as per the given specification of fields and sections. This specification is given in a YAML file. The fields are expressed as a set of two coordinates - top left and right bottom of the rectangle in which the text resides, and the name of the field. I am using SnakeYAML to load this info into objects. I have been successful upto this point. For the next part, where I have to extract text from PDFs using this data, well... I am kind of stuck here. For one, I am yet unable to decide on what PDF parsing library to use. Can you please suggest me a PDF parsing library suited to my task, and how should I go about accomplishing the above mentioned task? Thanks!
Asked
Active
Viewed 1,462 times
1 Answers
2
PDF Box is able to extract text from a given area. Have a look at PDFTextStripperByArea!

Vlad
- 10,602
- 2
- 36
- 38