1

Good morning, fellas. I have been assigned a task wherein I am supposed to extract text from a PDF file (a bank invoice), as per the given specification of fields and sections. This specification is given in a YAML file. The fields are expressed as a set of two coordinates - top left and right bottom of the rectangle in which the text resides, and the name of the field. I am using SnakeYAML to load this info into objects. I have been successful upto this point. For the next part, where I have to extract text from PDFs using this data, well... I am kind of stuck here. For one, I am yet unable to decide on what PDF parsing library to use. Can you please suggest me a PDF parsing library suited to my task, and how should I go about accomplishing the above mentioned task? Thanks!

Vlad
  • 10,602
  • 2
  • 36
  • 38
Jim
  • 19
  • 2

1 Answers1

2

PDF Box is able to extract text from a given area. Have a look at PDFTextStripperByArea!

Vlad
  • 10,602
  • 2
  • 36
  • 38