We have a project that we are hoping to realize and in this project we need to deal with PDF files (unfortunately) and parsing their content. For the last few days we have been researching a lot about different libraries and we tried few of those.
Although this is the case we still don't know if we will be able to accomplish such a task. Basically every page in our PDF document will contain 6-7 questions possibly with images and 5 multiple choice answers. We will need to kind of segment those questions out and further segment the multiple choice answers of the related question.
We have found PDFBox
(Java
) and PDFMiner
(Python
) to be most reliable libraries for parsing PDF's but still I personally think that creating a reliable system that will satisfy our requirements will be super difficult. This is not a which library is the best? question but it is more like, if such tasks are doable and such advanced requirements are realizable currently in PDF parsing world?
Of course I am open to any other advice (Image processing, cropping software, manual cropping? etc..) which might help us accomplish our task.
Ex: 6 of those on a page: