5

We have a project that we are hoping to realize and in this project we need to deal with PDF files (unfortunately) and parsing their content. For the last few days we have been researching a lot about different libraries and we tried few of those.

Although this is the case we still don't know if we will be able to accomplish such a task. Basically every page in our PDF document will contain 6-7 questions possibly with images and 5 multiple choice answers. We will need to kind of segment those questions out and further segment the multiple choice answers of the related question.

We have found PDFBox (Java) and PDFMiner(Python) to be most reliable libraries for parsing PDF's but still I personally think that creating a reliable system that will satisfy our requirements will be super difficult. This is not a which library is the best? question but it is more like, if such tasks are doable and such advanced requirements are realizable currently in PDF parsing world?

Of course I am open to any other advice (Image processing, cropping software, manual cropping? etc..) which might help us accomplish our task.

Ex: 6 of those on a page:

question format

ralzaul
  • 4,280
  • 6
  • 32
  • 51
  • 3
    First of all, not all PDFs are parseable **per definition**, but if all of your files come from a single source, it may be possible. Some libraries have problems with PDFs of one kind, others ... with others. There is a vast amount of ways to store text in a PDF. If you're planning to dive into the format, make sure to read the references mentioned in [tag:pdf]. Then come back when you have a concrete question. – Jongware Apr 28 '15 at 09:02
  • actually the pdf files will come from different suppliers so I expect them to have seperate formats of their own. More important thing are the images and how we are going to match the images to the corresponding text. I added an example to the question to make if more clear. – ralzaul Apr 28 '15 at 09:12
  • 1
    It would be more easy to come up with some useful advice if you provide (a link to) a sample of some representative typical PDF files of yours (with and without images). – Kurt Pfeifle Apr 28 '15 at 09:59
  • 1
    I'm not sure if the problem is the parsing or the interpretation of the data. I'll consider that the text is parseable into a meaningful form. Taking your example you'll have blocks of text and images, the location of both is available after parsing. Look for the question number, below it's an image, more text and another image. next comes the letters for the choices with an image and text in the same line or until the next choice. Doesn't look too difficult but without an actual pdf I'm guessing. – Paulo Soares Apr 28 '15 at 10:26
  • @PauloSoares Hhmm, considering the OP said *possibly with images* I think it might not be too easy. Without a representative selection of PDFs I think it is really hard to tell. – mkl Apr 28 '15 at 10:41
  • @mkl Let's wait for the PDF. – Paulo Soares Apr 28 '15 at 11:02
  • I will attach an example pdf file sometime this evening. So probably a tactic which you search for the question number , get everything up to "a-)", then get everything up to b-), then get everything up to c-) till you reach question number + 1, is theoretically a realizable tactic? – ralzaul Apr 28 '15 at 11:03
  • I attached the part of the PDF which is difficult to parse. Please consider the case that there might also be images on the answers or the answers can be entirely formed from images. – ralzaul Apr 28 '15 at 11:15
  • I am afraid the PDF part has been attached as image. stackoverflow does not provide means to attach anything else but some bitmap image formats. To analyze, though, we'd need the PDF. – mkl Apr 28 '15 at 11:37

0 Answers0