Identify and extract specific sections of a PDF document

Question

I have several exams in PDF format. I want to programatically extract each question as a separate image/document. OCR is not ideal because it does not maintain code/equation formatting well. The end goal is to make flash cards with each card containing an image of an entire question. Questions can be on the same page, and can also be multi-part (e.g. 1a, 2f, etc.).

Currently, I'm considering using OCR to extract question tags (e.g. 1, 2, 3, etc.) and then finding their positions in the pdf and extracting an iamge from the start of one question to the start of the next. Is there any framework or software that can do this or provide some sort of alternative approach to make this easier?

For the OCR and image recognition part, you could always try the [Azure Cognitive Services](https://azure.microsoft.com/en-gb/services/cognitive-services/) (if it's OK to have an online connection). It's free to try anyway, and I've written a [blog post](http://digitalpage.blog/2017/09/28/improving-search-in-pdf-documents-using-azure-and-ai/) about my experiences, if it helps. — Andy, Nov 07 '17 at 09:43

score 5 · Accepted Answer · answered Nov 20 '17 at 02:43

Have a look at Science-Parse by Allen AI. It does a pretty decent job at extracting metadata from PDF documents. Often, its better than other text extracting software such as textract and pdfplumber.

Extraction of mathematical formulae from PDF accurately has been a research topic for many years now. I have not found any open source projects/packages/softwares related to extracting mathematical formulae precisely, although there are a number of research papers which describe methods to do that such as this and this. (More research has been done on recognition of mathematical formula or converting them to a proper markup such as LaTeX, MathML, etc.) Most of these papers use information about the font, baseline, glyph bounding boxes, line spacing, etc. to correctly recognize mathematical formulae and extract them.

For OCR, you can always use Infty. This is what the description for InftyReader says:

InftyReader recognizes scanned images of printed scientific documents including Math formulae, an outputs the recognition results in various formats: XML format for InftyEditor, LaTeX, MathML, Human-Readable TeX for the blinds, etc.

Identify and extract specific sections of a PDF document

1 Answers1