I have several exams in PDF format. I want to programatically extract each question as a separate image/document. OCR is not ideal because it does not maintain code/equation formatting well. The end goal is to make flash cards with each card containing an image of an entire question. Questions can be on the same page, and can also be multi-part (e.g. 1a, 2f, etc.).
Currently, I'm considering using OCR to extract question tags (e.g. 1, 2, 3, etc.) and then finding their positions in the pdf and extracting an iamge from the start of one question to the start of the next. Is there any framework or software that can do this or provide some sort of alternative approach to make this easier?