I have thousands of exams in pdf, which I'd like to extract it's questions into a standard format (JSON, YML or XML).
They are multiple choice:
Question 1
Who was the first man to walk on the moon?
a) Yuri Gagarin
b) Ellen Ripley
c) Neil Armstrong
d) Shepard
Question 2
How many planets are in the solar system ?
a) 10
b) 12
c) 14
d) 15
(...)
In JSON:
{
"number": 1,
"wording": "Who as the first man to walk on the moon",
"alternatives": {
"a": Yuri Gagarin
"b": Ellen Ripley
"c": Neil Armstrong
"d": Shepard
}
}
The caveat is that as those exams were made by different teachers, so, they may differ slightly. That means that even extracting to plain text, I will not be able to match using regular expressions. (I've tried and the combinations (wording structure / alternative structure) are huge)
For example:
"Question X (...)".
"Question (X) (...)".
"Question X - (...)".
"X) (...)".
"X- (...)".
The alternatives also might change:
a) (...)
a. (...)
a- (...)
1) (...)
I guess I need some sort of machine learning tool in order to "teach" the program what is a question and make it find.
As an alternative, as the questions (in print) are physically distant one from another, I thought I could transform those PDFs into images and use some sort of image-recognition.
Is it feasible? Is there a tool (package, library, algorithm) for identifying those questions?