0

I have thousands of exams in pdf, which I'd like to extract it's questions into a standard format (JSON, YML or XML).

They are multiple choice:

Question 1

Who was the first man to walk on the moon?

a) Yuri Gagarin

b) Ellen Ripley

c) Neil Armstrong

d) Shepard

Question 2

How many planets are in the solar system ?

a) 10

b) 12

c) 14

d) 15

(...)

In JSON:

{
  "number": 1,
  "wording": "Who as the first man to walk on the moon",
  "alternatives": {
    "a": Yuri Gagarin
    "b": Ellen Ripley
    "c": Neil Armstrong
    "d": Shepard
  }
}

The caveat is that as those exams were made by different teachers, so, they may differ slightly. That means that even extracting to plain text, I will not be able to match using regular expressions. (I've tried and the combinations (wording structure / alternative structure) are huge)

For example:

"Question X (...)".

"Question (X) (...)".

"Question X - (...)".

"X) (...)".

"X- (...)".

The alternatives also might change:

a) (...)

a. (...)

a- (...)

1) (...)

I guess I need some sort of machine learning tool in order to "teach" the program what is a question and make it find.

As an alternative, as the questions (in print) are physically distant one from another, I thought I could transform those PDFs into images and use some sort of image-recognition.

Is it feasible? Is there a tool (package, library, algorithm) for identifying those questions?

Victor Ribeiro
  • 577
  • 7
  • 20

1 Answers1

0

There is no straight forward machine learning solution to your problem. If your PDFs are in 1000, and the formats are in 10s, better you write a string parser for each format. If you take the path of machine learning, the time to find solution may be longer. Python should help.

Haja Maideen
  • 450
  • 2
  • 4
  • It is actually 100 thousands. But I guess you are right. I'll try a semi-automatic approach. Parse but also verify and correct manually. – Victor Ribeiro Jul 20 '14 at 16:57