Evaluation metrics for multiple correct answers in QA problem system

Question

I'm building a QA machine and I have my own data for this task. I have a problem that 1 question can have 2 or more answers. For example:

Questions: "What does A have to do?"

Correct answers:

"A have to clean the floor"
"A have to hang up the laundry"

In my QA model, I can get k best answers. However, in some cases, not only k is unequal the number of correct answers but also some of the k answers are not correct.

Most of public dataset like SQuAD, triviaQA have a pair with one question and one answer. In my case, my question can have multiple answers. So, what kind of evaluation metrics I should use? Can I use F1 score?

score 1 · Answer 1 · answered Sep 29 '20 at 06:59

The evaluation metric should always depend on how the system you are developing will be used. F1 score is certainly a reasonable statistics that tells you a lot about how the distribution of the correct and wrong answers is.

If you are going to present a single best answer from your system, you should also measure the 1-best accuracy. If you are going present multiple answers, you should measure the precision at n (i.e., proportion of correct answers among n best answers, it is in fact recall, but folks in information retrieval call it precision).

If you are not sure what is a suitable number of answers to present, you might want to plot the ROC curve and compute the AUC score.

Evaluation metrics for multiple correct answers in QA problem system

1 Answers1