I'm building a QA machine and I have my own data for this task. I have a problem that 1 question can have 2 or more answers. For example:
Questions: "What does A have to do?"
Correct answers:
- "A have to clean the floor"
- "A have to hang up the laundry"
In my QA model, I can get k best answers. However, in some cases, not only k is unequal the number of correct answers but also some of the k answers are not correct.
Most of public dataset like SQuAD, triviaQA have a pair with one question and one answer. In my case, my question can have multiple answers. So, what kind of evaluation metrics I should use? Can I use F1 score?