AllenNLP Reading Comprehension results are different in UI Demo and Python Library

Question

I am trying AllenNLP reading comprehension with the Transformer QA Model to get the answer to question "Who is CEO of ABB?" from the passage "ABB opened its first dedicated global healthcare research center for robotics in October 2019.".

As expected, the UI demo shows no answer returned. The API response in network tab also shows that. In the json response, best_span_str is empty, but best_span_scores is 9.9. Screenshot of demo UI and API response in network tab.

When I execute the similar code via python library, I get a different result.

from allennlp.predictors.predictor import Predictor
import pandas

def allen_nlp_demo_1():
  import allennlp_models.structured_prediction
  import allennlp_models.rc
  predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/transformer-qa-2020-05-26.tar.gz")
  data = predictor.predict(
    passage="ABB opened its first dedicated global healthcare research center for robotics in October 2019.",
    question= "Who is CEO of ABB?"
  )
  print(data)

if __name__ == '__main__':
  allen_nlp_demo_1()

provides following json output

{
  "span_start_logits": [...],
  "best_span": [
    7,
    15
  ],
  "best_span_scores": -10.418445587158203,
  "loss": 0,
  "best_span_str": "healthcare research center for robotics in October 2019",
  "context_tokens": [...],
  "id": "1",
  "answers": []
}

Here I see best_span_str coming up, and best_span_scores as -10.418445587158203. Attaching python code and output snippet.

Why is this difference in output in the UI demo vs library? Also, what is the range of best_span_scores and where can I decide a threshold to discard false results?

score 2 · Answer 1 · answered Nov 06 '20 at 19:32

Regarding the discrepancy in the demo's output and your run, it is because the actual demo uses a different archive file. The usage code on the demo has been updated now to reflect the new file path (transformer-qa-2020-10-03.tar.gz).
For finding the best_span, the model considers a prediction of cls token to mean that the question is not answerable. This is indicated by best_spans, which is [-1, -1] when the question is not answerable. For the case when the question is actually answerable, the span scores are relative to each other; we pick the span with the highest score. So, there isn't a fixed threshold that can be used in all cases.

AllenNLP Reading Comprehension results are different in UI Demo and Python Library

1 Answers1