How to automatically compute accuracy (precision, recall, F1) for NER?

Question

I'm using a NER system that gives as output a text file containing a list of named entities which are instances of the concept Speaker. I'm looking for a tool that can compute the system's precision, recall and F1 by taking as input this list and the gold standard where the instances are correctly annotated with tags <Speaker>.

I have two txt files: Instances.txt and GoldStandard.txt. I need to compare the extracted instances with the gold standard in order to calculate these metrics. For example, according to the second file, the first three sentences in the first file are True Positive and the last sentence is False Positive.

instances.txt contains:

is sponsoring a lecture by <speaker> Antal Bejczy from
announces a talk by <speaker> Julia Hirschberg
His name is <speaker> Toshiaki Tsuboi He will
to produce a schedule by <speaker> 50% for problems

GoldStandard.txt contains:

METC is sponsoring a lecture by <speaker> Antal Bejczy from Stanford university
METC announces a talk by <speaker> Julia Hirschberg
The speaker is from USA His name is <speaker> Toshiaki Tsuboi He will              
propose a solution to these problems
It led to produce a schedule by 50% for problems

Your title only says *"compute automatically the accuracy"* but your question body says *"I'm looking for a tool that can compute the system's precision, recall and F1 from given input"*. Presumably you prefer one from/using standard packages, and/or open-source licensed. — smci, Feb 27 '19 at 23:47
Your instances.txt looks different. This usually is a result of an NLP pipeline, not just the NER model. Meaning, you had a sentence detection, tokenizer, etc. In this way, you need to take into account the accuracy of those as well because the end results are not aligned. Or if you don't care about that, manually map the results back to the gold standard and use conlleval. — Maziyar, Aug 14 '19 at 11:11

gnetmil · Answer 1 · 2019-03-01T20:07:29.330

2

For NER results, people usually measure precision, recall and F1-score instead of accuracy, and conlleval is probably the most common way to calculate these metrics: https://github.com/spyysalo/conlleval.py. It also reports the accuracy though.

the conlleval script takes conll format files as input. Take your first sentence as example:

METC    O   O
is  O   O
sponsoring  O   O
a   O   O
lecture O   O
by  O   O
Antal   B-speaker   B-speaker
Bejczy  I-speaker   I-speaker
from    O   O
Stanford    O   O
university  O   O

where the first column is word, the second column is system output, and the third column is gold label. O indicates that a token belongs to no chunk. Suffixes B- and I- mean the beginning of, inside of/the ending of a chunk. Sentences are separated using an empty line.

edited Mar 01 '19 at 20:07

answered Feb 26 '19 at 21:53

gnetmil

86
4

Thank you for your reply. I have edited my question to explain the target task. Is it possible to perform the latter with Conlleval? – Fasun Feb 27 '19 at 23:33
I'm a little bit confused. Usually the NER system only annotates the sentence with pre-defined tag(s) (e.g., in your example) and won't modify the text, while sentences in `instances.txt ` and `GoldStandard.txt` look different... For example, "METC" and "Stanford university" are missing in `instances.txt` – gnetmil Mar 01 '19 at 19:56
@J.Fatine I edited my answer. also please check my last comment, forgot to @ you. let me know if you have any questions. – gnetmil Mar 02 '19 at 16:58

smci · Answer 2 · 2019-02-28T00:07:24.693

It depends entirely on your use-case, and how much work you do on cleaning up/disambiguating the output from the NER. There is also weighted-F1 score; you presumably care more about missing references (i.e. want higher recall) than false-positives (higher precision). Except maybe for other types of use-cases you don't (issuing subpoenas or warrants, banning users for chat abuse).

sklearn.metrics.f1_score() implements weighted-F1.

Tell us more about your application: how bad is it if you mistake, misidentify or confuse speaker name (false-positive), vs miss a valid one (false-negative)?

How to automatically compute accuracy (precision, recall, F1) for NER?

2 Answers2