I have a dataset in CoNLL NER format which is basically a TSV file with two fields. The first field contains tokens from some text - one token per line (each punctuation symbol is also considered a token there) and the second field contains named entity tags for tokens in BIO format.
I would like to load this dataset into spaCy, infer new named entity tags for the text with my model and write these tags into the same TSV file as the new third column. All I know is that I can infer named entities with something like this:
nlp = spacy.load("some_spacy_ner_model")
text = "text from conll dataset"
doc = nlp(text)
Also I managed to convert the CoNLL dataset into spaCy's json format with this CLI command:
python -m spacy convert conll_dataset.tsv /Users/user/docs -t json -c ner
But I don't know where to go from here. Could not find how to load this json
file into a spaCy Doc
format. I tried this piece of code (found it in spaCy's documentation):
from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = Doc(Vocab()).from_disk("sample.json")
but it throws an error saying ExtraData: unpack(b) received extra data.
.
Also I don't know how to write ners from doc
object back into the same TSV file aligning tokens and NER tags with existing lines of the TSV file.
And here's an extract from the TSV file as an example of the data I am dealing with:
The O
epidermal B-Protein
growth I-Protein
factor I-Protein
precursor O
. O