I try to convert my IOB (token-per-line NER) files (train/test) to Spacy 3 binary format.
Example of input format (with separator "\t", no blanklines and encoding utf-8) :
Département B-LOCATION
des I-LOCATION
Bouches-du-Rhône I-LOCATION
. O
Port B-INSTALLATION
de I-INSTALLATION
la I-INSTALLATION
Ciotat I-INSTALLATION
. O
Avant-projet O
du O
môle B-INSTALLATION
Bérouard I-INSTALLATION
au O
port B-INSTALLATION
de I-INSTALLATION
La I-INSTALLATION
Ciotat I-INSTALLATION
. O
when I run :
!python -m spacy convert -c iob -s -n 10 -b fr_core_news_sm /content/ner4archives_v0_train.iob .
!python -m spacy convert -c iob -s -n 10 -b fr_core_news_sm /content/ner4archives_v0_test.iob .
I have this error :
ValueError: [E903] The token-per-line NER file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert
I saw the git directory with example data : https://github.com/explosion/spaCy/tree/master/extra/example_data/ner_example_data ; but I cannot find the difference between my data and examples.
I try to reformat my file with different kind of separators ("\t", " ", "|"), I have always the same error. In addition, i tested if I have empty tokens or labels, it is not.
anyone with leads ? thanks in advance.