In my spaCy project, I would like to initialize a Doc object with text, labels and whitespaces. spaCy doesn't appreciate the way I provide the labels however, and shows its lack of appreciation in the following error message:
doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces) File "spacy\tokens\doc.pyx", line 297, in spacy.tokens.doc.Doc.__init__ ValueError: [E177] Ill-formed IOB input detected: ('', 'O')
The code:
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
token_texts = ["I", "like", "potatoes", "!"]
labels = [("", "O"), ("", "O"), ("food", "I"), ("", "O")]
whitespaces = [True, True, False, False]
doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)
Does anyone know how to exactly serve spaCy the entities on the silver platter?
The spaCy Doc documentation states
ents: A list of strings, of the same length of words, to assign the token-based IOB tag. Defaults to None. Optional[List[str]]
The type-hint List[str]
made me attempt ["", "", "food", ""]
, which however results in the same error message.
Stackoverflow links that do not have the answer:
Convert NER SpaCy format to IOB format
Convert list of IOB formatted data to simple IOB formatted data