3

i am trying to upgrade my spacy version to nightly especially for using spacy transformers

so i converted spacy simple train datasets of format like

td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]

above to

[[{"head": 0, "dep": "", "tag": "", "orth": "Who", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "is", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "Shaka", "ner": "B-FRIENDS", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": "Khan", "ner": "L-FRIENDS", "id": 3}, {"head": 0, "dep": "", "tag": "", "orth": "?", "ner": "O", "id": 4}], [{"head": 0, "dep": "", "tag": "", "orth": "I", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "like", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "London", "ner": "U-LOC", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": ".", "ner": "O", "id": 3}]]

using following script

sentences = []
for t in td:
    doc = nlp(t[0])
    tags = offsets_to_biluo_tags(doc, t[1]['entities'])
    ner_info = list(zip(doc, tags))
    tokens = []
    for n, i in enumerate(ner_info):
        token = {"head" : 0,
        "dep" : "",
        "tag" : "",
        "orth" : i[0].orth_,
        "ner" : i[1],
        "id" : n}
        tokens.append(token)
    sentences.append(tokens)



with open("train_data.json","w") as js:
    json.dump(sentences,js)```


then i tried to convert this train_data.json using 
spacy's convert command

```python -m spacy convert train_data.json converted/```


but the result in converted folder is

```✔ Generated output file (0 documents): converted/train_data.spacy``` 

which means it doesn't created dataset

can anybody help on what i am missing

i am trying to do this with spacy-nightly
shahid khan
  • 409
  • 6
  • 23

1 Answers1

6

You can skip intermediate JSON step and convert the annotation directly to DocBin.

import spacy
from spacy.training import Example
from spacy.tokens import DocBin

td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]

nlp = spacy.blank("en")
db = DocBin()

for text, annotations in td:
    example = Example.from_dict(nlp.make_doc(text), annotations)
    db.add(example.reference)

db.to_disk("td.spacy")

See: https://nightly.spacy.io/usage/v3#migrating-training-python

(If you do want to use the intermediate JSON format, here are the specs: https://spacy.io/api/annotation#json-input . You can just include orth and ner in the tokens and leave the other features out, but you need this structure with paragraphs, raw, and sentences. An example is here: https://github.com/explosion/spaCy/blob/45c9a688285081cd69faa0627d9bcaf1f5e799a1/examples/training/training-data.json)

aab
  • 10,858
  • 22
  • 38
  • @abb thanks for the repsone but i have a doubt when i print(example.reference) i can see only "Who is Shaka Khan? and I like London. but no enitites is it expected or is there something wrong – shahid khan Nov 04 '20 at 10:33
  • 1
    `example.reference` is a `Doc`, so `print(doc)` just shows `doc.text`. Look at `doc.ents` to see the entities. – aab Nov 04 '20 at 16:54