3

I was wondering if it is possible to train two trainable components in Spacy with two different datasets ? In fact, I would like to use the NER and the text classifier but since the train datasets for these two components should be annotated differently so I don't know how can I train both components at once...

Should I train each task in a separate pipeline and assemble both pipelines at the end ? Or should I train the NER, package this pipeline and then use this package as input to train the text classifier ?

Many thanks in advance for your help

JulienD
  • 35
  • 5

1 Answers1

3

You won't be able to train these at the same time, if the dataset is not the same.

If you're working with spaCy v3, it should be relatively straightforward to combine the two training steps into one final pipeline. For instance, create a config that trains the NER first, and store it to disk. Then, create a new config where you source the NER from the previously trained pipeline, and then define this NER component as frozen:

[nlp]
pipeline = ["ner", "textcat"]
...

[training]
frozen_components = ["ner"]
...

[components.ner]
source = "your_trained_ner_location"
component = "ner"

[components.textcat]
factory = "textcat"
...

Now run training on your textcat.

FYI - this kind of multi-step workflows can be easily set-up with spacy projects

Sofie VL
  • 2,931
  • 2
  • 12
  • 22
  • Thank you Sofie for your answer. In the meantime, I tried this solution but it seems that frozing a component during the training of the textcat decreases the performance of the NER. I found out on your GitHub that I should use "replace_listeners" in the config file to overcome this issue. After training, I wanted to load the pipeline but it was impossible to deserialize the object (I use a transformer based pipeline). Could you help me on that one ? – JulienD Jun 14 '21 at 13:07
  • The performance of the NER will degrade if you're attempting to use the same Transformer for both tasks, because the transformer will only be updated for the `textcat` if `ner` is frozen. We're currently fixing the `replace_listener` functionality for transformers though, as it was broken (but would have indeed been the correct solution here): https://github.com/explosion/spacy-transformers/pull/277 – Sofie VL Jun 14 '21 at 14:26
  • Alternatively, you could have two transformers in the pipeline, have NER and textcat listen to a distinct one (with `upstream_name` something specific and not just `"*"`), and then freeze the NER and its corresponding transformer. But your pipeline will be slower, I'm afraid. – Sofie VL Jun 14 '21 at 14:27