0

In a SparkNLP's PipelineModel all the stages have to be of type AnnotatorModel. But what if one of those annotatormodels requires a certain column in the dataset as input and this input column is the output of an AnnotatorApproach?

For instance, I have a trained model for NER (as the last stage of the pipeline) which requires tokens and POS tags as two of the inputs. The tokens are also required by the POS tagger. But the Tokenizer is an AnnotatorApproach and I am not able to add this to the pipeline.

This is how the Tokenizer is instantiated (in Java):

AnnotatorApproach<TokenizerModel> tokenizer = new Tokenizer();

This works:

Pipeline pipeline = new Pipeline().setStages( new PipelineStage[]{tokenizer} );

But this doesn't work, because Tokenizer is not a Transformer:

List<Transformer> list;
list.add(tokenizer);
PipelineModel pipelineModel = new PipelineModel("ID42", list);
martin_wun
  • 1,599
  • 1
  • 15
  • 33
  • A workaround seems to be to construct a Pipeline instead of a PipelineModel and then call `fit(data).transform(data)` on this pipeline. This works, but seems counterintuitive somehow. Maybe I am missing some important conceptual point here. – martin_wun Nov 19 '21 at 08:05
  • PS: The other issue is that I would like to use a `LightPipeline` for predictions due to performance reasons. However, I am not able to construct a `LightPipeline` from a `Pipeline`, only from `PipelineModel`. – martin_wun Nov 19 '21 at 09:13

1 Answers1

0

Always fitting the pipeline will return you a pipeline ready for inference, even when you fit on an empty dataset. If you're only depending on annotators that don't require training that's ok. That's the recommended usage, manipulating the individual stages in typically not necessary, hacky, and can lead to errors.

AlbertoAndreotti
  • 478
  • 4
  • 13
  • Thanks a lot for the help, @Alberto. This makes it clearer. So, a solution would be to create a preprocessing pipeline and fit it on an empty dataframe to received a PipelineModel. This PipelineModel could then be combined with the pre-trained Model into a two-stage PipelineModel, from which one can construct the LightPipeline. Just to note: The empty dataset still has to have the schema of the expected data. This can make it a bit tricky to construct it upfront for complex datatypes. – martin_wun Nov 22 '21 at 09:42
  • that's true, but most of the times in NLP pipelines you enter with a string column, so doing this, empty_data = spark.createDataFrame([[""]]).toDF("text") is enough in most situations – AlbertoAndreotti Nov 22 '21 at 19:47