Where to add SentencePiece tokenization in AllenNlp pipeline?

Asked Aug 03 '18 at 06:26

Active Aug 22 '19 at 19:02

Viewed 265 times

I am new to allennlp, I use sentencepiece for subword tokenization in my pipeline.

SentencePiece needs a training step to generate a subword model, which can then be used for tokenization.

Is an implementation of Vocabulary class the right way to do it. Little confused whether it is the right place, given there are TokenIndexers for character tokenization etc.

edited Aug 22 '19 at 19:02

MBT

21,733
19
84
102

asked Aug 03 '18 at 06:26

Sai Prasanna

1

AllenNLP maintainer here - We don't really check stackoverflow at this point, but if you open this as an issue on github, we'd be happy to help. We've recently gotten byte pair encoding to work with some of our models, so some of our team (not me) should have some ideas for what you should do. – mattg Aug 03 '18 at 15:09
@mattg Sure, thanks. – Sai Prasanna Aug 03 '18 at 17:30

Where to add SentencePiece tokenization in AllenNlp pipeline?

0 Answers0