1

I am new to allennlp, I use sentencepiece for subword tokenization in my pipeline.

SentencePiece needs a training step to generate a subword model, which can then be used for tokenization.

Is an implementation of Vocabulary class the right way to do it. Little confused whether it is the right place, given there are TokenIndexers for character tokenization etc.

MBT
  • 21,733
  • 19
  • 84
  • 102
Sai Prasanna
  • 684
  • 1
  • 10
  • 25
  • 1
    AllenNLP maintainer here - We don't really check stackoverflow at this point, but if you open this as an issue on github, we'd be happy to help. We've recently gotten byte pair encoding to work with some of our models, so some of our team (not me) should have some ideas for what you should do. – mattg Aug 03 '18 at 15:09
  • @mattg Sure, thanks. – Sai Prasanna Aug 03 '18 at 17:30

0 Answers0