3

I am tokenizing my text corpus which is in german language using the spacy's german model. Since currently, spacy only has small german model, I am unable to extract the word vectors using spacy itself. So, I am using fasttext's pre-trained word embeddings from here:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning

Now facebook has used ICU tokenizer for tokenization process before extracting word embeddings for it. and i am using spacy Can someone tell me if this is okay? I feel spacy and ICU tokenizer might behave differently and if so then many tokens in my text corpus would not have a corresponding word vector

Thank for your help!

shasvat desai
  • 419
  • 3
  • 11
  • what is your question? is this okay is not a question, if this approach is helping you achieve your goals so it's okay... – shahaf Jun 18 '18 at 15:39
  • Can't you find this out with a loop? Loop through all the tokens and try to access model[token], and see how many misses you get, for each tokenizer? – Sam H. Jul 09 '18 at 06:26

1 Answers1

2

UPDATE:

I tried the above method and after extensive testing, I found that this works well for my use case. Most(almost all) of the tokens in my data matched the tokens present in fasttext ans I was able to obtain the word vectors representation for the same.

shasvat desai
  • 419
  • 3
  • 11