I am tokenizing my text corpus which is in german language using the spacy's german model. Since currently, spacy only has small german model, I am unable to extract the word vectors using spacy itself. So, I am using fasttext's pre-trained word embeddings from here:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning
Now facebook has used ICU tokenizer for tokenization process before extracting word embeddings for it. and i am using spacy Can someone tell me if this is okay? I feel spacy and ICU tokenizer might behave differently and if so then many tokens in my text corpus would not have a corresponding word vector
Thank for your help!