can I tokenize using spacy and then extract vectors for these token using pre trained word embeddings of fastext

Question

I am tokenizing my text corpus which is in german language using the spacy's german model. Since currently, spacy only has small german model, I am unable to extract the word vectors using spacy itself. So, I am using fasttext's pre-trained word embeddings from here:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning

Now facebook has used ICU tokenizer for tokenization process before extracting word embeddings for it. and i am using spacy Can someone tell me if this is okay? I feel spacy and ICU tokenizer might behave differently and if so then many tokens in my text corpus would not have a corresponding word vector

Thank for your help!

what is your question? is this okay is not a question, if this approach is helping you achieve your goals so it's okay... — shahaf, Jun 18 '18 at 15:39
Can't you find this out with a loop? Loop through all the tokens and try to access model[token], and see how many misses you get, for each tokenizer? — Sam H., Jul 09 '18 at 06:26

score 2 · Answer 1 · answered Jul 10 '18 at 09:14

UPDATE:

I tried the above method and after extensive testing, I found that this works well for my use case. Most(almost all) of the tokens in my data matched the tokens present in fasttext ans I was able to obtain the word vectors representation for the same.

can I tokenize using spacy and then extract vectors for these token using pre trained word embeddings of fastext

1 Answers1