1

I have a pre-trained word2vec model that I load to spacy to vectorize new words. Given new text I perform nlp('hi').vector to obtain the vector for the word 'hi'.

Eventually, a new word needs to be vectorized which is not present in the vocabulary of my pre-trained model. In this scenario spacy defaults to a vector filled with zeros. I would like to be able to set this default vector for OOV terms.

Example:

import spacy
path_model= '/home/bionlp/spacy.bio_word2vec.model'
nlp=spacy.load(path_spacy)
print(nlp('abcdef').vector, '\n',nlp('gene').vector)

This code outputs a dense vector for the word 'gene' and a vector full of 0s for the word 'abcdef' (since it's not present in the vocabulary):

enter image description here

My goal is to be able to specify the vector for missing words, so instead of getting a vector full of 0s for the word 'abcdef' you can get (for instance) a vector full of 1s.

Ferran
  • 840
  • 9
  • 18
  • Do you want to specify the vector for *all* out-of-vocabulary (OOV) words to be a single new vector of your choice? Or set a different vector, that you supply, for each new OOV? Or want a way to calculate a new, compatible-with-the-model vector for a new OOV word (perhaps by using subword correlations or some set of new usage examples)? – gojomo Aug 26 '19 at 15:21
  • I was trying to apply a single vector to all OOV words – Ferran Aug 26 '19 at 15:45

1 Answers1

2

If you simply want your plug-vector instead of the SpaCy default all-zeros vector, you could just add an extra step where you replace any all-zeros vectors with yours. For example:

words = ['words', 'may', 'by', 'fehlt']
my_oov_vec = ...  # whatever you like
spacy_vecs = [nlp(word) for word in words]
fixed_vecs = [vec if vec.any() else my_oov_vec 
              for vec in spacy_vecs]

I'm not sure why you'd want to do this. Lots of work with word-vectors simply elides out-of-vocabulary words; using any plug value, including SpaCy's zero-vector, may just be adding unhelpful noise.

And if better handling of OOV words is important, note that some other word-vector models, like FastText, can synthesize better-than-nothing guess-vectors for OOV words, by using vectors learned for subword fragments during training. That's similar to how people can often work out the gist of a word from familiar word-roots.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • This works and it's indeed a simple post-processing. But I am limitted by the computational time and was interested to know if there is any way to apply the 'my_oov_vec' direcly when calling 'nlp(word)', so the whole vectorization is quicker? Thanks for the detailed answer, FastText or Bert might indeed be very useful for our scenario, in which many new words appear after training – Ferran Aug 27 '19 at 08:51
  • Are you sure this approach is too slow? From quick glance at the SpaCy source, it looks like new zero-vectors are composed each time (**if** is the relevant place), so it doesn't look like there's an easy way to assign a new "fallback" value for all not-present words. Rather, you'll have to do it heuristically, outside the object/SpaCy code – and any such way of doing it would likely have similar complexity as above, checking for the all-zeros value & replacing it. – gojomo Aug 27 '19 at 10:39
  • I agree with your comment, I have not found any way to assign this 'fallback' value beforehand, so perhaps the best solution is to download and modify the source code or look for extra computatioinal resources. In terms of difference in speed I agree the difference is not huge. See the example in https://stackoverflow.com/questions/57672043/ignore-out-of-vocabulary-words-when-averaging-vectors-in-spacy?noredirect=1#comment101828789_57672043. Maybe when dealing with a very large corpus I might just hire more computational resources. – Ferran Aug 30 '19 at 10:29