I have generated word vectors from a corpus, but I am facing out of vocabulary issues for many words. How can I generate word vectors for OOV words on the fly using existing word embedding?
Asked
Active
Viewed 5,065 times
4
-
1See answer to similar question at: https://stackoverflow.com/a/48029141/130288 - I don't know any library that does this on-the-fly from existing full-word vectors. Some research with ideas is mentioned in the blog post here: http://ruder.io/word-embeddings-2017/index.html#oovhandling – gojomo Dec 29 '17 at 22:40
-
what percentage of vocabulary / text tokens are unknown? are you using a particular vectors downloaded from somewhere? – Ivan Dec 31 '17 at 11:44
-
@Ivan More than 20% of tokens are unknown . we are trying to construct words vectors for words from a given stem . Example : Green-Mango is there is vocabulary and we are trying to build New-Green-Mango , Fresh-Green-Mango etc. I am getting words like New , Fresh from third party and so I can't have it in my vocabulary . In few cases i have both "New and "Green-Mango in my vocabulary but not "New-Green-Mango" . – Navin Kumar Jan 04 '18 at 07:23
-
check for token in vocabulary list and return empty (zeros) vector in that case. It is the same logic as saying, this token has no information. This can also potentially be useful in a larger ML framework. – Nathan McCoy Jan 06 '18 at 10:20
1 Answers
2
A very late answer (not even the answer you are looking for) but, with skip-gram
models what you ask is almost impossible because each word is a distinct entity in and of itself.
The feature you ask can be done with FastText out of the box. It generates OOV word vectors using it's n-gram
s.
Gensim has a high-level API to use FastText.

ozgur
- 2,549
- 4
- 25
- 40