Implementing Word to vector model using Gensim

Question

We are trying to implement a word vector model for the set of words given below.

stemmed = ['data', 'appli', 'scientist', 'mgr', 'microsoft', 'hire', 'develop', 'mentor', 'team', 'data', 'scientist', 'defin', 'data', 'scienc', 'prioriti', 'deep', 'understand', 'busi', 'goal', 'collabor', 'across', 'multipl', 'group', 'set', 'team', 'shortterm', 'longterm', 'goal', 'act', 'strateg', 'advisor', 'leadership', 'influenc', 'futur', 'direct', 'strategi', 'defin', 'partnership', 'align', 'efficaci', 'broad', 'analyt', 'effort', 'analyticsdata', 'team', 'drive', 'particip', 'data', 'scienc', 'bi', 'commun', 'disciplin', 'microsoftprior', 'experi', 'hire', 'manag', 'run', 'team', 'data', 'scientist', 'busi', 'domain', 'experi', 'use', 'analyt', 'must', 'experi', 'across', 'sever', 'relev', 'busi', 'domain', 'util', 'critic', 'think', 'skill', 'conceptu', 'complex', 'busi', 'problem', 'solut', 'use', 'advanc', 'analyt', 'larg', 'scale', 'realworld', 'busi', 'data', 'set', 'candid', 'must', 'abl', 'independ', 'execut', 'analyt', 'project', 'help', 'intern', 'client', 'understand']

We are using this code:

import gensim
model = gensim.models.FastText(stemmed, size=100, window=5, min_count=1, workers=4, sg=1)
model.wv.most_similar(positive=['data'])

However, we are getting this error:

KeyError: 'all ngrams for word data absent from model'

petezurich · Accepted Answer · 2018-08-23T16:09:13.347

2

You need to provide your training data not as a list but rather as a generator.

Try:

import gensim
from gensim.models.fasttext import FastText as FT_gensim

stemmed = ['data', 'appli', 'scientist', ... ]

def gen_words(stemmed):
    yield stemmed   

model = FT_gensim(size=100, window=5, min_count=1, workers=4, sg=1)
model.build_vocab(gen_words(stemmed))

model.train(gen_words(stemmed), total_examples=model.corpus_count, epochs=model.iter)
model.wv.most_similar(positive=['data'])

This prints out:

[('busi', -0.043828580528497696)]

See also this notebook from the gensim documentation. And this excellent gensim tutorial on all things iterable:

In gensim, it’s up to you how you create the corpus. Gensim algorithms only care that you supply them with an iterable of sparse vectors (and for some algorithms, even a generator = a single pass over the vectors is enough).

edited Aug 23 '18 at 16:09

answered Aug 23 '18 at 15:19

petezurich

9,280
9
43
57

@MurthyRouthula I am glad that this helped. I suggest you accept this answer so that others can see that the matter is solved. – petezurich Aug 23 '18 at 16:50
1

can you tell how to apply normal word to vector implementation on bag of words after stemming? – Murthy Routhula Aug 23 '18 at 18:23
@MurthyRouthula Please post this as a new question. – petezurich Aug 23 '18 at 19:59

score 1 · Answer 2 · answered Aug 23 '18 at 15:35

1

The fundamental problem is that the FastText model expects sentences as training data instead of words. If you provide it with a list of words, it won't work very well, since it creates the vector embeddings based on the relative positions of the words in the sentences.

The actual error in the code comes from that the gensim.models.FastText constructor expects an iterable of lists of strings as its first argument (eg. a 2d list of strings), but you give it a list of strings.

Maybe you could use a pretrained FastText model instead of training your own model?

answered Aug 23 '18 at 15:35

Agost Biro

2,709
1
20
33

good one! then can you suggest me normal word to vector model? – Murthy Routhula Aug 23 '18 at 16:35
1

Could you share more details about the problem that you are trying to solve? It's hard to make a recommendation without knowing your ultimate goal. – Agost Biro Aug 23 '18 at 16:46
actually I have bag of words after applying stopwords, tokenizing, normalizing. now i want to apply words to vector model on those words. – Murthy Routhula Aug 23 '18 at 16:52

Implementing Word to vector model using Gensim

2 Answers2