3

We are trying to implement a word vector model for the set of words given below.

stemmed = ['data', 'appli', 'scientist', 'mgr', 'microsoft', 'hire', 'develop', 'mentor', 'team', 'data', 'scientist', 'defin', 'data', 'scienc', 'prioriti', 'deep', 'understand', 'busi', 'goal', 'collabor', 'across', 'multipl', 'group', 'set', 'team', 'shortterm', 'longterm', 'goal', 'act', 'strateg', 'advisor', 'leadership', 'influenc', 'futur', 'direct', 'strategi', 'defin', 'partnership', 'align', 'efficaci', 'broad', 'analyt', 'effort', 'analyticsdata', 'team', 'drive', 'particip', 'data', 'scienc', 'bi', 'commun', 'disciplin', 'microsoftprior', 'experi', 'hire', 'manag', 'run', 'team', 'data', 'scientist', 'busi', 'domain', 'experi', 'use', 'analyt', 'must', 'experi', 'across', 'sever', 'relev', 'busi', 'domain', 'util', 'critic', 'think', 'skill', 'conceptu', 'complex', 'busi', 'problem', 'solut', 'use', 'advanc', 'analyt', 'larg', 'scale', 'realworld', 'busi', 'data', 'set', 'candid', 'must', 'abl', 'independ', 'execut', 'analyt', 'project', 'help', 'intern', 'client', 'understand']

We are using this code:

import gensim
model = gensim.models.FastText(stemmed, size=100, window=5, min_count=1, workers=4, sg=1)
model.wv.most_similar(positive=['data'])

However, we are getting this error:

KeyError: 'all ngrams for word data absent from model'
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77

2 Answers2

2

You need to provide your training data not as a list but rather as a generator.

Try:

import gensim
from gensim.models.fasttext import FastText as FT_gensim

stemmed = ['data', 'appli', 'scientist', ... ]

def gen_words(stemmed):
    yield stemmed   

model = FT_gensim(size=100, window=5, min_count=1, workers=4, sg=1)
model.build_vocab(gen_words(stemmed))

model.train(gen_words(stemmed), total_examples=model.corpus_count, epochs=model.iter)
model.wv.most_similar(positive=['data'])

This prints out:

[('busi', -0.043828580528497696)]

See also this notebook from the gensim documentation. And this excellent gensim tutorial on all things iterable:

In gensim, it’s up to you how you create the corpus. Gensim algorithms only care that you supply them with an iterable of sparse vectors (and for some algorithms, even a generator = a single pass over the vectors is enough).

petezurich
  • 9,280
  • 9
  • 43
  • 57
1

The fundamental problem is that the FastText model expects sentences as training data instead of words. If you provide it with a list of words, it won't work very well, since it creates the vector embeddings based on the relative positions of the words in the sentences.

The actual error in the code comes from that the gensim.models.FastText constructor expects an iterable of lists of strings as its first argument (eg. a 2d list of strings), but you give it a list of strings.

Maybe you could use a pretrained FastText model instead of training your own model?

Agost Biro
  • 2,709
  • 1
  • 20
  • 33