3

FastText pre-trained model works great for finding similar words:

from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)

[('dogs', 0.8463464975357056),
 ('puppy', 0.7873005270957947),
 ('pup', 0.7692237496376038),
 ('canine', 0.7435278296470642),
 ...

However, it seems to fail for multi-word phrases, e.g.:

model.nearest_neighbors('Gone with the Wind', k=2000)

[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
  0.71047443151474),

or

model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
 ('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
  0.5197194218635559),

Is it a limitation of FastText pre-trained models?

dzieciou
  • 4,049
  • 8
  • 41
  • 85

2 Answers2

2

I'm not aware of FastText having any special ability to handle multi-word phrases.

So I expect your query is being interpreted as one long word that's not in the model, which includes many character n-grams that includes ' ' space characters.

And, as I don't expect the training data had any such n-grams with spaces, all such n-grams' vectors will be arbitrarily-random collisions in the model's n-gram buckets. Thus any such synthesized out-of-vocabulary vector for such a 'word' is likely to b even noisier than the usual OOV vectors.

But also: the pyfasttext wrapper is an abandoned unofficial interface to FastText that hasn't bee updated in over 2 years, and has a message at its PyPI page:

Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresearch/fastText/tree/master/python

You may find better results using it instead. See its doc/examples folder for example code for examples of how its can be queried for nearest-neighbors, and also consider its get_sentence_vector() as a way to split a string into words whose vectors are then averaged, rather than just treating the string as one long OOV word.

gojomo
  • 52,260
  • 14
  • 86
  • 115
2

As described in the documentation, official fastText unsupervised embeddings are built after a phase of tokenization, in which the words are separated.

If you look at your model vocabulary (model.words in the official python binding), you won't find multi-word phrases containing spaces.

Therefore, as pointed out by gojomo, the generated vectors are synthetic, artificial and noisy; you can deduce it from the result of your queries.

In essence, fastText official embeddings are not suitable for this task. In my experience this does not depend on the version / wapper used.