0

Just curiosity, but I was debugging gensim's FastText code for replicating the implementation of Out-of-Vocabulary (OOV) words, and I'm not being able to accomplish it. So, the process i'm following is training a tiny model with a toy corpus, and then comparing the resulting vectors of a word in the vocabulary. That means if the whole process is OK, the output arrays should be the same.

Here is the code I've used for the test:

from gensim.models import FastText
import numpy as np
# Default gensim's function for hashing ngrams
from gensim.models._utils_any2vec import ft_hash_bytes

# Toy corpus
sentences = [['hello', 'test', 'hello', 'greeting'],
             ['hey', 'hello', 'another', 'test']]

# Instatiate FastText gensim's class
ft = FastText(sg=1, size=5, min_count=1, \
window=2, hs=0, negative=20, \
seed=0, workers=1, bucket=100, \
min_n=3, max_n=4)

# Build vocab
ft.build_vocab(sentences)

# Fit model weights (vectors_ngram)
ft.train(sentences=sentences, total_examples=ft.corpus_count, epochs=5)

# Save model
ft.save('./ft.model')
del ft

# Load model
ft = FastText.load('./ft.model')

# Generate ngrams for test-word given min_n=3 and max_n=4
encoded_ngrams = [b"<he", b"<hel", b"hel", b"hell", b"ell", b"ello", b"llo", b"llo>", b"lo>"]
# Hash ngrams to its corresponding index, just as Gensim does
ngram_hashes = [ft_hash_bytes(n) % 100 for n in encoded_ngrams]
word_vec = np.zeros(5, dtype=np.float32)
for nh in ngram_hashes:
    word_vec += ft.wv.vectors_ngrams[nh]

# Compare both arrays
print(np.isclose(ft.wv['hello'], word_vec))

The output of this script is False for every dimension of the compared arrays.

It would be nice if someone could point me out if i'm missing something or doing something wrong. Thanks in advance!

threepwood
  • 13
  • 3

1 Answers1

0

The calculation of a full word's FastText word-vector is not just the sum of its character n-gram vectors, but also a raw full-word vector that's also trained for in-vocabulary words.

The full-word vectors you get back from ft.wv[word] for known-words have already had this combination pre-calculated. See the adjust_vectors() method for an example of this full calculation:

https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/keyedvectors.py#L2282

The raw full-word vectors are in a .vectors_vocab array on the model.wv object.

(If this isn't enough to reconcile matters: ensure you're using the latest gensim, as there have been many recent FT fixes. And, ensure your list of ngram-hashes matches the output of the ft_ngram_hashes() method of the library – if not, your manual ngram-list-creation and subsequent hashing may be doing something different.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks for the reply @gojomo, I succeed replicating the expected behavoiur. Just for clarification, and sorry if it's a silly question, but which is the reason of this "disaggregation" of having two matrix, one for `.vectors_vocab` and other for `.vectors_ngrams`? Because the original paper only refers to "we represent a word by the sum of the vector representations of its n-grams". – threepwood Mar 05 '20 at 12:05
  • Also from the original 'Enriching Word Vectors with Subword Information' (FastText) paper: "Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. **We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to character n-grams).**" [emphasis added] – gojomo Mar 05 '20 at 20:16
  • The original Facebook C++ implementation, IIRC, included both the full-word vectors, and the n-gram vectors, in one contiguous array (with one group at the front, and the other at the back) – & recalculated the final (combined) vector every time it was requested. I believe the implementor of gensim's approach keeps the raw/trained full-word vectors as needed for continued training, but caches the combined vectors for all in-vocabulary words as a performance optimization. (Having those full-word final vectors in an array lets bulk vectorized ops go much faster for `most_similar` etc.) – gojomo Mar 05 '20 at 20:19