2

I am using the Word2vec module of Gensim library to train a word embedding, the dataset is 400k sentences with 100k unique words (its not english)

I'm using this code to monitor and calculate the loss :

class MonitorCallback(CallbackAny2Vec):
    def __init__(self, test_words):
        self._test_words = test_words

    def on_epoch_end(self, model):
        print("Model loss:", model.get_latest_training_loss())  # print loss
        for word in self._test_words:  # show wv logic changes
            print(model.wv.most_similar(word))


monitor = MonitorCallback(["MyWord"])  # monitor with demo words

w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT  , callbacks=[monitor])

w2v_model.build_vocab(tokenized_corpus)

words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)

print("[*] Training...")

# Train Word Embeddings
w2v_model.train(tokenized_corpus, total_examples=len(tokenized_corpus), epochs=W2V_EPOCH)

The problem is from epoch 1 the loss is 0 and the vector of the monitored words dont change at all!

[*] Training...
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0

so what is the problem here? is this normal? the tokenized corpus is a list of lists that are something like tokenized_corpus[0] = [ "word1" , "word2" , ...]

I googled and seems like some of the old versions of gensim had problem with calculating loss function, but they are from almost a year ago and it seems like it should be fixed right now?

I tried the code provided in the answer of this question as well but still the loss is 0 :

Loss does not decrease during training (Word2Vec, Gensim)

EDIT1 : after adding compute_loss=True, the loss shows up, but it keeps going higher and higher, and the top similar words and their similarity doesn't change at all :

Model loss: 2187903.5
Model loss: 3245492.0
Model loss: 4103624.5
Model loss: 4798541.0
Model loss: 5413940.0
Model loss: 5993822.5
Model loss: 6532631.0
Model loss: 7048384.5
Model loss: 7547147.0
OneAndOnly
  • 1,048
  • 1
  • 13
  • 33

1 Answers1

2

The top issue with your code is that you haven't used the Word2Vec initialization parameter necessary to toggle loss-tracking on: compute_loss=True

(See 'parameters' section of https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec )

Even with that fix, the loss-reporting is still quite buggy (as of gensim-3.8.3 & this writing in August 2020):

  • it's not the per-epoch total, or per-example average, one might expect. (So if you need that, as a workaround, your callback should be remembering the last value and computing the delta, or resetting the internal counter to 0.0, each epoch's end.)
  • it definitely loses precision in larger training runs, eventually becoming useless. (This may not be an issue for you.)
  • it might lose some tallies due to multithreaded value-overwriting. (This may not be a practical issue for you, depending on why you're consulting the loss value.)
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks! That fixed the 0 loss problem, but still the top similar words and their similarity doesn't change at all, for example the top similar one is always ('word20', 0.9581440091133118) at every epoch report, and the number or word doesnt change, is this normal? – OneAndOnly Aug 20 '20 at 18:41
  • 1
    `most_similar()` is designed for use *after* training, and needs (in current released gensim versions) to use a cached set of unit-normalized vectors – and that cache isn't automatically cleared upon more training. So, if you're using it mid-training, you should clear the cache manually after each epoch, before checking the current results, with something like `model.wv.vectors_norm = None`. – gojomo Aug 20 '20 at 19:09