2

I am training a word2vec model using gensim on 800k browser useragent. My dictionary size is between 300 and 1000 depending on the word frequency limit. I am looking at few embedding vectors and similarities to see if the algorithm has been converged. here is my code:

wv_sim_min_count_stat={}
window=7;min_count=50;worker=10;size=128
total_iterate=1000
from copy import copy
for min_count in [50,100,500]:
    print(min_count)

    wv_sim_min_count_stat[min_count]={}
    model=gensim.models.Word2Vec(size=size,window=window,min_count=min_count,iter=1,sg=1)
    model.build_vocab(ua_parsed)


    wv_sim_min_count_stat[min_count]['vocab_counts']=[len(ua_parsed),len(model.wv.vocab),len(model.wv.vocab)/len(ua_parsed)]
    wv_sim_min_count_stat[min_count]['test']=[]

    alphas=np.arange(0.025,0.001,(0.001-0.025)/(total_iterate+1))
    for i in range(total_iterate):
        model.train(ua_parsed,total_examples=model.corpus_count,
                    epochs=model.iter,start_alpha=alphas[i],end_alpha=alphas[i+1])

        wv_sim_min_count_stat[min_count]['test'].append(
        (copy(model.wv['iphone']),copy(model.wv['(windows']),copy(model.wv['mobile']),copy(model.wv['(ipad;']),copy(model.wv['ios']),
         model.similarity('(ipad;','ios')))

unfortunately even after 1000 epochs there is no sign of convergence in embedding vectors. for example I plot embedding of the first dimension of '(ipad''s embedding vector vs number of epochs below:

for min_count in [50,100,500]:
    plt.plot(np.stack(list(zip(*wv_sim_min_count_stat[min_count]['test']))[3])[:,1],label=str(min_count))

plt.legend() 

embedding of '(ipad' vs number of epochs

I looked at many blogs and papers and it seems nobody trained the word2vec beyond 100 epochs. What I am missing here?

dani d
  • 23
  • 3

1 Answers1

1

Your dataset, user-agent strings, may be odd for word2vec. It's not natural-language; it might not have the same variety of co-occurences that causes word2vec to do useful things for natural language. (Among other things, a dataset of 800k natural-language sentences/docs would tend to have a much larger vocabulary than just ~1,000 words.)

Your graphs do look like they're roughly converging, to me. In each case, as the learning-rate alpha decreases, the dimension magnitude is settling towards a final number.

There is no reason to expect the magnitude of a particular dimension, of a particular word, would reach the same absolute value in different runs. That is: you shouldn't expect the three lines you're plotting, under different model parameters, to all tend towards the same final value.

Why not?

The algorithm includes random initialization, randomization-during-training (in negative-sampling or frequent-word downsampling), and then in its multi-threading some arbitrary re-ordering of training-examples due to OS thread-scheduling jitter. As a result, even with exactly the same metaparameters, and the same training corpus, a single word could land at different coordinates in subsequent training runs. But, its distances and orientation with regard to other words in the same run should be about-as-useful.

With different metaparameters like min_count, and thus a different ordering of surviving words during initialization, and then wildly different random-initialization, the final coordinates per word could be especially different. There is no inherent set-of-best-final-coordinates for any word, even with regard to a particular fixed corpus or initialization. There's just coordinates that work increasingly well, through a particular randomized initialization/training session, balanced over all the other co-trained words/examples.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks @gojomo for your reply. I don't expect the coordinates in different runs converge to the same values but I expect in each individual training embedding coordinates stop oscillating and settle down. – dani d Aug 23 '17 at 18:40
  • Your graphs seem to show early oscillations, ending anywhere from 100 to 300 passes before the end. So isn't that what you expect? Noting that your data may be quite different from typical natural-language w2v corpuses – smaller vocab & less contextual variation – you may want to try a similar graph on real natural-language data, to see if the same patterns are evident If so – similar oscillations early then settling – then you're at least getting the same behavior as in classic w2v applications. If not – it may be an issue unique to your data. – gojomo Aug 23 '17 at 18:54