I am training a word2vec model using gensim on 800k browser useragent. My dictionary size is between 300 and 1000 depending on the word frequency limit. I am looking at few embedding vectors and similarities to see if the algorithm has been converged. here is my code:
wv_sim_min_count_stat={}
window=7;min_count=50;worker=10;size=128
total_iterate=1000
from copy import copy
for min_count in [50,100,500]:
print(min_count)
wv_sim_min_count_stat[min_count]={}
model=gensim.models.Word2Vec(size=size,window=window,min_count=min_count,iter=1,sg=1)
model.build_vocab(ua_parsed)
wv_sim_min_count_stat[min_count]['vocab_counts']=[len(ua_parsed),len(model.wv.vocab),len(model.wv.vocab)/len(ua_parsed)]
wv_sim_min_count_stat[min_count]['test']=[]
alphas=np.arange(0.025,0.001,(0.001-0.025)/(total_iterate+1))
for i in range(total_iterate):
model.train(ua_parsed,total_examples=model.corpus_count,
epochs=model.iter,start_alpha=alphas[i],end_alpha=alphas[i+1])
wv_sim_min_count_stat[min_count]['test'].append(
(copy(model.wv['iphone']),copy(model.wv['(windows']),copy(model.wv['mobile']),copy(model.wv['(ipad;']),copy(model.wv['ios']),
model.similarity('(ipad;','ios')))
unfortunately even after 1000 epochs there is no sign of convergence in embedding vectors. for example I plot embedding of the first dimension of '(ipad''s embedding vector vs number of epochs below:
for min_count in [50,100,500]:
plt.plot(np.stack(list(zip(*wv_sim_min_count_stat[min_count]['test']))[3])[:,1],label=str(min_count))
plt.legend()
embedding of '(ipad' vs number of epochs
I looked at many blogs and papers and it seems nobody trained the word2vec beyond 100 epochs. What I am missing here?