I iteratively apply the...
bigram.add_vocab(<List of List with Tokens>)
method in order to update a...
bigram = gensim.models.phrases.Phrases(min_count=bigramMinFreq, threshold=10.0)
Gensim phrases model. With each iteration up to ~10'000 documents are added. Therefore my intuition is that the Phrases model grows with each added document set. I check this intuition by checking the length of the bigram vocabulary with...
len(bigram.vocab))
Furthermore I also check the amount of phrasegrams in the freezed Phrase model with...
bigram_freezed = bigram.freeze()
len(bigram_freezed.phrasegrams)
A resulting output looks as follows:
Data of directory: 000 is loaded
Num of Docs: 97802
Updated Bigram Vocab is: 31819758
Amount of phrasegrams in freezed bigram model: 397554
-------------------------------------------------------
Data of directory: 001
Num of Docs: 93368
Updated Bigram Vocab is: 17940420
Amount of phrasegrams in freezed bigram model: 429162
-------------------------------------------------------
Data of directory: 002
Num of Docs: 87265
Updated Bigram Vocab is: 36120292
Amount of phrasegrams in freezed bigram model: 661023
-------------------------------------------------------
Data of directory: 003
Num of Docs: 55852
Updated Bigram Vocab is: 20330876
Amount of phrasegrams in freezed bigram model: 604504
-------------------------------------------------------
Data of directory: 004
Num of Docs: 49390
Updated Bigram Vocab is: 31101880
Amount of phrasegrams in freezed bigram model: 745827
-------------------------------------------------------
Data of directory: 005
Num of Docs: 56258
Updated Bigram Vocab is: 19236483
Amount of phrasegrams in freezed bigram model: 675705
-------------------------------------------------------
...
As can be seen neither the bigram vocab count nor the phrasegram count of the freezed bigram model is continuously increasing. I expected both counts to increase with added documents.
Do I not understand what phrase.vocab and phraser.phrasegrams are referring to? (if needed I can add the whole corrsponding Jupyter Notebook cell)