2

Doc2vec while creating the vocabulary has possibility to put minimum occurence of the word in documents to be included in vocabulary as parameter min_count.

model = gensim.models.doc2vec.Doc2Vec(vector_size=200, min_count=3, epochs=100,workers=8)

How is it possible to exclude words which appear far too often, with some parameter?

I know that one way is to do this in preprocessing step by manually deleting those words, and counting each, but would be nice to know if there is maybe some built in method to do so, as it gives more space for testing. Many thanks for the answer.

gojomo
  • 52,260
  • 14
  • 86
  • 115
Igor sharm
  • 396
  • 1
  • 10
  • Why you want to remove words that appear far to often? What kind of testing you are thinking to do after excluding words which appear far too often? – vb_rises Jun 06 '19 at 13:36
  • @Vishal I am performing clustering of the texts focused on high similarity between texts. The problem is that in the specific dataset there are some words which i know that appear very often without actual context meaning and may mislead the system to cluster documents based on those words(also some words i dont know). Also I tried TF-IDF vectorization for BOW approach and it showed pretty good result, but now the goal is to compare it with Doc2Vec. The thing is TF-IDF has such parameters as maximum occurence of the word (float parameter representing ratio of occurences of word in document) – Igor sharm Jun 06 '19 at 14:08
  • 2
    You can write your own trim rule. Check this [link](https://github.com/RaRe-Technologies/gensim/issues/824) to write your own trim rule and pass the documents as well as this trim rule during initialization. Also you could check the `ns_exponent` parameter of Doc2Vec whose negative values samples low frequency words more than high frequency words. Read more [here](https://radimrehurek.com/gensim/models/doc2vec.html) – vb_rises Jun 06 '19 at 14:31
  • @Vishal wow, especially ns_exponent looks exactly like what I was looking for, thanks a lot, will try it right away – Igor sharm Jun 06 '19 at 14:39

1 Answers1

3

There's no explicit max_count parameter in gensim's Word2Vec.

If you're sure some tokens are meaningless, you should preprocess your text to eliminate them.

There is also a trim_rule option that can be passed as model instantiation or build_vocab(), where your own function can discard some words; see the gensim docs at:

https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

Similarly, you could possibly avoid calling build_vocab() directly, and instead call its substeps – but edit the discovered raw-counts dictionary before the vocabulary is finalized. You would want to consult the source code to do this, and could use the code that discards too-infrequent words as a model for your own new additional code.

The classic sample parameter of Word2Vec also controls a downsampling of high-frequency words, to prevent the model from spending too much relative effort on redundantly training abundant words. The more aggressive (smaller) this value is, the more instances of high-frequency words will be radomly skipped during training. The default of 1e-03 (0.001) is very conservative; in very-large natural language corpuses I've seen good results up to 1e-07 (0.0000001) or 1e-8 (0.00000001) – so in another domain where some lower-meaning tokens are very-frequent, similarly aggressive downsampling is worth trying.

The newer ns_exponent option changes negative sampling to adjust the relative favoring of less-frequent words. The original word2vec work used a fixed value of 0.75, but some research since has suggested other domains, like recommendation systems, might benefit from other values that are more or less sensitive to actual token frequencies. (The relevant paper is linked in the gensim docs for the ns_exponent parameter.)

gojomo
  • 52,260
  • 14
  • 86
  • 115