0

Does sample= 0 in Gensim word2vec mean that no downsampling is being used during my training? The documentation says just that

"useful range is (0, 1e-5)"

However putting the threshold to 0 would cause P(wi) to be equal to 1, meaning that no word would be discarded, am I understanding it right or not?

I'm working on a relatively small dataset of 7597 Facebook posts (18945 words) and my embeddings perform far better using sample= 0rather than anything else within the recommended range. Is there any particular reason? Text size?

1 Answers1

2

That seems an incredibly tiny dataset for Word2Vec training. (Is that only 18945 unique words, or 18945 words total, so hardly more than 2 words per post?)

Sampling is most useful on larger datasets - where there are so many examples of common words, more training examples of them aren't adding much – but they are stealing time from, and overwieghting those words' examples compared to, other less-frequent words.

Yes, sample=0 means no down-sampling.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    18945 unique words. I'm doing my PhD (digital humanities) studying word embedding applications in qualitative research, they actually perform quite well also on this size small datasets, at least for my research goal. In this case the goal was a very first experimental exploration of the data and also of my hypothesis. However, as a linguist, I imagined that the proportions wouldn't change that much, since articles and propositions do not add any semantic information to my context, however using no downsampling output accurate vectors, while using it output very sparse and "noisy" vectors – Leonardo Sanna Mar 31 '20 at 07:05
  • Interesting! If you enable logging at `INFO` level, some of the output will indicate how much your `sample` value affects the actual number of words trained. I suppose that some word-distributions, or other parameter/data contributions (maybe a small `window` & many tiny training individual texts which each have a lot of downsampled frequent words?) might create an outsized effect from `sample`. But the default value. `1e-03`, should have such a mild effect - the logging output will show more about how many words get affected by it. – gojomo Mar 31 '20 at 18:19