Understanding wordNgram from fastText

Question

I'm trying to understanding what is the -wordNgrams parameter in the fastText.

Let's take the following text as an example:

The quick brown fox jumps over the lazy dog

Now we have the context windows size of 2 at the 'brown' word, then we would have the following samples

In case we set -wordNgrans 2, would we find in our vocabulary the word 'brown_fox' ? And hence, our training samples now would be:

Is that correct ?

I didn't find out there any explanation about that.

If you try it, do you see those bigrams in the model's vocabulary? (I'm not sure, but vaguely recall that `-wordNgrams` may only have effect in `supervised` mode, and may use the same sort of shared collision-tolerating set of vector-buckets as are used for character n-grams, so you wouldn't necessarily see exactly-remembered bigrams in results. But, bigrams which had enough impact during training – influencing their bucket's vector more than the noise from other bigrams – would continue to have impact during post-training classifications.) — gojomo, Sep 12 '19 at 16:07

score 2 · Answer 1 · answered May 15 '20 at 09:14

I'm wondering the same question.

I find a issue which said 'word n-grams are only used in supervised mode', so setting wordNgrams=2 doesn't work when unsupervised mode.

And then I test it myself:

./fasttext skipgram -input data.txt -output test -dim 50 -wordNgrams 2 -loss hs

cut -d' ' -f1 test.vec | vocab.txt

Result is that, there are only single word and subword in vocab.txt.

1 Answers1