2

I'm trying to understanding what is the -wordNgrams parameter in the fastText.

Let's take the following text as an example:

The quick brown fox jumps over the lazy dog

Now we have the context windows size of 2 at the 'brown' word, then we would have the following samples

  • (brown, the)
  • (brown, quick)
  • (brown, fox)
  • (brown, jumps)

In case we set -wordNgrans 2, would we find in our vocabulary the word 'brown_fox' ? And hence, our training samples now would be:

  • (brown_fox, the)
  • (brown_fox, quick)
  • (brown_fox, jumps)
  • (brown_fox, over)

Is that correct ?

I didn't find out there any explanation about that.

Kleyson Rios
  • 2,597
  • 5
  • 40
  • 65
  • If you try it, do you see those bigrams in the model's vocabulary? (I'm not sure, but vaguely recall that `-wordNgrams` may only have effect in `supervised` mode, and may use the same sort of shared collision-tolerating set of vector-buckets as are used for character n-grams, so you wouldn't necessarily see exactly-remembered bigrams in results. But, bigrams which had enough impact during training – influencing their bucket's vector more than the noise from other bigrams – would continue to have impact during post-training classifications.) – gojomo Sep 12 '19 at 16:07

1 Answers1

2

I'm wondering the same question.

I find a issue which said 'word n-grams are only used in supervised mode', so setting wordNgrams=2 doesn't work when unsupervised mode.

And then I test it myself:

./fasttext skipgram -input data.txt -output test -dim 50 -wordNgrams 2 -loss hs

cut -d' ' -f1 test.vec | vocab.txt

Result is that, there are only single word and subword in vocab.txt.

siberiawolf61
  • 77
  • 1
  • 3