0

It seems that the Gensim's implementation in FastText leads to a smaller model size than Facebook's native implementation. With a corpus of 1 million words, the fasttext native model is is 6GB, while the gensim fasttext model size is only 68MB.

Is there any information stored in Facebook's implementation not present in Gensim's implementation?

Jinhua Wang
  • 1,679
  • 1
  • 17
  • 44

1 Answers1

1

Please show which models generated this comparison, or what process was used. It probably has bugs/misunderstandings.

The size of a model is more influenced by the number of unique words (and character n-gram buckets) than the 'corpus' size.

The saved sizes of a Gensim-trained FastText model, or a native Facebook FastText-trained model, should be roughly in the same ballpark. Be sure to include all subsidiary raw numpy files (ending .npy, alongside the main save-file) created by Gensim's .save() - as all such files are required to re-.load() the model!

Similarly, if you were to load a Facebook FastText model into Gensim, then use Gensim's .save(), the total disk space taken in both alternate formats should be quite close.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    Thanks @gojomo! It seems that I forgot to count the size of the npy files. By the way, could you check my GitHub issue here: https://github.com/RaRe-Technologies/gensim/issues/3228 and SO question here: https://stackoverflow.com/questions/69127120/gensim-fasttext-cannot-get-latest-training-loss – Jinhua Wang Sep 10 '21 at 04:04
  • That seems to be a bug. – Jinhua Wang Sep 10 '21 at 04:04