Please show which models generated this comparison, or what process was used. It probably has bugs/misunderstandings.
The size of a model is more influenced by the number of unique words (and character n-gram buckets) than the 'corpus' size.
The saved sizes of a Gensim-trained FastText
model, or a native Facebook FastText-trained model, should be roughly in the same ballpark. Be sure to include all subsidiary raw numpy
files (ending .npy
, alongside the main save-file) created by Gensim's .save()
- as all such files are required to re-.load()
the model!
Similarly, if you were to load a Facebook FastText model into Gensim, then use Gensim's .save()
, the total disk space taken in both alternate formats should be quite close.