17

Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar?

Matthew Haugen
  • 12,916
  • 5
  • 38
  • 54
Idriss Brahimi
  • 171
  • 1
  • 1
  • 5
  • I just wanted to add a link to other pretrained gensim models: http://nilc.icmc.usp.br/embeddings – xxx May 09 '23 at 11:34

2 Answers2

9

I don't know of any good one. There's one linked from this project, but:

  • it's based on a custom fork from an older gensim, so won't load in recent code
  • it's not clear what parameters or data it was trained with, and the associated paper may have made uninformed choices about the effects of parameters
  • it doesn't appear to be the right size to include actual doc-vectors for either Wikipedia articles (4-million-plus) or article paragraphs (tens-of-millions), or a significant number of word-vectors, so it's unclear what's been discarded

While it takes a long time and significant amount of working RAM, there is a Jupyter notebook demonstrating the creation of a Doc2Vec model from Wikipedia included in gensim:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

So, I would recommend fixing the mistakes in your attempt. (And, if you succeed in creating a model, and want to document it for others, you could upload it somewhere for others to re-use.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I know this is a very old answer but do you think it is possible to train a Doc2Vec model on Google colab? – Dani Jun 13 '21 at 14:29
  • I'm not a users of Google Colab, but if I understand correctly that it lets you run Python code, in a notebook, with enough RAM to do common ML tasks – sure why not? – gojomo Jun 13 '21 at 18:06
8

Yes! I could find two pre-trained doc2vec models at this link

but still could not find any pre-trained doc2vec model which is trained on tweets

Moniba
  • 789
  • 10
  • 17