24

I'm kinda newbie and not native english so have some trouble understanding Gensim's word2vec and doc2vec.

I think both give me some words most similar with query word I request, by most_similar()(after training).

How can tell which case I have to use word2vec or doc2vec?

Someone could explain difference in short word, please?

Thanks.

user3595632
  • 5,380
  • 10
  • 55
  • 111
  • 2
    Modelling wise, it is really nothing different. Except an additional input cell which carries the information about the paragraph, document, etc that the input sequence was selected from. Read the actual paper proposing it: https://cs.stanford.edu/~quocle/paragraph_vector.pdf – user3639557 Mar 16 '17 at 10:28
  • 1
    doc2vec captures similarities between documents. [wikimark](https://github.com/amirouche/wikimark/) is a project of mine that try to compute similarity of a document against wikipedia vital articles. It is another example use of doc2vec (because in this case doc2vec vectors are fed into scikit learn regression). – amirouche Mar 29 '18 at 21:26

1 Answers1

28

In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X? If two authors generally use the same words then their vector will be closer. AUTHOR_X is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).

Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).

If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.

pembeci
  • 696
  • 4
  • 9
  • 1
    for text classification, i.e. sentiment classification, does it make a difference to use word2vec or Doc2Vec? In both cases, it is going to be input. – user697911 Aug 29 '17 at 21:23
  • 1
    @user697911 you can see here the Doc2Vec whitepaper: https://cs.stanford.edu/~quocle/paragraph_vector.pdf In the experiments section, they talk about sentiment analysis. Since you are classifying documents as either positive or negative, Doc2Vec is the preferred approach because it also vectorizes documents, and not just words. – vasia Feb 05 '18 at 16:48
  • 1
    @pembeci what would you recommend for authorship classification? doc2vec or word2vec? Is it only supported for english on gensim pre-trained models? – Daniel Vilas-Boas Mar 25 '20 at 01:47
  • 1
    @DanielVilas-Boas, doc2vec will be better since it will aggregate the docs for a particular author and summarize them in a vector. For an unknown doc you can directly test similarity between that doc's vector and author vectors or usese the vectors as features for other ML algorithms. 2nd question: no, you can train them on your own corpus. – pembeci Mar 26 '20 at 13:46
  • 1
    @pembeci thanks for your suggestion. I am already using doc2vec, but another question that came to my mind was the number of features that I want it to be trained. I started with a random number of 10, but what suggestions do you have? My dataset is really small (70 documents for 11 authors) – Daniel Vilas-Boas Mar 27 '20 at 01:12