Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)
Questions tagged [doc2vec]
556 questions
8
votes
2 answers
How does Pyspark Calculate Doc2Vec from word2vec word embeddings?
I have a pyspark dataframe with a corpus of ~300k unique rows each with a "doc" that contains a few sentences of text in each.
After processing, I have a 200 dimension vectorized representation of each row/doc. My NLP Process:
Remove Punctuation…

whs2k
- 741
- 2
- 10
- 19
8
votes
1 answer
Doc2vec and word2vec with negative sampling
My current doc2vec code is as follows.
# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)
I also have a word2vec code as below.
# Train word2vec model
model =…
user8566323
8
votes
1 answer
What is the difference between gensim LabeledSentence and TaggedDocument
Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog!
class…

Rashmi Singh
- 519
- 1
- 8
- 20
7
votes
1 answer
What does epochs mean in Doc2Vec and train when I have to manually run the iteration?
I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function.
In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the…

Suhail Gupta
- 22,386
- 64
- 200
- 328
7
votes
1 answer
What is different between doc2vec models when the dbow_words is set to 1 or 0?
I read this page but I do not understand what is different between models which are built based on the following codes.
I know when dbow_words is 0, training of doc-vectors is faster.
First model
model = doc2vec.Doc2Vec(documents1, size = 100,…

user3092781
- 313
- 2
- 16
7
votes
1 answer
creating word2vec model syn1neg.npy extension
When creating model,there is not any more model with extension finish
.syn1neg.npy
syn0.npy
My code is below:
corpus= x+y
tok_corp= [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]
model = gensim.models.Word2Vec(tok_corp,…

Tomas Ukasta
- 170
- 1
- 7
7
votes
3 answers
Is there any way to get the vocabulary size from doc2vec model?
I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.

Rashmi Singh
- 519
- 1
- 8
- 20
6
votes
1 answer
Measure similarity between two documents using Doc2Vec
I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one.
Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc…

Borislav Stoilov
- 3,247
- 2
- 21
- 46
6
votes
2 answers
How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?
I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages:
Page from gensim github issues section. It…

Ghaliamus
- 101
- 1
- 4
6
votes
1 answer
NLP: Pre-processing in doc2vec / word2vec
A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the…

Simon Hessner
- 1,757
- 1
- 22
- 49
6
votes
1 answer
Doc2vec: Only 10 docvecs in gensim doc2vec model?
I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.
The example of training data (length>10)
docs = ['This is…

GemOfRoe
- 125
- 5
6
votes
1 answer
How much data is actually required to train a doc2Vec model?
I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?
I will be sharing my understanding…

Shalabh Singh
- 360
- 1
- 3
- 10
6
votes
1 answer
Does Doc2Vec learn representations for the tags?
I'm using the Doc2Vec tags as an unique identifier for my documents, each document has a different tag and no semantic meaning. I'm using the tags to find specific documents so I can calculate the similarity between them.
Do the tags influence the…

Stanko
- 4,275
- 3
- 23
- 51
6
votes
1 answer
Doc2Vec: Differentiate Sentence and Document
I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.
The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.
But the original…

Vikash Balasubramanian
- 2,921
- 3
- 33
- 74
6
votes
2 answers
doc2vec How to cluster DocvecsArray
I've patched the following code from examples I've found over the web:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
# random
from…

Shlomi Schwartz
- 8,693
- 29
- 109
- 186