Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)
Questions tagged [doc2vec]
556 questions
0
votes
1 answer
Group by and aggregate problems for numpy arrays over word vectors
My pandas data frame looks something like this:
Movieid review movieRating wordEmbeddingVector
1 "text" 4 [100 dimensional vector]
I am trying to run a doc2vec implementation and I want to be able to group by movie ids and…

Roshini
- 703
- 2
- 8
- 21
-1
votes
1 answer
What would be the best way to compare different parts of a document in just one doc2vec embedding?
Let's say I have many documents with a question and an answer. I want to build an embedding where I can find the most similar documents based on just a new question without an answer but still be able to find similar documents based on the whole…

Red Boraley
- 21
- 2
-1
votes
1 answer
Reverse TF-IDF vector (vec2text)
Given a generated doc2vec vector on some document. is it possible to reverse the vector back to the original document?
If so, does there exist any hash algorithm that would make the vector irreversible but still comparable to other vectors of the…
-1
votes
1 answer
Tokenization of unbalanced dataset
I'm working with a dataset of emails' content which I want to transform with doc2vec. This is a labeled dataset (spam/not-spam) and it is unbalanced (90-10 ratio).
My question is: when tokenizing the emails' content, should I first oversample (using…

Efrat Magidov
- 11
- 3
-1
votes
1 answer
Why doc2vec is giving different and un-reliable results?
I have a set of 20 small document which talks about a particular kind of issue (training data). Now i want to identify those docs out of 10K documents, which are talking about the same issue.
For the purpose i am using the doc2vec…

Shivam Agrawal
- 2,053
- 4
- 26
- 42
-1
votes
1 answer
Is there anyway to train doc2vec model in multiples batches
i don't know how to train model in multiples batches with doc2vec . Since i load all my data in ram and it't can not be loaded
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import…
-1
votes
1 answer
How to do supervised learning with Gensim/Word2Vec/Doc2Vec having large corpus of text documents?
I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words.
I am trying to do a supervised learning with these documents.
My approach would be:
Vectorize each document in the corpus. Say we have 2347…

afghani
- 467
- 5
- 7
-1
votes
1 answer
how to get words of clusters
How can I get the words of each cluster
I divided them into groups
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train['KARMA'].values:
…

N.K
- 38
- 5
-1
votes
1 answer
how to approach the project which is about analyzing call records and getting meaningful results about the topic
I am analyzing the call records and try to use doc2vec I cant find the appropriate way to apply
I tried to convert words to root later i will try to get rid of stop words(which are rooted).
I desire to understand that each what the conversation is…

N.K
- 38
- 5
-1
votes
1 answer
Get all similar documents with doc2vec
I am actually working with doc2vec from gensim library and I want to get all similarities with probabilites not only the top 10 similarities provided by model.docvecs.most_similar()
Once my model is trained
In [1]: print(model)
Out [1]:…

Oussama Jabri
- 674
- 1
- 7
- 18
-1
votes
1 answer
Computing a similarity score for a set of sentences
My team does a lot of chatbot training, and I'm trying to come up with some tools to improve the quality of our work. In chatbot training, it is really important to train intents with diverse utterances that phrase the same intent in very different…

SymphonyTomorrow
- 1
- 1
- 3
-1
votes
2 answers
How doc2vec creates vector for sentence
I am working on Doc2vec for text classification. It is creating a vector for a sentence with some given size (e.g.: 100, length of vector). I am not able to understand how it creates vector of that length.
i am following this link. in here they are…

Naveen Meka
- 27
- 6
-2
votes
2 answers
How do I input doc2vec vectors of multiple text columns?
I have a dataset which has 3 different columns of relevant text information which I want to convert into doc2vec vectors and subsequently classify using a neural net. My question is how do I convert these three columns into vectors and input into a…

anmol narang
- 51
- 1
- 6
-3
votes
0 answers
Comparing Similarity Between Two Texts with Doc2Vec
I'm working on a Machine Learning project. I have some user data from an e-commerce website and I'm predicting future purchases. Actually my model is complete but I want to add a new feature to my dataframe.
I haven't used search terms data of users…

XPrime
- 1
- 2
-3
votes
1 answer
How to find similarity between two list of strings using doc2vec?
I have a list of strings like below. I would like to see similarity between list1 and list2 using Doc2Vec.
list1 = [['i','love','machine','learning','its','awesome'],['i', 'love', 'coding', 'in', 'python'],['i', 'love', 'building',…

Praveenkumar
- 3
- 1
- 3