Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)
Questions tagged [doc2vec]
556 questions
0
votes
1 answer
Cannot align graph because multiple tag doc2vec returning more items in doctag_syn0 than there are in the training data
I am training a doc2vec model with multiple tags, so it includes the typical doc "ID" tag and then it also contains a label tag "Category 1." I'm trying to graph the results such that I get the doc distribution in a 2d (using LargeVis) but am able…

seeiespi
- 3,628
- 2
- 35
- 37
0
votes
3 answers
Doc2Vec: Similarity Between Coded Documents and Unseen Documents
I have a sample of ~60,000 documents. We've hand coded 700 of them as having a certain type of content. Now we'd like to find the "most similar" documents to the 700 we already hand-coded. We're using gensim doc2vec and I can't quite figure out…

Academic Researcher
- 13
- 1
- 3
0
votes
1 answer
Why Gensim most similar in doc2vec gives the same vector as the output?
I am using the following code to get the ordered list of user posts.
model = doc2vec.Doc2Vec.load(doc2vec_model_name)
doc_vectors = model.docvecs.doctag_syn0
doc_tags = model.docvecs.offset2doctag
for w, sim in…

J Cena
- 963
- 2
- 11
- 25
0
votes
1 answer
I want to classify some sentences on the basis of their semantic meaning.How can I use Doc2Vec in this? Or is there a better approach than this?
I want to implement doc2vec on various reviews which we extracted from a source.And I want to classify these reviews into different classes defined by the user. How can I do this?
0
votes
1 answer
MemoryError using Python and Doc2Vec
I'm trying to train a Doc2vec for massive data. I have a 20k files with 72GB in total, and write this code:
def train():
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
data = []
random.shuffle(onlyfiles)
…

Dimmy Magalhães
- 357
- 1
- 6
- 21
0
votes
1 answer
Using Doc2Vec to find salience score for resumes based on job description
Here is my use case:
HR department provide job description(free text) and set of resumes(plain text), and the ask is to come up with salience score based on job description relevance.
The job description consists of skills required and minimum…

Madhur Telang
- 13
- 5
0
votes
1 answer
Doc2vec - About getting document vector
I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import…

Chhyun
- 11
0
votes
1 answer
Gensim tagging documents with big numbers
I want to label my documents with tags mapped to id attribute in database.
The ids can be for example also like this:
documents[0] is for example
TaggedDocument(words=['blabla', 'request'], tags=[225616076])
For some reason, it is not able to…

xdaniel
- 113
- 1
- 11
0
votes
0 answers
Sent2Vec or Doc2Vec Testing
How can i test a sent2vec or doc2vec model that I've trained on a specific dataset? The process is all unsupervised so have no labels to help in the testing. My interest is in how the semantic similarity measure is computed. Thanks in advance.

Hummer
- 429
- 1
- 3
- 16
0
votes
1 answer
AttributeError: 'Tree' object has no attribute 'words'. Doc2Vec error
I am trying to train a Doc2Vec word embedding on preprocessed paragraphs. I have removed punctuation, and have carried out tokenization, pos tag and chunking.
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from gensim.models.doc2vec…

Nuc
- 11
- 3
0
votes
1 answer
How to check via callbacks if alpha is decreasing? + How to load all cores during training?
I'm training doc2vec, and using callbacks trying to see if alpha is decreasing over training time using this code:
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch.'''
def __init__(self, path_prefix):
…

Dasha
- 327
- 2
- 10
0
votes
1 answer
Gensim Doc2vec trained, but not saved
While I trained d2v on a large text corpus I received these 3 files:
doc2vec.model.trainables.syn1neg.npy
doc2vec.model.vocabulary.cum_table.npy
doc2vec.model.wv.vectors.npy
Bun final model has not saved, because there was not enough free space…

Dasha
- 327
- 2
- 10
0
votes
0 answers
Cannot figure out format needed to make predictions on dataset trained with doc2vec and random forest classifier
I am trying to make predictions on a dataset based on some pre-defined data (tweets and categories that the tweets belong to, labeled 1-16) that I have built a model in with doc2vec and trained on random forest classifier. I am confused about what…

Natalie
- 447
- 1
- 4
- 16
0
votes
1 answer
Doc2Vec gensim with supervised data predefined labels
I am trying to use gensim's doc2vec to create a model which will be trained on a set of documents and a set of labels. The labels were created manually and need to be put into the program to be trained on. So far I have 2 lists: a list of sentences,…

Natalie
- 447
- 1
- 4
- 16
0
votes
1 answer
How to I get the similiarity between a word to a document in gensim
So I have started to learn gensim for both word2vec and doc2vec and it works. The similarity scores actually work really well. For an experiment, however, I wanted to optimize a key word based search algorithm by comparing a single word and getting…

Julian Kurz
- 93
- 1
- 9