Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
4
votes
1 answer

Multiple tags for single document in doc2vec. TaggedDocument

Is it possible to to train a doc2vec model where a single document has multiple tags? For example, in movie reviews, doc0 = doc2vec.TaggedDocument(words=review0,tags=['UID_0','horror','action']) doc1 =…
unknown_jy
  • 757
  • 9
  • 19
4
votes
1 answer

Difference between TaggedDocument and TaggedLineDocument in gensim? and How to work with files in a directory?

I am new to doc2vec and I wish to classify set of texts using it. I am confused about TaggedDocument and TaggedLineDocument. 1) What is the difference between two? Is it that TaggedLineDocument is collection of TaggedDocuments? 2) If I have a…
dfault
  • 41
  • 2
4
votes
1 answer

scikit-learn classification using doc2vec representation

I want to classify text documents using doc2vec representation and scikit-learn models. My problem is that I'm lost on how to get started. can someone explain the general steps usually taken to use doc2vec with scikit-learn?
4
votes
2 answers

How to get word vectors from a gensim Doc2Vec?

I trained a gensim.models.doc2vec.Doc2Vec model d2v_model = Doc2Vec(sentences, size=100, window=8, min_count=5, workers=4) and I can get document vectors by docvec = d2v_model.docvecs[0] How can I get word vectors from trained model ?
V Y
  • 685
  • 10
  • 21
3
votes
1 answer

My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?

I'm training a Doc2Vec model using the below code, where tagged_data is a list of TaggedDocument instances I set up before: max_epochs = 40 model = Doc2Vec(alpha=0.025, min_alpha=0.001) model.build_vocab(tagged_data) for epoch in…
gojomo
  • 52,260
  • 14
  • 86
  • 115
3
votes
1 answer

How to perform efficient queries with Gensim doc2vec?

I’m working on a sentence similarity algorithm with the following use case: given a new sentence, I want to retrieve its n most similar sentences from a given set. I am using Gensim v.3.7.1, and I have trained both word2vec and doc2vec models. The…
3
votes
1 answer

Doc2vec beyond beginner guidance

I've been using doc2vec in the most basic way so far with limited success. I'm able to find similar documents however often I get a lot of false positives. My primary goal is to build a classification algorithm for user requirements. This is to…
3
votes
3 answers

Doc2Vec & classification - very poor results

I have a dataset of 6000 observations; a sample of it is the following: job_id job_title job_sector 30018141 Secondary Teaching Assistant Education 30006499 Legal Sales…
Outcast
  • 4,967
  • 5
  • 44
  • 99
3
votes
2 answers

Cosine Similarity between Lists of Sentences using Doc2Vec

I'm new to NLP but I'm trying to match a list of sentences to another list of sentences in Python based on their semantic similarity. For example, list1 = ['what they ate for lunch', 'height in inches', 'subjectid'] list2 = ['food eaten two days…
m13op22
  • 2,168
  • 2
  • 16
  • 35
3
votes
1 answer

Doc2Vec Clustering with kmeans for a new document

I have a corpus trained with Doc2Vec as follows: d2vmodel = Doc2Vec(vector_size=100, min_count=5, epochs=10) d2vmodel.build_vocab(train_corpus) d2vmodel.train(train_corpus, total_examples=d2vmodel.corpus_count, epochs=d2vmodel.epochs) Using the…
kami
  • 361
  • 3
  • 15
3
votes
1 answer

doc2vec: measurement of performance and 'workers' parameter

I have an awfully large corpora as input to my doc2vec training, around 23mil documents streamed using an iterable function. I was wondering if it were at all possible to see the development of my training progress, possibly through finding out…
apgsov
  • 794
  • 1
  • 8
  • 30
3
votes
2 answers

Doc2vec predictions - do we average the words or what is the paragraph ID for a new paragraph?

I understand that you treat the paragraph ID as a new word in doc2vec (DM approach, left on the figure) during training. The training output is the context word. After a model is trained, suppose I want to get 1 embedding given a new document. Do I…
dorien
  • 5,265
  • 10
  • 57
  • 116
3
votes
1 answer

How find the most decisive sentences or words in a document via Doc2Vec?

I've trained a Doc2Vec model in order to do a simple binary classification task, but I would also love to see which words or sentences weigh more in terms of contributing to the meaning of a given text. So far I had no luck finding anything relevant…
Farhood ET
  • 1,432
  • 15
  • 32
3
votes
1 answer

Paragraph Vector or Doc2vec model size

I am using deeplearning4j java library to build paragraph vector model (doc2vec) of dimension 100. I am using a text file. It has around 17 million lines, and size of the file is 330 MB. I can train the model and calculate paragraph vector which…
3
votes
1 answer

Gensim Doc2Vec getting the doc tags from the Concatenated model

I'm trying to replicate Mikolov's work of PV-DM + PV-DBOW. He says that both algorithms should be used in order to get better results. For this reason I'm trying to train the model and then give the document tags to t-SNE. Using Gensim's Doc2Vec I…