Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
0
votes
1 answer

Hierarchical training for doc2vec: how would assigning same labels to sentences of the same document work?

What is the effect of assigning the same label to a bunch of sentences in doc2vec? I have a collection of documents that I want to learn vectors using gensim for a "file" classification task where file refers to a collection of documents for a given…
HMK
  • 578
  • 2
  • 9
  • 24
0
votes
1 answer

Doc2Vec input format

running gensim Doc2Vec over ubuntu Doc2Vec rejects my input with the error AttributeError: 'list' object has no attribute 'words' import gensim from gensim.models import doc2vec as dtv from nltk.corpus import brown documents =…
Lcat
  • 857
  • 1
  • 8
  • 16
0
votes
1 answer

Can doc2vec be useful if training on Documents and inferring on sentences only

I am training with some documents with gensim's Doc2vec. I have two types of inputs: Whole English Wikipedia: Each article of Wikipedia text is considered as one document for doc2vec training. (Total around 5.5 million articles or…
DK818
  • 135
  • 6
0
votes
0 answers

Doc2vec most_similar method returns similarity score higher than 1

I have trained doc2vec model by following this tutorial for 500.000 documents. https://github.com/abtpst/Doc2Vec/blob/master/trainDoc2Vec.py However, when I try to find most_similar documents for a given document, the results have similarity higher…
akoksal
  • 11
  • 4
0
votes
1 answer

Gensim DOC2VEC trims and delete the vocabulary

I tried creating a simple Doc2Vec model: sentences = [] sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'rosse', u'con', u'tacco'], tags=[1])) sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'blu'], tags=[2])) …
Nicolò Gasparini
  • 2,228
  • 2
  • 24
  • 53
0
votes
0 answers

Vector representation for token and compound word

I have a corpus of sentences. Each of them may contain marked compound words. For example: This is an example_sentence followed by another awesome_paragraph . I want to get embedding vector for all tokens and compound words (this, is, an,…
Brody
  • 1
0
votes
1 answer

Shape ValueError in LSTM network using Tensorflow

I want to train a LSTM model with Tensorflow. I have a text data as input and I get doc2vec of each paragraph of the text and pass it to the lstm layers but I get ValueError because of inconsistency of shape rank. I've searched through Stackoverflow…
Mina smz
  • 55
  • 1
  • 7
0
votes
1 answer

Normalize the similarity between word vectors and document vectors?

Cosine similarity is broadly used for measuring the similarity between two vectors, where two could be word vectors or document vectors. Others, like manhattan, euclidean, minkowski, etc, are also popular. Cosine similarity gives the number between…
Isaac Sim
  • 539
  • 1
  • 7
  • 23
0
votes
1 answer

semantic and syntactic performance of Doc2vec model

I am trying to check the semantic and syntactic performance of a doc2vec model- doc2vec_model.accuracy(questions-words), but it doesnt seem to function since models.deprecated.doc2vec – Deep learning with paragraph2vec, says it has been deprecated…
Dela
  • 115
  • 2
  • 12
0
votes
1 answer

How are vectors calculated in doc2vec and what does the size parameter depict?

If I pass a Sentence containing 5 words to the Doc2Vec model and if the size is 100, there are 100 vectors. I'm not getting what are those vectors. If I increase the size to 200, there are 200 vectors for just a simple sentence. Please tell me how…
Yash Ghorpade
  • 607
  • 1
  • 7
  • 16
0
votes
0 answers

matching between two separate documents using gensim doc2vec

i have two separate data sets, one is resumes and the other is demands, using gensim doc2vec, i created models for each and i am able to query similar words in each data sets, but now, i need to merge these two models into one and query for resumes…
krits
  • 68
  • 1
  • 9
0
votes
2 answers

How to measure the word weight using doc2vec vector

I'm using the word2vec algorithm to detect the most important words in a document, my question is about how to compute the weight of an important word using the vector obtained from doc2vec, my code is like that: model =…
ucmou
  • 79
  • 1
  • 1
  • 8
0
votes
0 answers

Document tags in vectorization models

I am a little new to python and the unsupervised learning methods, but I have a quick question. where as doc2vec model has docvecs property holding all trained vectors for the 'document tags' seen during training; Are there similar properties that…
Dela
  • 115
  • 2
  • 12
0
votes
1 answer

cosine similarity is 0.7 for exactly same sentences

Cosine similarity for exactly two same sentences is 0.7. Is my doc2vec model correct? I am using quora question pairs dataset available in kaggle. In the code below, train1 is the list of first questions and train2 is the list of second…
Gautam Kumar
  • 71
  • 1
  • 5
0
votes
1 answer

How to get most similar words to a document in gensim doc2vec?

I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model. For example, I have a document about "java" with the tag "doc_about_java". When I ask for…
aburkov
  • 13
  • 2
  • 5