Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
3
votes
1 answer

Document similarity in production environment

We are having n number of documents. Upon submission of new document by user, our goal is to inform him about possible duplication of existing document (just like stackoverflow suggests questions may already have answer). In our system, new document…
user2578525
  • 191
  • 1
  • 11
3
votes
1 answer

Gensim Doc2Vec most_similar() method not working as expected

I am struggling with Doc2Vec and I cannot see what I am doing wrong. I have a text file with sentences. I want to know, for a given sentence, what is the closest sentence we can find in that file. Here is the code for model creation: sentences =…
Yann Droy
  • 177
  • 1
  • 2
  • 9
3
votes
1 answer

Gensim's Doc2vec - inferred vector isn't similar

When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would…
awa993
  • 177
  • 2
  • 14
3
votes
0 answers

Embedding Gensim Doc2Vec Tensorboard

I have a set of documents in a df. I am transforming those documents to vectors with gensim Doc2Vec: def read_corpus(documents): for i, plot in enumerate(documents): yield…
OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
3
votes
1 answer

how to use build_vocab in gensim?

Build_vocab extend my old vocabulary? For example, my idea is when I use doc2vec(s) to train a model, it just builds the vocabulary from the datasets. If I want to extend it, I need to use build_vocab() Where should I use it? Should I put it…
Cherrymelon
  • 412
  • 2
  • 7
  • 17
3
votes
1 answer

Updating training documents for gensim Doc2Vec model

I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model. I take the new documents, and perform preproecssing as normal: stoplist =…
Brian O'Halloran
  • 323
  • 3
  • 18
3
votes
1 answer

What are doc2vec training iterations?

I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents. However, I would like to know the benefits of retraining…
user8566323
3
votes
2 answers

User2Vec? representing a user based on the docs they consume

I'd like to form a representation of users based on the last N documents they have liked. So i'm planning on using doc2vec to form this representation of each document but i'm just trying to figure out what would be a good way to essentially place…
andrewm4894
  • 1,451
  • 4
  • 17
  • 37
3
votes
1 answer

Why are almost all cosine similarities positive between word or document vectors in gensim doc2vec?

I have calculated document similarities using Doc2Vec.docvecs.similarity() in gensim. Now, I would either expect the cosine similarities to lie in the range [0.0, 1.0] if gensim used the absolute value of the cosine as the similarity metric, or…
Sami Liedes
  • 1,084
  • 8
  • 19
3
votes
2 answers

Gensim docvecs.most_similar returns Id's that dont exist

I'm trying create an algorithm that's capable of show the top n documents similar to a specific document. For that i used the gensim doc2vec. The code is bellow: model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11,…
JoaoSilva
  • 63
  • 7
3
votes
2 answers

load pre-trained word2vec model for doc2vec

I'm using gensim to extract feature vector from a document. I've downloaded the pre-trained model from Google named GoogleNews-vectors-negative300.bin and I loaded that model using the following command: model =…
lenhhoxung
  • 2,530
  • 2
  • 30
  • 61
3
votes
1 answer

Doc2Vec model Python 3 compatibility

I trained a doc2vec model with Python2 and I would like to use it in Python3. When I try to load it in Python 3, I get : Doc2Vec.load('my_doc2vec.pkl') UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in…
Bernard
  • 301
  • 2
  • 6
3
votes
0 answers

Is there any way to validate the performance of a Doc2Vec/ Word2Vec Deep Learning model?

I am working with the Doc2Vec and Word2Vec deep learning algorithms (Doc2Vec API description from Gensim). More description here Currently I am interested in using the model.n_similarity(wordSet1, wordSet2) method which basically computes the …
Uther Pendragon
  • 302
  • 2
  • 14
2
votes
2 answers

doc2vec infer words from vectors

I am clustering comments. After preprocessing and a vectorization of a text, I have inferred vectors from my doc2vec model and applied kmeans. After that I want to convert cluster centroid vectors to words to kinda look at the semantic cores of the…
frogseer
  • 39
  • 6
2
votes
1 answer

Run model that need gensim older vesion

I need to run a model but it needs older version of gensim with DocvecsArray attribute.How can i run it? AttributeError: Can't get attribute 'DocvecsArray' on