Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
5
votes
2 answers

AttributeError: 'Word2Vec' object has no attribute 'most_similar' (Word2Vec)

I am using Word2Vec and using a wiki trained model that gives out the most similar words. I ran this before and it worked but now it gives me this error even after rerunning the whole program. I tried to take off return_path=True but im still…
RSB
  • 49
  • 1
  • 1
  • 6
5
votes
0 answers

How to get document embeddings using GPT-2?

I'm curious if using GPT-2 might yield a higher accuracy for document vectors (with greatly varying length) or not (would it surpass the state of the art?) Really I'm most interested in document embeddings that are as accurate as possible. I'm…
5
votes
1 answer

Which method dm or dbow works well for document similarity using Doc2Vec?

I'm trying to find out the similarity between 2 documents. I'm using Doc2vec Gensim to train around 10k documents. There are around 10 string type of tags. Each tag consists of a unique word and contains some sort of documents. Model is trained…
iNikkz
  • 3,729
  • 5
  • 29
  • 59
5
votes
1 answer

Gensim Doc2Vec generating huge file for model

I am trying to run doc2vec library from gensim package. My problem is that when I am training and saving the model the model file is rather large(2.5 GB) I tried using this line : model.estimate_memory() But it didn't change anything. I also have…
ida
  • 1,011
  • 1
  • 9
  • 17
5
votes
2 answers

Doc2Vec Sentence Clustering

I have multiple documents that contain multiple sentences. I want to use doc2vec to cluster (e.g. k-means) the sentence vectors by using sklearn. As such, the idea is that similar sentences are grouped together in several clusters. However, it is…
Boyos123
  • 119
  • 1
  • 5
5
votes
1 answer

What is gensim's 'docvecs'?

The above picture is from Distributed Representations of Sentences and Documents, the paper introducing Doc2Vec. I am using Gensim's implementation of Word2Vec and Doc2Vec, which are great, but I am looking for clarity on a few issues. For a given…
Michael Davidson
  • 1,391
  • 1
  • 14
  • 31
5
votes
0 answers

How to train a new text with gensim doc2vec

sentences=gensim.models.doc2vec.TaggedLineDocument("raw_docs.txt") model=gensim.models.Doc2Vec(sentences,min_count=1,iter=100) sentence=TaggedDocument(words=[u'为了'],tags=[u'T1']) sentences1=[sentence] model.build_vocab(sentences1,update=True) model.t…
Jeffery
  • 151
  • 1
  • 1
  • 7
5
votes
2 answers

doc2vec - How to infer vectors of documents faster?

I have trained paragraph vectors for around 2300 paragraphs(between 2000-12000 words each) each with vector size of 300. Now, I need to infer paragraph vectors of around 100,000 sentences which I have considered as paragraphs(each sentence is around…
Dreams
  • 5,854
  • 9
  • 48
  • 71
5
votes
1 answer

How to get the Document Vector from Doc2Vec in gensim 0.11.1?

Is there a way to get the document vectors of unseen and seen documents from Doc2Vec in the gensim 0.11.1 version? For example, suppose I trained the model on 1000 thousand - Can I get the doc vector for those 1000 docs? Is there a way to get…
silent_dev
  • 1,566
  • 3
  • 20
  • 45
4
votes
2 answers

Gensim Doc2Vec visualization issue when using t-SNE and/or PCA

I am trying to familiarize with Doc2Vec results by using a public dataset of movie reviews. I have cleaned the data and run the model. There are, as you can see below, 6 tags/genres. Each is a document with its vector representation. doc_tags =…
4
votes
2 answers

ModuleNotFoundError: No module named 'numpy.random._pickle'

I have a doc2vec model which drives my recommendation app. I have built the doc2vec model and saved into s3 bucket. Now when i open the webapp the model should be loaded back from s3 but this not happenning. I used AWS Elasticbean stalk to deploy my…
Praneeth Sai
  • 1,421
  • 2
  • 7
  • 11
4
votes
1 answer

Use Spacy to find most similar sentences in doc

I'm looking for a solution to use something like most_similar() from Gensim but using Spacy. I want to find the most similar sentence in a list of sentences using NLP. I tried to use similarity() from Spacy (e.g. https://spacy.io/api/doc#similarity)…
Heraknos
  • 343
  • 3
  • 8
4
votes
2 answers

What is the appropriate distance metric when clustering paragraph/doc2vec vectors?

My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates. To do this I am using gensim to generate document vectors. The elements of the resulting docvecs are…
fluffet
  • 43
  • 6
4
votes
1 answer

gensim - Doc2Vec: Difference iter vs. epochs

When reading the Doc2Vec documentation of gensim, I get a bit confused about some options. For example, the constructor of Doc2Vec has a parameter iter: iter (int) – Number of iterations (epochs) over the corpus. Why does the train method then…
Simon Hessner
  • 1,757
  • 1
  • 22
  • 49
4
votes
2 answers

Do gensim Doc2Vec distinguish between same Sentence with positive and negative context.?

While learning Doc2Vec library, I got stuck on the following question. Do gensim Doc2Vec distinguish between the same Sentence with positive and negative context? For Example: Sentence A: "I love Machine Learning" Sentence B: "I do not love Machine…
DK818
  • 135
  • 6
1 2
3
37 38