Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
2
votes
1 answer

How to put maximum vocabulary frequency in doc2vec

Doc2vec while creating the vocabulary has possibility to put minimum occurence of the word in documents to be included in vocabulary as parameter min_count. model = gensim.models.doc2vec.Doc2Vec(vector_size=200, min_count=3,…
Igor sharm
  • 396
  • 1
  • 10
2
votes
1 answer

gensim Doc2Vec word not in vocabulary

I am training a doc2vec gensim model with txt file 'full_texts.txt' that contains ~1600 documents. Once I have trained the model, I wish to use similarity methods over words and sentences. However, since this is my first time using gensim , I am…
Shoaibkhanz
  • 1,942
  • 3
  • 24
  • 41
2
votes
1 answer

What text processing does WikiCorpus perform in gensim?

I have trained a doc2vec model on the Wikipedia corpus using gensim and I wish to retrieve vectors from different documents. I was wondering what text processing the WikiCorpus function did when I used it to train my model e.g. removed punctuation,…
OultimoCoder
  • 244
  • 2
  • 7
  • 24
2
votes
0 answers

Doc2Vec error: need at least one array to concatenate

I am running into an error trying to apply a doc2vec model to some text. The tutorial I am following is here. However I cannot seem to "replicate" the results on some new text information. I have read other SO posts about this issue and its because…
user113156
  • 6,761
  • 5
  • 35
  • 81
2
votes
0 answers

Doc2vec on a corpus of novels: how do I assign to each sentence of a novel one tag for the ID of the sentence and one tag for the ID of the book?

I am trying to train a doc2vec model on a corpus of six novels and I need to build the corpus of Tagged Documents. Each novel is a txt file, already preprocessed and read into python using the read() method, so that it appears as a "long string".…
2
votes
1 answer

Doc2Vec: get text of the label

I've trained Doc2Vec model I'm trying to get predictions. I use test_data = word_tokenize("Филип Моррис Продактс С.А.".lower()) model = Doc2Vec.load(model_path) v1 = model.infer_vector(test_data) sims =…
Petr Petrov
  • 4,090
  • 10
  • 31
  • 68
2
votes
1 answer

I get more vectors than my documents size - gensim doc2vec

I have protein sequences and want to do doc2vec. My goal is to have one vector for each sentence/sequence. I have 1612 sentences/sequences and 30 classes so the label is not unique and many documents share the same labels. So when I first tried…
2
votes
1 answer

Gensim Doc2vec – KeyError: "tag not seen in training corpus/invalid"

I am using gensim's Doc2vec to learn features from news articles. I can successfully train my documents. However, I struggle to retrieve the document vectors from the model for further processing. Example code (directly taken from gensim's…
petezurich
  • 9,280
  • 9
  • 43
  • 57
2
votes
1 answer

Python Calculating similarity between two documents using word2vec, doc2vec

I am trying to calculate similarity between two documents which are comprised of more than thousands sentences. Baseline would be calculating cosine similarity using BOW. However, I want to capture more of semantic difference between…
ChanKim
  • 361
  • 2
  • 16
2
votes
1 answer

Gensim Doc2vec model: how to compute similarity on a corpus obtained using a pre-trained doc2vec model?

I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I…
2
votes
1 answer

Unsupervised sentiment Analysis using doc2vec

Folks, I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some…
Saurabh Gokhale
  • 53,625
  • 36
  • 139
  • 164
2
votes
1 answer

GridSearch for doc2vec model built using gensim

I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not…
Rajat
  • 647
  • 3
  • 10
  • 30
2
votes
1 answer

How to classify text documents in legal domain

I've been working on a project which is about classifying text documents in the legal domain (Legal Judgment Prediction class of problems). The given data set consists of 700 legal documents (well balanced in two classes). After the preprocessing,…
hey_rey
  • 103
  • 8
2
votes
0 answers

Doc2Vec with Keras

According to Micholov paper I want to compute Doc2Vec using Keras. I'm new on Keras so I need your help. There is a corpus of documents with an Id and I want to get two embeddings matrices : one for words and one for paragraphs, isn't it ? Is it…
2
votes
1 answer

Doc2vec output data for only a single document and not two documents vectors

I try to build a simple program to test on my understanding about Doc2Vec and it seems like I still have a long way to go before knowing it. I understand that each sentence in the document is first being labeled with its own label and for doc2vec…
JJson
  • 233
  • 1
  • 4
  • 18