Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
2
votes
1 answer

gensim doc2vec train more documents from pre-trained model

I am trying to train with new labelled document(TaggedDocument) with the pre-trained model. Pretrained model is the trained model with documents which the unique id with label1_index, for instance, Good_0, Good_1 to Good_999 And the total size of…
Isaac Sim
  • 539
  • 1
  • 7
  • 23
2
votes
1 answer

doc2vec/gensim - issue with shuffling sentences in the epochs

I am trying to get started with word2vec and doc2vec using the excellent tutorials, here and here and trying to use the code samples. I only added in a line_clean() method to remove punctuation, stopwords etc. But I am having trouble with the…
Santino
  • 776
  • 2
  • 11
  • 29
2
votes
1 answer

Doc2vec: clustering resulting vectors

In the doc2vec model, Can we cluster on the vectors themselves? Should we cluster each resulting model.docvecs[1]vector? How to implement the clustering model? model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50,…
Hackerds
  • 1,195
  • 2
  • 16
  • 34
2
votes
1 answer

Doc2vec: model.docvecs is only of length 10

I am trying doc2vec for 600000 rows of sentences and my code is below: model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores) model.build_vocab(res) model.train(res, total_examples=model.corpus_count,…
Hackerds
  • 1,195
  • 2
  • 16
  • 34
2
votes
2 answers

How to use Gensim Doc2vec infer_vector() for large DataFrame?

I have created document vectors for a large corpus using Gensim's doc2vec. sentences=gensim.models.doc2vec.TaggedLineDocument('file.csv') model = gensim.models.doc2vec.Doc2Vec(sentences,size = 10, window = 800, min_count = 1, workers=40, iter=10,…
CMM
  • 31
  • 2
2
votes
2 answers

How to access document details from Doc2Vec similarity scores in gensim model?

I have been given a doc2vec model using gensim which was trained on 20 Million documents. The 20 Million documents it was trained are also given to me but I have no idea how or which order the documents were trained in from the folder. I am supposed…
User54211
  • 121
  • 2
  • 11
2
votes
1 answer

How to obtain document vectors in doc2vec in gensim

I know to obtain a document vector for a given tag in doc2vec using print(model.docvecs['recipe__11']). My document vectors are either recipes (tags start with recipe__), newspapers (tags start with news__) or ingredients (tags start with…
user8566323
2
votes
2 answers

How to load the pre-trained doc2vec model and use it's vectors

Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec? I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec…
Vera
  • 75
  • 2
  • 6
2
votes
0 answers

Doc2Vec from gensim to deeplearning4j

Is there any way to load doc2vec model saved using gensim into deeplearning4j's ParagraphVectors? My gensim model is valid - I am able to load it using gensim with no problems. When I call WordVectorSerializer.readParagraphVectors on my model from…
dkaras
  • 195
  • 2
  • 12
2
votes
1 answer

applying the Similar function in Gensim.Doc2Vec

I am trying to get the doc2vec function to work in python 3. I Have the following code: tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()] def prep (x): low =…
Niels Helsø
  • 45
  • 1
  • 4
2
votes
0 answers

TypeError while using infer_vector on a gensim Doc2Vec model loaded from memory

I am a little new to doc2vec algorithm and using gensim for its implementation in python. Following the gensim tutorial "Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset" I have built vocab and trained a doc2vec model, and stored it on the disc…
cvipul
  • 120
  • 1
  • 9
2
votes
1 answer

gensim doc2vec "intersect_word2vec_format" command

Just reading through the doc2vec commands on the gensim page. I am curious about the command"intersect_word2vec_format" . My understanding of this command is it lets me inject vector values from a pretrained word2vec model into my doc2vec model…
pete the dude
  • 139
  • 3
  • 7
2
votes
1 answer

How to train word2vec with your own vocab

I am getting error while training word2vec with my own vocabulary. I am also not getting why its happening. Code: from gensim.models import word2vec import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',…
Manish Kumar
  • 1,419
  • 3
  • 17
  • 36
2
votes
1 answer

Can I create a topic model (such as LDA) from the output of doc2vec model?

I did document similarity on my corpus using Doc2Vec and it outputting not that good of similarities. I was wondering if I could do a topic model from what Doc2Vec is giving me to increase the accuracy of my model in order to get better…
2
votes
1 answer

How do I find cosine similarity between two text documents using Java?

I need to compare a large number of tweets containing a particular hashtag to display the tweet which has the highest content in it. For the same, I need to find pair-wise cosine similarity between each one of them and display the tweet with highest…
Manan Kalra
  • 41
  • 1
  • 4