Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
8
votes
2 answers

How does Pyspark Calculate Doc2Vec from word2vec word embeddings?

I have a pyspark dataframe with a corpus of ~300k unique rows each with a "doc" that contains a few sentences of text in each. After processing, I have a 200 dimension vectorized representation of each row/doc. My NLP Process: Remove Punctuation…
whs2k
  • 741
  • 2
  • 10
  • 19
8
votes
1 answer

Doc2vec and word2vec with negative sampling

My current doc2vec code is as follows. # Train doc2vec model model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20) I also have a word2vec code as below. # Train word2vec model model =…
user8566323
8
votes
1 answer

What is the difference between gensim LabeledSentence and TaggedDocument

Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog! class…
Rashmi Singh
  • 519
  • 1
  • 8
  • 20
7
votes
1 answer

What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function. In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the…
Suhail Gupta
  • 22,386
  • 64
  • 200
  • 328
7
votes
1 answer

What is different between doc2vec models when the dbow_words is set to 1 or 0?

I read this page but I do not understand what is different between models which are built based on the following codes. I know when dbow_words is 0, training of doc-vectors is faster. First model model = doc2vec.Doc2Vec(documents1, size = 100,…
user3092781
  • 313
  • 2
  • 16
7
votes
1 answer

creating word2vec model syn1neg.npy extension

When creating model,there is not any more model with extension finish .syn1neg.npy syn0.npy My code is below: corpus= x+y tok_corp= [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus] model = gensim.models.Word2Vec(tok_corp,…
Tomas Ukasta
  • 170
  • 1
  • 7
7
votes
3 answers

Is there any way to get the vocabulary size from doc2vec model?

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.
Rashmi Singh
  • 519
  • 1
  • 8
  • 20
6
votes
1 answer

Measure similarity between two documents using Doc2Vec

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one. Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc…
Borislav Stoilov
  • 3,247
  • 2
  • 21
  • 46
6
votes
2 answers

How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages: Page from gensim github issues section. It…
Ghaliamus
  • 101
  • 1
  • 4
6
votes
1 answer

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences: The corpora were lemmatized and POS-tagged with the…
Simon Hessner
  • 1,757
  • 1
  • 22
  • 49
6
votes
1 answer

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs. The example of training data (length>10) docs = ['This is…
GemOfRoe
  • 125
  • 5
6
votes
1 answer

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model? I will be sharing my understanding…
Shalabh Singh
  • 360
  • 1
  • 3
  • 10
6
votes
1 answer

Does Doc2Vec learn representations for the tags?

I'm using the Doc2Vec tags as an unique identifier for my documents, each document has a different tag and no semantic meaning. I'm using the tags to find specific documents so I can calculate the similarity between them. Do the tags influence the…
Stanko
  • 4,275
  • 3
  • 23
  • 51
6
votes
1 answer

Doc2Vec: Differentiate Sentence and Document

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates. The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences. But the original…
Vikash Balasubramanian
  • 2,921
  • 3
  • 33
  • 74
6
votes
2 answers

doc2vec How to cluster DocvecsArray

I've patched the following code from examples I've found over the web: # gensim modules from gensim import utils from gensim.models.doc2vec import LabeledSentence from gensim.models import Doc2Vec from sklearn.cluster import KMeans # random from…
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
1
2
3
37 38