Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions

votes

2 answers

How does Pyspark Calculate Doc2Vec from word2vec word embeddings?

I have a pyspark dataframe with a corpus of ~300k unique rows each with a "doc" that contains a few sentences of text in each. After processing, I have a 200 dimension vectorized representation of each row/doc. My NLP Process: Remove Punctuation…

asked Jan 02 '18 at 16:20

whs2k

votes

1 answer

Doc2vec and word2vec with negative sampling

My current doc2vec code is as follows. # Train doc2vec model model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20) I also have a word2vec code as below. # Train word2vec model model =…

python nlp word2vec gensim doc2vec

asked Oct 21 '17 at 04:58

user8566323

votes

1 answer

What is the difference between gensim LabeledSentence and TaggedDocument

Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog! class…

gensim text-classification word2vec doc2vec

asked Dec 16 '16 at 10:33

Rashmi Singh

votes

1 answer

What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function. In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the…

python gensim doc2vec

asked Jul 09 '18 at 12:32

Suhail Gupta

22,386
64
200
328

votes

1 answer

What is different between doc2vec models when the dbow_words is set to 1 or 0?

I read this page but I do not understand what is different between models which are built based on the following codes. I know when dbow_words is 0, training of doc-vectors is faster. First model model = doc2vec.Doc2Vec(documents1, size = 100,…

gensim doc2vec

asked May 16 '17 at 21:15

user3092781

votes

1 answer

creating word2vec model syn1neg.npy extension

When creating model,there is not any more model with extension finish .syn1neg.npy syn0.npy My code is below: corpus= x+y tok_corp= [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus] model = gensim.models.Word2Vec(tok_corp,…

python python-3.x deep-learning word2vec doc2vec

asked Apr 24 '17 at 12:37

Tomas Ukasta

votes

3 answers

Is there any way to get the vocabulary size from doc2vec model?

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.

gensim word2vec doc2vec

asked Jan 12 '17 at 08:07

Rashmi Singh

votes

1 answer

Measure similarity between two documents using Doc2Vec

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one. Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc…

python machine-learning nlp gensim doc2vec

asked Nov 27 '18 at 15:34

Borislav Stoilov

3,247
2
21
46

votes

2 answers

How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages: Page from gensim github issues section. It…

python nlp gensim doc2vec

asked Jun 05 '18 at 09:48

Ghaliamus

votes

1 answer

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences: The corpora were lemmatized and POS-tagged with the…

nlp stanford-nlp word2vec gensim doc2vec

asked May 29 '18 at 12:03

Simon Hessner

1,757
1
22
49

votes

1 answer

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs. The example of training data (length>10) docs = ['This is…

machine-learning nlp word2vec gensim doc2vec

asked Feb 28 '18 at 03:14

GemOfRoe

votes

1 answer

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model? I will be sharing my understanding…

neural-network gensim doc2vec

asked Jan 02 '18 at 10:19

Shalabh Singh

votes

1 answer

Does Doc2Vec learn representations for the tags?

I'm using the Doc2Vec tags as an unique identifier for my documents, each document has a different tag and no semantic meaning. I'm using the tags to find specific documents so I can calculate the similarity between them. Do the tags influence the…

gensim doc2vec

asked Apr 21 '17 at 13:16

Stanko

4,275
3
23
51

votes

1 answer

Doc2Vec: Differentiate Sentence and Document

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates. The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences. But the original…

python gensim doc2vec

asked Feb 15 '17 at 06:55

Vikash Balasubramanian

2,921
3
33
74

votes

2 answers

doc2vec How to cluster DocvecsArray

I've patched the following code from examples I've found over the web: # gensim modules from gensim import utils from gensim.models.doc2vec import LabeledSentence from gensim.models import Doc2Vec from sklearn.cluster import KMeans # random from…

python machine-learning k-means word2vec doc2vec

asked Sep 08 '16 at 13:04

Shlomi Schwartz

8,693
29
109
186

Prev 1

…

37 38 Next