Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
0
votes
1 answer

Cannot align graph because multiple tag doc2vec returning more items in doctag_syn0 than there are in the training data

I am training a doc2vec model with multiple tags, so it includes the typical doc "ID" tag and then it also contains a label tag "Category 1." I'm trying to graph the results such that I get the doc distribution in a 2d (using LargeVis) but am able…
seeiespi
  • 3,628
  • 2
  • 35
  • 37
0
votes
3 answers

Doc2Vec: Similarity Between Coded Documents and Unseen Documents

I have a sample of ~60,000 documents. We've hand coded 700 of them as having a certain type of content. Now we'd like to find the "most similar" documents to the 700 we already hand-coded. We're using gensim doc2vec and I can't quite figure out…
0
votes
1 answer

Why Gensim most similar in doc2vec gives the same vector as the output?

I am using the following code to get the ordered list of user posts. model = doc2vec.Doc2Vec.load(doc2vec_model_name) doc_vectors = model.docvecs.doctag_syn0 doc_tags = model.docvecs.offset2doctag for w, sim in…
J Cena
  • 963
  • 2
  • 11
  • 25
0
votes
1 answer

I want to classify some sentences on the basis of their semantic meaning.How can I use Doc2Vec in this? Or is there a better approach than this?

I want to implement doc2vec on various reviews which we extracted from a source.And I want to classify these reviews into different classes defined by the user. How can I do this?
0
votes
1 answer

MemoryError using Python and Doc2Vec

I'm trying to train a Doc2vec for massive data. I have a 20k files with 72GB in total, and write this code: def train(): onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))] data = [] random.shuffle(onlyfiles) …
Dimmy Magalhães
  • 357
  • 1
  • 6
  • 21
0
votes
1 answer

Using Doc2Vec to find salience score for resumes based on job description

Here is my use case: HR department provide job description(free text) and set of resumes(plain text), and the ask is to come up with salience score based on job description relevance. The job description consists of skills required and minimum…
0
votes
1 answer

Doc2vec - About getting document vector

I'm a very new student of doc2vec and have some questions about document vector. What I'm trying to get is a vector of phrase like 'cat-like mammal'. So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below import…
Chhyun
  • 11
0
votes
1 answer

Gensim tagging documents with big numbers

I want to label my documents with tags mapped to id attribute in database. The ids can be for example also like this: documents[0] is for example TaggedDocument(words=['blabla', 'request'], tags=[225616076]) For some reason, it is not able to…
xdaniel
  • 113
  • 1
  • 11
0
votes
0 answers

Sent2Vec or Doc2Vec Testing

How can i test a sent2vec or doc2vec model that I've trained on a specific dataset? The process is all unsupervised so have no labels to help in the testing. My interest is in how the semantic similarity measure is computed. Thanks in advance.
Hummer
  • 429
  • 1
  • 3
  • 16
0
votes
1 answer

AttributeError: 'Tree' object has no attribute 'words'. Doc2Vec error

I am trying to train a Doc2Vec word embedding on preprocessed paragraphs. I have removed punctuation, and have carried out tokenization, pos tag and chunking. import nltk from nltk import word_tokenize, pos_tag, ne_chunk from gensim.models.doc2vec…
Nuc
  • 11
  • 3
0
votes
1 answer

How to check via callbacks if alpha is decreasing? + How to load all cores during training?

I'm training doc2vec, and using callbacks trying to see if alpha is decreasing over training time using this code: class EpochSaver(CallbackAny2Vec): '''Callback to save model after each epoch.''' def __init__(self, path_prefix): …
Dasha
  • 327
  • 2
  • 10
0
votes
1 answer

Gensim Doc2vec trained, but not saved

While I trained d2v on a large text corpus I received these 3 files: doc2vec.model.trainables.syn1neg.npy doc2vec.model.vocabulary.cum_table.npy doc2vec.model.wv.vectors.npy Bun final model has not saved, because there was not enough free space…
Dasha
  • 327
  • 2
  • 10
0
votes
0 answers

Cannot figure out format needed to make predictions on dataset trained with doc2vec and random forest classifier

I am trying to make predictions on a dataset based on some pre-defined data (tweets and categories that the tweets belong to, labeled 1-16) that I have built a model in with doc2vec and trained on random forest classifier. I am confused about what…
Natalie
  • 447
  • 1
  • 4
  • 16
0
votes
1 answer

Doc2Vec gensim with supervised data predefined labels

I am trying to use gensim's doc2vec to create a model which will be trained on a set of documents and a set of labels. The labels were created manually and need to be put into the program to be trained on. So far I have 2 lists: a list of sentences,…
Natalie
  • 447
  • 1
  • 4
  • 16
0
votes
1 answer

How to I get the similiarity between a word to a document in gensim

So I have started to learn gensim for both word2vec and doc2vec and it works. The similarity scores actually work really well. For an experiment, however, I wanted to optimize a key word based search algorithm by comparing a single word and getting…
Julian Kurz
  • 93
  • 1
  • 9