This is my first time using Doc2Vec I'm trying to classify works of an author. I have trained a model with Labeled Sentences (paragraphs, or strings of specified length), with words = the list of words in the paragraph, and tags = author's name. In my case I only have two authors. I tried accessing the docvecs attribute from the trained model but it only contains two elements, corresponding to the two tags I have when I trained the model. I'm trying to get the doc2vec numpy representations of each paragraph I fed in to the training so I can use that as training data later on. How can I do this? Thanks.
Asked
Active
Viewed 944 times
1 Answers
0
Bulk training only creates vectors for tags you supplied. If you want to read out a bulk-trained vector per paragraph (as if by model.docvecs['paragraph000']
), you have to give each paragraph a unique tag during training (like 'paragraph000'
). You can give docs other tags as well - but bulk training only creates remembers doc-vectors for supplied tags.
After training, you can infer vectors for any other texts you supply to infer_vector()
- and of course you could supply the same paragraphs that were used during training.

gojomo
- 52,260
- 14
- 86
- 115
-
I actually figured that out and Iām using paragraph number as tags as you said. (10000 vectors per author) I do this for both authors and then train a svm model (sklearn) with it. But when I use those numeric docvec arrays as vectors I get horrible accuracy.. ~50%. I got 73% with nltk pos_tag so I must be doing something wrong... ā Eric Han Nov 08 '17 at 05:49
-
Thank you so much for your help. I used infer_vector on my paragraphs and am now getting 93.28% accuracy in my binary classification task!! ;) ā Eric Han Nov 08 '17 at 22:14