0

This is my first time using Doc2Vec I'm trying to classify works of an author. I have trained a model with Labeled Sentences (paragraphs, or strings of specified length), with words = the list of words in the paragraph, and tags = author's name. In my case I only have two authors. I tried accessing the docvecs attribute from the trained model but it only contains two elements, corresponding to the two tags I have when I trained the model. I'm trying to get the doc2vec numpy representations of each paragraph I fed in to the training so I can use that as training data later on. How can I do this? Thanks.

Eric Han
  • 45
  • 2
  • 10

1 Answers1

0

Bulk training only creates vectors for tags you supplied. If you want to read out a bulk-trained vector per paragraph (as if by model.docvecs['paragraph000']), you have to give each paragraph a unique tag during training (like 'paragraph000'). You can give docs other tags as well - but bulk training only creates remembers doc-vectors for supplied tags.

After training, you can infer vectors for any other texts you supply to infer_vector() - and of course you could supply the same paragraphs that were used during training.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I actually figured that out and I’m using paragraph number as tags as you said. (10000 vectors per author) I do this for both authors and then train a svm model (sklearn) with it. But when I use those numeric docvec arrays as vectors I get horrible accuracy.. ~50%. I got 73% with nltk pos_tag so I must be doing something wrong... – Eric Han Nov 08 '17 at 05:49
  • Thank you so much for your help. I used infer_vector on my paragraphs and am now getting 93.28% accuracy in my binary classification task!! ;) – Eric Han Nov 08 '17 at 22:14