I have protein sequences and want to do doc2vec. My goal is to have one vector for each sentence/sequence.
I have 1612 sentences/sequences and 30 classes so the label is not unique and many documents share the same labels.
So when I first tried doc2vec, it gave my just 30 vectors which is the number of unique labels. Then I decided to have multiple tags to get a vector for each sentence.
When I did this I ended up having more vectors than my sentences. Any explanations what might have gone wrong?
tagged = data.apply(lambda r: TaggedDocument(words=(r["A"]), tags=[r.label,r.id]), axis=1)
print(len(tagged))
1612
sents = tagged.values
model = Doc2Vec(sents, size=5, window=5, iter=20, min_count = 0)
sents.shape
(1612,)
model.docvecs.vectors_docs.shape
(1643,5)