I am training with some documents with gensim's Doc2vec.
I have two types of inputs:
- Whole English Wikipedia: Each article of Wikipedia text is considered as one document for doc2vec training. (Total around 5.5 million articles or documents)
- Some documents related to my project that are manually prepared and collected from some websites. (around 15000 documents).
Where each document the size is around 100 sentences.
Further, I want to use this model to infer sentences of size (10~20 words).
I request some clarification on my approach.
Is the method of training over documents(size of each document approx. 100 sentences each) and then inferring over new sentence correct. ?
Or, should I train over only sentences and not documents and then infer over the new sentence.?