I am building a Doc2Vec model with 1000 documents using Gensim. Each document has consisted of several sentences which include multiple words.
Example)
Doc1: [[word1, word2, word3], [word4, word5, word6, word7],[word8, word9, word10]]
Doc2: [[word7, word3, word1, word2], [word1, word5, word6, word10]]
Initially, to train the Doc2Vec, I first split sentences and tag each sentence with the same document tag using "TaggedDocument". As a result, I got the final training input for Doc2Vec as follows:
TaggedDocument(words=[word1, word2, word3], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7], tags=['Doc1'])
TaggedDocument(words=[word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word7, word3, word1, word2], tags=['Doc2'])
TaggedDocument(words=[word1, word5, word6, word10], tags=['Doc2'])
However, would it be okay to train the model with the document as a whole without splitting sentences?
TaggedDocument(words=[word1, word2, word3,word4, word5, word6, word7,word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7,word1, word5, word6, word10], tags=['Doc2'])
Thank you in advance :)