I have multiple text files and I am trying to find a way to identify similar bodies of text. The files themselves consist of an "average" sized paragraph. On top of this I also have some data that could be used as lables for the data if I were to go down the root of a neural networks such as a saimese network.
While that was one option another possibility I was wondering about was using something such as doc2vec in order to process all of the paragraphs (with the removal of stopwords and such) and then attempting to find similar files of text based upon the cosine from doc2vec.
I was wondering how do the methods outlined above generally compare to each other in terms of results they produce and is doc2vec robust and accurate enough to consider it a viable option? Also I may be overlooking a good method for this.