0

I have multiple text files and I am trying to find a way to identify similar bodies of text. The files themselves consist of an "average" sized paragraph. On top of this I also have some data that could be used as lables for the data if I were to go down the root of a neural networks such as a saimese network.

While that was one option another possibility I was wondering about was using something such as doc2vec in order to process all of the paragraphs (with the removal of stopwords and such) and then attempting to find similar files of text based upon the cosine from doc2vec.

I was wondering how do the methods outlined above generally compare to each other in terms of results they produce and is doc2vec robust and accurate enough to consider it a viable option? Also I may be overlooking a good method for this.

  • 1
    What do you mean by "similar bodies of text"? Taking baseball as an example, do you want to 1. tell if two documents are both about baseball 2. tell if two documents are about the same baseball game 3. tell if two documents are mostly the same text, or 4. something else? – polm23 Jul 03 '17 at 03:56

1 Answers1

0

The 'Paragraph Vectors' algorithm, which goes by the name Doc2Vec in the gensim library, can work for this. You don't necessarily have to remove stop-words. Results may be a little erratic for very small documents (less than 10-20 words) or small corpuses (less than 100,000 documents).

Given that you have labels, Facebook's FastText refinement of word2vec also includes a 'classifier' mode, which optimizes word-vectors to not just predict their neighbors, but to work well for predicting the known labels, when averaging all the word-vectors for a run-of-text together. They'd be worth trying, too.

With any set of word-vectors, a calculation called "Word Mover's Distance" gives an interesting measure of the similarity between texts. But, it's expensive to calculate against all candidate matches.

There are many other techniques – there's a 'fastSent', a 'sent2vec', 'skip-thought vectors', and more refinements. Which works best often depends on your corpus and specific end-goals, and how well you can tweak your corpus and the algorithm, and including which aspects of 'similarity' are most important for your users. You really have to try them and then perform a rigorous evaluation against your project's aims.

gojomo
  • 52,260
  • 14
  • 86
  • 115