I have index with documents that are basically scraped website content. I need to be able to serve documents that are nearly identical. This requirement arises when one website copies content from another website. They do change some words, but mostly the text is 80% - 90% the same and I need to group such content, basically find its near duplicates. So the requirement is to find and group documents that are more than 75% similar to one another.
I was experimenting with Solr MLT, and I'm pleased with overall results, but I can't find a nice and efficient way to get normalized results.
The closest I got to a result that I need is to send the document content via stream.body
(that document is already in the index) to MLT \mlt
request handler and then see what score is returned for the same document that is already indexed. With that I can calculate how similar are other documents.
But this seems to be very wasteful of resources and I feel that there has to be a better way to achieve this task.
So my question is: can MLT produce such results, or am I stretching what MLT can achieve?