I have stored a lot of news articles from RSS feeds from different sources in an elasticsearch index. At the moment when I do a search query, it will return me a lot of similar news articles for one query, because the same news topics gets covered by many RSS sources.
Instead what I would like to do is return only one news article out of a group of articles to the same topic. So I somehow need to recognize, which articles are about the same topic, cluster these documents and return only the "best" article out of such a cluster.
What would be the most convenient way to approach that problem? Can I somehow make use of the elasticsearch more-like-this API? Or is the https://github.com/carrot2/elasticsearch-carrot2 plugin the way to go? Or is there simply no convenient way and I have to implement somehow my own version of http://en.wikipedia.org/wiki/K-means_clustering or http://en.wikipedia.org/wiki/Non-negative_matrix_factorization to cluster my documents?