In Elasticsearch, is possible to group documents that share the most similar texts, without giving an initial query to compare to?
I know is possible to query and get MLT("more like this document") but, is possible to cluster documents within an index according to a field values?
For instance:
document 1: The quick brown fox jumps over the lazy dog
document 2: Barcelona is a great city
document 3: The fast orange fox jumps over the lazy dog
document 4: Lotus loft Room - Bear Mountains Neighbourhood
document 5: I do not like to eat fish
document 6: "Lotus Loft" Condo From $160.00 CAD/night, sleeps up to 4
document 7: Lotus Loft
Now, perform some kind of aggregation that, without giving a search query, it can group:
Group 1: document 1 and document 3
Group 2: document 2
Group 3: document 4 and document 6 and document 7
Group 4: document 5
OR
Please just let me know other ways to find the different document clustering e.g using Apache Spark, KNN, Unsupervised learning way or any other algorithm to find the near-duplicate documents or cluster similar documents?
I just want to cluster my document based on country, city, latlng, property name or description etc. field of my elasticsearch documents.
Basically I want to know-
How to make clusters of similar documents(e.g json/csv) or find duplicate documents using python text analysis/unsupervised learning with KNN/ pyspark with MLIB or any other document clustering algorithms? give me some hint/open source projects or any other resource links. I just need some concrete examples or tutorials for this task