2

In Elasticsearch, is possible to group documents that share the most similar texts, without giving an initial query to compare to?

I know is possible to query and get MLT("more like this document") but, is possible to cluster documents within an index according to a field values?

For instance:

document 1: The quick brown fox jumps over the lazy dog

document 2: Barcelona is a great city

document 3: The fast orange fox jumps over the lazy dog

document 4: Lotus loft Room - Bear Mountains Neighbourhood

document 5: I do not like to eat fish

document 6: "Lotus Loft" Condo From $160.00 CAD/night, sleeps up to 4

document 7: Lotus Loft

Now, perform some kind of aggregation that, without giving a search query, it can group:

Group 1: document 1 and document 3

Group 2: document 2 

Group 3: document 4 and document 6 and document 7

Group 4: document 5

OR

Please just let me know other ways to find the different document clustering e.g using Apache Spark, KNN, Unsupervised learning way or any other algorithm to find the near-duplicate documents or cluster similar documents?

I just want to cluster my document based on country, city, latlng, property name or description etc. field of my elasticsearch documents.

Basically I want to know-

How to make clusters of similar documents(e.g json/csv) or find duplicate documents using python text analysis/unsupervised learning with KNN/ pyspark with MLIB or any other document clustering algorithms? give me some hint/open source projects or any other resource links. I just need some concrete examples or tutorials for this task

A l w a y s S u n n y
  • 36,497
  • 8
  • 60
  • 103

1 Answers1

1

Yes, it's possible. There is an ElasticSearch plugin named Carrot2. The clustering plugin automatically group together similar "documents" and assign human-readable labels to these groups, and it has 4 built-in clustering algorithms (3 free, 1 license required). You can make a match_all query if you want to cluster all documents in an ES index.

Here is my ES 6.6.2 client code example for clustering in Python 3:

import json
import requests

REQUEST_URL = 'http://localhost:9200/b2c_index/_search_with_clusters'
HEADER = {'Content-Type':'application/json; charset=utf-8'}

requestDict = {
  "search_request": {
    "_source": [ "title", "content", "lang" ],
    "query": {"match_all":{}},
    "size": 100
  },

  "query_hint": "",
  "field_mapping": {
    "title": ["_source.title"],
    "content": ["_source.content"],
    "language": ["_source.lang"],
  }
}

resp = requests.post(REQUEST_URL, data=json.dumps(requestDict), headers=HEADER)
print(resp.json())

By the way, Solr also uses Carrot2 to cluster documents.

derek.z
  • 907
  • 11
  • 19