0

I have a minhash field generated for some text (based on minhash algorithm), now my question is, is it possible to somehow complement or add the prefix query with wildcards? Because the problem is, the hashed string values are based on the content (text) position of the shingles/tokens. So the first few characters (prefix) might not always exactly match similar content. Would it be possible to add a wildcard, e.g *3AF8659GJ in front of the prefix for a query?

EDIT: I guess I wasnt thinking hard enough about the problem. The hash differences can be anywhere in the hash-string (based on text differences in the content position of the difference of the text). So I guess the "best" only way would be edit distance and some threshhold.

E.g put all hashes into an array and sort them in lexical order (or how would you sort Hex-strings?) and then you only compare the next k documents until the edit-distance threshold is reached, and put the duplicates in a separate array..

MMMM
  • 3,320
  • 8
  • 43
  • 80
  • So, your idea is to compare only a suffix? If yes, are you ready to reindex your data? – Val Mar 28 '19 at 11:50
  • see my comment below, I was thinking to compare prefix first, but it may happen that text differences may appear not only at the beginning but also at the end or anywhere, so edit distance is I guess the best approach. But fuzzy search is ridiculous with only 2 edit distances. I would have to implement a custom search in Node.js based on edit distance.. – MMMM Mar 28 '19 at 13:03

1 Answers1

1

Searching by suffixes is highly discouraged for performance reasons, as explained in the official document:

In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?

There's still a way to achieve what you want by using a cleverly crafted analyzer. The idea is to index only the end of the minhash. You can achieve it as described below.

First, create an index with the following analyzer:

PUT minhash-index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "suffix": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": [
              "lowercase",
              "reverse",
              "substring",
              "reverse"
            ]
          }
        },
        "filter": {
          "substring": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 10
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "minhash": {
          "type": "text",
          "analyzer": "suffix",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

The idea of the suffix analyzer is that it will index all suffixes of length 1 to 10 (you can decide to index longer suffixes) for each minhash that you thrown into your index.

So for instance, for the minhash C50FD711C2C43287351892A4D82F44B055F048C46D2C54197AC1D1E921F11E6699C4057C4B93907518E6DCA51A672D3D3E419160DAE276CB7716D11B94D8C3BB2E4A591329B7AF973D17A7F9336342FFAAFD4D, it will index all the following suffixes:

  • d
  • 4d
  • d4d
  • fd4d
  • afd4d
  • aafd4d
  • faffd4d
  • ffaafd4d
  • 2ffaafd4d
  • 42ffaafd4d

Then you can easily search and find the above minhash with the following query:

POST minhash-index/_search
{
  "query": {
    "match": {
      "minhash": "42FFAAFD4D"
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • thanks a lot thats a nice approach, however the problem is, it cannot be determined in advance where the differences / similarities based on content of 2 documents are. The minhash hash-key positions of the single hex-hashes are simply concatenated together, and they are calculated based on the shingles of the text. So therefore, it depends on the differences in text content position on where the hash key differences are. It can be at the beginning or at the end or somewhere in the middle. Another approach would be the edit distance of a document compared to other documents. – MMMM Mar 28 '19 at 12:59
  • So therefore, I was thinking about another approach where I first pull all documents and put them into an array. Then I sort the documents in minhash sort order and compare only the next k documents based on some edit distance to detect the duplicates... – MMMM Mar 28 '19 at 13:00