0

Currently I've indexed my mongoDB collection into Elasticsearch running in a docker container. I am able to query a document by it's exact name, but Elasticsearch is unable to match the query if it is only part of the name. Here is an example:

>>> es = Elasticsearch('0.0.0.0:9200')
>>> es.indices.get_alias('*')
{'mongodb_meta': {'aliases': {}}, 'sigstore': {'aliases': {}}, 'my-index': {'aliases': {}}}
>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS'}}})
>>> x
{'took': 198, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 1, 'relation': 'eq'}, 'max_score': 8.062855, 'hits': [{'_index': 'sigstore', '_type': 'sigs', '_id': '5d66c23228144432307c2c49', '_score': 8.062855, '_source': {'id': 1, 'name': 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS', 'description': 'http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLYSIS_GLUCONEOGENESIS', 'members': ['ACSS2', 'GCK', 'PGK2', 'PGK1', 'PDHB', 'PDHA1', 'PDHA2', 'PGM2', 'TPI1', 'ACSS1', 'FBP1', 'ADH1B', 'HK2', 'ADH1C', 'HK1', 'HK3', 'ADH4', 'PGAM2', 'ADH5', 'PGAM1', 'ADH1A', 'ALDOC', 'ALDH7A1', 'LDHAL6B', 'PKLR', 'LDHAL6A', 'ENO1', 'PKM2', 'PFKP', 'BPGM', 'PCK2', 'PCK1', 'ALDH1B1', 'ALDH2', 'ALDH3A1', 'AKR1A1', 'FBP2', 'PFKM', 'PFKL', 'LDHC', 'GAPDH', 'ENO3', 'ENO2', 'PGAM4', 'ADH7', 'ADH6', 'LDHB', 'ALDH1A3', 'ALDH3B1', 'ALDH3B2', 'ALDH9A1', 'ALDH3A2', 'GALM', 'ALDOA', 'DLD', 'DLAT', 'ALDOB', 'G6PC2', 'LDHA', 'G6PC', 'PGM1', 'GPI'], 'user': 'naji.taleb@medimmune.com', 'type': 'public', 'level1': 'test', 'level2': 'test2', 'time': '08-28-2019 14:03:29 EDT-0400', 'source': 'File', 'mapped': [''], 'notmapped': [''], 'organism': 'human'}}]}}

When using the full name of the document, elasticsearch is able to successfully query it. But this is what happens when I attempt to search part of the name or use a wildcard:

>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG'}}})
>>> x
{'took': 17, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}}



>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG*'}}})
>>> x
{'took': 3, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}}

In addition to the default index settings I also tried making an index that allows the use of the nGram tokenizer to enable me to do partial search, but that also didn't work. These are the settings I used for that index:

{
  "sigstore": {
    "aliases": {},
    "mappings": {},
    "settings": {
      "index": {
        "max_ngram_diff": "99",
        "number_of_shards": "1",
        "provided_name": "sigstore",
        "creation_date": "1579200699718",
        "analysis": {
          "filter": {
            "substring": {
              "type": "nGram",
              "min_gram": "1",
              "max_gram": "20"
            }
          },
          "analyzer": {
            "str_index_analyzer": {
              "filter": [
                "lowercase",
                "substring"
              ],
              "tokenizer": "keyword"
            },
            "str_search_analyzer": {
              "filter": [
                "lowercase"
              ],
              "tokenizer": "keyword"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "3nf915U6T9maLdSiJozvGA",
        "version": {
          "created": "7050199"
        }
      }
    }
  }
}

and this is the corresponding python command that created it:

es.indices.create(index='sigstore',body={"mappings": {},"settings": { 'index': { "analysis": {"analyzer": {"str_search_analyzer": {"tokenizer": "keyword","filter": ["lowercase"]},"str_index_analyzer": {"tokenizer": "keyword","filter": ["lowercase", "substring"]}},"filter": {"substring": {"type": "nGram","min_gram": 1,"max_gram": 20}}}},'max_ngram_diff': '99'}})

I use mongo-connector as the pipeline between my mongoDB collection and elasticsearch. This is the command I use to start it:

mongo-connector -m mongodb://username:password@xx.xx.xxx.xx:27017/?authSource=admin -t elasticsearch:9200 -d elastic2_doc_manager -n sigstore.sigs

I'm unsure as to why my elasticsearch is unable to get a partial match, and wondering if there is some setting I'm missing or if there's some crucial mistake I've made somewhere. Thanks for reading.

Versions

MongoDB 4.0.10

elasticsearch==7.1.0

elastic2-doc-manager[elastic5]

najitaleb
  • 27
  • 6
  • Hi, You set str_search_analyzer in the settings but don't set the mapping! Can you provide the mapping of your index? I'm sure that all your fields are mapped as keyword that why only exact match work. More about how to set the mapping: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/indices-put-mapping.html – Gabriel Jan 17 '20 at 05:06
  • Hi Gabriel. Thanks for taking a look. Here is a gist of the mappings of the index. https://gist.github.com/najitaleb/5c778b098d10ffd69e4eb36de2b6947b I'm looking through the page you posted and I'm not sure which of those options I would need to change. Would it be expand_wildcards? – najitaleb Jan 17 '20 at 13:33
  • Actually, I think I see what you're saying now. Should I switch the type of the 'name' field from keyword to text? Would that allow me to search for words with just a part of the word? – najitaleb Jan 17 '20 at 16:37
  • I've changed some settings but I'm still not getting a match. Here is my current mappings and settings: https://gist.github.com/najitaleb/11798f1b6cc95112c35aadd33fe42eb7 – najitaleb Jan 17 '20 at 17:26

1 Answers1

1

Updated after checked your gist:

You need to apply the mapping to your field as written in the doc, cf the first link I share in the comment.

You need to do it after applying the settings on your index according to the gist it's line 11.

Something like:

PUT /your_index/_mapping
{
  "properties": {
    "name": {
      "type": "keyword",
      "ignore_above": 256,
      "fields": {
        "str_search_analyzer": {
          "type": "text",
          "analyzer": "str_search_analyzer"
        }
      }
    }
  }
}

After you set the mapping need to apply it to your document, using update_by_query

https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-update-by-query.html

So you can continue to search with term search on your field name as it will be indexed with a keyword mapping (exact match) and on the sub_field name.str_search_analyzer with part of the word.

your_keyword = 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS' OR 'KEGG*'

x = es.search(index='sigstore', body={'query': {'bool': {'should':[{'term':  {'name': your_keyword}},
{'match': {'name.str_search_analyzer': your_keyword}}
]}}
})
Gabriel
  • 192
  • 8
  • Thanks for the response Gabriel. I tried to input the settings you've given me and try to search again, but I'm still unsure as to why your suggestion didn't work. I've attached a gist with the steps I took to implement your answer, but I'm still getting 0 matches. https://gist.github.com/najitaleb/cdf6a8aa0363580f0841e2dacec859d2 – najitaleb Jan 21 '20 at 14:35
  • 1
    I updated the answer and reply in your gist. You need to update the mapping and reindex your data to apply the changes OR you need to delete your data and index again so your data will be indexed with the correct mapping. – Gabriel Jan 22 '20 at 01:35
  • I've decided to make a few changes since the last post, but I've followed your advice on remapping and using update_by_query. The update_by_query command works successfully and I've also decided to add an analyzer. I'm now able to get exact matches with the new settings after reindexing, but I am still not able to get partial matches even when I use the edge_ngram tokenizer. I've attached a gist of my new settings if you're able to help. Thanks. https://gist.github.com/najitaleb/6ac352191fdaa132a40efce9c32c59c2 – najitaleb Jan 22 '20 at 20:24
  • 1
    Hi, great that you can make it work, and sure if I can help, I'll help : ) You can check the analyze api it will help you to debug and build your analyzer to make it work with your search case. https://www.elastic.co/guide/en/elasticsearch/reference/7.5/indices-analyze.html#indices-analyze – Gabriel Jan 23 '20 at 04:53