1

I have a collection hosted on Atlas,

I currently have declared an Atlas Search index with the default configuration, but I am unable to use it to find documents that partially matches the text.

For instance, I have the following documents :

    [
  {
    _id: 'ABC123',
    designation: 'ENPHASE IQ TERMINAL CABLE 3PH-1 UD',
    supplierIdentifier: 205919
  },
  {
    _id: 'DEF456',
    designation: 'ENPHASE CABLE VERT IQ 60/72CELLS 400VAC',
    supplierIdentifier: 205919
  },
  {
    _id: 'GHI789',
    designation: 'P/SOLAR PC ASTROENERGY 275W 60 CELULAS',
    supplierIdentifier: 206382
  }
]

If I use the text search to search "EN", Nothing is returned :

[{ "$search" : { "index" : "default", "text" : { "query" : "EN", "path" : { "wildcard" : "*"}}, "count": {"type": "total"}}}]
No result

But if i use the regex search, my documents are correctly returned :

db.testproducts.aggregate([{ "$search" : { "index" : "default", "regex" : { "query" : "(.*)EN(.*)", "allowAnalyzedField" : true, "path" : { "wildcard" : "*"}}, "count": {"type": "total"}}}])
[
  {
    _id: 'ABC123',
    designation: 'ENPHASE IQ TERMINAL CABLE 3PH-1 UD',
    supplierIdentifier: 205919
  },
  {
    _id: 'DEF456',
    designation: 'ENPHASE CABLE VERT IQ 60/72CELLS 400VAC',
    supplierIdentifier: 205919
  },
  {
    _id: 'GHI789',
    designation: 'P/SOLAR PC ASTROENERGY 275W 60 CELULAS',
    supplierIdentifier: 206382
  }
]

As the regex operator is pretty slow, how to achieve the same with the text search ?

gfyhser
  • 164
  • 1
  • 1
  • 11

1 Answers1

0

Gfhyser, you have a few options and I'm not sure which one you will like the best as they both have limitations.

Option 1, you can specify a path. As you can imagine, wildcard paths and leading ad trailing regex can be expensive. If you know the path you want search is designation, performance will be better if you change your existing query to:

db.testproducts.aggregate([{ "$search" : { "index" : "default", "regex" : { "query" : "(.*)EN(.*)", "allowAnalyzedField" : true, "path" : "designation", "count": {"type": "total"}}}])

Option 2, you can refine your search. Ask yourself if you are truly looking for Enphase and Energy wherever they appear in the same result.

Option 3,The final option is somewhat experimental for me because I need to spend more time on it. I simply want to help. It might be the best performing, involves you reversing your tokens indexed and when querying with a custom analyzer because it can speed up leading wild card queries.If you don't mind a bit of complexity, here is how it would look. Let me know if works out as I don't use regular expressions as much these days.

I create a custom analyzer in the sample_airbnb.listings_and_reviews dataset to search with leading wildcard characters. The index looks like:

{
  "analyzer": "lucene.keyword",
  "mappings": {
    "dynamic": false,
    "fields": {
      "name": [
        {
          "dynamic": true,
          "type": "document"
        },
        {
          "type": "string"
        }
      ],
      "summary": {
        "analyzer": "fastRegex",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "fastRegex",
      "tokenFilters": [
        {
          "type": "reverse"
        }
      ],
      "tokenizer": {
        "type": "keyword"
      }
    }
  ]
}

And a query that exploits this speed and has the flexibility to potentially match both of your desired terms would look like this:

[
  {
    '$search': {
      'index': 'reviews_search', 
      'compound': {
        'should': [
          {
            'wildcard': {
              'query': '*cated*', 
              'path': 'summary', 
              'allowAnalyzedField': true
            }
          }
        ]
      }
    }
  }
]
Nice-Guy
  • 1,457
  • 11
  • 20
  • When I try to use this custom analyzer in json editor, I get an error: "Expected double-quoted property name in JSON at position 500" – Matt Mar 02 '23 at 09:11