0

With a MongoDB collection structured like this:

{
    title: {
      lang1: "Title in lang1",
      lang2: "Title in lang2"
    },
    description: {
      lang1: "Description in lang1",
      lang2: "Description in lang2"
    },
    keywords: ['keyword1','keyword2']
}

What is the best index structure for this query :

{$or: [ 
    {"title.lang1": /searchTerm/i},
    {"title.lang2": /searchTerm/i},
    {"description.lang1": /searchTerm/i},
    {"description.lang2": /searchTerm/i},
    {"keywords": /searchTerm/i},
]}

What I have tested so far (using MongoDB Compass):

  • Option 1 : no index. I got a warning No index available for this query..
  • Option 2 : create a compound index on all these fields. I got a warning No index available for this query..
  • Option 3 : create an index per field, I got a warning Shard results differ. Query is slower than options 1 or 2.

Is there an efficient option for this use case?

PS: Using Text Index is not an option as regex searches are required for fuzzy search (search terms are tokenized).

EDIT: In this video from MongoDB Atlas team you can see some quite fast fuzzy search in action performed with a feature called Full Text Search available on MongoDB Atlas (paid hosted solution). I want to achieve a similar feature with the free Community Server. The presenter Eliot Horowitz is talking about embedding Lucene in MongoDB. Maybe that's the only way?

Ludovic
  • 223
  • 3
  • 9
  • Regex + Index could let to poor performance. Because MongoDB could only made a GOOD use of the indexes, if you regex is a “prefix expression” ( you expression start with ^). https://docs.mongodb.com/manual/reference/operator/query/regex/#index-use – Haniel Baez Jan 04 '21 at 21:20
  • The regex I'm using implements i18n fuzzy searches, leading to complex expressions such as `/.*s.{0,2}[e,é,ë].{0,2}[a,á,à,ä].{0,2}r.{0,2}[c,ç].{0,2}h.{0,2} .{0,2}t.{0,2}[e,é,ë].{0,2}r.{0,2}m.*/i`. Fuzzy search feature is available on Mongo Atlas (paid hosted service, very likely using an index for performance) but not provided with on-premises community server. Are you aware of any other option to implement fuzzy searches with a good performance level? – Ludovic Jan 04 '21 at 22:10
  • Have you tried using a [collation](https://docs.mongodb.com/manual/reference/collation/index.html#collation) to allow case- and diacritic-insensitive searches? It may serious simplify the regex if the index already considers a,á,à, and ä to be equivalent. – Joe Jan 05 '21 at 01:02
  • I'm not sure if this will impact performance, the leading and trailing `.*` are not necessary, and are implied if the regex is not anchored with `^` or `$`. – Joe Jan 05 '21 at 01:04
  • What are you actually trying to accomplish? That regex looks overly broad. It looks like you were intending to match "search term" with case and accent variations, but that regex would also match "Salem Arches term rental" and any number of other strings as well. – Joe Jan 05 '21 at 01:17
  • @Joe : (1) I've tried using `fr` collation to simplify the regex by removing all `[e,é,ë]` patterns, but I didn't manage to get it working correctly. For exemple, searching "glacon" was not returning "glaçon". – Ludovic Jan 05 '21 at 13:21
  • @Joe : (2) Your point on leading and trailing `.*` is good, previous tests I forgot to get rid of, it's fixed. – Ludovic Jan 05 '21 at 13:22
  • @Joe : (3) I'm trying to get a simple fuzzy search feature to accept minor typing errors, for exemple searching "sarchTerm" or "searcchTerm" should return "searchTerm". Ideally searching "searchzTerm" would also but didn't find a way. – Ludovic Jan 05 '21 at 13:22
  • It is the `strength` of the collation, not the `locale` that determines whether case and diacritics are considered difference. – Joe Jan 05 '21 at 18:42
  • @Joe I set `strength` value to 1, which performs comparisons of the base characters only, ignoring other differences such as diacritics and case. But for any reason, `ç` was not working. – Ludovic Jan 05 '21 at 19:28
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226893/discussion-between-joe-and-ludovic). – Joe Jan 05 '21 at 20:24

0 Answers0