Mongodb multilingual search: Which schema is better for faster search results - nested or having language specific fields directly

Question

We are implementing fuzzy search on product using Atlas search index and for querying, we are using Mongoose. The kind of search we want includes multilingual searching and for this we are using following schema for the product -

{
  language: "de",
  name: String,
  description: String,
  translation: {
     en: {
        name: String,
        description: String
     },
     fr: {
        name: String,
        description: String
     }
  }
}

Will above schema be a good fit considering search performance as there will be thousands or more hits for reading the data. Going forward, the search queries may go up to millions as it is an e-commerce system. Having nested structure will be good for querying or there are another options we can opt for,

Having language specific fields directly with shorthand specified for language:

    {
         name_de: String,
         description_de: String,
         name_en: String,
         description_en: String,
         name_fr: String,
         description_fr: String
    }

Having language specific fields nested with the field name as the key

    {
         name: {
            en: String,
            de: String,
            fr: String
         },
         description: {
            en: String,
            de: String,
            fr: String
         }
    }

Having language as the key and field names nested in that object:

   {
      en: {
            name: String,
            description: String
          },
      fr: {
            name: String,
            description: String
          }
   }

Or any other schema that will be suitable for this scenario?

Search will be performed on the basis of language selected by the user. So, if a user opts for French as his preferred language, we will look for the keyword typed by user in French language.

P.S. - There are more fields than just name and description which are also language specific.

score 1 · Answer 1 · answered Dec 13 '20 at 11:33

I would opt for option one because of the limited support for nested fields in Atlas Search, though option 2 would work as well. Here is how I would define the index in your case:

{
  "mappings": {
    "fields": {
      "name_de": {
        "analyzer": "lucene.german",
        "type": "string"
      },
      "name_fr": {
        "analyzer": "lucene.french",
        "type": "string"
      },
      "name_en": {
        "analyzer": "lucene.english",
        "type": "string"
      },
        "description_de": {
        "analyzer": "lucene.german",
        "type": "string"
      },
      "description_fr": {
        "analyzer": "lucene.french",
        "type": "string"
      },
      "description_en": {
        "analyzer": "lucene.english",
        "type": "string"
      }
    }
  }
}

This way, you can the benefits of highlighting, which could be extra helpful if your description field is long. You will also get better stop word support and diacritics out of the box. If you have any trouble, let me know here and I will help.

With your suggestion I was little concerned about Atlas search index size, but instead of increasing it reduced with modified mapping, which makes it a good-to-go with option. But one thing I didn't understand in your answer is, how will it support diacritics in a better way? I mean, having language specific fields in the schema we are using currently also does that i.e. supporting stop words and diacritics. — Avani Khabiya, Dec 14 '20 at 11:46
Stop words are in language analyzers by default. You can specify how to handle diacritics in the index definition. — Nice-Guy, Feb 10 '21 at 00:29

Mongodb multilingual search: Which schema is better for faster search results - nested or having language specific fields directly

1 Answers1