How to get the number of occurence of a keyword inside a string using mongoDb

Question

I'm doing a text search using the $search operator in the aggregation below : However i'd like to get the number of occurence of the matched keywords (not only the textScore that mongo provides)

  myCollection= await MyCollection.aggregate(
            [
                {
                    $match: {
                        $text: {
                            $search: `${keyword}`
                        }
                    }
                },
                { $sort: { score: { $meta: "textScore" } } },
                {
                    $project: {
                        _id: 1, score: { $meta: "textScore" },
                        // other fields here
                    }
                },
                { $match: { score: { $gt: 1.0 } } },
                { $limit: 100 },
            ],
        )

my collection looks like this: (the index is only for description)

var myCollectionSchema = mongoose.Schema(

    {  name: String,
       email: String
       description: String
    })
    
myCollectionSchema.index({ 'description': 'text' },{ background:true})
    
mongoose.model('myCollection', myCollectionSchema);

So how to achieve such thing using mongo only ? thanks

F.Y.I Score algorithm is complex. We cannot reverse engineer to get frequency of the occurrence using the `textScore` value. Check here: https://stackoverflow.com/questions/43041494/mongodb-full-text-search-score-what-does-score-means — Dheemanth Bhat, Feb 22 '21 at 11:28
@DheemanthBhat F.Y.I the score algorithm is complex indeed yet it is opensorce. There is no need to reverse engineer anything. Check here: https://github.com/mongodb/mongo/blob/8f11dec55/src/mongo/db/fts/fts_spec.cpp#L173 — Alex Blex, Feb 22 '21 at 12:02
The text index itself won't store duplicates, therefore there is no way to get it from index stored, it is not helping you anywhere. The split + size solution works but please know that this will have its impact on the DB when at scale. — Gandalf the White, Feb 23 '21 at 07:02

Dheemanth Bhat · Answer 1 · 2021-02-22T12:33:08.413

2

Try this:

let keyword = new RegExp("Hi", "i")

db.myCollection.aggregate([
    {
        $addFields: {
            num_of_occ: {
                $size: {
                    $filter: {
                        input: { $split: ["$description", " "] },
                        as: "text",
                        cond: {
                            $regexMatch: { input: "$$text", regex: keyword }
                        }
                    }
                }
            }
        }
    }
]);

Output:

[
  {
    "_id" : ObjectId("6033a3b0406aaa04445434a1"),
    "name" : "Dheemanth",
    "description" : "Everything has hI in it.",
    "num_of_occ" : 2
  },
  {
    "_id" : ObjectId("6033a3b0406aaa04445434a2"),
    "name" : "Heath",
    "description" : "shushi ends with Hi.",
    "num_of_occ" : 2
  },
  {
    "_id" : ObjectId("6033a3b0406aaa04445434a3"),
    "name" : "Abdelmlak",
    "description" : "Hi i am Abdelmlak.",
    "num_of_occ" : 1
  },
  {
    "_id" : ObjectId("6033a3b0406aaa04445434a4"),
    "name" : "Alex",
    "description" : "missing keyword",
    "num_of_occ" : 0
  }
]

edited Feb 22 '21 at 12:33

answered Feb 22 '21 at 11:54

Dheemanth Bhat

4,269
2
21
40

this an answer is correct too :) however im taking Alex's answer as he was the first . thank you – Abdelmlak Dhif Feb 22 '21 at 15:28
2

@AbdelmlakDhif This one is better. Both mine and Dheemanth's are flawed in a sense that they will give you wrong numbers in edge cases. Mine doesn't count word separators and when you search for "pen" it will also count "pencil" and make 2 for "penopen". Dheemanth's is the opposite - it assumes a white space is the only word separator. It's almost always a case in English/ASCII and is a bit more complex with UTF strings. – Alex Blex Feb 22 '21 at 23:43
i used this, got "InvalidPipelineOperator" with code 168 . – gentur ariyadi Sep 09 '22 at 03:54

score 1 · Answer 2 · answered Feb 22 '21 at 10:09

The score should be higher for strings with more matches. If you need exact number you can utilise $split in your $project stage:

myCollection= await MyCollection.aggregate(
        [
            {
                $match: {
                    $text: {
                        $search: `${keyword}`
                    }
                }
            },
            { $sort: { score: { $meta: "textScore" } } },
            {
                $project: {
                    _id: 1, score: { $meta: "textScore" },
                    cnt: { $add: [ { $size: { $split: ["$description", `${keyword}` ] } }, -1 ] }
                    // other fields here
                }
            },
            { $match: { score: { $gt: 1.0 } } },
            { $limit: 100 },
        ],
    )

Haniel Baez · Answer 3 · 2021-02-23T15:26:34.843

When performing a text search MongoDB assigns a score to each document that contains the search term in the indexed fields. The score represents the relevance of a document to a given text search query.

We can use the $function operator to define custom functions to implement this behavior by tokenization and removing stop words (english).

db.restaurants.insert(
    [
        { _id: 1, subject: "coffee", author: "xyz", views: 50 },
        { _id: 2, subject: "Coffee Shopping", author: "efg", views: 5 },
        { _id: 3, subject: "Baking a cake", author: "abc", views: 90 },
        { _id: 4, subject: "baking", author: "xyz", views: 100 },
        { _id: 5, subject: "Café Con Leche", author: "abc", views: 200 },
        { _id: 6, subject: "Сырники", author: "jkl", views: 80 },
        { _id: 7, subject: "coffee and cream", author: "efg", views: 10 },
        { _id: 8, subject: "Cafe con Leche cafe", author: "xyz", views: 10 },
        { _id: 9, subject: "Just milk", author: "efg", views: 23 },
        { _id: 10, subject: "Milk with coffee", author: "xyz", views: 16 },
        { _id: 11, subject: "One dollar coffe", author: "efg", views: 23 },
        { _id: 12, subject: "$cafe and leche", author: "xyz", views: 16 },
        { _id: 13, subject: "cafe, CAFE, CaFe...", author: "xyz", views: 16 }
    ]
)

Query:

let search_keyword = "Keep calm and take a cup of Cafe (coffee)";

db.restaurants.aggregate([
    {
        "$match": {
            "$text": {
                "$search": search_keyword
            }
        }
    },
    {
        "$addFields": {
            "matches": {
                "$function": {
                    body: function (subject) {

                        //tokenize a string using whitespace and most punctuation as delimiters.
                        //https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/tokenizer.cpp
                        function tokenizer(str) {
                            return str.split(/[^\w]+/).map(keyword => keyword.toLowerCase()).filter(char => char);
                        };

                        //tokens
                        let keywords_tokens = tokenizer(search_keyword);
                        let subject_tokens = tokenizer(subject);

                        //remove the stop words from "keywords_tokens"
                        let stop_words_removed = keywords_tokens.filter(word => !STOP_WORDS_ENGLISH.includes(word));

                        //return Array<String> 
                        return subject_tokens.filter((word) => stop_words_removed.includes(word))

                    },
                    args: ["$subject"],
                    "lang": "js"
                }
            }
        }
    }
]);

Output:

{ "_id" : 12, "subject" : "$cafe and leche", "author" : "xyz", "views" : 16, "matches" : [ "cafe" ] }
{ "_id" : 13, "subject" : "cafe, CAFE, CaFe...", "author" : "xyz", "views" : 16, "matches" : [ "cafe", "cafe", "cafe" ] }
{ "_id" : 10, "subject" : "Milk with coffee", "author" : "xyz", "views" : 16, "matches" : [ "coffee" ] }
{ "_id" : 1, "subject" : "coffee", "author" : "xyz", "views" : 50, "matches" : [ "coffee" ] }
{ "_id" : 7, "subject" : "coffee and cream", "author" : "efg", "views" : 10, "matches" : [ "coffee" ] }
{ "_id" : 8, "subject" : "Cafe con Leche cafe", "author" : "xyz", "views" : 10, "matches" : [ "cafe", "cafe" ] }
{ "_id" : 5, "subject" : "Café Con Leche", "author" : "abc", "views" : 200, "matches" : [ ] }
{ "_id" : 2, "subject" : "Coffee Shopping", "author" : "efg", "views" : 5, "matches" : [ "coffee" ] }

List of English stop words:

// https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/stop_words_english.txt
const STOP_WORDS_ENGLISH = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"];

Finally you could count the number of matches (using $size), and get the desire result.

NOTE:

This could be improved by replacing the diacritics with a intuitive ascii character. (Café !== Cafe).
As @AlexBlex point out in the comment keywords_tokens and stop_words_removed should be calculated at the application level.

Please be careful with this one. JavaScript in aggregation has very high computation cost and is recommended to avoid by mongo db to the degree of disabling it on shared tiers on Atlas. This particular implementation is sub-optimal. `let keywords_tokens = tokenizer(search_keyword);` and `let stop_words_removed ` are re-calculated for each document. It is enough to tokenize the search words only once on the application level. — Alex Blex, Feb 23 '21 at 09:42
HI @AlexBlex you are **right** the `keywords_tokens` and `stop_words_removed` should be calculated at the application level. Say that "JavaScript in aggregation has very high computation cost", is not the case for this operation, what the doc actually said is: Executing JavaScript inside an aggregation expression may decrease performance. Only use the $function operator if the provided pipeline operators cannot fulfill your application’s needs. — Haniel Baez, Feb 23 '21 at 15:17
The docs cannot say anything stronger about javascript function in queries without hurting sales, but it is really bad. The whole aggregation business is built to avoid using javascript in queries, map-reduce functions, and aggregation itself. It's still an option tho. If the limitations are clear to readers they can test/decide if it is something they can tolerate. — Alex Blex, Feb 23 '21 at 15:29

How to get the number of occurence of a keyword inside a string using mongoDb

3 Answers3

Linked