1

Let's say I have a collection with only one field BlogText. When a user searches for a word and if that word is present in BlogText, I want to:

  1. Retrieve only 10 words before the matched word and 10 words after the matched query, led and followed by an ellipsis.
  2. Also, I want to replace Matched word by <b>Matched word</b>

For example, if the searched query is 1500, I want to retrieve following:

... has been the industry's standard dummy text ever since the <b>1500<b>s, when an unknown printer took a galley of type and ...

given that original text in BlogText is:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

I know this can by done on the server as well, but I want to avoid retrieving data that I don't need(referring to 1st point).

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
dasfdsa
  • 7,102
  • 8
  • 51
  • 93
  • Your only hope is the aggregation framework, but I am not sure if it's advanced enough. – Sergio Tulentsev Aug 30 '17 at 10:33
  • *"but I want to avoid retrieving data that I don't need."* -- So then why are you proposing that the database add additional markup to results which would increase the number of bytes that it needs to return to your application client??? That would be what we call a contradiction. It's not the databases job to do this kind of thing ( part of the reason the wall of functions available on Oracle etc is not present in MongoDB ). Mark up your matches **after** the results are returned from the databse server. You should remember what terms you asked for, so simply match them and add the markup. – Neil Lunn Aug 30 '17 at 10:34
  • @SergioTulentsev ["No can do" on regex matching and replacement](https://jira.mongodb.org/browse/SERVER-11947). And forcing this through mapReduce would just be downright silly. – Neil Lunn Aug 30 '17 at 10:35
  • @NeilLunn I agree. By saying that I was referring to my 1st point. A BlogText may contain 1000s of words while I need only 11 or 13 even after b tag added. I will edit. – dasfdsa Aug 30 '17 at 10:38

1 Answers1

2

You can return a substring of a long text using aggregation.

Assuming you need a substring around first occurrence of the matched term, and a space is used as a word delimiter, the pipeline can be like this:

db.collection.aggregate([
    { $match: { BlogText:/1500/ } },
    { $project: {
        match: {
            $let: {
                vars: { pos: { $indexOfCP: [ "$BlogText", "1500" ] }},
                in: { $concat: [
                    { $reduce: {
                        input: { $slice: [ 
                            { $split: [ 
                                { $substrCP: [ "$BlogText", 0, "$$pos" ] }, 
                                " " 
                            ]}, 
                            -10 
                        ]},
                        initialValue: "",
                        in: { $concat : [ "$$value", " ", "$$this" ] }
                    }},
                    { $reduce: {
                        input: { $slice: [ 
                            { $split: [ 
                                { $substrCP: [  "$BlogText", "$$pos", { $strLenCP: "$BlogText" } ] }, 
                                " " 
                            ]}, 
                            10 
                        ]},
                        initialValue: "",
                        in: { $concat : [ "$$value", " ", "$$this" ] }
                    }}            
                ]}
            }
        } 
    }}
]);
Alex Blex
  • 34,704
  • 7
  • 48
  • 75