How to count number of documents that have a maximum unique value considering every subfield within a given field?

Question

Problem

Given this structure:

{
    "_id": ObjectId("56n606c39q49b80adfe6d17b")
    "data": {
        "field1": [
            {
                "subfield1.1": [
                    {
                        "val1.1.1": a
                    },
                    {
                        "val1.1.2": b
                    }
                ]
            },
            {
                "subfield1.2": [
                    {
                        "val1.2.1": c
                    },
                    {
                        "val1.2.2": d
                    }
                ]
            }
        ]
    }
}

How can I write a query that counts the number of documents that have a maximum unique "val" considering all the "val"s in every "subfield" within a given field?

Facts to consider:

"val"s are elements in an array
"subfield"s are also an elements in an array
"field"s, "subfield"s and "val"s field names are the same for all documents
There could be 1 or many "val"s

I am somewhat new to NoSQL. In normal SQL I'd approach this problem maybe with a self join, but here, even though it's possible to use Aggregation, I couldn't find a way to even get close to the real solution.

Case Examples

Given that a is the maximum value... This document SHOULD be counted:

{
    "_id": ObjectId("56n606c39q49b80adfe6d17b")
    "data": {
        "field1": [
            {
                "subfield1.1": [
                    {
                        "val1.1.1": a
                    },
                    {
                        "val1.1.2": b
                    }
                ]
            },
            {
                "subfield1.2": [
                    {
                        "val1.2.1": c
                    },
                    {
                        "val1.2.2": d
                    }
                ]
            }
        ]
    }
}

This document SHOULDN'T be counted:

{
    "_id": ObjectId("56n606c39q49b80adfe6d17b")
    "data": {
        "field1": [
            {
                "subfield1.1": [
                    {
                        "val1.1.1": a
                    },
                    {
                        "val1.1.2": a
                    }
                ]
            },
            {
                "subfield1.2": [
                    {
                        "val1.2.1": b
                    },
                    {
                        "val1.2.2": c
                    }
                ]
            }
        ]
    }
}

This document SHOULDN'T be counted either:

{
    "_id": ObjectId("56n606c39q49b80adfe6d17b")
    "data": {
        "field1": [
            {
                "subfield1.1": [
                    {
                        "val1.1.1": a
                    },
                    {
                        "val1.1.2": b
                    }
                ]
            },
            {
                "subfield1.2": [
                    {
                        "val1.2.1": a
                    },
                    {
                        "val1.2.2": c
                    }
                ]
            }
        ]
    }
}

This document SHOULD be counted (even if b is repeated):

{
    "_id": ObjectId("56n606c39q49b80adfe6d17b")
    "data": {
        "field1": [
            {
                "subfield1.1": [
                    {
                        "val1.1.1": a
                    },
                    {
                        "val1.1.2": b
                    }
                ]
            },
            {
                "subfield1.2": [
                    {
                        "val1.2.1": b
                    },
                    {
                        "val1.2.2": c
                    }
                ]
            }
        ]
    }
}

Any idea would be welcomed. Thank you!

score 1 · Accepted Answer · answered Aug 06 '19 at 23:23

Aggregation is absolutely what you need here. It may need some modification, but something like this might (hopefully) work for you:

db.collection.aggregate([
    /*
        Step 1: We need to unravel the multi-dimensional array first, otherwise we can't efficiently search for globally unique maximums.
    */

    // Unravel the outer array first.
    {$unwind: "$data.field1"},

    // Simplifies the representation of the unwind result so that we have a flat field path rather than a nested one.
    {$project: {
        vals: "$data.field1.subfield1"
    }},

    // Now unravel the inner array.
    {$unwind: "$vals"},

    // Another simplification step for the same reason as the previous projection.
    {$project: {
        val: "$vals.val1"
    }},

    /*
        Step 2: We want to create counts for array elements that are the same value from the same source document.
    */

    // Creating the counts is easy--simply group by documents with the same source document ID and the same value, adding 1 to our total for each entry.
    {$group: {
        _id: {
            _id: "$_id",
            val: "$val"
        },
        count: {$sum: 1}
    }},

    /*
        Step 3: Once we have our counts, can retrieve the maximum value for each source document.
    */

    // First, sort by descending value so that the maximum value is the first we encounter.
    {$sort: {
        "_id.val": -1
    }},

    // With the entries in descending order, we can grab the first entry for each source document, which will give us all of the maximums.
    {$group: {
        _id: "$_id._id",
        max: {
            $first: {
                val: "$_id.val",
                count: "$count"
            }
        }
    }},

    // Just for kicks, let's simplify once more by unnesting the data.
    {$project: {
        val: "$max.val",
        count: "$max.count"
    }},

    /*
        Step 4: Now we just need to limit our results.
    */

    // Any result with a count of 1 is a unique maximum.
    {$match: {
        count: 1
    }}
])

This is admittedly a complicated query and difficult to assure functionality without knowing your actual document structure. That being said, there should be enough information in the comments to help you modify it to suit your needs. If you run into any problems, however, please let me know and I'll do what I can to help you nail down the issues.

Thank you very much for your response, it works very well. In regard to this problem, what I wanted to achieve at the beginning is actually one little step further: instead of counting the records just considering one field with it's subfields, I wanted to count the records that fulfill these conditions with four fields at the same time. So, let's say there are also 3 more fields called "field2", "field3" and "field4" with their respective subfields/values. Maybe I am being too ambitious, and I don't want to ask for more because you've helped a lot already. Though, do you have any suggestions? — feran, Aug 07 '19 at 20:38
This is certainly achievable for sure. Please take a look at the `$concatArrays` operator. Using this operator, you can merge the arrays for `field1`, `field2`, `field3`, and `field4`. Make this the very first aggregation pipeline stage. The result will be a single field with all of the array elements of all four of those fields. This new field is the one you will work with using the rest of the aggregation outlined in the answer. — B. Fleming, Aug 07 '19 at 22:12
Thanks again for your response. And sorry, I think I didn't explain myself correctly. What I meant is that it should count the documents not only fulfilling this criteria in "field1", but also apply this same criteria separately to more fields. In a programming logic, what I am trying to say is "if (validate(field1) && validate(field2)) { count++; }", where as the problem that you helped me solve was only "if (validate(field1) {count++ }". — feran, Aug 08 '19 at 15:23

score 0 · Answer 2 · answered Aug 09 '19 at 16:15

Just wanted to post a different solution because this one ran slightly faster (in the DB I am using) than the one that used the Aggregation framework; this is a JavaScript solution.

use myDB;

// Function that determines if a "field" has a unique maximum value.
function validate(list){
    let len = list.length;
    let isGood = false;
    if(len == 0){
        isGood = false;
    }
    else if (len == 1){
        isGood = true;
    }
    else{
        isGood = list[0] != list[1];
    }
    return isGood;
}

// These function iterates over all the "values" in every "subfield" 
//  within a "field" of a document.
// They add possible maximum values to a list which is then 
//  validated in "validate()".
function verifySubfields(field){
    let list = [];
    field.forEach(fieldElement => {
        // Check if subfield exists within the element and
        //  check that is not empty
        if (fieldElement.subfield && fieldElement.subfield[0]){
            let subfield = fieldElement.subfield;
            subfield.forEach(subfieldElement => {
                let val = subfieldElement.val;

                if (list.length == 0){
                    list.push(val);
                }
                else{
                    if (a >= list[0]){
                        list.unshift(val);
                    }
                }
            });
        }
    });

    return validate(list);
}

function verifyField(doc){
    return verifySubfields(doc.data.field);
};

let cont = 0, iterations = 0;
db.myCollection.find({ "data.field.subfield": {$exists: true} }).forEach(doc => { 
   iterations++; 
   if(verifyField(doc)){
      cont++;
   } 
});
print(`\nTotal: ${iterations} \nPositives: ${cont} \n`);

Note: Run using mongo < myFile.js .

The problem mentioned in the comments of the checked solution could be solved by just making more calls to "verifySubfields()", and validate those results in "verifyField()", which could have the name changed to "verifyFields()".

How to count number of documents that have a maximum unique value considering every subfield within a given field?

Problem

Case Examples

2 Answers2