Current Processing is mapReduce
If you need to execute this on the server and sort the top results and just keep the top 100, then you could use mapReduce for this like so:
db.test.mapReduce(
function() {
var input = [0.1,0.3,0.4];
var value = Array.sum(this.vals.map(function(el,idx) {
return Math.abs( el - input[idx] )
}));
emit(null,{ "output": [{ "_id": this._id, "value": value }]});
},
function(key,values) {
var output = [];
values.forEach(function(value) {
value.output.forEach(function(item) {
output.push(item);
});
});
output.sort(function(a,b) {
return a.value < b.value;
});
return { "output": output.slice(0,100) };
},
{ "out": { "inline": 1 } }
)
So the mapper function does the calculation and output's everything under the same key so all results are sent to the reducer. The end output is going to be contained in an array in a single output document, so it is both important that all results are emitted with the same key value and that the output of each emit is itself an array so mapReduce can work properly.
The sorting and reduction is done in the reducer itself, as each emitted document is inspected the elements are put into a single tempory array, sorted, and the top results are returned.
That is important, and just the reason why the emitter produces this as an array even if a single element at first. MapReduce works by processing results in "chunks", so even if all emitted documents have the same key, they are not all processed at once. Rather the reducer puts it's results back into the queue of emitted results to be reduced until there is only a single document left for that particular key.
I'm restricting the "slice" output here to 10 for brevity of listing, and including the stats to make a point, as the 100 reduce cycles called on this 10000 sample can be seen:
{
"results" : [
{
"_id" : null,
"value" : {
"output" : [
{
"_id" : ObjectId("56558d93138303848b496cd4"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d96138303848b49906e"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d93138303848b496d9a"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d93138303848b496ef2"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497861"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497b58"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497ba5"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497c43"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d95138303848b49842b"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d96138303848b498db4"),
"value" : 2.1
}
]
}
}
],
"timeMillis" : 1758,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 100,
"output" : 1
},
"ok" : 1
}
So this is a single document output, in the specific mapReduce format, where the "value" contains an element which is an array of the sorted and limitted result.
Future Processing is Aggregate
As of writing, the current latest stable release of MongoDB is 3.0, and this lacks the functionality to make your operation possible. But the upcoming 3.2 release introduces new operators that make this possible:
db.test.aggregate([
{ "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
{ "$group": {
"_id": "$_id",
"result": {
"$sum": {
"$abs": {
"$subtract": [
"$vals",
{ "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] }
]
}
}
}
}},
{ "$sort": { "result": -1 } },
{ "$limit": 100 }
])
Also limitting to the same 10 results for brevity, you get output like this:
{ "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }
This is made possible largely due to $unwind
being modified to project a field in results that contains the array index, and also due to $arrayElemAt
which is a new operator that can extract an array element as a singular value from a provided index.
This allows the "look-up" of values by index position from your input array in order to apply the math to each element. The input array is facilitated by the existing $literal
operator so $arrayElemAt
does not complain and recongizes it as an array, ( seems to be a small bug at present, as other array functions don't have the problem with direct input ) and gets the appropriate matching index value by using the "index" field produced by $unwind
for comparison.
The math is done by $subtract
and of course another new operator in $abs
to meet your functionality. Also since it was necessary to unwind the array in the first place, all of this is done inside a $group
stage accumulating all array members per document and applying the addition of entries via the $sum
accumulator.
Finally all result documents are processed with $sort
and then the $limit
is applied to just return the top results.
Summary
Even with the new functionallity about to be availble to the aggregation framework for MongoDB it is debatable which approach is actually more efficient for results. This is largely due to there still being a need to $unwind
the array content, which effectively produces a copy of each document per array member in the pipeline to be processed, and that generally causes an overhead.
So whilst mapReduce is the only present way to do this until a new release, it may actually outperform the aggregation statement depending on the amount of data to be processed, and despite the fact that the aggregation framework works on native coded operators rather than translated JavaScript operations.
As with all things, testing is always recommended to see which case suits your purposes better and which gives the best performance for your expected processing.
Sample
Of course the expected result for the sample document provided in the question is 0.9
by the math applied. But just for my testing purposes, here is a short listing used to generate some sample data that I wanted to at least verify the mapReduce code was working as it should:
var bulk = db.test.initializeUnorderedBulkOp();
var x = 10000;
while ( x-- ) {
var vals = [0,0,0];
vals = vals.map(function(val) {
return Math.round((Math.random()*10),1)/10;
});
bulk.insert({ "vals": vals });
if ( x % 1000 == 0) {
bulk.execute();
bulk = db.test.initializeUnorderedBulkOp();
}
}
The arrays are totally random single decimal point values, so there is not a lot of distribution in the listed results I gave as sample output.