11

I have access logs such as below stored in a mongodb instance:

Time                           Service                      Latency
[27/08/2013:11:19:22 +0000]    "POST Service A HTTP/1.1"    403
[27/08/2013:11:19:24 +0000]    "POST Service B HTTP/1.1"    1022 
[27/08/2013:11:22:10 +0000]    "POST Service A HTTP/1.1"    455 

Is there an analytics function like PERCENTILE_DISC in Oracle to calculate the percentile?

I would like to calculate latency percentiles over a period of time.

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
user2574093
  • 111
  • 1
  • 4

4 Answers4

9

There still appears to be no native way to calculate percentiles but by combining a few aggregate operators you can get the same result.

db.items.aggregate([
        {'$group': {
            '_id': {
                'league': '$league',
                'base': '$base',
                'type': '$type'
            },
            'value': {'$push': '$chaosequiv'}
        }},
        {'$unwind': '$value'},
        {'$sort': {'value': 1}},
        {'$group': {'_id': '$_id', 'value': {'$push': '$value'}}},
        {'$project': {
            '_id': 1,
            'value': {'$arrayElemAt': ['$value', {'$floor': {'$multiply': [0.25, {'$size': '$value'}]}}]}
        }}
    ], allowDiskUse=True)

Note I wrote my original code in pymongo for a problem that needed to group on 3 fields in the first group so this may be more complex than necessary for a single field. I would write a solution specific to this question but I don't think there is enough specific information.

Jeremy Parks
  • 101
  • 1
  • 5
4

Starting Mongo 4.4, the $group stage has a new aggregation operator $accumulator allowing custom accumulations of documents as they get grouped, via javascript user defined functions.

Thus, in order to find the 20th percentile:

// { "a" : 25, "b" : 12 }
// { "a" : 89, "b" : 73 }
// { "a" : 25, "b" : 7  }
// { "a" : 25, "b" : 17 }
// { "a" : 89, "b" : 14 }
// { "a" : 89, "b" : 17 }
// { "a" : 25, "b" : 24 }
// { "a" : 25, "b" : 15 }
// { "a" : 25, "b" : 22 }
// { "a" : 25, "b" : 94 }
db.collection.aggregate([
  { $group: {
    _id: "$a",
    percentile: {
      $accumulator: {
        accumulateArgs: ["$b"],
        init: function() { return []; },
        accumulate: function(bs, b) { return bs.concat(b); },
        merge: function(bs1, bs2) { return bs1.concat(bs2); },
        finalize: function(bs) {
          bs.sort(function(a, b) { return a - b });
          return bs[Math.floor(bs.length*.2) + 1];
        },
        lang: "js"
      }
    }
  }}
])
// { "_id" : 89, "percentile" : 17 }
// { "_id" : 25, "percentile" : 15 }

The accumulator:

  • accumulates on the field b (accumulateArgs)
  • is initialised to an empty array (init)
  • accumulates b items in an array (accumulate and merge)
  • and finally performs the percentile calculation on b items (finalize)
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
0

Calculating percentile from 0 to 100 (step 5)

// { "session": 1, "date" : 2013-08-27 11:00::00, "latency" : 403 }
// { "session": 1, "date" : 2013-08-27 11:00::01, "latency" : 1022 }
// { "session": 1, "date" : 2013-08-27 11:00::02, "latency" : 455 }
// { "session": 1, "date" : 2013-08-27 11:00::02, "latency" : 307 }
// { "session": 2, "date" : 2013-08-27 12:00::00, "latency" : 403 }
// { "session": 2, "date" : 2013-08-27 12:00::01, "latency" : 1022 }
// { "session": 2, "date" : 2013-08-27 12:00::02, "latency" : 455 }
// { "session": 2, "date" : 2013-08-27 12:00::02, "latency" : 307 }

var interval = 5

db.getCollection.aggregate([
    {
        $group:{
            _id: { $toString: "$session" },
            value: { $push: "$atency" }
        }
    },
    {
        $addFields:{
            index : {$range: [0, 100, interval]}
        }
    },
    {$unwind: "$index"},
    {
        $addFields:{
            p_index : {$divide: ["$index", 100]}
        }
    },
    {
        $addFields:{
            percentile_x : {$toDouble: {'$arrayElemAt': ['$value', {'$floor': {'$multiply': ["$p_index", {'$size': '$value'}]}}]}}
        }
    },
    {   $project: {
           'percentile': "$index",
           'value': "$percentile_x"
        }
    },
        {$sort: {
          percentile: 1
        }
    }
])
0

Considering there is still no native way to do this, I'll be using an:

  • $aggregate as well with a $push to keep track of all the values.
  • Then I'll just use np.percentile(values, xth_percentile) to get the xth percentile. e.g. np.percentile([0,1,10,30,100], 25) to get the 25th percentile of that array.

For reference, there have been for many, many years a JIRA card and an open pull request for this feature. Go show your support.

Akaisteph7
  • 5,034
  • 2
  • 20
  • 43