8

I have a collection of decision documents in a form similar to:

    { 
    _id: ObjectId("23de23802fe925b6ef7162a4"),
    userId: 6de4, 
    decision: true, 
    datetime:ISODate("2016-07-27T08:22:47.169Z")
    },
    { 
    _id: ObjectId("507f1f77bcf86cd799439011"),
    userId: 23f4, 
    decision: true, 
    datetime:ISODate("2016-02-03T11:48:50.456Z")
    },
.
.
.

I'm struggling to figure out a way of grouping these documents into groups of contiguous datetimes. I.e. a document should belong to a particular group if it is less than, say 5 minutes from at least one other document in the group.

The aim being to achieve groups of decisions that where made in a "session". Further insights could then be made on these "session" groups using aggregation (such as average time per decision..etc).

If not possible with MongoDb's aggreagation framework can this be done with map-reduce or by some other means. I am open to suggestions.

Clarification

Another way of picturing the problem is by applying the following algorithm to the collection of documents.

  1. Start by putting the documents in datetime order.
  2. Put the earliest document which comes first (chronologicaly) in it's own group and move to the next document.
  3. If the next document has a datetime that occurs less than a specified time after the immediately preceding one (say 5 mins) place this in the same group as before. If not, create a new group and place this document in it.
  4. Repeat step 3. until all documents have been traversed.

This would leave the collection with the required "session" groupings. Of course this is just one way of picturing the problem. I am not aware of any way to traverse an ordered collection, whilst grouping in this way using MongoDb.

Can it be done this way? Is there another way to achieve the same result using MongoDb?

Community
  • 1
  • 1
Chris
  • 564
  • 1
  • 7
  • 13
  • Related http://stackoverflow.com/questions/26814427/group-result-by-15-minutes-time-interval-in-mongodb – chridam Sep 05 '16 at 16:13
  • 2
    @chridam this is not the same your referenced question. I do **not** want to group the documents into fixed time intervals but want to group by proximity in time to other documents in the same group _(contiguous time points)_ . – Chris Sep 05 '16 at 16:37
  • What *time intervals* look like? Is it an array? What is the expected output here? – styvane Sep 05 '16 at 20:29
  • @Styvane the point is the intervals are not fixed and homogeneous. Rather I want to group documents that are close to each other in time. The threshold for closeness is determined by a single scalar value (e.g. 5 minutes) – Chris Sep 05 '16 at 21:10
  • Is it possible to modify your current objects? (i.e. adding a new field) about how many objects do you have? – Nahuel Sep 06 '16 at 11:36
  • If it helps with the solution, yes I could. I'd be interested to understand how that would help. – Chris Sep 06 '16 at 23:15
  • The problem here is that your *intervals are not fixed and homogenous*, if not adding a field as mentioned in the comment will help, but this imply that you need to update all the documents in the collection every time the intervals change. Also this might be possible with the aggregation framework if we could group the documents and put them in one array field, thus a single document. But I think the BSON size limit will probably not help. MapReduce may help but I will need to try this first when I have a little time. – styvane Sep 07 '16 at 08:05
  • What is the size of each document in the collection? How large is the collection? How often do you need to run this query? – styvane Sep 07 '16 at 08:07
  • The query would only run once every month. The collection contains ~200,000 documents each roughly ~2KB – Chris Oct 06 '16 at 12:11

1 Answers1

1

By the algorithm you described, the grouping logic for each document always depends on another document. I don't see a way for doing this using map reduce, aggregation or a single MongoDB query. The only solution I see is to follow strictly your algorithm, that is read each document and make the decision if it belongs to the current group or if it should be in a new one.

It's not recommended to load all documents in memory since it could be a very large collection. So I used a stream to load document by document.

Create a cursor that finds all documents and order them by the date and then use cursor.on('data', function(document){ ... }); to read each document individually.

var groups = {} // init group object
var currentGroupKey;
var groupInterval = 5 * 60 * 1000; // Five minutes in milliseconds

var cursor = db.collection("documents").find({}).sort({date: 1});

cursor.on('data', function(doc) {
  var timestamp = doc.date.getTime();

  if (currentGroupKey != null && currentGroupKey + groupInterval >= timestamp) {
    // add it to current group
    groups[currentGroupKey].push(doc);
  } else {
    // create a new group
    groups[timestamp] = [doc];
    currentGroupKey = timestamp;
  }
});
cursor.once('end', function() {
  // This is called after last document is read
  console.log(groups); // print your grouped documents
  db.close();
});

For this documents

[ { _id: 57f59acb8e73d9634ac8c7b0,
    index: 3,
    date: Wed Oct 05 2016 21:02:29 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7ae,
    index: 1,
    date: Wed Oct 05 2016 21:04:02 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7b3,
    index: 6,
    date: Wed Oct 05 2016 21:07:43 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7b4,
    index: 7,
    date: Wed Oct 05 2016 21:10:26 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7b2,
    index: 5,
    date: Wed Oct 05 2016 21:14:23 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7b5,
    index: 8,
    date: Wed Oct 05 2016 21:17:39 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7b6,
    index: 9,
    date: Wed Oct 05 2016 21:21:07 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7ad,
    index: 0,
    date: Wed Oct 05 2016 21:24:19 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7af,
    index: 2,
    date: Wed Oct 05 2016 21:25:50 GMT-0300 (BRT) },
  { _id: 57f59acb8e73d9634ac8c7b1,
    index: 4,
    date: Wed Oct 05 2016 21:28:13 GMT-0300 (BRT) } ]

The final group object is

{ '1475712149573':
   [ { _id: 57f59acb8e73d9634ac8c7b0,
       index: 3,
       date: Wed Oct 05 2016 21:02:29 GMT-0300 (BRT) },
     { _id: 57f59acb8e73d9634ac8c7ae,
       index: 1,
       date: Wed Oct 05 2016 21:04:02 GMT-0300 (BRT) } ],
  '1475712463238':
   [ { _id: 57f59acb8e73d9634ac8c7b3,
       index: 6,
       date: Wed Oct 05 2016 21:07:43 GMT-0300 (BRT) },
     { _id: 57f59acb8e73d9634ac8c7b4,
       index: 7,
       date: Wed Oct 05 2016 21:10:26 GMT-0300 (BRT) } ],
  '1475712863890':
   [ { _id: 57f59acb8e73d9634ac8c7b2,
       index: 5,
       date: Wed Oct 05 2016 21:14:23 GMT-0300 (BRT) },
     { _id: 57f59acb8e73d9634ac8c7b5,
       index: 8,
       date: Wed Oct 05 2016 21:17:39 GMT-0300 (BRT) } ],
  '1475713267412':
   [ { _id: 57f59acb8e73d9634ac8c7b6,
       index: 9,
       date: Wed Oct 05 2016 21:21:07 GMT-0300 (BRT) },
     { _id: 57f59acb8e73d9634ac8c7ad,
       index: 0,
       date: Wed Oct 05 2016 21:24:19 GMT-0300 (BRT) },
     { _id: 57f59acb8e73d9634ac8c7af,
       index: 2,
       date: Wed Oct 05 2016 21:25:50 GMT-0300 (BRT) } ],
  '1475713693672':
   [ { _id: 57f59acb8e73d9634ac8c7b1,
       index: 4,
       date: Wed Oct 05 2016 21:28:13 GMT-0300 (BRT) } ] }

EDIT

Since the logic for grouping is always the last read document, I modified the algorithm for fit it. Also now it updates each document with a group key so it don't load all documents in memory.

var lastDocumentTimestamp;
var groupIndex = 0;
var groupInterval = 5 * 60 * 1000; // Five minutes in milliseconds

var cursor = db.collection("documents").find({}).sort({date: 1});

cursor.on('data', function(doc) {
  var timestamp = doc.date.getTime();

  if (lastDocumentTimestamp + groupInterval < timestamp) {
    groupIndex++;
  }
  lastDocumentTimestamp = timestamp;
  db.collection("documents").update({ _id: doc._id}, { $set: {group: groupIndex}});
});
cursor.once('end', function() {
  // This is called after last document is read
  db.close();
});

After that you can use aggregation for grouping documents by its groups

db.collection("documents").aggregate([{
  $group: {
    _id: "$group",
    count: { $sum: 1 },
    docs: { $push: "$date" }
  }
}])

This produces a result like:

[ { _id: 0,
    count: 1,
    docs: [ Thu Oct 06 2016 22:00:20 GMT-0300 (BRT) ] },
  { _id: 1,
    count: 4,
    docs:
     [ Thu Oct 06 2016 22:20:31 GMT-0300 (BRT),
       Thu Oct 06 2016 22:22:52 GMT-0300 (BRT),
       Thu Oct 06 2016 22:25:34 GMT-0300 (BRT),
       Thu Oct 06 2016 22:27:15 GMT-0300 (BRT) ] },
  { _id: 2,
    count: 5,
    docs:
     [ Thu Oct 06 2016 22:33:27 GMT-0300 (BRT),
       Thu Oct 06 2016 22:35:45 GMT-0300 (BRT),
       Thu Oct 06 2016 22:38:45 GMT-0300 (BRT),
       Thu Oct 06 2016 22:40:02 GMT-0300 (BRT),
       Thu Oct 06 2016 22:44:20 GMT-0300 (BRT) ] } ]
Marcelo Risse
  • 514
  • 3
  • 6
  • Thanks very much for the detailed explanation. This looks very promising although this would group the docs which fit into the interval starting from the time-stamp of the **first** document in each group. I was looking to group the document in a **contiguous** fashion by comparing the time-stamp of the **document immediately preceding**. I think the key for each group would need to be updated every time a new doc is added. This way the time comparison is made with the immediately preceding time. – Chris Oct 06 '16 at 11:53
  • Also wouldn't the group object still be loaded in memory containing every single document of the collection by the end. Could the group object be replaced with a new collection which would hold the groupings? – Chris Oct 06 '16 at 12:13
  • You can keep the reference for the last document timestamp and make the comparison with it. Regarding the group object in memory, you are right, it contains all documents at the end of the execution and it absolutely could be in a collection that holds all groups. You can keep the group document instance in memory and for each document you make the comparison and push the document inside the group document. When a new group is needed, you save the current group document in the database and create a new object for the new group. – Marcelo Risse Oct 06 '16 at 14:25
  • Sounds great. Thanks! I'll give this a go and let you know if it works. – Chris Oct 06 '16 at 14:35
  • Great, let me know if it works so I can update my answer to add these changes. – Marcelo Risse Oct 06 '16 at 14:38