0

I am trying to port an existing SQL schema into Mongo.
We have document tables, with sometimes several times the same document, with a different revision but the same reference. I want to get only the latest revisions of the documents.

A sample input data:

{
    "Uid" : "xxx",
    "status" : "ACCEPTED",
    "reference" : "DOC305",
    "code" : "305-D",
    "title" : "Document 305",
    "creationdate" : ISODate("2011-11-24T15:13:28.887Z"),
    "creator" : "X"
},
{
    "Uid" : "xxx",
    "status" : "COMMENTED",
    "reference" : "DOC306",
    "code" : "306-A",
    "title" : "Document 306",
    "creationdate" : ISODate("2011-11-28T07:23:18.807Z"),
    "creator" : "X"
},
{
    "Uid" : "xxx",
    "status" : "COMMENTED",
    "reference" : "DOC306",
    "code" : "306-B",
    "title" : "Document 306",
    "creationdate" : ISODate("2011-11-28T07:26:49.447Z"),
    "creator" : "X"
},
{
    "Uid" : "xxx",
    "status" : "ACCEPTED",
    "reference" : "DOC501",
    "code" : "501-A",
    "title" : "Document 501",
    "creationdate" : ISODate("2011-11-19T06:30:35.757Z"),
    "creator" : "X"
},
{
    "Uid" : "xxx",
    "status" : "ACCEPTED",
    "reference" : "DOC501",
    "code" : "501-B",
    "title" : "Document 501",
    "creationdate" : ISODate("2011-11-19T06:40:32.957Z"),
    "creator" : "X"
}

Given this data, I want this result set (sometimes I want only the last revision, sometimes I want all revisions with an attribute telling me whether it's the latest):

{
    "Uid" : "xxx",
    "status" : "ACCEPTED",
    "reference" : "DOC305",
    "code" : "305-D",
    "title" : "Document 305",
    "creationdate" : ISODate("2011-11-24T15:13:28.887Z"),
    "creator" : "X",
    "lastrev" : true
},
{
    "Uid" : "xxx",
    "status" : "COMMENTED",
    "reference" : "DOC306",
    "code" : "306-B",
    "title" : "Document 306",
    "creationdate" : ISODate("2011-11-28T07:26:49.447Z"),
    "creator" : "X",
    "lastrev" : true
},
{
    "Uid" : "xxx",
    "status" : "ACCEPTED",
    "reference" : "DOC501",
    "code" : "501-B",
    "title" : "Document 501",
    "creationdate" : ISODate("2011-11-19T06:40:32.957Z"),
    "creator" : "X",
    "lastrev" : true
}

I already have a bunch of filters, sorting, and skip/limit (for pagination of data), so the final result set should be mindful of these constraints.

The current "find" query (built with the .Net driver), which filters fine but gives me all revisions of each document:

coll.find(
    { "$and" : [
        { "$or" : [
            { "deletedid" : { "$exists" : false } },
            { "deletedid" : null }
        ] },
        { "$or" : [
            { "taskid" : { "$exists" : false } },
            { "taskid" : null }
        ] },
        { "objecttypeuid" : { "$in" : ["xxxxx"] } }
    ] },
    { "_id" : 0, "Uid" : 1, "lastrev" : 1, "title" : 1, "code" : 1, "creator" : 1, "owner" : 1, "modificator" : 1, "status" : 1, "reference": 1, "creationdate": 1 }
).sort({ "creationdate" : 1 }).skip(0).limit(10);

Using another question, I have been able to build this aggregation, which gives me the latest revision of each document, but with not enough attributes in the result:

coll.aggregate([
    { $sort: { "creationdate": 1 } },
    {
        $group: {
            "_id": "$reference",
            result: { $last: "$creationdate" },
            creationdate: { $last: "$creationdate" }
        }
    }
]);

I would like to integrating the aggregate with the find query.

Community
  • 1
  • 1
thomasb
  • 5,816
  • 10
  • 57
  • 92

2 Answers2

0

I have found the way to mix aggregation and filtering:

coll.aggregate(
[
    { $match: {
            "$and" : [
                { "$or" : [
                    { "deletedid" : { "$exists" : false } },
                    { "deletedid" : null }
                ] },
                { "$or" : [
                    { "taskid" : { "$exists" : false } },
                    { "taskid" : null }
                ] },
                { "objecttypeuid" : { "$in" : ["xxx"] } }
            ]
        }
    },
    { $sort: { "creationdate": 1 } },
    { $group: {
            "_id": "$reference",
            "doc": { "$last": "$$ROOT" }
        }
    },
    { $sort: { "doc.creationdate": 1 } },
    { $skip: skip },
    { $limit: limit }
],
    { allowDiskUse: true }
);

For each result node, this gives me a "doc" node with the document data. It has too much data still (it's missing projections), but it's a start.

Translated in .Net:

FilterDefinitionBuilder<BsonDocument> filterBuilder = Builders<BsonDocument>.Filter;
FilterDefinition<BsonDocument> filters = filterBuilder.Empty;

filters = filters & (filterBuilder.Not(filterBuilder.Exists("deletedid")) | filterBuilder.Eq("deletedid", BsonNull.Value));
filters = filters & (filterBuilder.Not(filterBuilder.Exists("taskid")) | filterBuilder.Eq("taskid", BsonNull.Value));
foreach (var f in fieldFilters) {
    filters = filters & filterBuilder.In(f.Key, f.Value);
}

var sort = Builders<BsonDocument>.Sort.Ascending(orderby);

var group = new BsonDocument {
    { "_id", "$reference" },
    { "doc", new BsonDocument("$last", "$$ROOT") }
};

var aggregate = coll.Aggregate(new AggregateOptions { AllowDiskUse = true })
    .Match(filters)
    .Sort(sort)
    .Group(group)
    .Sort(sort)
    .Skip(skip)
    .Limit(rows);

return aggregate.ToList();

I'm pretty sure there are better ways to do this, though.

thomasb
  • 5,816
  • 10
  • 57
  • 92
0

You answer is pretty close. Instead of $last, $max is better.

About $last operator:

Returns the value that results from applying an expression to the last document in a group of documents that share the same group by a field. Only meaningful when documents are in a defined order.

Get the last revision in each group, see code below in mongo shell:

db.collection.aggregate([
  {
    $group: {
      _id: '$reference',
      doc: {
        $max: {
          "creationdate" : "$creationdate",
          "code" : "$code",
          "Uid" : "$Uid",
          "status" : "$status",
          "title" : "$title",
          "creator" : "$creator"
        }
      }
    }
  },
  {
    $project: {
      _id: 0,
      Uid: "$doc.Uid",
      status: "$doc.status",
      reference: "$_id",
      code: "$doc.code",
      title: "$doc.title",
      creationdate: "$doc.creationdate",
      creator: "$doc.creator"
    }
  }
]).pretty()

The output as your expect:

{
    "Uid" : "xxx",
    "status" : "ACCEPTED",
    "reference" : "DOC501",
    "code" : "501-B",
    "title" : "Document 501",
    "creationdate" : ISODate("2011-11-19T06:40:32.957Z"),
    "creator" : "X"
}
{
    "Uid" : "xxx",
    "status" : "COMMENTED",
    "reference" : "DOC306",
    "code" : "306-B",
    "title" : "Document 306",
    "creationdate" : ISODate("2011-11-28T07:26:49.447Z"),
    "creator" : "X"
}
{
    "Uid" : "xxx",
    "status" : "ACCEPTED",
    "reference" : "DOC305",
    "code" : "305-D",
    "title" : "Document 305",
    "creationdate" : ISODate("2011-11-24T15:13:28.887Z"),
    "creator" : "X"
}
Shawyeok
  • 1,186
  • 1
  • 8
  • 15
  • In your example, how does max determines that it has to get the max value for the `creationdate` field, and not another one ? – thomasb Dec 22 '16 at 14:10
  • Because `creationdate` is first field of `$max` argument object, the `$max` operator will sort by fields one by one in a defined order. But it's not documented, so we need to confirm in source code. – Shawyeok Dec 22 '16 at 14:23
  • ok, thanks but I'm not very fond of relying on undocumented features... – thomasb Dec 22 '16 at 14:25