4

Goal

  • Zero Conflict System: Having this be a write-only system would save us from conflicts. People are creating and updating documents both offline and online and being able to figure out what update trumps what is important.
  • Deep Historical reference: I want to know at any period of time, what that document looked like. On top of that, I need a deep historical analysis of how each item changes over time.

I was thinking of the following architecture:

Reference Document

_id: "u12345",
type: "user",
createdAt: 1584450565 //UNIX TIMESTAMP

{
  _id: "<random>"
  type: "user-name-revision" //{type}-{key}-Revision
  referenceId: "u12345"
  value: "John Doe Boy"
  updatedAt: 1584450565
}

{
  _id: "<random>"
  type: "user-name-revision"
  referenceId: "u12345"
  value: "John Doe"
  updatedAt: 1584450566 // 1 second higher than the above
}

{
  _id: "<random>"
  type: "user-email-revision"
  referenceId: "u12345"
  value: "john@gmail.com"
  updatedAt: 1584450565
}

If you want to get the user, you would:

  • Get all documents with referenceId of u12345.
  • Only get the most recent of each type
  • Then combine and output the user like so:

_id: "u12345",
type: "user",
createdAt: 1584450565,
name: "John Doe"
email: "john@gmail.com"
updatedAt: 1584450566 // highest timestamp

The only issue I see is if I wanted to sort all users by name let's say - If I have 1000 users, I don't see a clean way of doing this.

I was wondering if anyone had any suggestions for a pattern I could use. I'm using MongoDB so I have the power of that at my disposal.

Community
  • 1
  • 1
bryan
  • 8,879
  • 18
  • 83
  • 166
  • 1
    Is your question to come up with event driven design or just address the sorting part ? How did you get output shown in the post ? Could you add the query you have to the post ? May be someone can build the answer based on your query. – s7vr Mar 20 '20 at 16:50

3 Answers3

2

You can try below aggregation.

Project the key field from the type field, sort by updatedAt and group to pick latest value and keep the reference and updatedAt.

Group all documents and merge the different key values and keep the updatedAt and post processing to format the document.

Lookup to pull in user value and followed by replaceRoot to merge the main document with lookup document.

Sort the documents by name.

db.collectionname.aggregate([
  {"$addFields":{"key":{"$arrayElemAt":[{"$split":["$type","-"]},1]}}},
  {"$sort":{"updatedAt":-1}},
  {"$group":{
    "_id":{"referenceId":"$referenceId","key:"$key"},
    "value":{"$first":"$$ROOT"},
    "referenceId":{"$first":"$referenceId"},
    "updatedAt":{"$first":"$updatedAt"}
  }},
  {"$sort":{"updatedAt":-1}},
  {"$group":{
    "_id":"$_id.referenceId",
    "data":{
      "$mergeObjects":{"$arrayToObject":[[["$_id.key","$value"]]]}
    },
    "updatedAt":{"$first":"$updatedAt"}
  }},
  {"$addFields":{
    "data.referenceId":"$referenceId",
    "data.updatedAt":"$updatedAt"
  }},
  {"$project":{"data":1}},
  {"$lookup":{
    "from":"othercollectionname",
    "localField":"data.referenceId",
    "foreignField":"_id",
    "as":"reference"
  }},
  {"$replaceRoot":{
    "newRoot":{
      "$mergeObjects":[{"$arrayElemAt":["$reference",0]},"$data"]}
  }},
  {"$project":{"_id":0}},
  {"$sort":{"name":1}}
])

Alternate approach:

With all the transformation your query will be little slower. You can make few tweaks.

Input

{
  _id: "<random>"
  type: "user",
  key: "name"
  referenceId: "u12345"
  value: "John Doe Boy"
  updatedAt: 1584450565
}

Query

db.collectionname.aggregate([
  {"$sort":{"updatedAt":-1}},
  {"$group":{
    "_id":{"referenceId":"$referenceId","key":"$key"},
    "top":{"$first":"$$ROOT"}
  }},
  {"$sort":{"top.updatedAt":-1}},
  {"$group":{
    "_id":"$_id.referenceId",
    "max":{"$max":{"$cond":[{"$eq":["$key", "name"]},"$top.value",null]}},
    "key-values":{"$push":{"k":"$_id.key","v":"$top.value"}},
    "updatedAt":{"$first":"$top.updatedAt"}
  }},
  {"$lookup":{
    "from":"othercollectionname",
    "localField":"_id",
    "foreignField":"_id",
    "as":"reference"
  }},
  {"$project":{"_id":0}},
  {"$sort":{"max":1}}
])

We can refine our schema further to remove few other stages. We make sure we add the latest value at the end of array. Something like

Input

 {
      _id: "<random>"
      type: "user",
      key: "name"
      referenceId: "u12345"
      updates:[
        {"value": "John Doe Boy", updatedAt: 1584450565},
        {"value": "John Doe", updatedAt: 1584450566}
      ]
  }

Query

db.collectionname.aggregate([
  {"$addFields":{"latest":{"$arrayElemAt":["$updates",-1]}}},
  {"$group":{
    "_id":"$referenceId",
    "max":{"$max":{"$cond":[{"$eq":["$key", "name"]},"$latest.value",null]}},
    "updatedAt":{"$first":"$updatedAt"}
    "key-values":{"$push":{"k":"$key","v":"$latest.value"}},
    "updatedAt":{"$first":"$latest.updatedAt"}
  }},
  {"$lookup":{
    "from":"othercollectionname",
    "localField":"_id",
    "foreignField":"_id",
    "as":"reference"
  }},
  {"$project":{"_id":0}},
  {"$sort":{"max":1}}
])
s7vr
  • 73,656
  • 11
  • 106
  • 127
  • Thanks for taking the time Sagar. COVID-19 has gotten me a bit busy so sorry for the late response. This seems exactly what I'm looking for. Did you mean `collection.aggregate()` here? If not, how is it different aggregation. – bryan Mar 25 '20 at 16:24
  • You're welcome. It was a typo. Fixed now. No worries. Please stay safe. – s7vr Mar 25 '20 at 16:52
  • Thanks! You too! Do you think this will have an unnecessary impact on database performance? Is it worth even considering going through this route if you were me? – bryan Mar 26 '20 at 12:28
  • Its hard to say but if we can keep the formatting ($mergeObjects, $split) out of the queries we would be better off. Adjust the structure to store the type value in different fields .You could also perform the lookup as a separate query followed by all the transformation on the client side. This would make the query faster with proper indexing. I have added few alternatives to the post. You can compare both approaches and decide. – s7vr Mar 27 '20 at 12:57
  • Thanks Sagar, I'm really sorry but it looks like the bounty expired and I wasn't able to award it to you. I'm going to make another one and make sure you get it. – bryan Mar 28 '20 at 19:44
  • No worries, appreciate the response – bryan Mar 28 '20 at 21:35
0

Your question does not have enough requirements for a specific answer, so I'll try to give an answer that should cover many cases.

I doubt you'll find detailed published use cases, however, I can give you a few tips from my personal experience.

High throughput:

If you are using a high throughput event streaming, it would be better to store you data in an event log, where IDs are not unique and there are no updates, only inserts. This could be done for instance with Kafka which is meant to be used for event streaming. You could then process the events in bulks into a searchable database e.g. MongoDB.

Low throughput:

For a lower throughput, you could insert documents directly into MongoDB, however, still only insert, not update data.

Storing data in a event-log style in MongoDB:

In both cases, within MongoDB, you'll want a random _id (e.g. UUID), so each event has a unique _id. To access a logical document, you'll need another field, e.g. docId, which along with eventTimestamp will be indexed (with eventTimestamp sorted desc for faster access to latest version).

Searching:

To search by other fields, you can use additional indexes, as necessary, however, if your searches take significant CPU time, make sure you only run them against secondary instances of MongoDB (secondayOnly), so that the event inserts won't get delayed. Make yourself familiar with MongoDB's aggregation pipeline.

To prevent invalid states due to out-of-order updates:

Since you want to enable updates, you should consider only saving the changes in each document, e.g. +1 to field A, set value to x for field B. In this case you will need to have an index with docId and ascending eventTimestamp instead and every now and then aggregate your events into summary documents in a different collection, to enable faster reading of the latest state. Use the eventTimestamp of the latest document per docId for the aggregated document, plus the aggregationTimestamp and versionCount. If at any point you receive a document with an eventTimestamp lower than the latest eventTimestamp in the aggregated collection, you'll need to partially recalculate that collection. In other cases, you can update the aggregated collection incrementually.

Danny Varod
  • 17,324
  • 5
  • 69
  • 111
  • Hey Danny, do you know what would be considered High Throughput? I'm wondering at what point do I need to look at Kafka or if I can stick with Mongo for the time being – bryan Apr 26 '20 at 13:51
  • Best to test this yourself, although, the more data you accumulate the worst this may get. I think that for document editing data manually sent by humans, Mongo should do, since the events aren't frequent or large, for rapid events such as sensor data, I'd go with Kafka. However, this depends on size of events, frequency of events, collection size, number of indexes, RAM size, storage throughput, number of shards etc. – Danny Varod Apr 26 '20 at 18:33
0

Use this you will get desired output, make sure you have indexed in the referencedId and updatedAt and enough memory to sort.

db.columnName.aggregate([
    {
        $match:{
            referenceId:"u12345"
        }
    },
    {
        $project:{
                type: { $arrayElemAt: [ {$split: [ "$type", "-" ]}, 0 ] },
                referenceId:true,
                createdAt:true,
                name:true,
                email:true,
                updatedAt:true
            }
        },
    },
    {
        $sort:{
            updatedAt:-1
        }
    },
    {
        $group:{
            _id:"$referenceId",
            type:{
                $first:"$type"
            },
            createdAt:{
                $last:"$updatedAt"
            },
            name:{
                $first:"$name"
            },
            email:{
                $first:"$email"
            },
            updatedAt:{
                $first:"$updatedAt"
            }
        }
    }
])