0

One problem I commonly solve is that of keeping immutable versions of a document rather than editing the document. When asked for the document, retrieve the most recent version.

One way to do this is with timestamps:

doc 0:

{
   id: "e69e0bea-77ea-4d97-bedf-d3cca27ae4b6",
   correlationId: "d00be916-10e3-415c-aaf6-9acb7c70cf4f",
   created: "11/17/2018 2:20:25 AM",
   value: "foo"
}

doc 1:

{
   id: "37ef6f99-bc87-45bb-87ae-a1b81070cc91",
   correlationId: "d00be916-10e3-415c-aaf6-9acb7c70cf4f",
   created: "11/17/2018 2:20:44 AM",
   value: "bar"
}

doc 2:

{
   id: "93fc913e-5ecc-4c59-a130-0e577ed4f2fb",
   correlationId: "d00be916-10e3-415c-aaf6-9acb7c70cf4f",
   created: "11/17/2018 2:21:51 AM",
   value: "baz"
}

The downside of using timestamps is you have to order by the timestamp (O(n*log(n))) to get the Nth most recent version.

I desire to make this O(n) by storing pointers to the previous version, like

{
   id: "e69e0bea-77ea-4d97-bedf-d3cca27ae4b6",
   previousId: null,
   correlationId: "d00be916-10e3-415c-aaf6-9acb7c70cf4f",
   created: "11/17/2018 2:20:25 AM",
   value: "foo"
}

doc 1:

{
   id: "37ef6f99-bc87-45bb-87ae-a1b81070cc91",
   previousId: "e69e0bea-77ea-4d97-bedf-d3cca27ae4b6",
   correlationId: "d00be916-10e3-415c-aaf6-9acb7c70cf4f",
   created: "11/17/2018 2:20:44 AM",
   value: "bar"
}

doc 2:

{
   id: "93fc913e-5ecc-4c59-a130-0e577ed4f2fb",
   previousId: "37ef6f99-bc87-45bb-87ae-a1b81070cc91",
   correlationId: "d00be916-10e3-415c-aaf6-9acb7c70cf4f",
   created: "11/17/2018 2:21:51 AM",
   value: "baz"
}

so it is a linked list like

NULL <- doc0 <- doc1 <- doc2

The only thing stopping me from doing this is that for creating a new version I would need some locking mechanism, like (in pseudo-code)

lock correlationId
   get latest
   new.previousId = latest.id
   insert new

but I'm not sure if it's possible at the database level.

See Sharp
  • 379
  • 1
  • 4
  • 11

2 Answers2

0

There's no concept of locking, but in your case, you can take advantage of unique key constraints:

  • Create a partitioned collection, with correlationId as your logical partition key
  • Add a unique key constraint, with the key based on previousId

At this point, for a given correlationId, if you try to create a new link in the list, and somehow another one was create just before, you'd run into a collision on previousId, and you'd then be able to re-do your operation using the just-created document's id for previousId.

Note: There is an ETag for each document, which helps with concurrency when updating a document, in case you decide to utilize updates at some point.

David Makogon
  • 69,407
  • 21
  • 141
  • 189
-1

Did you consider Cosmos DB Graph API. Linked list is effectively a very basic form of graph.

What you are doing looks good, but updating the correlation id can be a mess. With graph API that problem won’t be there.

Updating the answer following the first comment:

This what we can do using SQL API.

The chain can be modeled as:

NULL <- Doc1 <- Doc2 <- Doc3 <- Head.

The Head has same correlationId as other versions documents. Also, correlationId needs to be the partition key of the collection, so that all version of the same documents are placed in the same physical partition.

Now, we can use a stored procedure to update the version of the document. Note that stored procedures are transactional within the scope of a partition key (the reason we wanted correlationId to be the partition key).

Below is the pseudocode of the stored procedure.

Add New version: 
    Read the Head(H) Document 
    save the _etag of the Head Document 
    Follow H to read the current most recent version (CMRV)
    Add a document for the new most recent version (NMRV)
    Point H to NRMV and NMRV to CMRV
    Update H with some dummy information (say number of version) using the _etag saved before

This entire piece is atomic. If another concurrent thread has successfully updated H, the the current stored proc will fail with "Precondition" failed error (due to _etag mismatch), and the entire stored proc will be rolled back.

Jayanta Mondal
  • 436
  • 2
  • 5
  • How does going to a graph solve the problem? The same problem exists, whether using core (SQL) API or Gremlin (graph) API. `CorrelationID` isn't the issue (and I don't see why that would be a mess). If two writers try adding a vertex to the graph, connected via edge to the latest-known version of a vertex, you have the same concurrency issue as with the Core API. – David Makogon Nov 17 '18 at 04:33
  • David, I have updated response. I suggested graph API primarily for the ease of creating the chain. I misread the question thinking that the chain-creation was the primary problem, and concurrency issue was just an artifact of using SQL API to create the chain. – Jayanta Mondal Nov 18 '18 at 21:12