14

I am implementing my first syncing code. In my case I will have 2 types of iOS clients per user that will sync records to a server using a lastSyncTimestamp, a 64 bit integer representing the Unix epoch in milliseconds of the last sync. Records can be created on the server or the clients at any time and the records are exchanged as JSON over HTTP.

I am not worried about conflicts as there are few updates and always from the same user. However, I am wondering if there are common things that I need to be aware of that can go wrong with a timestamp based approach such as syncing during daylight savings time, syncs conflicting with another, or other gotchas.

I know that git and some other version control system eschew syncing with timestamps for a content based negotiation syncing approach. I could imagine such an approach for my apps too, where using the uuid or hash of the objects, both peers announce which objects they own, and then exchange them until both peers have the same sets.

If anybody knows any advantages or disadvantages of content-based syncing versus timestamp-based syncing in general that would be helpful as well.

Edit - Here are some of the advantages/disadvantages that I have come up with for timestamp and content based syncing. Please challenge/correct.

Note - I am defining content-based syncing as simple negotiation of 2 sets of objects such as how 2 kids would exchange cards if you gave them each parts of a jumbled up pile of 2 identical sets of baseball cards and told them that as they look through them to announce and hand over any duplicates they found to the other until they both have identical sets.

  • Johnny - "I got this card."
  • Davey - "I got this bunch of cards. Give me that card."
  • Johnny - "Here is your card. Gimme that bunch of cards."
  • Davey - "Here are your bunch of cards."
  • ....
  • Both - "We are done"

Advantages of timestamp-based syncing

  • Easy to implement
  • Single property used for syncing.

Disadvantages of timestamp-based syncing

  • Time is a relative concept to the observer and different machine's clocks can be out of sync. There are a couple ways to solve this. Generate timestamp on a single machine, which doesn't scale well and represents a single point of failure. Or use logical clocks such as vector clocks. For the average developer building their own system, vector clocks might be too complex to implement.
  • Timestamp based syncing works for client to master syncing but doesn't work as well for peer to peer syncing or where syncing can occur with 2 masters.
  • Single point of failure, whatever generates the timestamp.
  • Time is not really related to the content of what is being synced.

Advantages of content-based syncing

  • No per peer timestamp needs to be maintained. 2 peers can start a sync session and start syncing based on the content.
  • Well defined endpoint to sync - when both parties have identical sets.
  • Allows a peer to peer architecture, where any peer can act as client or server, providing they can host an HTTP server.
  • Sync works with the content of the sets, not with an abstract concept time.
  • Since sync is built around content, sync can be used to do content verification if desired. E.g. a SHA-1 hash can be computed on the content and used as the uuid. It can be compared to what is sent during syncing.
  • Even further, SHA-1 hashes can be based on previous hashes to maintain a consistent history of content.

Disadvantages of content-based syncing

  • Extra properties on your objects may be needed to implement.
  • More logic on both sides compared to timestamp based syncing.
  • Slightly more chatty protocol (this could be tuned by syncing content in clusters).
John Wright
  • 2,418
  • 4
  • 29
  • 34

3 Answers3

7

Part of the problem is that time is not an absolute concept. Whether something happens before or after something else is a matter of perspective, not of compliance with a wall clock.

Read up a bit on relativity of simultaneity to understand why people have stopped trying to use wall time for figuring these things out and have moved to constructs that represent actual causality using vector clocks (or at least Lamport clocks).

If you want to use a clock for synchronization, a logical clock will likely suit you best. You will avoid all of your clock sync issues and stuff.

Dustin
  • 89,080
  • 21
  • 111
  • 133
  • Thanks Dustin, I realize vector clocks are used often for synchronizing a set of machines but are logical clocks used commonly for simpler synching since last timestamp scenarios like the one I am considering? Do you know of any examples or how I would do that? – John Wright Nov 15 '10 at 17:52
  • I'd imagine if your server acted as a central truth (ie, not peer-to-peer syncing), its time could be the only one that matters. The time a transaction for a specific sync session completed, for example. – Joshua Nozzi Nov 15 '10 at 17:57
  • So if I generate the timestamp on the server what if there are a few servers handling requests in a cluster and their clocks differ? Would always syncing a minute or 2 before the timestamp handle these inconsistencies (as long as the clients can easily disregard duplicate records)? Is that not something to worry about, or is there a better way? – John Wright Nov 15 '10 at 23:11
2

I don't know if it applies in your environment, but you might consider whose time is "right", the client or the server (or if it even matters)? If all clients and all servers are not sync'd to the same time source there could be the possibility, however slight, of a client getting an unexpected result when syncing to (or from) the server using the client's "now" time.

Our development organization actually ran into some issues with this several years ago. Developer machines were not all sync'd to the same time source as the server where the SCM resided (and might not have been sync'd to any time source, thus the developer machine time could drift). A developer machine could be several minutes off after a few months. I don't recall all of the issues, but it seems like the build process tried to get all files modified since a certain time (the last build). Files could have been checked in, since the last build, that had modification times (from the client) that occurred BEFORE the last build.

It could be that our SCM procedures were just not very good, or that our SCM system or build process were unduly susceptible to this problem. Even today, all of our development machines are supposed to sync time with the server that has our SCM system on it.

Again, this was several years ago and I can't recall the details, but I wanted to mention it on the chance that it is significant in your case.

wageoghe
  • 27,390
  • 13
  • 88
  • 116
  • Good point. Gotcha #1, always use one machine to compute timestamps. I hadn't considered this but I should probably always produce the lastSyncTimestamp on the server. – John Wright Nov 15 '10 at 17:43
0

You could have a look at unison. It's file-based but you might find some of the ideas interesting.

Karl
  • 698
  • 4
  • 4