0

In a message passing/distributed systems, we do checkpoints based on synchronized clock, where we store the state of the process.

Now i want to know, how can we do this practically?

Say, my system deals with request /response client server system. In a place, i would like to make checkpoints, so i can do rollback, if there are any failures occurred.

In such case, what are the information i need to store? I would like to know the practical considerations. I went through several articles about the roll back recovery and now trying to make an implementation for a PoC.

Anybody, who tried the checkpoint mechanism in their system, could give me some clues?

Edit

Im trying to do a Rollback for non-deterministic events(eg: receiving requests to a webservice) There are two approaches i have in my mind, One is checkpoint based, another one is log based. I chose Apache Axis2 platform as my webservice platform. It already has the logging facility.So, logging will be easier in this case..

So, when we do checkpoint based/log based, do we need to store whole data?

  • Is there any difference in storing data in both cases?

  • In this type of recovery , we need to rollback the client and server, Client can be rollback based on the information we stored.. How can we rollback the server in that case? Is that necessary? Or i understood the protocol incorrectly

Ratha
  • 9,434
  • 17
  • 85
  • 163

1 Answers1

1

The data you need to save is highly dependent on your system so without knowing a lot more about it, we won't be able to tell you specifically what to store here.

However, in general, you need to store whatever data is critical to your application. For some applications, you store the entire content of memory and reload at that point. That's a bit heavy handed, but might be necessary. In general, that's very expensive and not usually necessary. Usually, you can store things like any internal status data (maybe the clients to which you're connected or the status of those connections) and when you restart you application after the failure, you can reload all of that data and remake any connections to your clients.

In another implementation, you may not care about the connections to your clients and you only need to save some small internal state and let the clients reconnect themselves and restart whatever they were doing when they made the connection. It all depends on how much of your execution you're willing to lose vs. how much overhead you want to introduce by creating these checkpoints.

Wesley Bland
  • 8,816
  • 3
  • 44
  • 59
  • thanks for the clear explanation, i added further questions to my original post..Please check those also.. – Ratha Jun 25 '13 at 13:48