Very fast virtual machine rollback

Question

Summary: Is there a VM that's optimized for fast in-memory rollback to a previous state?

I've tried in vain on StackOverflow and SuperUser — maybe this is the right place to ask this question. :)

I'm looking for the fastest way to rollback a virtual machine to a previously known state. I'll be doing a lot of rollbacks to the same state. Taking a snapshot can be slow, but the rollback needs to be fast.

Ideally, I'd like the performance of the rollback to be as close to the performance of a single memcpy() as possible.

I'm currently testing things on my portable machine (Mac OS X), but will later deploy this to a beefy Linux server. Portability would be nice, but only running on Linux is fine. Commercial solutions are fine, if they get the job done.

For reference's sake, here's how long a memcpy() takes on my machine: (Code available here.)

     Size      Time
 ========= ========= 
    64 MB   0.030 s
   128 MB   0.085 s
   256 MB   0.140 s
   512 MB   0.290 s
  1024 MB   0.600 s

So far I've tried VirtualBox and VMWare Fusion on my machine. Performance was comparable for both.

Ideally, I'd also like to rollback the disk state. VirtualBox seems to have support for "immutable" disks, where all disk I/O is logged to a differential file and thrown away on rollbacks. That seems like a natural fit.

For VirtualBox rolling back a VM with 2.5 GB of assigned RAM with the state saved on disk (a SSD) took ~35 seconds.

Since disk I/O is expected to be slow, I next created a RAM disk (formatted as HFS+) and moved the snapshots there. The rollback of the otherwise unchanged VM to the same state then takes ~16.5 seconds.

The snapshot data takes up 1.11 GB. So VirtualBox takes ~15s / 1 GB of data when rolling back from a RAM disk. That's 6.72x the time of a memcpy(). (Does going through Mac OS X' I/O system cause that much overhead?)

Since I know of no way to rollback a VirtualBox VM without first shutting it down, and later starting it up again, it could be that some of this is caused by other factors.

Ideally, the hypervisor would keep all snapshot state in main memory, and use memcpy() for a rollback, letting us rollback in ~5 seconds. (An even smarter hypervisor might implement copy on write on pages and only rollback memory pages that actually changed.)

Are there any hypervisors which are optimized for rollback-frequently workloads? If not: What might be the most viable way of implementing such a thing myself?

(I've also done a bit of research into process-level rollback and there seems to be some interesting solutions for that for Linux; but having rollbacks work for arbitrary applications and being able to rollback I/O are big pluses.)

(What I really want to do is rollback to a checkpoint in arbitrary Java applications, ideally together with their I/O. I've only found dead research projects for rolling back a JVM, but maybe somebody is aware of more current work? It'd however need to do with pretty much arbitrary Java applications.)

When a snapshot is made, it starts a new differential "disk" on the drive. At least in Virtualbox, if you stop the server and delete the changes, it's made instantly. — Nathan C, Jun 27 '13 at 11:53
tbh situations where 'rollback-frequently workloads' are truely appropriate are exceptional and chances are in most cases would indicate that you are doing it wrong. Not saying you are, but it explains why there is little support for it. — JamesRyan, Jun 27 '13 at 12:02
@JamesRyan Not necessarily. Especially automated build machines, unit tests etc. may require machines in a known state (install/uninstall testing). Start, run test, roll back to known state for next test. Lab setups have this requirement at times. — TomTom, Jun 27 '13 at 12:08
Do they have it where it is required to do it every 30 seconds though? Time seems to be a considerable concern with this instance. — Travis, Jun 27 '13 at 12:11
@TomTom: Absolutely. What I'm actually doing here is to automatically come up with tests for an application. It makes sense to be able to start from a sane state for that, and throwing a differential disk file together with memory changes would perfectly address my use case. — Florian Groß, Jun 28 '13 at 10:12

score 1 · Answer 1 · answered Jun 27 '13 at 11:33

1

I do not know what your actual needs are. However is it possible to run multiple clones and do some kind of round robin loadbalancing? Then you could rollback those machines in the background and you have more time at hand to do so.

answered Jun 27 '13 at 11:33

Reiner Rottmann

633
1
7
19

Absolutely. However cheaper rollbacks would mean I require less clones; so minimizing the rollback time is still interesting to me. – Florian Groß Jun 28 '13 at 10:13

Travis · Answer 2 · 2013-06-28T11:49:42.560

Not really. The purpose of a snapshot is to take a "shot" of the current system state so if something goes wrong in the immediate future (software update or install), it can quickly be returned to the state previous to the installation. That way you don't have to do a full recovery from back up. The snapshots are meant to be retained for only a short period of time and keep those clean out per VMWare.

Something similar to what you are looking for is VMWare Site Recovery Manager which allows for you to have a completely replicated VM infrastructure and will only partially work for what you need. I do not believe that it exactly fits your purpose because it is more for a mirrored copy of your VM, not a "roll back" state of your VM.

The other option I know of with VMWare and possibly VirtulBox is to create a scheduled tasks that automated the snapshot and reversion with command lines (PowerCLI). Then it will create the snapshot and revert back on a schedule.

Also, if you are having to snapshot and roll back that frequently then may something is wrong in your process. Maybe explaining exactly why you are doing it would garner better responses.

Edit

I think the better approach to this may be to simply have multiple VM running, all with the same baseline. Basically clone the first one to 3 or 4 additional VMs and use them for your testing. What this will do is as soon as the original set of testing is done on VM1, start the testing on VM2. Roll back VM1 with the snapshot while this is going on. Then once VM2 is done move to VM3 while rolling back VM2. Then go back to VM1 when VM3 is done and roll back VM3. If by the time you finish with VM3 that VM1 has not completed its roll back, add an additional VM. That way when you reach the end of the "line" so to speak, VM1 will be done with the roll back. Since it is automated you can use PowerCLI or some sort of command line to revert the snapshot upon completion of the test.

See above. This is absolutely an unusual use case. (I plan to use this for returning a system into a sane state so I can automatically create good system tests for it.) It could be that rollback needs to be 20-30s expensive, but if so, I'd like to understand why. (I only see throwing away differential I/O plus returning parts of the memory to a known state. Shouldn't that be much cheaper, if we can avoid touching the disk as much as possible?) — Florian Groß, Jun 28 '13 at 10:16
I don't think what you are asking for is possible. The restoration process of the snapshot can depend on too many factors. I will edit my answer to explain an another alternative. — Travis, Jun 28 '13 at 11:46

Very fast virtual machine rollback

2 Answers2