0

I have a backup procedure for ec2 instances with lvm spanned volumes that does the following:

1) ssh to the box as root using an ssh forced command to dmsetup suspend the spanned volume. 2) take an ebs snapshot of the volumes 3) ssh to the box as root using an ssh forced command to dmsetup resume the spanned volume.

This has been working fine for a while, but last night something went wrong. It appears that the volume was suspended and never came back to the active state. I could ssh into the instance, but very few commands worked (ls did, top did, ps did not). I could run dmsetup info to see that it was suspended, but attempts to run dmsetup resume did nothing. I eventually rebooted, and it forced a disk check on that volume, which will take a very long time. I've restored from a previous snapshot instead.

What might have gone wrong here, and are there any steps I can take to prevent this from happening in the future?

fields
  • 690
  • 1
  • 10
  • 21
  • After the whole discussion I have the feeling that an lv-snapshot would be more appropriate. If I understood it right an lvm-spanned volume is nothing else but a LV created in a VG consisting of several PVs that are EC2-"Volumes". So an EBS-Snapshot freezes the PVs - right or wrong? – Nils Jun 02 '11 at 20:44
  • An EBS snapshot does not in itself freeze the PVs - it's lower level than that. – fields Jun 03 '11 at 01:38
  • "Lower" means even more into the hardware layer direction? Anyway - the rest of my assumptions are correct? So why should you not use a high-level "lvcreate -s SNAPSHOT..." instead? – Nils Jun 04 '11 at 20:56

1 Answers1

-1

What is the point in suspending before snapshotting?

I read "force" three times in your procedure.

If you have to "force" something you should know exactly what you are doing. If not, try to avoid the "force".

So what warnings do you get, if you do not force? That might give you some clues to what went wrong.

Nils
  • 7,695
  • 3
  • 34
  • 73
  • Thanks, but this is a misreading of my question and is not helpful. Suspending the volume prior to snapshot ensures that the lvm volume will be in sync across the snapshots of the multiple ebs volumes containing the lvm volume. An "ssh forced command" is unrelated to the snapshot except in the way that it allows remote execution of specific commands as root on a box that doesn't allow ssh logins for the root account. – fields May 31 '11 at 20:04
  • More specifically, the problem here is that my lvm volumes are stuck in a suspended state, and I can't figure out how to get them unsuspended or what's blocking the resume from working. – fields May 31 '11 at 20:06
  • So you should clarify the "force" in your description. "Suspend" does not sound like "sync" to me. It rather sounds like stopping any io in mid-action. From your description of "ssh force" it seems that you will not see any warnings - is that correct? – Nils May 31 '11 at 20:17
  • No - it flushes i/o before suspending. From the man page: "Suspends a device. Any I/O that has already been mapped by the device but has not yet completed will be flushed. Any further I/O to that device will be postponed for as long as the device is suspended.". I've tried doing the resume via a regular root login, and it just hangs - no errors or warnings. – fields May 31 '11 at 20:26
  • I googled around this a little bit. I found that an lv-snapshot uses dmsetup susped internally. But I also found that this sync has to be supported by the filesystem on top of the device. So what filesystem do you use here? – Nils Jun 01 '11 at 12:19
  • It's ext4, which should be fine. – fields Jun 01 '11 at 14:13
  • ext4 is quite new... looking at this [release note](http://kernelnewbies.org/LinuxChanges#head-706f12db62b146bde4c701ae220faf2ea16aa467) it is still under heavy devolopement. I found some [lockup-bugs](http://www.fedoraforum.org/forum/showthread.php?t=231380) for ext4 in conjunction with "suspend" but these date from 2009 - but sometimes the little buggers come back... – Nils Jun 02 '11 at 20:29