4

We've been running some production services on Amazon EC2 for a while, using mainly t1.large and t1.xlarge instances (non-EBS). Every so often one of the attached (ephemeral disks) will get into a state of 100% util (as reported by iostat -xtc).

When a disk gets in this state, it is essentially completely unusable. A reboot fixes the issue, seemingly without any corruption. Occurrences are apparently random and happen every few weeks.

I'm not sure if any software is related, but we're running up-to-date Ubuntu 10.04 (Lucid). These ephemeral disks currently operate under lvm (RAID0). Previouslly we were using mdadm in conjunction with lvm.

Has anyone else seen this behavior before (not sure it is specific to EC2) and any ideas how to avoid it or correct for it without rebooting?

yegg
  • 141
  • 2
  • 100% disk usage as a statistic itself is almost completely meaningless. Are you tracking number of IOPS (reads and writes), the average request size and svctime for those operations? The next step for solving this is to determine if your application is causing a significant increase in # or size of operations (your issue), or if the operations are starting to take longer (amazon issue) when you see the problem. – polynomial Oct 09 '11 at 01:48
  • Sorry for not being more clear. When this occurs, I can shut down everything such that nothing is happening on the machine and the drive is still in this state. Any operation to the drive doesn't complete. Also, when I was running mdadm it did not detect a degraded state for it. – yegg Oct 09 '11 at 14:56
  • Can you update with iops r/w counts then? iostat -x 1 output for several seconds perhaps? – polynomial Oct 10 '11 at 01:39
  • I don't have one in this state now, but will update once I do. – yegg Oct 10 '11 at 11:54
  • It happened again, and here is the output from iostat -x 1 for a few seconds, as per your suggestion: https://gist.github.com/1318739 – yegg Oct 27 '11 at 03:48
  • Is the raid used for swap? – Jason Martin Jan 06 '17 at 15:55

1 Answers1

0

Even the ephemeral storage on EC2 instances is subject to the typical problems of multi-tenancy. Instead of just rebooting the server, [if your configuration permits] fully stop and start the instance so that your instance ends up on a different hypervisor.

Here is an article about Netflix's strategy for dealing with multi-tenancy issues on EC2.

mh.
  • 233
  • 1
  • 5
  • Thx, but since we're not using EBS, we cannot stop and start the instance. Additionally, the reboot actually works fine, which leads me to believe it may not be a multi-tenancy issue. – yegg Oct 08 '11 at 21:47