4

I have just started a large instance using ami-fa01f193 AMI. When I use ps aux, a bunch of random processes will show HUGE numbers for the CPU time used. Looks like some sort of overflow. Did someone see this before and how do I fix this?

Here is a sample output:

  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:00 /sbin/init
    2 ?        S      0:00 [kthreadd]
    3 ?        S      0:00 [migration/0]
    4 ?        S    17179869:11 [ksoftirqd/0]
    5 ?        S      0:00 [watchdog/0]
    6 ?        S    17179869:11 [events/0]
    7 ?        S      0:00 [cpuset]
    8 ?        S      0:00 [khelper]
    9 ?        S      0:00 [netns]
   10 ?        S      0:00 [async/mgr]
   11 ?        S      0:00 [xenwatch]
   12 ?        S      0:00 [xenbus]
   14 ?        S      0:00 [migration/1]
   15 ?        S    17179869:11 [ksoftirqd/1]
   16 ?        S      0:00 [watchdog/1]
   17 ?        S    17179869:11 [events/1]
   18 ?        S      0:00 [sync_supers]
   19 ?        S      0:00 [bdi-default]
Mad Wombat
  • 155
  • 6

2 Answers2

2

TL/DR: Known Issue with Ubuntu 10.04 LTS on Amazon EC2 Nehalem instances


According to Mike Heffner (of Librato's Silverline):

During conversations with other tech companies we learned of an issue when running the Ubuntu 10.04 LTS release on certain Amazon EC2 servers -- the same environment as our backend servers. The issue appeared to be triggered when launching the Ubuntu 10.04 LTS release on hypervisors running on Intel Xeon Series 55xx (Nehalem) CPUs. For example, some Cassandra users were reporting that nodes would completely freeze up for extended periods of time. We identified that we only saw the large CPU spikes in our backend system CPU graphs when we had launched an E5507 backed instance.

Mike recommends the following workarounds while a kernel patch for Ubuntu 10.01: There are a number of approaches users can take to avoid being impacted by this:

  1. Update to a newer Ubuntu release, for example, Ubuntu 10.10. Since Ubuntu 10.04, the Xen patches are better integrated into the kernel avoiding the requirement to backport them to 2.6.32. Users have reported that the original process lockups don’t occur with the Ubuntu 10.10 images.

  2. For users with environments currently dependent on the Ubuntu 10.04 environment (we still have some ourselves) we have modified our OPS scripts to throw out instances that boot with the Nehalem CPUs and reprovision until we get an E5430 machine. We have noticed that in some AZs we see more Nehalem’s than in others which likely points to AZs with more recent hardware deployments. Obviously this approach is not sustainable on a whole as more users seek out the older E5430 CPUs and Amazon further invests in the Nehalem architecture, so we are actively working to migrate our 10.04 systems to 10.10.

  3. For advanced users, building a custom 2.6.32 kernel that contains the patchset from the bug report is an option. There are also some custom kernels and AMIs in this bug report that users have reported success with.

KnipSter
  • 599
  • 4
  • 4
0

A similar thing happened to me on a Centos server. A full, cold reboot fixed the problem. Of course, I don't know how you would go about a cold reboot on a virtual machine though...

On a freshly-rebooted server why would the CPU running time of processes be huge?

ChrisW
  • 283
  • 2
  • 9