Can it be because of any Kernel Bugs?specifically spin lock

Question

We are running ubuntu 11.04 with 2.6.38-13-generic kernel on Intel(R) Xeon(R) CPU E5620 @ 2.40GHz with 48 GB RAM dedicated server with Hardware RAID.

top command output is showing many kernel threads running on different cores.

thread number

ksoftirqd - 16 (one on each core)
kworker - 35
migration - 16 (one on each core)

We already experienced two freezes and forced to restart the machine,both happened after we made modifications to .htaccess and then reloaded apache.

on syslog General Protection Fault was the last message logged.

After the restart most data on the hardisk became 0 bytes. 2.5 Gb data changed to 30 Mb soon after restart . :(

Is this because of any kernel bugs. on kernel.org 2.6.38-13 is not listed as a stable release.Does this mean that we need to change from current kernel to any stable one?? if so which kernel should we choose?

syslog output

isn't this a kernel spinlock case

May 2 22:34:01 416831 CRON[19206]: (root) CMD (bash /home/admin/log-children)

May 2 22:34:11 416831 kernel: [3715446.033031] general protection fault: 0000 [#1] SMP

May 2 22:34:11 416831 kernel: [3715446.054726] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map

May 2 22:34:11 416831 kernel: [3715446.097404] CPU 5

May 2 22:34:11 416831 kernel: [3715446.097869] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_LOG xt_tcpudp ipt_REDIRECT xt_conntrack iptable_mangle nf_conntrack_ftp ipt_REJECT ipt_LOG xt_limit xt_multiport xt_state ip6table_filter ip6_tables iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables vesafb snd_hda_intel snd_hda_codec psmouse ioatdma snd_hwdep i7core_edac ghes edac_core lp hed dca joydev snd_pcm serio_raw parport snd_timer snd soundcore snd_page_alloc usbhid hid e1000e

May 2 22:34:11 416831 kernel: [3715446.279465]

May 2 22:34:11 416831 kernel: [3715446.303429] Pid: 19118, comm: apache2 Not tainted 2.6.38-13-generic #56-Ubuntu Supermicro X8DTL/X8DTL

May 2 22:34:11 416831 kernel: [3715446.355544] RIP: 0010:[] [] task_rq_lock+0x4a/0xa0

May 2 22:34:11 416831 kernel: [3715446.411635] RSP: 0018:ffff88060b853da8 EFLAGS: 00010082

May 2 22:34:11 416831 kernel: [3715446.440241] RAX: 010021b86505c7ff RBX: 0000000000013d00 RCX: 00000001162d8937

May 2 22:34:11 416831 kernel: [3715446.497492] RDX: 0000000000000282 RSI: ffff88060b853df0 RDI: 00007fdac0088280

May 2 22:34:11 416831 kernel: [3715446.559362] RBP: ffff88060b853dc8 R08: 0000000000000040 R09: 001fc00000000000

May 2 22:34:11 416831 kernel: [3715446.625144] R10: 0000000000000000 R11: dead000000100100 R12: 00007fdac0088280

May 2 22:34:11 416831 kernel: [3715446.695569] R13: ffff88060b853df0 R14: 0000000000013d00 R15: 0000000000000005

May 2 22:34:11 416831 kernel: [3715446.770654] FS: 00007fdac0023760(0000) GS:ffff880c3fc20000(0000) knlGS:0000000000000000

May 2 22:34:11 416831 kernel: [3715446.849786] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

May 2 22:34:11 416831 kernel: [3715446.889882] CR2: 00007fdac187ca80 CR3: 000000058cda1000 CR4: 00000000000006e0

May 2 22:34:11 416831 kernel: [3715446.968627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

May 2 22:34:11 416831 kernel: [3715447.049676] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

May 2 22:34:11 416831 kernel: [3715447.130842] Process apache2 (pid: 19118, threadinfo ffff88060b852000, task ffff88058c11c4a0)

May 2 22:34:11 416831 kernel: [3715447.212160] Stack:

May 2 22:34:11 416831 kernel: [3715447.251311] 00007fdac0088280 ffff880be1ca5ec8 000000000000000f 0000000000000000

May 2 22:34:11 416831 kernel: [3715447.331017] ffff88060b853e28 ffffffff8105f2e1 0000000000000000 0000000081a4c270

May 2 22:34:11 416831 kernel: [3715447.412179] ffff88060b853e38 0000000000000282 0000000000000021 ffff880b92505ec8

May 2 22:34:11 416831 kernel: [3715447.493302] Call Trace:

May 2 22:34:11 416831 kernel: [3715447.533014] [] try_to_wake_up+0x31/0x3e0

May 2 22:34:11 416831 kernel: [3715447.573262] [] wake_up_process+0x15/0x20

May 2 22:34:11 416831 kernel: [3715447.612669] [] wake_up_sem_queue_do+0x37/0x60

May 2 22:34:11 416831 kernel: [3715447.651327] [] freeary+0x1c6/0x200

May 2 22:34:11 416831 kernel: [3715447.689083] [] semctl_down.clone.5+0xbb/0x110

May 2 22:34:11 416831 kernel: [3715447.726360] [] ? sys_kill+0x7e/0x90

May 2 22:34:11 416831 kernel: [3715447.762833] [] ? fput+0x25/0x30

May 2 22:34:11 416831 kernel: [3715447.798362] [] sys_semctl+0x7e/0xd0

May 2 22:34:11 416831 kernel: [3715447.833126] [] system_call_fastpath+0x16/0x1b

May 2 22:34:11 416831 kernel: [3715447.867350] Code: 00 48 c7 c3 00 3d 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 49 8b 44 24 08 49 89 de <8b> 40 18 4c 03 34 c5 80 c8 aa 81 4c 89 f7 e8 53 4e 57 00 49 8b

May 2 22:34:11 416831 kernel: [3715447.970388] RIP [] task_rq_lock+0x4a/0xa0

May 2 22:34:11 416831 kernel: [3715448.004042] RSP

May 2 22:34:11 416831 kernel: [3715448.083219] ---[ end trace 244a1ec2d6f912fa ]---

May 2 22:35:01 416831 CRON[19243]: (root) CMD (bash /home/admin/log-children)

score 4 · Answer 1 · answered May 15 '12 at 13:38

4

This sounds like a hardware bug. Today's Linux does not have bugs of severity level "I reloaded Apache, my server crashed and I lost my data" left -- you have some kind of hardware problem. Overheating, bad RAM/CPU/motherboard/RAID controller/HDD/something else.

The reason your post has received couple of downvotes is because your post lacks details. We can't possibly guess what's wrong (other than my guess about HW problem).

answered May 15 '12 at 13:38

Janne Pikkarainen

31,852
4
58
81

the kernel we are using is not listed in kernel.org under stable release does that indicate that the kernel we are using is not stable – ananthan May 15 '12 at 13:52
1

...why are you using an unstable kernel on a production server? If you suspect the kernel at fault, roll back to an older version, see if you can reproduce the error. – Bart Silverstrim May 15 '12 at 14:08
The fact that the kernel isn't a stable release doesn't mean it will crash like that. Generally it means some of the new features may not work or still have bugs. It's incredibly unlikely that a bug of that magnitude would make it into any release, stable or not. – Grant May 15 '12 at 14:28
it is the default version that comes with ubuntu 11.4.does this kernel threads indicate a problem.A quick google reveals that some are interrupt handlers.I don't feel this as normal to have these much interrupt handlers. – ananthan May 15 '12 at 14:32
We consulted a kernel developer for handling the issue,and based on his observation its a spin lock condition,which is indeed a kernel bug.. – ananthan May 18 '12 at 06:53

score 1 · Answer 2 · answered May 15 '12 at 13:45

1

This is extremely unlikely to be because of a kernel bug. As Janne says, hardware fault is more likely. Your speediest route to remediation is likely to be to replace faulty hardware and reinstall/recover data from backup.

answered May 15 '12 at 13:45

Sirch

5,785
4
20
36

the kernel we are using is not listed in kernel.org under stable release does that indicate that the kernel we are using is not stable – ananthan May 15 '12 at 13:53
What is your question? If you are trying to find a root cause of your data loss, you will have to look to your crash dump. Is your data loss due to a kernel bug? It is possible, albeit incredibly unlikely. A hardware fault is more likely. – Sirch May 15 '12 at 14:08
it is the default version that comes with ubuntu 11.4.does this kernel threads indicate a problem.A quick google reveals that some are interrupt handlers.I don't feel this as normal to have these much interrupt handlers. – ananthan May 15 '12 at 14:27

Can it be because of any Kernel Bugs?specifically spin lock

2 Answers2