1

My custom hardware is based on imax6q processor. Kernel I'm using - Linux-boundary 4.1.15. I'm getting kernel panics while running linux application on my custom hardware design. So how can I debug this to clarify whether this is a hardware related or software problem ? Or how to address which memory is corrupting ?

Presently I'm using addr2line -e vmlinux_with_debug_info 804a47e8 to decode kernel panic message. But it only shows the kernel source when the panic occur(Where program counter stopped) .

1) As an example from the following kernel log when I decode [804a47e8] it only give the kernel source. So do you know how to identify the which memory is going to corrupt ? Or else can we get more details on this ?

Ex :

Using addr2line>>

root@osboxes:/home/debug# addr2line -e vmlinux 804a47e8
/yocto/fsl-community-bsp/build/workspace/sources/linux- 
boundary/lib/rbtree.c:457

So the source is rbtree.c and line 457 is going to crash.

if (node->rb_right) {
node = node->rb_right; 
while (node->rb_left)    **//Line 457**
node=node->rb_left;
return (struct rb_node *)node;
}

In file entry-armv.S line 1219 is pointing from addr2line

vectors_start:
>>W(b)    vector_rst    **//Line 1219**
W(b)    vector_und
W(ldr)    pc, __vectors_start + 0x1000
W(b)    vector_pabt
W(b)    vector_dabt
W(b)    vector_addrexcptn
W(b)    vector_irq
W(b)    vector_fiq

Like this I could identify some file like;

entry-armv.S
timerqueue.c
thread_info.h
spinlock.c

and their lines. With this how I get an idea what is going wrong with the kernel or memory ?

2) If the kernel panic log point the above files when the panic happen what we can figure-out from this ? Can we find why the memory is going to corrupt and the reason with this information ?

Kernel Panic Log

[ 6814.239517] Unable to handle kernel NULL pointer dereference at virtual 
address 00000408
[ 6814.247646] pgd = 80004000
[ 6814.250364] [00000408] *pgd=00000000
[ 6814.253979] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[ 6814.259387] Modules linked in: mxc_v4l2_capture ipu_bg_overlay_sdc tw6869 
videobuf2_dma_contig ipu_still ipu_prp_enc videobuf2_memops ipu_csi_enc 
adv7610_video ipu_fg_overlay_sdc v4l2_int_device galcore(O)[ 6814.276790] 
DEBUG: ADV7610 HDMI cable detected TENGRI-RC
[ 6814.280811] DEBUG: ADV7610 read 1280x720p
[ 6814.282531] DEBUG: ADV7610 returned 1280x720p@61

[ 6814.291626]
[ 6814.293327] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G O 4.1.15-2.0.0- 
ga+yocto+gff4e28b #1
[ 6814.302558] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 6814.309096] task: ce11b200 ti: ce14a000 task.ti: ce14a000
[ 6814.314520] PC is at rb_next+0x2c/0x6c
[ 6814.318282] LR is at timerqueue_del+0x48/0x80
[ 6814.322647] pc : [<804a47e8>] lr : [<804a7128>] psr: 20030193
[ 6814.322647] sp : ce14bd80 ip : ce14bd90 fp : ce14bd8c
[ 6814.334128] r10: 00000002 r9 : 00000001 r8 : ce14be40
[ 6814.339356] r7 : d0f0f650 r6 : 00000000 r5 : d0f0f404 r4 : d0f0f650
[ 6814.345887] r3 : 00000400 r2 : d0f0f650 r1 : d0f0f650 r0 : 00000400
[ 6814.352419] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment 
kernel
[ 6814.359819] Control: 10c5387d Table: 5e96404a DAC: 00000015
[ 6814.365569] Process swapper/1 (pid: 0, stack limit = 0xce14a210)
[ 6814.371580] Stack: (0xce14bd80 to 0xce14c000)
[ 6814.375945] bd80: ce14bdac ce14bd90 804a7128 804a47c8 d0f0f650 d0f0f3f8 
00000000 d0f0f650
[ 6814.384128] bda0: ce14bddc ce14bdb0 8018fa6c 804a70ec 00000000 d0f0f650 
00000632 d0f0f3f8
[ 6814.392311] bdc0: d0f0f3c0 ce14be40 00000001 d0f0f4d8 ce14be04 ce14bde0 
8019025c 8018fa30
[ 6814.400494] bde0: 8a653461 00000632 d0f0f3f8 00000000 d0f0f3c0 00000001 
ce14be7c ce14be08
[ 6814.408677] be00: 80190894 80190210 8a653461 00000632 80e02508 d0f0f4d0 
d0f0f498 d0f0f460
[ 6814.416860] be20: 00000003 d0f0f3f8 ffffffff 7fffffff 8a653461 00000632 
8a653461 00000632
[ 6814.425042] be40: 8a653461 00000632 ce14be6c dc8ba30f 00008648 80d98648 
80d99c34 00000001
[ 6814.433225] be60: 80e88b48 00000000 00000001 80e02508 ce14be94 ce14be80 
801a061c 80190768
[ 6814.441407] be80: 80d99c34 80d99c34 ce14bec4 ce14be98 8010ff50 801a05e0 
ce14bee8 f4a0010c
[ 6814.449590] bea0: 80e02f7c ce14bee8 f4a00100 d0f11ed0 80e08b14 80e02508 
ce14bee4 ce14bec8
[ 6814.457772] bec0: 80101594 8010fd60 806fb0b0 60030013 ffffffff ce14bf1c 
ce14bf8c ce14bee8
[ 6814.465955] bee0: 8010d240 80101538 00000000 80ed4c20 dc8ba30f dc8ba30f 
89d010b1 00000632
[ 6814.474137] bf00: 80e89160 00000004 d0f11ed0 80e08b14 80e02508 ce14bf8c 
ce14bed0 ce14bf30
[ 6814.482320] bf20: 80944994 806fb0b0 60030013 ffffffff 8a652374 00000632 
89d010b1 00000632
[ 6814.490502] bf40: 00000004 00000000 0094f441 00000001 8a652374 00000632 
00000000 dc8ba30f
[ 6814.498685] bf60: 80d97320 ce14a000 80e025e8 80e89160 d0f11ed0 80e08a74 
80a02970 ce14bfa0
[ 6814.506868] bf80: ce14bf9c ce14bf90 806fb390 806faffc ce14bfdc ce14bfa0 
80171a74 806fb378
[ 6814.515050] bfa0: 80d9b880 80d9b880 00000000 80e02e00 80e025f0 80e888ea 
00000001 80c137fc
[ 6814.523233] bfc0: 80d9aec8 80d97300 10c0387d 80e8e35c ce14bff4 ce14bfe0 
8010fae0 801717d8
[ 6814.531416] bfe0: 5e13006a 00000015 00000000 ce14bff8 1010162c 8010f994 
5b3c779b 85ac6ed7
[ 6814.539594] Backtrace:
[ 6814.542068] [<804a47bc>] (rb_next) from [<804a7128>] 
(timerqueue_del+0x48/0x80)
[ 6814.549393] [<804a70e0>] (timerqueue_del) from [<8018fa6c>] 
(__remove_hrtimer+0x48/0xdc)
[ 6814.557487] r7:d0f0f650 r6:00000000 r5:d0f0f3f8 r4:d0f0f650
[ 6814.563218] [<8018fa24>] (__remove_hrtimer) from [<8019025c>] 
(__run_hrtimer+0x58/0x284)
[ 6814.571310] r10:d0f0f4d8 r9:00000001 r8:ce14be40 r7:d0f0f3c0 r6:d0f0f3f8 
r5:00000632
[ 6814.579220] r4:d0f0f650 r3:00000000
[ 6814.582837] [<80190204>] (__run_hrtimer) from [<80190894>] 
(hrtimer_interrupt+0x138/0x344)
[ 6814.591104] r9:00000001 r8:d0f0f3c0 r7:00000000 r6:d0f0f3f8 r5:00000632 
r4:8a653461
[ 6814.598940] [<8019075c>] (hrtimer_interrupt) from [<801a061c>] 
(tick_receive_broadcast+0x48/0x60)
[ 6814.607813] r10:80e02508 r9:00000001 r8:00000000 r7:80e88b48 r6:00000001 
r5:80d99c34
[ 6814.615721] r4:80d98648
[ 6814.618283] [<801a05d4>] (tick_receive_broadcast) from [<8010ff50>] 
(handle_IPI+0x1fc/0x2f8)
[ 6814.626723] r5:80d99c34 r4:80d99c34
[ 6814.630337] [<8010fd54>] (handle_IPI) from [<80101594>] 
(gic_handle_irq+0x68/0x6c)
[ 6814.637909] r10:80e02508 r9:80e08b14 r8:d0f11ed0 r7:f4a00100 r6:ce14bee8 
r5:80e02f7c
[ 6814.645818] r4:f4a0010c r3:ce14bee8
[ 6814.649431] [<8010152c>] (gic_handle_irq) from [<8010d240>] 
(__irq_svc+0x40/0x74)
[ 6814.656917] Exception stack(0xce14bee8 to 0xce14bf30)
[ 6814.661974] bee0: 00000000 80ed4c20 dc8ba30f dc8ba30f 89d010b1 00000632
[ 6814.670157] bf00: 80e89160 00000004 d0f11ed0 80e08b14 80e02508 ce14bf8c 
ce14bed0 ce14bf30
[ 6814.678338] bf20: 80944994 806fb0b0 60030013 ffffffff
[ 6814.683392] r7:ce14bf1c r6:ffffffff r5:60030013 r4:806fb0b0
[ 6814.689129] [<806faff0>] (cpuidle_enter_state) from [<806fb390>] 
(cpuidle_enter+0x24/0x28)
[ 6814.697395] r10:ce14bfa0 r9:80a02970 r8:80e08a74 r7:d0f11ed0 r6:80e89160 
r5:80e025e8
[ 6814.705304] r4:ce14a000
[ 6814.707865] [<806fb36c>] (cpuidle_enter) from [<80171a74>] 
(cpu_startup_entry+0x2a8/0x43c)
[ 6814.716137] [<801717cc>] (cpu_startup_entry) from [<8010fae0>] 
(secondary_start_kernel+0x158/0x164)
[ 6814.725184] r7:80e8e35c
[ 6814.727743] [<8010f988>] (secondary_start_kernel) from [<1010162c>] 
(0x1010162c)
[ 6814.735142] r5:00000015 r4:5e13006a
[ 6814.738754] Code: e5923004 e3530000 0a000004 e1a00003 (e5933008)
[ 6814.744855] ---[ end trace 4a16766634a9d08f ]---
[ 6814.749478] Kernel panic - not syncing: Fatal exception in interrupt
[ 6814.755842] CPU0: stopping
[ 6814.758566] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D O 4.1.15-2.0.0- 
ga+yocto+gff4e28b #1
[ 6814.767789] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 6814.774320] Backtrace:
[ 6814.776799] [<8010c358>] (dump_backtrace) from [<8010c5d4>] 
(show_stack+0x20/0x24)
[ 6814.784372] r7:80e88b48 r6:80e45d94 r5:00000000 r4:80e45d94
[ 6814.790107] [<8010c5b4>] (show_stack) from [<8093ee58>] 
(dump_stack+0x7c/0xbc)
[ 6814.797338] [<8093eddc>] (dump_stack) from [<80110038>] 
(handle_IPI+0x2e4/0x2f8)
[ 6814.804737] r7:80e88b48 r6:00000005 r5:80d99c34 r4:80d99c34
[ 6814.810469] [<8010fd54>] (handle_IPI) from [<80101594>] 
(gic_handle_irq+0x68/0x6c)
[ 6814.818041] r10:80e02508 r9:80e08b14 r8:d0f04ed0 r7:f4a00100 r6:80e01e98 
r5:80e02f7c
[ 6814.825955] r4:f4a0010c r3:80e01e98
[ 6814.829571] [<8010152c>] (gic_handle_irq) from [<8010d240>] 
(__irq_svc+0x40/0x74)
[ 6814.837057] Exception stack(0x80e01e98 to 0x80e01ee0)
[ 6814.842114] 1e80: 00000000 d0f084c0
[ 6814.850299] 1ea0: dc8ba30f dc8ba30f a8cb74e4 00000632 80e89160 00000004 
d0f04ed0 80e08b14
[ 6814.858483] 1ec0: 80e02508 80e01f3c 80e01e80 80e01ee0 80944994 806fb0b0 
600f0013 ffffffff
[ 6814.866662] r7:80e01ecc r6:ffffffff r5:600f0013 r4:806fb0b0
[ 6814.872399] [<806faff0>] (cpuidle_enter_state) from [<806fb390>] 
(cpuidle_enter+0x24/0x28)
[ 6814.880665] r10:80e01f50 r9:80a02970 r8:80e08a74 r7:d0f04ed0 r6:80e89160 
r5:80e025e8
[ 6814.888581] r4:80e00000
[ 6814.891140] [<806fb36c>] (cpuidle_enter) from [<80171a74>] 
(cpu_startup_entry+0x2a8/0x43c)
[ 6814.899418] [<801717cc>] (cpu_startup_entry) from [<8093cb20>] 
(rest_init+0x98/0x9c)
[ 6814.907165] r7:80e02500
[ 6814.909731] [<8093ca88>] (rest_init) from [<80d00d9c>] 
(start_kernel+0x404/0x424)
[ 6814.917216] r5:80e8e000 r4:80e8e04c
[ 6814.920833] [<80d00998>] (start_kernel) from [<1000807c>] (0x1000807c)
[ 6814.927366] CPU3: stopping
[ 6814.930085] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G D O 4.1.15-2.0.0- 
ga+yocto+gff4e28b #1
[ 6814.939309] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 6814.945839] Backtrace:
[ 6814.948315] [<8010c358>] (dump_backtrace) from [<8010c5d4>] 
(show_stack+0x20/0x24)
[ 6814.955888] r7:80e88b48 r6:80e45d94 r5:00000000 r4:80e45d94
[ 6814.961621] [<8010c5b4>] (show_stack) from [<8093ee58>] 
(dump_stack+0x7c/0xbc)
[ 6814.968852] [<8093eddc>] (dump_stack) from [<80110038>] 
(handle_IPI+0x2e4/0x2f8)
[ 6814.976250] r7:80e88b48 r6:00000005 r5:80d99c34 r4:80d99c34
[ 6814.981983] [<8010fd54>] (handle_IPI) from [<80101594>] 
(gic_handle_irq+0x68/0x6c)
[ 6814.989555] r10:80e02508 r9:80e08b14 r8:d0f2bed0 r7:f4a00100 r6:ce14fee8 
r5:80e02f7c
[ 6814.997471] r4:f4a0010c r3:ce14fee8
[ 6815.001087] [<8010152c>] (gic_handle_irq) from [<8010d240>] 
(__irq_svc+0x40/0x74)
[ 6815.008572] Exception stack(0xce14fee8 to 0xce14ff30)
[ 6815.013631] fee0: 00000000 d0f2f4c0 dc8ba30f dc8ba30f a8cc0d3c 00000632
[ 6815.021815] ff00: 80e89160 00000004 d0f2bed0 80e08b14 80e02508 ce14ff8c 
ce14fed0 ce14ff30
[ 6815.029996] ff20: 80944994 806fb0b0 60010013 ffffffff
[ 6815.035050] r7:ce14ff1c r6:ffffffff r5:60010013 r4:806fb0b0
[ 6815.040786] [<806faff0>] (cpuidle_enter_state) from [<806fb390>] 
(cpuidle_enter+0x24/0x28)
[ 6815.049052] r10:ce14ffa0 r9:80a02970 r8:80e08a74 r7:d0f2bed0 r6:80e89160 
r5:80e025e8
[ 6815.056967] r4:ce14e000
[ 6815.059525] [<806fb36c>] (cpuidle_enter) from [<80171a74>] 
(cpu_startup_entry+0x2a8/0x43c)
[ 6815.067798] [<801717cc>] (cpu_startup_entry) from [<8010fae0>] 
(secondary_start_kernel+0x158/0x164)
[ 6815.076846] r7:80e8e35c
[ 6815.079404] [<8010f988>] (secondary_start_kernel) from [<1010162c>] 
(0x1010162c)
[ 6815.086803] r5:00000015 r4:5e13006a
[ 6815.090414] CPU2: stopping
[ 6815.093135] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D O 4.1.15-2.0.0- 
ga+yocto+gff4e28b #1
[ 6815.102358] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 6815.108888] Backtrace:
[ 6815.111364] [<8010c358>] (dump_backtrace) from [<8010c5d4>] 
(show_stack+0x20/0x24)
[ 6815.118937] r7:80e88b48 r6:80e45d94 r5:00000000 r4:80e45d94
[ 6815.124669] [<8010c5b4>] (show_stack) from [<8093ee58>] 
(dump_stack+0x7c/0xbc)
[ 6815.131901] [<8093eddc>] (dump_stack) from [<80110038>] 
(handle_IPI+0x2e4/0x2f8)
[ 6815.139300] r7:80e88b48 r6:00000005 r5:80d99c34 r4:80d99c34
[ 6815.145032] [<8010fd54>] (handle_IPI) from [<80101594>] 
(gic_handle_irq+0x68/0x6c)
[ 6815.152604] r10:80e02508 r9:80e08b14 r8:d0f1eed0 r7:f4a00100 r6:ce14dee8 
r5:80e02f7c
[ 6815.160519] r4:f4a0010c r3:ce14dee8
[ 6815.164134] [<8010152c>] (gic_handle_irq) from [<8010d240>] 
(__irq_svc+0x40/0x74)
[ 6815.171620] Exception stack(0xce14dee8 to 0xce14df30)
[ 6815.176679] dee0: 00000000 d0f224c0 dc8ba30f dc8ba30f a87f549c 00000632
[ 6815.184863] df00: 80e89160 00000004 d0f1eed0 80e08b14 80e02508 ce14df8c 
ce14ded0 ce14df30
[ 6815.193044] df20: 80944994 806fb0b0 60010013 ffffffff
[ 6815.198099] r7:ce14df1c r6:ffffffff r5:60010013 r4:806fb0b0
[ 6815.203834] [<806faff0>] (cpuidle_enter_state) from [<806fb390>] 
(cpuidle_enter+0x24/0x28)
[ 6815.212100] r10:ce14dfa0 r9:80a02970 r8:80e08a74 r7:d0f1eed0 r6:80e89160 
r5:80e025e8
[ 6815.220015] r4:ce14c000
[ 6815.222574] [<806fb36c>] (cpuidle_enter) from [<80171a74>] 
(cpu_startup_entry+0x2a8/0x43c)
[ 6815.230847] [<801717cc>] (cpu_startup_entry) from [<8010fae0>] 
(secondary_start_kernel+0x158/0x164)
[ 6815.239895] r7:80e8e35c
[ 6815.242453] [<8010f988>] (secondary_start_kernel) from [<1010162c>] 
(0x1010162c)
[ 6815.249852] r5:00000015 r4:5e13006a
[ 6815.253467] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt

Regards, Peter.

Peter_Amond
  • 51
  • 1
  • 9
  • 1
    I'm not good at kernel but it looks like your node->rb_left is point to forbidden memory area, first thing try to check it for NULL before access it. – Nick S Jun 04 '18 at 11:05
  • 1
    As @NickS pointed, it could be null pointer. Do Kernel print the value of the pointer. addr2line is surely the way to go, and do some kprintfs to debug. – Vinay P Jun 04 '18 at 11:08

2 Answers2

1

From the source code, rb_next is passed a pointer to a tree node. From your panic log:

[ 6814.345887] r3 : 00000400 r2 : d0f0f650 r1 : d0f0f650 r0 : 00000400

On ARM, r0 holds the first parameter, and it does not look like a valid address. See the dereference causing the OOPS is 0x00000408, so - as expected - it is in

while (node->rb_left) /* Here node == 0x00000408, which is `passed_node->rb_right` */

Your next step, get a coredump and examine the timerqueue in question (coming from timerqueue_del).

Perhaps the program that triggers this could tell us more.

Michael Foukarakis
  • 39,737
  • 6
  • 87
  • 123
  • 1
    my 2¢. I have looked into timerqueue_del funcion in 4.14. it is doing sanity check for argument passed to rb_next `WARN_ON_ONCE(RB_EMPTY_NODE(&node->node));` also in rb_next simmilar test `if (RB_EMPTY_NODE(node))`. I have checked the definition not checking for NULL value. I feel issue is in corrupt linked list. Either Not terminated properly or some node has changed some invalid address. – Devidas Jun 04 '18 at 12:28
  • @Michael addr2line -e vmlinux timerqueue_del+0x48/0x8 This point to /linux-boundary/arch/arm/kernel/entry-armv.S:1219 And this line included vectors_start: W(b) vector_rst ****// Line 1219 W(b) vector_und W(ldr) pc, __vectors_start + 0x1000 W(b) vector_pabt W(b) vector_dabt W(b) vector_addrexcptn W(b) vector_irq W(b) vector_fiq – Peter_Amond Jun 05 '18 at 09:39
  • @Michael addr2line -e vmlinux 804a7128 this point to linux-boundary/lib/timerqueue.c:80 And This line include /* update next pointer */ if (head->next == node) { struct rb_node *rbn = rb_next(&node->node); head->next = rbn ? >>>>>>This is line 80<<<<<<<<< rb_entry(rbn, struct timerqueue_node, node) : NULL; } rb_erase(&node->node, &head->head); RB_CLEAR_NODE(&node->node); } EXPORT_SYMBOL_GPL(timerqueue_del); – Peter_Amond Jun 05 '18 at 09:41
  • @Michael If the above code which I took from source kernel files are not clear you can find all the source codes from here. timerqueue.c >> https://pastebin.com/DhX6qFJT entry-armv.S >> https://pastebin.com/C4kMbEGK I must be thankful to you if you can give me an idea whether this is hardware or kernel problem. I have another same custom hardware which work properly without kernel panics (From 50 units 30 are working and 20 are getting kernel panics, but those kernel panic boards passed memory hardware verification stress tests). So I just want to pinpoint what is the exact problem is. – Peter_Amond Jun 05 '18 at 10:00
  • @ Devidas Many thanks for your reply. What did you mean from this ? " I feel issue is in corrupt linked list. Either Not terminated properly or some node has changed some invalid address." Can you give me some clue to clarify this kernel panic or any source to decode and get clear understanding about kernel panics. – Peter_Amond Jun 05 '18 at 10:03
0

I think it's not hardware issue because dereference happens in virtual memory not physical.

You can decode your Ooops to get more information:

17 its value in decimal you need to convert it to binary after it you get your 10001

You can find more detailed information at arch/arch/mm/fault.c

start decode it from right to left:

bit 0 ==    0: no page found                 1: protection fault
bit 1 ==    0: read access                   1: write access
bit 2 ==    0: kernel-mode access            1: user-mode access
bit 3 ==    0: reserved bit not detect       1: use of reserved bit detected
bit 4 ==    0: No fault at instuction fetch  1: fault was an instruction fetch

In your case its:

Protection fault | read access | kernel | bit not set | fault on instruction

So add some debug printing and check for NULL pointer.

Nick S
  • 1,299
  • 1
  • 11
  • 23
  • @ Nick S Thank you for your feedback. I have 50 units of custom hardware designs and some boards are working properly without kernel panics. From hardware verification side I did all verification like memory calibration and stress test. But I didn't get any issue with hardware side. But still these boards are generating kernel panics. So I have a doubt that some thing went wrong with kernel. But how other units are working ? That is why I'm up to debug each and every kernel panic to get some useful information regarding each and every kernel panic boards. – Peter_Amond Jun 05 '18 at 10:40