3

I am first of all looking for debugging tips. If some one can point out the one line of code to change or the one peripheral config bit to set to fix the problem, that would be terrific. But that's not what I'm hoping for; I'm looking more for how do I go about debugging it.

Googling "msleep hang linux kernel site:stackoverflow.com" yields 13 answers and none is on the point, so I think I'm safe to ask.

I rebuild an ARM Linux kernel for an embedded TI AM1808 ARM processor (Sitara/DaVinci?). I see the all the boot log up to the login: prompt coming out of the serial port, but trying to login gets no response, doesn't even echo what I typed.

After lots of debugging I arrived at the kernel and added debugging code between line 828 and 830 (yes, kernel version is 2.6.37). This is at this point in the kernel mode before 'sbin/init' is called:

http://lxr.linux.no/linux+v2.6.37/init/main.c#L815

Right before line 830 I added a forever loop printk and I see the results. I have let it run for about a couple of hour and it counts to about 2 million. Sample line:

dbg:init/main.c:1202: 2088430

So it has spit out 60 million bytes without problem.

However, if I add msleep(1000) in the loop, it prints only once, i.e. msleep () does not return.

Details: Adding a conditional printk at line 4073 in the scheduler that condition on a flag that get set at the start of the forever test loop described above shows that the schedule() is no longer called when it hangs:

http://lxr.linux.no/linux+v2.6.37/kernel/sched.c#L4064

The only selections under .config/'Device Drivers' are: Block devices I2C support SPI support

The kernel and its ramdisk are loaded using uboot/TFTP. I don't believe it tries to use the Ethernet. Since all these happened before '/sbin/init', very little should be happenning.

More details: I have a very similar board with the same CPU. I can run the same uImage and the same ramdisk and it works fine there. I can login and do the usual things.

I have run memory test (64 MB total, limit kernel to 32M and test the other 32M; it's a single chip DDR2) and found no problem. One board uses UART0, and the other UART2, but boot log comes out of both so it should not be the problem.

Any debugging tips is greatly appreciated. I don't have an appropriate JTAG so I can't use that.

TheCodeArtist
  • 21,479
  • 4
  • 69
  • 130
user1261470
  • 141
  • 3
  • 6
  • could it be that the scheduler depends on some hardware timer? which is maybe broken? or using a different io address? – Willem Hengeveld Mar 10 '12 at 21:27
  • As far as I know, everything should be on chip (I guess its worth double checking) so they should see identical environment except for serial ports (all 3 should be active, just choose which one is active.) I guess I'll look for time tick IRQ and add a printk there (if I can find it:) – user1261470 Mar 11 '12 at 05:32

2 Answers2

0

If msleep doesn't return or doesn't make it to schedule, then in order to debug we can follow the call stack.

msleep calls schedule_timeout_uninterruptible(timeout) which calls schedule_timeout(timeout) which in the default case exits without calling schedule if the timeout in jiffies passed to it is < 0, so that is one thing to check.

If timeout is positive , then setup_timer_on_stack(&timer, process_timeout, (unsigned long)current); is called, followed by __mod_timer(&timer, expire, false, TIMER_NOT_PINNED); before calling schedule.

If we aren't getting to schedule then something must be happening in either setup_timer_on_stack or __mod_timer.

The calltrace for setup_timer_on_stack is setup_timer_on_stack calls setup_timer_on_stack_key which calls init_timer_on_stack_key is either external if CONFIG_DEBUG_OBJECTS_TIMERS is enabled or calls init_timer_key(timer, name, key);which calls debug_init followed by __init_timer(timer, name, key).

__mod_timer first calls timer_stats_timer_set_start_info(timer); then a whole lot of other function calls.

I would advise starting by putting a printk or two in schedule_timeout probably either side of the setup_timer_on_stack call or either side of the __mod_timer call.

Appleman1234
  • 15,946
  • 45
  • 67
  • That's a ton for me to chew. Thanks, and will report back. – user1261470 Mar 11 '12 at 06:33
  • added "before" and "after" around 'schedule();': http://lxr.linux.no/linux+v2.6.37/kernel/timer.c#L1477 also printk at the entry and exit of schedule();: http://lxr.linux.no/linux+v2.6.37/kernel/sched.c#L4073 http://lxr.linux.no/linux+v2.6.37/kernel/sched.c#L4152 – user1261470 Mar 11 '12 at 07:36
  • seen "before" on both boards. seen schedule() entry/exit pair twice on the bad board and trice on the good board. also see: http://lxr.linux.no/linux+v2.6.37/kernel/sched.c#L4133 The context switch have flipped the stack from under us so I guess schedule() returns to some where else, and only the third time does it return to my msleep call. So I guess the question is why the task that calls msleep() is not ready to be schedule again, or the CPU has crashed (I thought there would at least be a panic message). – user1261470 Mar 11 '12 at 07:38
  • How many before's did you see ? 2 for bad board and 3 for good ? You may want to update your question with additional information rather than placing it in the comments. – Appleman1234 Mar 11 '12 at 07:46
0

This problem has been solved.

With liberal use of prink it was determined that schedule() indeed switches to another task, the idle task. In this instance, being an embedded Linux, the original code base I copied from installed an idle task. That idle task seems not appropriate for my board and has locked up the CPU and thus causing the crash. Commenting out the call to the idle task

http://lxr.linux.no/linux+v2.6.37/arch/arm/mach-davinci/cpuidle.c#L93

works around the problem.

Flexo
  • 87,323
  • 22
  • 191
  • 272
user1261470
  • 141
  • 3
  • 6
  • I am also getting same problem as yours and my board gets hang on msleep call but I did not get your fix....Did u commented out schedule() under "schedule_timeout" ? – Nishith Goswami Mar 23 '16 at 23:00
  • As I said, I commented out the call to the idle task. The Linux browser has a new URL: http://lxr.free-electrons.com/source/arch/arm/mach-davinci/cpuidle.c?v=2.6.37#L93 and the line that I commented out was: `cpu_do_idle();` –  Apr 05 '16 at 16:51