0

The gist of the problem is : What are the possibilities of a user-land app getting corrupted while it is running ? Other than hardware failures.

Hardware rig : ARM9 (at91sam9xe) NAND Flash for :Linux kernel + FS + userland app.

We had an app running on embedded linux on ARM9 (at91sam9xe ), there were no problems for a couple of months but then suddenly an ARM reported being unable to execute the app..

When it was executed it crashed with the following dump :

pgd = c16b8000
[00000020] *pgd=215a0031, *pte=00000000, *ppte=00000000

Pid: 349, comm:              console
CPU: 0    Not tainted  (2.6.30.4-uc0 #280)
PC is at 0x4e000
LR is at 0x673e0
pc : [<0004e000>]    lr : [<000673e0>]    psr: 60000010
sp : bec6a728  ip : bec6acb4  fp : bec6ac9c
r10: 000bd9f8  r9 : 00000000  r8 : 00000000
r7 : 00000000  r6 : bec6acb4  r5 : 00000000  r4 : fbad2084
r3 : ffffffff  r2 : bec6acb4  r1 : 00000025  r0 : 0009eab0
Flags: nZCv  IRQs on  FIQs on  Mode USER_32  ISA ARM  Segment user
Control: 0005317f  Table: 216b8000  DAC: 00000015
[<c02ec3b0>] (show_regs+0x0/0x50) from [<c02f11a8>] (__do_user_fault+0x9c/0xa8)
 r5:0000000b r4:c1696360
[<c02f110c>] (__do_user_fault+0x0/0xa8) from [<c02f1344>] (do_page_fault+0x114/0x244)
 r7:00010000 r6:c1696360 r5:c15a62e0 r4:c1c5fde0
[<c02f1230>] (do_page_fault+0x0/0x244) from [<c02ea284>] (do_DataAbort+0x3c/0xa0)
[<c02ea248>] (do_DataAbort+0x0/0xa0) from [<c02eae00>] (ret_from_exception+0x0/0x10)
Exception stack(0xc1683fb0 to 0xc1683ff8)
3fa0:                                     0009eab0 00000025 bec6acb4 ffffffff 
3fc0: fbad2084 00000000 bec6acb4 00000000 00000000 00000000 000bd9f8 bec6ac9c 
3fe0: bec6acb4 bec6a728 000673e0 0004e000 60000010 ffffffff     

I tried addr2line to see where it crashed but it gave reference to crtstuff.c =\ crtstuff.c is not a part of our app, its related to GCC i think.

I feared corruption of my executable, so i ran a diff on the file on NAND and file from my PC... there were differences which shouldn't happen. Plus, the differences were almost all of them as "0x00" values instead of the value they should contain.

What I really want to know is , how can a userland app get corrupted other than the hardware failures ?

Cause: NAND flash was always writeable , so what we hypohtesized was that there is a coincidence where things are being written to flash and power goes out .

Solution Moved our FS to RAM, we only mount part of NAND partition as writeable only when there is a need to write something. NAND write protect was controlled via Hardware Pin to only enable when there is a write-request from App

Muhammad Ali
  • 418
  • 6
  • 20
  • What filesystem do you have on this NAND flash? – sawdust Dec 21 '13 at 20:25
  • Are you using MTD or UBI? What kind of ECC? Have you actually been able to rule out HW as the possible cause? – sawdust Dec 24 '13 at 09:09
  • we are using MTD, and I haven't worked out what kind of ECC, i didnt change any settings during compilation so It would be safe to assume whichever is default... About hardware as the possible cause , No we haven't ruled that out... its still under investigation. But it doesn't appear to happen on test-table :( and why corrupt only my single executable... all other files are in mint condition. – Muhammad Ali Dec 28 '13 at 14:16
  • Are you sure you are compiling your executable for arm? Have you linked the right library version to your program? Sorry to comment on a super old question! you probably arn't using ARM9 anymore... – mjz19910 Oct 21 '17 at 12:07
  • it is a very old post but since its coming in resolved i have updated it with cause and solution – Muhammad Ali Oct 30 '17 at 06:29

0 Answers0