Random HardFaults - STM32F4 - FreeRTOS

Question

I have a board with STM32F4 running FreeRTOS (3 tasks on it), and i'm getting a HardFault every 15-50 min of use.

My hardware: 3 Encoders, 6 Analog In, 10 digital in and 3 PWM output for DC motors.

At start, I thought was some StackOverflow, than I implemmented uxTaskGetStackHighWaterMark(); for each task and checked that it's not it.

I then implemented some HardFault handlers:

   void HardFault_Handler(void)
{
    __asm volatile
    (
            " tst lr, #4                                                \n"
            " ite eq                                                    \n"
            " mrseq r0, msp                                             \n"
            " mrsne r0, psp                                             \n"
            " ldr r1, [r0, #24]                                         \n"
            " ldr r2, handler2_address_const                            \n"
            " bx r2                                                     \n"
            " handler2_address_const: .word prvGetRegistersFromStack    \n"
    );
}

void prvGetRegistersFromStack( uint32_t *pulFaultStackAddress )
{
    volatile uint32_t CFSRValue = SCB->CFSR;
    volatile uint32_t HFSRValue = SCB->HFSR;
    char stepError [100] = "";
    if ((HFSRValue & (1 << 30)) != 0) {
        CFSRValue >>= 16;
        if((CFSRValue & (1 << 9)) != 0) strcpy(stepError," Divide by zero");
        if((CFSRValue & (1 << 8)) != 0) strcpy(stepError," Unaligned access");
        if((CFSRValue & (1 << 3)) != 0) strcpy(stepError," No coprocessor UsageFault" );
        if((CFSRValue & (1 << 2)) != 0) strcpy(stepError," Invalid PC load UsageFault");
        if((CFSRValue & (1 << 1)) != 0) strcpy(stepError," Invalid state");
        if((CFSRValue & (1 << 0)) != 0) strcpy(stepError," Undefined instruction");
    }

    volatile uint32_t r0;
    volatile uint32_t r1;
    volatile uint32_t r2;
    volatile uint32_t r3;
    volatile uint32_t r12;
    volatile uint32_t lr; /* Link register. */
    volatile uint32_t pc; /* Program counter. */
    volatile uint32_t psr;/* Program status register. */

    r0 = pulFaultStackAddress[ 0 ];
    r1 = pulFaultStackAddress[ 1 ];
    r2 = pulFaultStackAddress[ 2 ];
    r3 = pulFaultStackAddress[ 3 ];

    r12 = pulFaultStackAddress[ 4 ];
    lr = pulFaultStackAddress[ 5 ]; 
    pc = pulFaultStackAddress[ 6 ];
    psr = pulFaultStackAddress[ 7 ];
    GPIO_WriteLed(0,1);
    for(int i=0;i<=10;i++)
    {
        PWM_Change_DutyCycle(i,0);
    }
    for(;;);
}

And from this implementation, i got those results (each one was a HardFault, sometimes the PC was 0), that appear to be very random (to me):

1- if((CFSRValue & (1 << 1)) != 0) strcpy(stepError," Invalid state"); pc=0

2- if((CFSRValue & (1 << 0)) != 0) strcpy(stepError," Undefined instruction");
    0800807d: ...IncrementTick+252   ldr     r3, [r7, #8]   -  pc=134250621 - lr=2779096485


3-  if((CFSRValue & (1 << 8)) != 0) strcpy(stepError," Unaligned access");
    0800d63b: MX_ADC1_Init+290       ldr     r3, [pc, #240]  ; (0x800d72c <MX_ADC1_Init+532>)


4-  if((CFSRValue & (1 << 1)) != 0) strcpy(stepError," Invalid state");
        addr 0

5-080124c9: SysTick_Handler+8      bl      0x80072cc <osSystickHandler>


6-  if((CFSRValue & (1 << 0)) != 0) strcpy(stepError," Undefined instruction");
    08012521: SysTick_Handler+8      bl      0x80072cc <osSystickHandler>

Regards,

Do you think posting the exception handler here will be more beneficial than posting the code generating the exception? — Eugene Sh., Dec 13 '17 at 18:01
I posted the handler to you know how the values (CFSR, pc, etc) are obtained, and maybe someone can indicate better alternative to get more details about the exception. If you read my question, each exception occurs in one different part of the code. The osSystickHandler, IncrementTick, etc, are default from FreeRTOS, i didnt touch it. — Matheus Stumpf, Dec 13 '17 at 18:32
If it is not stack overflow, then it could be a write through a stray pointer, or a buffer overflow, or something else. Impossible to say without seeing the code. — user58697, Dec 13 '17 at 21:45
Only the Application code have more than 5000 lines.. :/ I'm trying something here that may help.. if I unplug my DC motors (still PWM sent but no Encoder pulses), it takes much more time to crash (It has been 5 hours with no exception until now). Maybe something in context switching? — Matheus Stumpf, Dec 13 '17 at 21:59
I afraid only you can find the source of the HF. It is not possible to debug complex code remotely without the access to your machine and hardware — 0___________, Dec 14 '17 at 02:10
@PeterJ_01, can you point me something to do to find a connection between the HFs? Maybe some register that I'm not watching, or some other HF Handler to get more information? Thanks! — Matheus Stumpf, Dec 14 '17 at 10:24
You need to find what has caused the HF and where. As I wrote finding the problem cam be a painful difficult and boring process. I do it in my projects sometimes as well (probably as most of the uC coders :)). — 0___________, Dec 14 '17 at 10:28
As far as I see it, some of the 6 instances given in your report list are the same/equivalent: No. 1/4 and no. 5/6. — HelpingHand, May 15 '20 at 11:19

score 0 · Answer 1 · answered May 15 '20 at 11:40

The question doesn't state it explicitly, but as I understand it, this thread is not about where the hard faults are coming from but if the shown testing idea was OK up to this point, and what else can be done to locate the error.
The question is pretty old now, but this board is meant to help others that have the same problem, so let's read the Q&AQ with general interest in such problems.

In order to trace back the problem, the following strategies can help:

If you can apply tracing hardware (because the HW target supports it and you have enough of the expensive equipment...), please use it: An off-chip ETM trace and a classical break point in your hard fault handler, and your search might be over 50 minutes later.

I guess the conditions aren't fulfilled in the present case. Still, there are professional projects where designing another PCB and buying a good debug/trace adapter is cheaper than having some developers search for weeks. Maybe the STM32 eval boards with full JTAG/TPIU access are a partial solution for you...
There are quite a few error models where the addresses the hard fault handler reports to you have nothing to do with the source of the error. Still, you might get some useful idea by checking (using the memory map) to which function or variable/buffer the addresses belong. Modify the environment of the error by putting unused "spacer" buffers (one or few words may be enough) between the modules and re-run the test. If you write some magic pattern into these unused regions, you can monitor them for corruption and use them as "canaries" to detect in which context the error happens.
If this doesn't help, deactivate different components of the software step by step and re-run and check when the hard faults vanish. If not already before, you may need some automated endurance testing environment so that your efforts (and searching time) doesn't explode.
As far as I know, all STM32F4 have a memory-protection unit. Can you activate it?

Random HardFaults - STM32F4 - FreeRTOS

1 Answers1