PIC32 speed : Optimizing c code

Question

I want some suggestions to optimize my code which is a simple one but it need to be fast and by fast I mean something less than 250 ns.
my first code was slow , about 1000 ns but after some works its about 550 ns but I believe it can be done faster but I don't know how :<
I am using a PIC32 with 80 MHz system clock
my code:

void main()
{
    unsigned long int arr_1[4095]; 
    unsigned long int arr_2[4095]; 

    //here I assign arr_1 and arr_2 values
    //...
    //...

    TRISC = 0;
    TRISD = 0;

    while(1){
         LATC = arr_1[PORTE];
         LATD = arr_2[PORTE];
    }

}

as you can see its very simple as a job, the only problem is the speed.
I saw the assembly listing just to see how many instructions there are , but I don't know assembly language to optimize it.

;main.c, 14 ::      LATC = arr_1[PORTE];
0x9D000064  0x27A30000  ADDIU   R3, SP, 0
0x9D000068  0x3C1EBF88  LUI R30, 49032
0x9D00006C  0x8FC26110  LW  R2, 24848(R30)
0x9D000070  0x00021080  SLL R2, R2, 2
0x9D000074  0x00621021  ADDU    R2, R3, R2
0x9D000078  0x8C420000  LW  R2, 0(R2)
0x9D00007C  0x3C1EBF88  LUI R30, 49032
0x9D000080  0xAFC260A0  SW  R2, 24736(R30)
;main.c, 15 ::      LATD = arr_2[PORTE];
0x9D000084  0x27A33FFC  ADDIU   R3, SP, 16380
0x9D000088  0x3C1EBF88  LUI R30, 49032
0x9D00008C  0x8FC26110  LW  R2, 24848(R30)
0x9D000090  0x00021080  SLL R2, R2, 2
0x9D000094  0x00621021  ADDU    R2, R3, R2
0x9D000098  0x8C420000  LW  R2, 0(R2)
0x9D00009C  0x3C1EBF88  LUI R30, 49032
;main.c, 16 ::      }
0x9D0000A0  0x0B400019  J   L_main0
0x9D0000A4  0xAFC260E0  SW  R2, 24800(R30)

Any suggestions to optimize my code ?

edit:
*PORTE, LATC and LATD are I/O mapped registers *The goal of the code to change LATC and LATD registers as fast as possible when PORTE is changed(so PORTE is an input and LATC and LATD are output), the output depend on the value of PORTE

the code never exits, so actual running time is 'forever'. Exactly what were you measuring? Most likely, the majority of the running time (except for the 'forever' loop) is in setting the array values and you did not post that important detail. — user3629249, Nov 01 '15 at 18:43
How are you compiling/linking the code? are you using the any optimization parameters? (for gcc, suggest `-o3`) — user3629249, Nov 01 '15 at 18:44
You probably need to be clear about what PORTE, LATC and LATD are - especially if they are `volatile` or I/O mapped. — Clifford, Nov 01 '15 at 19:07
@user3629249 my measure is the time it takes to changes LATC and LATD when PORTE is changed . — Hisoka Hunter, Nov 01 '15 at 19:11
@HisokaHunter : I am sure they are, but you should make it clear by editing the question rather then assuming everyone will read the comments - they won't. — Clifford, Nov 01 '15 at 19:20

Clifford · Answer 1 · 2015-11-02T13:39:24.190

A potential limiting factor is that since PORTE, LATC and LATD are not regular memory but rather I/O registers, it is possible that the I/O bus speed is lower than the memory bus speed and that the processor inserts wait-states between accesses. That may or may not be the case for PIC32, but it is a general point that you need to consider for any architecture.

If the I/O bus is not a limitation then first of all have you applied compiler optimisations? For such micro-optimisations that is usually your best bet. This code seems trivially optimised, but the assembler does not appear to reflect that (although I am no MIPS assembler expert - the compiler optimiser is however).

Since I/O registers are volatile then the optimiser may be defeated at optimising the loop body significantly. But since they are volatile, the code is probably also be unsafe, since it is possible (and indeed likely) for PORTE to change value between the assignment of LATC and LATD which may not be your intention or desirable. If that is the case then the code should be changes as follows:

int porte_value_latch = 0 ;
for(;;)
{
     // Get a non-volatile copy of PORTE.
     porte_value_latch = PORTE ;  

     // Write LATC/D with a consistent PORTE value that 
     // won't change between assignments, and does not need 
     // to be read from memory or I/O.
     LATC = arr_1[porte_value_latch] ;
     LATD = arr_2[porte_value_latch] ;
}

which is then both safe and potentially faster since the volatile PORTE is only read once, and the porte_value_latch value can be retained in a temporary register for both array accesses rather than read from memory each time. The optimiser will almost certainly optimise it to a register access even if regular compilation does not.

The use of the for(;;) rather then while(1) probably makes little difference, but some compilers issue a warning for invariant while expressions, bit will accept the for(;;) idiom quietly. You have not included the code assembler for line 13 so it is not possible to determine what your compiler generated.

A further possibility for optimisation may be available if LATC and LATD are located in adjacent addresses, in which case you might use a single array of type unsigned long long int in order to write both locations in a single assignment. Of course the 64 bit access is still non-atomic, but the compiler may generate more efficient code in any case. It also neatly avoids the need for the porte_value_latch variable as there would then be only one reference to PORTE. However if LATCand LATD must be written in a specific order, you loose that level of control. The loop would look like:

for(;;)
{
    LATCD = arr_1_2[PORTE] ;
}

Where the address of LATCD is the low-order address of the adjacent LATC and LATD registers, and has type unsigned long long int . If LATC has the lower address then:

unsigned long long int LATCD = (unsigned long long int)LATC ;

so that writing to LATCD writes to both LATC and LATD. Toy then have to combine the arr_1 an arr_2 into a single array of unsigned long long with appropriate word-order so that it contains both C and D values in a single value.

Another suggestion: Configure the hardware to read PORTE to a single location using DMA triggered from a clock signal at >=4MHz. The loop would then not need to read PORTE at all but rather read the DMA memory location which may or may not be faster. You could also set up the DMA to write LATC/LATD from a memory location so that the loop performs no I/O at all. That method would also allow the "adjacent memory" method to work even if LATC and LATD are not actually adjacent.

Ultimately if the issue is only down to the compiler's code generation, then implementing the loop in in-line assembler and hand optimising it may make sense.

Firstly, thanks for your answer. secondly for your code using a variable to latch could make the timing slower for LATD if PORTE changed just after latch(I verified that) however I like the idea of the adjacent adresses of LATC and LATD(I should investigate) — Hisoka Hunter, Nov 01 '15 at 19:37
@HisokaHunter : My intention was to make the code safer - whether it is faster or not would need testing. It assumes that LATC and LATD must be written with a consistent index value. If that is not the case then the only benefit is the single PORTE read. You should test it, as it depends what the limiting factor is. — Clifford, Nov 01 '15 at 19:43
I think its safe for two reasons: the order of switching between LATD and LATC is not important and reading PORTE is atomic operation so PORTE can be changed in any moment without porblem — Hisoka Hunter, Nov 01 '15 at 19:47
@HisokaHunter : You know your application, the point about data consistency remains an important consideration in the general case perhaps. — Clifford, Nov 02 '15 at 13:20
@HisokaHunter : I don't think the latency between PORTE changing and LATD being set is particularly valid. If it changes immediately after the first read, setting LATD is still delayed by the length of time it takes to set LATC and read PORTE a second time - the worst-case latency is likely to be longer, even if the best-case is shorter - it makes it less deterministic. Remember the change in PORTE is entirely asynchronous to the reading of PORTE in any case. The worst case occurs when PORTE changes just after it was read, and the best case when it is read just before. — Clifford, Nov 02 '15 at 13:21
I thought too about using the DMA , in fact i already tried it but it didn't work because i am not familiar with it, I searched for examples to implement it but it just didn't work for me so I searched for alternative solutions. however it seems to be the best solution on the paper so I will try again to implement it. thanks — Hisoka Hunter, Nov 03 '15 at 08:01
May also help: http://blog.flyingpic24.com/2009/03/18/testing-the-pic32-io-speed/ — Clifford, Nov 03 '15 at 09:53
I already did some of what the topic said for optimizing PIC32 , but the maximum I get of toggling is 13 MHz so I will try the rest to arrive to 40 MHz . thanks — Hisoka Hunter, Nov 03 '15 at 10:36

PIC32 speed : Optimizing c code

1 Answers1