This code sample takes an arbitrarily large number of bytes as inputs, one at the time, and maps them to a table with 32 values [0,31]. To simplify the example I used mod 32. The >> 3 operation is equivalent to division by 8. In case it is not clear, the "q" loop counts to 5 because numbers from 0 to 31 only require 5-bits of resolution.
Can this be improved in terms of speed?
uint8_t sto[10000];
uint8_t srt[32] = {<32 values in total>};
uint8_t t,q
uint64_t loop,cr = 0;
for(loop = 0; loop < 10000; loop++) {
t = srt[loop % 32];
for(q = 0; q < 5; q++) {
if(t & (0x1 << (4 - q)))
sto[cr >> 3] |= (0x1 << (7 - (cr % 8)));
cr++;
}
}