7

I need to implement as fast as possible left bit shift of a 16-byte array in JavaCard.

I tried this code:

private static final void rotateLeft(final byte[] output, final byte[] input) {
         short carry = 0;
         short i = (short) 16;
         do {
             --i;
             carry = (short)((input[i] << 1) | carry);
             output[i] = (byte)carry;
             carry = (short)((carry >> 8) & 1);
         } while (i > 0);
}

Any ideas how to improve the performace? I was thinking about some Util.getShort(...) and Util.setShort(...) magic, but I did not manage to make it work faster then the implementation above.

This is one part of CMAC subkeys computation and it is done quite often, unfortunately. In case you know some faster way to compute CMAC subkeys (both subkeys in one loop or something like that), please, let me know.

vojta
  • 5,591
  • 2
  • 24
  • 64
  • I gather JavaCard is interpeted? If so then I would recommend that you take a look at the generated byte-code and optimize with the available instruction set in mind. For instance I suspect ints may be preferable to shorts, and that loop unrolling may gain you a few cycles. Beyond that I suspect that you're going to be doing more than a single extended-precision arithmetic operation so it's probably wise to switch to a wider integer early on for faster processing and convert the back an 8-bit array at the end. – doynax May 20 '15 at 13:19
  • @doynax there is no `int` or `long` in JavaCard... `byte` and `short` is all you have. – vojta May 20 '15 at 13:21
  • Sorry about that, sounds like a particularly crippling environment to work with. My point still stands though, keep an eye on the generated byte-code to insure that the compiler doesn't decide to generate unnecessary `i2s` instructions on the intermediate short-of-int-but-not-really results. – doynax May 20 '15 at 13:36
  • @doynax Yes, JavaCard is a nightmare. Thanks, I will study my bytecode. – vojta May 20 '15 at 13:43
  • I took a quick peek at the specification and while you may not have a full `BigInteger` library there does seem to be a slimmed-down `BigNumber` version. Perhaps a multiplication by two might be faster if it is hand optimized? – doynax May 20 '15 at 13:49
  • @doynax Unfortunately, not all real-world cards support `BigNumber`. My cards don't... – vojta May 21 '15 at 06:58
  • 2
    Well, losing the loop would surely make it faster. You know in advance you have 16 bytes. :-) – Shuckey May 21 '15 at 08:48
  • @Shuckey Yes, how could I miss something that obvious! Thanks! – vojta May 21 '15 at 08:49
  • 1
    I've created multiple implementations to perform *any kind of rotate* relatively quickly [here](https://stackoverflow.com/a/52564557/589259). It is also be able to rotate an array *in place* (and without a temporary array) which is rather trickier. The 64 bit implementation at the bottom can be very easily extended to 128 bit, of course: just add a bit set to 1 at the left of the mask used to get the `byteRot` variable. – Maarten Bodewes Sep 29 '18 at 18:46
  • BTW, that looks like a pretty fast implementation; I don't think that you can get much faster. Yes, you can unroll the loop, but the accepted answer performs multiple array accesses for the same location and the other uses `getShort` and `setShort`, which are method calls - and those are much slower than anything your original code seems to be doing. It **is** possible to change to do / while loop by a bounded for loop, which may prevent unnecessary branching. You could also use the input array to receive the output as well, easy change. – Maarten Bodewes Sep 29 '18 at 19:25
  • You can lose that final `& 1`, because `carry` can't be greater than `0x1FF`. – TonyK Oct 20 '18 at 01:24

3 Answers3

4

When it comes for speed, known length, hard-coded version is the fastest (but ugly). If you need to shift more than one bit, ensure to update the code accordingly.

output[0] = (byte)((byte)(input[0] << 1) | (byte)((input[1] >> 7) & 1));
output[1] = (byte)((byte)(input[1] << 1) | (byte)((input[2] >> 7) & 1));
output[2] = (byte)((byte)(input[2] << 1) | (byte)((input[3] >> 7) & 1));
output[3] = (byte)((byte)(input[3] << 1) | (byte)((input[4] >> 7) & 1));
output[4] = (byte)((byte)(input[4] << 1) | (byte)((input[5] >> 7) & 1));
output[5] = (byte)((byte)(input[5] << 1) | (byte)((input[6] >> 7) & 1));
output[6] = (byte)((byte)(input[6] << 1) | (byte)((input[7] >> 7) & 1));
output[7] = (byte)((byte)(input[7] << 1) | (byte)((input[8] >> 7) & 1));
output[8] = (byte)((byte)(input[8] << 1) | (byte)((input[9] >> 7) & 1));
output[9] = (byte)((byte)(input[9] << 1) | (byte)((input[10] >> 7) & 1));
output[10] = (byte)((byte)(input[10] << 1) | (byte)((input[11] >> 7) & 1));
output[11] = (byte)((byte)(input[11] << 1) | (byte)((input[12] >> 7) & 1));
output[12] = (byte)((byte)(input[12] << 1) | (byte)((input[13] >> 7) & 1));
output[13] = (byte)((byte)(input[13] << 1) | (byte)((input[14] >> 7) & 1));
output[14] = (byte)((byte)(input[14] << 1) | (byte)((input[15] >> 7) & 1));
output[15] = (byte)(input[15] << 1);

And use RAM byte array!

David
  • 3,957
  • 2
  • 28
  • 52
  • Thank you! However, I think you use too much casting... You could cast the final result only and keep int as long as possible... – vojta May 21 '15 at 10:53
  • 1
    Just to note: surprisingly, loop unrolling is not the fastest in general, even when it comes to low-level assembler instructions. – Anton Samsonov May 22 '15 at 18:13
  • @AntonSamsonov How is that possible? – vojta May 22 '15 at 18:17
  • 3
    @vojta That's because of sophisticated out-of-order execution and branch prediction, both of which benefit from jumping backwards, especially if the loop entry is aligned on cache line boundary and body is not large, as CPU doesn't need to analyze your code again. This may get even more complicated with HLVM, such as Java and .Net, where some recognizable patterns may translate to optimal native instructions (SSE, etc), while manually unwinded code is likely to be preserved. Therefore, no judgement should be carried without proper benchmarking on the exact target platform. – Anton Samsonov May 23 '15 at 09:46
  • 1
    @AntonSamsonov And how much of that sophisticated out of order execution and branch prediction do you expect on Java Card? I'd completely ignore your remark unless it can be proven that a normal loop is faster than the unrolled loop. By which I'm not saying that the implementation of David is necessarily very fast, by the way - in general you want to avoid spurious array access, and that's certainly not the case here - every byte location is visited *twice*!!! – Maarten Bodewes Sep 29 '18 at 18:51
3

This is the fastest algorithm to rotate arbitrary number of bits I could come up with (I rotate array of 8 bytes, you can easily transform it to shifting 16 instead):

Use EEPROM to create a masking table for your shifts. Mask is just increasing amounts of 1s from the right:

final static byte[] ROTL_MASK = {
    (byte) 0x00, //shift 0: 00000000 //this one is never used, we don't do shift 0.
    (byte) 0x01, //shift 1: 00000001
    (byte) 0x03, //shift 2: 00000011
    (byte) 0x07, //shift 3: 00000111
    (byte) 0x0F, //shift 4: 00001111
    (byte) 0x1F, //shift 5: 00011111
    (byte) 0x3F, //shift 6: 00111111
    (byte) 0x7F  //shift 7: 01111111
};

Then you first use Util.arrayCopyNonAtomic for quick swap of bytes if shift is larger than 8:

final static byte BITS = 8;
//swap whole bytes:
Util.arrayCopyNonAtomic(in, (short) (shift/BITS), out, (short) 0, (short) (8-(shift/BITS)));
Util.arrayCopyNonAtomic(in, (short) 0, out, (short) (8-(shift/BITS)), (short) (shift/BITS));
shift %= BITS; //now we need to shift only up to 8 remaining bits

if (shift > 0) {
    //apply masks
    byte mask = ROTL_MASK[shift];
    byte comp = (byte) (8 - shift);

    //rotate using masks
    out[8] = in[0]; // out[8] is any auxiliary variable, careful with bounds!
    out[0] = (byte)((byte)(in[0] << shift) | (byte)((in[1] >> comp) & mask));
    out[1] = (byte)((byte)(in[1] << shift) | (byte)((in[2] >> comp) & mask));
    out[2] = (byte)((byte)(in[2] << shift) | (byte)((in[3] >> comp) & mask));
    out[3] = (byte)((byte)(in[3] << shift) | (byte)((in[4] >> comp) & mask));
    out[4] = (byte)((byte)(in[4] << shift) | (byte)((in[5] >> comp) & mask));
    out[5] = (byte)((byte)(in[5] << shift) | (byte)((in[6] >> comp) & mask));
    out[6] = (byte)((byte)(in[6] << shift) | (byte)((in[7] >> comp) & mask));
    out[7] = (byte)((byte)(in[7] << shift) | (byte)((in[8] >> comp) & mask));
}

You can additionally remove mask variable and use direct reference to the table instead.

Using this rather than naive implementation of bit-wise rotation proved to be about 450% - 500% faster.

MiragePV
  • 65
  • 6
2

It might help to cache CMAC subkeys when signing repeatedly using the same key (i.e. the same DESFire EV1 session key). The subkeys are always the same for the given key.

I think David's answer could be even faster if it used two local variables to cache the values read twice from the same offset of the input array (from my observations on JCOP, the array access is quite expensive even for transient arrays).

EDIT: I can provide the following implementation which does 4 bit right shift using short (32-bit int variant for cards supporting it would be even faster):

short pom=0; // X000 to be stored next
short pom2; // loaded value
short pom3; // 0XXX to be stored next
short curOffset=PARAMS_TRACK2_OFFSET;
while(curOffset<16) {
    pom2=Util.getShort(mem_PARAMS, curOffset);
    pom3=(short)(pom2>>>4);
    curOffset=Util.setShort(mem_RAM, curOffset, (short)(pom|pom3));
    pom=(short)(pom2<<12);
}

Beware, this code assumes same offsets in source and destination.

You can unroll this loop and use constant parameters if desired.

vlp
  • 7,811
  • 2
  • 23
  • 51