Another question seems exactly about this, but it is for Java.
Even wrongly doing the shift on a per byte basis (local
), it is still slower than shifting across bytes (glob
) through a conversion to integer:
In [394]: def local(ts):
...: return [((ts[i] << 1) | ts[i] >> 31) ^ ks[i] for i in range(16)]
...:
In [395]: def glob(bs):
...: n = int.from_bytes(bs, 'big')
...: return (((n & (2**127-1)) << 1) | (n >> 127)).to_bytes(16, 'big')
...:
In [396]: %timeit local(a)
2.34 µs ± 5.38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [397]: %timeit glob(a)
647 ns ± 0.493 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [398]: a
Out[398]: b'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
In [411]: %timeit [bin(x)[1:]+bin(x)[0:1] for x in a]
3.94 µs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)