1

So here's my problem.. I am writing out an AES-NI implementation for a library, and I am stuck on the decryption of a 256 bit block. Here's what I know.. The 128 bit block works perfectly. The encryption of a 256 block aligns with a proven Rijndael implementation. The expanded key also aligns with the other Rijndael implementation (allowing for the little endian byte order). The routine uses a blend and shift mask to compensate for the offset column shuffle of a 256 bit block, it is the inverse of the mask used to encrypt the block, this also tested and appears to be working fine. Here is the encrypt function:

    void Encrypt32(const std::vector<byte> &Input, const size_t InOffset, std::vector<byte> &Output, const size_t OutOffset)
{
    const size_t LRD = m_expKey.size() - 3;
    size_t keyCtr = 0;
    __m128i RIJNDAEL256_MASK = { 0,1,6,7,4,5,10,11,8,9,14,15,12,13,2,3 };
    __m128i BLEND_MASK = _mm_set_epi32(0x80000000, 0x80800000, 0x80800000, 0x80808000);
    __m128i block1 = _mm_loadu_si128((const __m128i*)(const void*)&Input[InOffset]);
    __m128i block2 = _mm_loadu_si128((const __m128i*)(const void*)&Input[InOffset + 16]);
    __m128i temp1, temp2;

    block1 = _mm_xor_si128(block1, m_expKey[keyCtr]);
    block2 = _mm_xor_si128(block2, m_expKey[++keyCtr]);

    while (keyCtr != LRD)
    {
        temp1 = _mm_blendv_epi8(block1, block2, BLEND_MASK);    // combine 2 blocks
        temp2 = _mm_blendv_epi8(block2, block1, BLEND_MASK);
        temp1 = _mm_shuffle_epi8(temp1, RIJNDAEL256_MASK);      // shuffle
        temp2 = _mm_shuffle_epi8(temp2, RIJNDAEL256_MASK);
        block1 = _mm_aesenc_si128(temp1, m_expKey[++keyCtr]);   // encrypt
        block2 = _mm_aesenc_si128(temp2, m_expKey[++keyCtr]);
    }

    temp1 = _mm_blendv_epi8(block1, block2, BLEND_MASK);
    temp2 = _mm_blendv_epi8(block2, block1, BLEND_MASK);
    temp1 = _mm_shuffle_epi8(temp1, RIJNDAEL256_MASK);
    temp2 = _mm_shuffle_epi8(temp2, RIJNDAEL256_MASK);
    block1 = _mm_aesenclast_si128(temp1, m_expKey[++keyCtr]);
    block2 = _mm_aesenclast_si128(temp2, m_expKey[++keyCtr]);

    _mm_storeu_si128((__m128i*)(void*)&Output[OutOffset], block1);
    _mm_storeu_si128((__m128i*)(void*)&Output[OutOffset + 16], block2);
}

This is the inverse transform:

    void Decrypt32(const std::vector<byte> &Input, const size_t InOffset, std::vector<byte> &Output, const size_t OutOffset)
{
    const size_t LRD = m_expKey.size() - 3;
    __m128i RIJNDAELINV_MASK = { 0,1,14,15,4,5,2,3,8,9,6,7,12,13,10,11 };
    __m128i BLEND_MASK = _mm_set_epi32(0x80000000, 0x80800000, 0x80800000, 0x80808000);
    __m128i block1 = _mm_loadu_si128((const __m128i*)(const void*)&Input[InOffset]);
    __m128i block2 = _mm_loadu_si128((const __m128i*)(const void*)&Input[InOffset + 16]);
    __m128i temp1, temp2;
    size_t keyCtr = 0;

    block1 = _mm_xor_si128(block1, m_expKey[keyCtr]);
    block2 = _mm_xor_si128(block2, m_expKey[++keyCtr]);

    while (keyCtr != LRD)
    {
        temp1 = _mm_aesdec_si128(block1, m_expKey[++keyCtr]);   // decrypt
        temp2 = _mm_aesdec_si128(block2, m_expKey[++keyCtr]);
        temp1 = _mm_shuffle_epi8(temp1, RIJNDAELINV_MASK);      // shuffle
        temp2 = _mm_shuffle_epi8(temp2, RIJNDAELINV_MASK);
        block1 = _mm_blendv_epi8(temp1, temp2, BLEND_MASK);     // combine
        block2 = _mm_blendv_epi8(temp2, temp1, BLEND_MASK);
    }

    temp1 = _mm_aesdeclast_si128(block1, m_expKey[++keyCtr]);
    temp2 = _mm_aesdeclast_si128(block2, m_expKey[++keyCtr]);
    temp1 = _mm_shuffle_epi8(temp1, RIJNDAELINV_MASK);
    temp2 = _mm_shuffle_epi8(temp2, RIJNDAELINV_MASK);
    block1 = _mm_blendv_epi8(temp1, temp2, BLEND_MASK);
    block2 = _mm_blendv_epi8(temp2, temp1, BLEND_MASK);

    _mm_storeu_si128((__m128i*)(void*)&Output[OutOffset], block1);
    _mm_storeu_si128((__m128i*)(void*)&Output[OutOffset + 16], block2);
}

I've been debugging this for hours, and just can't spot the problem, can anyone see why this wouldn't work? I can post the code to git if that would help.

JGU
  • 879
  • 12
  • 14
  • AES has only one block size: 128-bits. Rijndael supports other block sizes such as 256-bits but note that much less review has been done on larger block sizes and they may even be less secure. – zaph May 25 '16 at 18:39
  • @zaph -the reason for the reduction in security is a wider block generates twice as many round keys from the same fixed input key, but that is not the issue with this code.. – JGU May 25 '16 at 18:44
  • I've voted this up as it is an OK question. But block ciphers combined with a secure *mode of operation* should be enough to encrypt 256 bit blocks. I don't completely see the point in using Rijndael-256 instead (then again, there are reasons for using 256 bit block ciphers, so OK). – Maarten Bodewes May 25 '16 at 23:14
  • The best thing to do here is debug or print out the intermediate values and compare them with an implementation that does already perform Rijndael-256. – Maarten Bodewes May 26 '16 at 16:04
  • @Maarten -I've scrapped 256 for now.. I'll revisit it later (frustrated). The problem with that approach though, is that Intel NI is LE, and Rijndael is BE, so you either have to transpose each rounds output, or write Rijndael as LE to compare (best choice). – JGU May 26 '16 at 17:58
  • @Maarten -Right now I'm trying to figure out a speed variance, whether it's a bug, or hyperthreading/overclock/memory creating the variance (runs between 1800-4700 MB p/s on my i7-6700T, mostly high but dips.. but 4+GB seems too high?). – JGU May 26 '16 at 18:38
  • It can be pretty fast, I would not be too surprised. Send me a i7-6700T and I'll be happy to test. – Maarten Bodewes May 26 '16 at 18:48
  • Just passed 10000 loops of AESAVS so.. pretty fast. Gotta love those intrinsics.. – JGU May 26 '16 at 18:56

0 Answers0