4

Why does the following code

var s =  "2I==";
var b = Convert.FromBase64String(s);
var new_s = Convert.ToBase64String(b);

end up with new_s being 2A==?

s was originally a longer string (96 chars) but I couldn't include it because it is a secret key.

eggbert
  • 3,105
  • 5
  • 30
  • 39
  • I suspect it's because `2I==` is effectively providing some bits for the padding value. Will analyze... – Jon Skeet Mar 26 '15 at 20:13

2 Answers2

6

"2I==" represents the numbers 54, 8, (padding x2), as per Wikipedia.

In other words, the bits represented are:

110110 000100 XXXXXX XXXXXX

(Where X represents an "I don't care, it's from padding")

However, because the padding indicates that there's only one byte of information here, the last 4 bits from the second character are irrelevant. As ever, we can reformat the 4 pieces of 6-bit information into 3 pieces of 8-bit information, at which point it becomes clearer:

11011000 0100XXXX XXXXXXXX

You can see that the second byte must be padding, as some of its bits come from a padding character. So only the first character and the top two bits of the second character are relevant - it decodes to just the single byte 0b11011000.

Now when you encode 0b11011000, you know that you'll have two padding characters, and the first character must be '2' (to represent bits '110110') but the second character can be any character whose first two bits represent '00'. It just happens that Convert.ToBase64String uses 'A', which has 0 bits for the irrelevant parts.

The question in my mind is why an encoder would choose to use 'I' instead of 'A'. I don't think it's invalid to do this in Base64, but it's an odd choice.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
4

Jon Skeet has provided a good explanation for the observed behaviour. However, under most definitions of Base64, your input string would be considered invalid. The standards contain the following text:

When fewer than 24 input bits are available in an input group, bits with value zero are added (on the right) to form an integral number of 6-bit groups.

  • RFC 1421: Privacy Enhancement for Internet Electronic Mail
  • RFC 2045: Multipurpose Internet Mail Extensions (MIME)
  • RFC 4648: The Base16, Base32, and Base64 Data Encodings

RFC 4648 puts further emphasis on this:

The padding step in base 64 and base 32 encoding can, if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders [...] Decoders MAY chose to reject an encoding if the pad bits have not been set to zero.

We can assume that your original input consisted of a single byte having value 216 (0xD8). In binary:

11011000

This needed to be split up into 6-bit groups:

110110 00

And, per the quoted definition above, the last group needed to be padded with zeros:

110110 000000

Per the Base64 alphabet, 110110 (decimal: 54) maps to the character 2, whilst 000000 (decimal: 0) maps to the character A. Adding the = padding to get a 24-bit group, the final result will be 2A==. This is the only valid encoding for your original input.

Community
  • 1
  • 1
Douglas
  • 53,759
  • 13
  • 140
  • 188
  • +1 - This shows why the range from `2A==` to `2P==` will generate the same value `2A==`. Then once `Q` is reached, the least significate bit of the two bit value will increment and you would get `2Q==` – SwDevMan81 Mar 26 '15 at 21:08