0

I have a byte array, which is the hash of a file. This is made with messageDigest, so there is a padding. Then I make a shorthash, which is just the two first bytes of the hash, like this:

 byte[] shorthash = new byte[2];
 System.arraycopy(hash, 0, shortHash, 0, 2);

To make it readable for the user and to save it in a DB, I'm converting it to String with a Base64 Encoder:

Base64.getUrlEncoder().encodeToString(hash); //Same for shorthash

What I don't understand is:

  1. Why is the String representing my shorthash four characters long? I thought a char was one or two bytes, so since I'm copying only two bytes, I shouldn't have more than two chars, right?

  2. Why isn't my shorthash String the same as the start of the hash String?

For example, I'll have :

Hash: LE5D8vCsMp3Lcf-RBwBRbO1v4soGq7BBZ9kB_2SJnGY=
Shorthash: Rak=

You can see the = at the end of each; it certainly comes from the MessageDigest padding, so it is normal for the hash, but why for the shorthash? It should be the two FIRST bytes, and the = is at the end!

Moreover: since I wanted to get rid of this Padding, I decided to do that:

String finalHash = Base64.getUrlEncoder().withoutPadding().encodeToString(hash);
byte[] shorthash = new byte[2];
System.arraycopy(hash.getBytes(), 0, shortHash, 0, 2);
String finalShorthash = Base64.getUrlEncoder().encodeToString(shorthash);

I didn't wanted to copy directly the String, since, I'm not really sure what would be two bytes in a string.

Then, the = is gone for my hash, but not for my shorthash. I guess I need to add the "withoutPadding" option to my shorthash, but I don't understand why, since it's a copy of my hash who shouldn't have padding anymore. Except if the padding is gone only on the String representation and not in the Byte behind it?

Can someone explain this behavior? Does it comes from the conversion between byte[] and String?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Ablia
  • 317
  • 1
  • 3
  • 14
  • Base64 needs one byte to encode 6 bits so it is to be expected that a base64 encoded string is longer than the initial byte sequence. – tkausl Sep 01 '18 at 10:03
  • Can you just convert you array of two bytes into a hex string? So it will be more readable than a base 64 string – Maurizio Ricci Sep 01 '18 at 10:06
  • Thx for the review. Yeah, i guess i could make it in hex, Base64 is just the first solution i found. I can do it, but i still want to understand the stragne behavior here. @tkausl gaves the answer to the first problem, thank you! – Ablia Sep 01 '18 at 10:19
  • Base64 is padded up to multiples of four (although in some base64 schemes, padding is optional). This padding has nothing to do with any padding in a hashing scheme. – Mark Rotteveel Sep 01 '18 at 10:20

1 Answers1

2

"Why is the String representing my shorthash four characters long?"

Because you base64 encoded it. Each base64 digit represents exactly 6 bits of data. You have 16 bits. 2 digits is not enough (just 12 bits), so you need 3 digits to represent those bits. The 4th digit is padding, because base64 usually gets normalized to be a multiple of 4 digits.

Max Vollmer
  • 8,412
  • 9
  • 28
  • 43
  • And similarly the 32-byte hash (probably SHA-256 or SHA3-256?) is 256 bits and 256/6 > 42 so it requires 43 digits plus 1 pad char in base64 – dave_thompson_085 Sep 01 '18 at 15:48