Trouble comparing Java strings (of different encoding)

Question

I'm writing EXIF metadata to a JPEG using Apache Commons Imaging (Sanselan), and, at least in the 0.97 release of Sanselan, there were some bugs related to charset/encoding. The EXIF 2.2 standard requires that the encoding of fields of type UNDEFINED be prefixed with an 8-byte ASCII "signature", describing the encoding of the following content. The field/tag I'm writing to is the UserComment EXIF tag.

Windows expects the content to be encoded in UTF16, so the bytes written to the JPEG must contain a combination of (single byte) ASCII characters, followed by (double byte) Unicode characters. Furthermore, although UserComment doesn't seem to require it, I notice that often the content is "null-padded" to even length.

Here's the code I'm using to create and write the tag:

String textToSet = "Test";
byte[] ASCIIMarker = new byte[]{ 0x55, 0x4E, 0x49, 0x43, 0x4F, 0x44, 0x45, 0x00 }; // spells out "UNICODE"
byte[] comment = textToSet.getBytes("UnicodeLittle"); 

// pad with \0 if (total) length is odd (or is \0 byte automatically added by arraycopy?)
int pad = (ASCIIMarker.length + comment.length) % 2;

byte[] bytesComment = new byte[ASCIIMarker.length + comment.length + pad];
System.arraycopy(ASCIIMarker, 0, bytesComment, 0, ASCIIMarker.length);
System.arraycopy(comment, 0, bytesComment, ASCIIMarker.length, comment.length);
if (pad > 0) bytesComment[bytesComment.length-1] = 0x00;

TiffOutputField exif_comment = new TiffOutputField(TiffConstants.EXIF_TAG_USER_COMMENT,
        TiffFieldTypeConstants.FIELD_TYPE_UNDEFINED, bytesComment.length - pad, bytesComment);

Then when I read the tag back from the JPEG, I do the following:

String textRead;
TiffField field = jpegMetadata.findEXIFValue(TiffConstants.EXIF_TAG_USER_COMMENT);
if (field != null) {
    textRead= new String(field.getByteArrayValue(), "UnicodeLittle");
}

What confuses me is this: The bytes written to the JPEG are prefixed with 8 ASCII bytes, which obviously need to be "stripped off" in order to compare what was written to what was read:

if (textRead != null) {
  if (textToSet.equals(textRead)) {  // expecting this to FAIL
    print "Equal";  
  } else {
    print "Not equal";
    if (textToSet.equals(textRead.substring(5))) {  // this works
      print "Equal after all...";
    }
  }
}

But why substring(5), as opposed to... substring(8)? If it was 4, I might think that 4 double byte (UTF-16) symbols total 8 bytes, but it only works if I strip off 5 bytes. Is this an indication that I'm not creating the payload (byte array bytesComment) properly?

PS! I will update to Apache Commons Imaging RC 1.0, which came out in 2016 and hopefully has fixed these bugs, but I'd still like to understand why this works once I've gotten this far with 0.97 :-)

According to http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html , "UnicodeLittle" assumes a byte order mark (BOM, or U+FEFF) is present at the start of the text. Try using `StandardCharsets.UTF_16LE` instead. — VGR, Apr 06 '17 at 17:49
Thanks, that makes sense and solves my problem. `substring(4)` now works as expected. I'd mark your solution as "accepted" if it were possible. — joakimk, Apr 06 '17 at 20:09
@joakimk feel free to answer (and accept) your own question so question does not remain unanswered — Piro, Apr 12 '18 at 06:04

Trouble comparing Java strings (of different encoding)

0 Answers0