I'm writing EXIF metadata to a JPEG using Apache Commons Imaging (Sanselan), and, at least in the 0.97 release of Sanselan, there were some bugs related to charset/encoding. The EXIF 2.2 standard requires that the encoding of fields of type UNDEFINED
be prefixed with an 8-byte ASCII "signature", describing the encoding of the following content. The field/tag I'm writing to is the UserComment
EXIF tag.
Windows expects the content to be encoded in UTF16, so the bytes written to the JPEG must contain a combination of (single byte) ASCII characters, followed by (double byte) Unicode characters. Furthermore, although UserComment
doesn't seem to require it, I notice that often the content is "null-padded" to even length.
Here's the code I'm using to create and write the tag:
String textToSet = "Test";
byte[] ASCIIMarker = new byte[]{ 0x55, 0x4E, 0x49, 0x43, 0x4F, 0x44, 0x45, 0x00 }; // spells out "UNICODE"
byte[] comment = textToSet.getBytes("UnicodeLittle");
// pad with \0 if (total) length is odd (or is \0 byte automatically added by arraycopy?)
int pad = (ASCIIMarker.length + comment.length) % 2;
byte[] bytesComment = new byte[ASCIIMarker.length + comment.length + pad];
System.arraycopy(ASCIIMarker, 0, bytesComment, 0, ASCIIMarker.length);
System.arraycopy(comment, 0, bytesComment, ASCIIMarker.length, comment.length);
if (pad > 0) bytesComment[bytesComment.length-1] = 0x00;
TiffOutputField exif_comment = new TiffOutputField(TiffConstants.EXIF_TAG_USER_COMMENT,
TiffFieldTypeConstants.FIELD_TYPE_UNDEFINED, bytesComment.length - pad, bytesComment);
Then when I read the tag back from the JPEG, I do the following:
String textRead;
TiffField field = jpegMetadata.findEXIFValue(TiffConstants.EXIF_TAG_USER_COMMENT);
if (field != null) {
textRead= new String(field.getByteArrayValue(), "UnicodeLittle");
}
What confuses me is this: The bytes written to the JPEG are prefixed with 8 ASCII bytes, which obviously need to be "stripped off" in order to compare what was written to what was read:
if (textRead != null) {
if (textToSet.equals(textRead)) { // expecting this to FAIL
print "Equal";
} else {
print "Not equal";
if (textToSet.equals(textRead.substring(5))) { // this works
print "Equal after all...";
}
}
}
But why substring(5)
, as opposed to... substring(8)
? If it was 4, I might think that 4 double byte (UTF-16) symbols total 8 bytes, but it only works if I strip off 5 bytes. Is this an indication that I'm not creating the payload (byte array bytesComment
) properly?
PS! I will update to Apache Commons Imaging RC 1.0, which came out in 2016 and hopefully has fixed these bugs, but I'd still like to understand why this works once I've gotten this far with 0.97 :-)