0

I'm storing strings in TIFF headers using JAI. Some strings contain characters which value is greater than 127d (e.g. 'é' is 233d).

When I open the resulting TIFF file with an hex editor, I can see the byte 233d, but when I try to read it back through JAI by the TIFFField.getAsString(), I'm getting '?' (Unicode 0xfffd "replacement char"). I have checked the TIFF 6.0 specifications but they just mention "7-bits ASCII".

I would like to tell JAI to use ISO-8859-1 Charset to decode strings. Is that possible? I haven't find anything in the (old) javadoc. As a last resort, I could also use URL-encoding for strings but would rather avoid that.

JosefZ
  • 28,460
  • 5
  • 44
  • 83
Matthieu
  • 2,736
  • 4
  • 57
  • 87
  • What is the type of the TIFF tags you are writing? If the tags are specified as ASCII, there really is no other encoding available (although I've seen software write UTF8 regardless). – Harald K Sep 02 '13 at 17:41
  • @haraldK Yes, it's TIFF_ASCII. JAI takes care of the writing through Java String (which is Unicode) but if I write 'é', I get '?' when reading back. – Matthieu Sep 02 '13 at 23:03

1 Answers1

4

A TIFF tag defined as ASCII, is by the specification only allowed to contain plain 7 bit ASCII.

Unfortunately, this isn't very useful in the real world (where not all of us speak English), so a lot of software will write UTF8 or even a ISO-8859-x encoded strings into these fields, even if it's in violation of the spec. There is no other encoding allowed in an ASCII tag.

JAI, being very strict in reading, probably decodes the string as plain ASCII, and as the 'é' isn't part of that charset it replaces it with a "unicode replacement char".

Your best bet, is to do one of the following:

  • If allowed by the tag, use BYTE or UNDEFINED instead of ASCII + encoding specification
  • If possible, use a different tag to write your value (that allows BYTE or UNDEFINED values + encoding specification)
  • If neither of the above is possible, your best bet is to get to the actual bytes and decode yourself, or use a different library to parse the TIFF structure
Harald K
  • 26,314
  • 7
  • 65
  • 111
  • I was afraid you would say that ;) I'd rather not use a different library to parse TIFF, unless you know one which can handle multipage JPEG-in-TIFF? Is it possible with JAI to get the actual bytes of the field to decode it directly? – Matthieu Sep 03 '13 at 13:41
  • 1
    Not sure if it suits your needs, but I am developing a pure Java [TIFF plugin for ImageIO](https://github.com/haraldk/TwelveMonkeys/tree/master/imageio/imageio-tiff) that should support multipage JPEG encoded TIFF files (both old and new flavors). Feel free to give it a try. Independent of that plugin, there's also a [TIFF/EXIF parser](https://github.com/haraldk/TwelveMonkeys/tree/master/imageio/imageio-metadata/src/main/java/com/twelvemonkeys/imageio/metadata/exif) you could use to read the tags. Don't know if JAI lets you access the actual bytes, sorry. – Harald K Sep 03 '13 at 19:07
  • 1
    Thanks, I'll try to find time to give it a try. In the meantime, I'll just URL-encode my strings before storing them. It seems it would have the least impact on both size and code. – Matthieu Sep 04 '13 at 12:54
  • 1
    If control both reading and writing, the URL-encoding trick should be pretty safe. – Harald K Sep 04 '13 at 13:09