1

My Android app uses an open-source library that only accepts text data in an ISO-8859-1 encoding. I have a few users from Eastern Europe who would like to enter cp1251-encoded text. This seems to be a limitation of the open-source library, as Java is fully capable of supporting these formats as well as unicode formats.

One option could be to modify the open-source library to support multiple character sets. Would it be possible to convert cp1251 to ISO-8859-1 and then back again? Since they are both 8-bit language encodings, it seems like you would be storing the same amount of data at a byte level. However, when the open-source library loads the byte data into a string with ISO-8859-1 encoding, any byte value not present in ISO-8859-1 would likely throw an exception.

I'm not a character set expert, but the fact that I can't find code samples doing this conversion leads me to believe it won't work, at least not reliably.

sashoalm
  • 75,001
  • 122
  • 434
  • 781
ktambascio
  • 434
  • 4
  • 17
  • 2
    ISO 8559-1 doesn't exist. You probably meant [ISO 8859-1](http://en.wikipedia.org/wiki/ISO_8859-1)? – BalusC Jan 08 '13 at 01:48
  • Any byte sequence is valid in ISO-8859-1. This all depends on what the library does and why does it not take strings but bytes instead? What is the library called? – Esailija Jan 08 '13 at 15:12
  • BalusC - thanks for pointing that out...fixed now. – ktambascio Jan 08 '13 at 18:52
  • Esailija - the library is Sanselan from Apache Commons. https://svn.apache.org/repos/asf/commons/proper/imaging/trunk/src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcParser.java The code is explicitly reading a byte buffer stored in the JPG and converting to a string using ISO-8859-1 encoding (the same true on the reverse). I suspect, since the JPG storage is just a bag of bytes, I can probably modify the library to store strings of any encoding. – ktambascio Jan 08 '13 at 18:54

1 Answers1

3

You are correct that this won't work very well at all. Most of the non-ASCII characters in CP1251 are not present in ISO8859-1. (CP1251 is Eastern European, and contains a lot of Cyrillic characters; ISO8859-1 is Western European, and contains a mix of accented Latin characters, punctuation, and symbols.) There are a few characters which are represented in both, but so few (and almost all of them are punctuation) that it probably won't do you any good.

  • Well, to be complete, both 8859-1 and 1251 are 8-bit encodings, and both contain the entirety of ASCII, so they share that, at least. :) – Michael Petrotta Jan 08 '13 at 01:53
  • Well... yes. But if the text were just ASCII, encodings wouldn't be an issue. :) –  Jan 08 '13 at 03:39