Converting between ISO-8559-1 and cp1251

Question

My Android app uses an open-source library that only accepts text data in an ISO-8859-1 encoding. I have a few users from Eastern Europe who would like to enter cp1251-encoded text. This seems to be a limitation of the open-source library, as Java is fully capable of supporting these formats as well as unicode formats.

One option could be to modify the open-source library to support multiple character sets. Would it be possible to convert cp1251 to ISO-8859-1 and then back again? Since they are both 8-bit language encodings, it seems like you would be storing the same amount of data at a byte level. However, when the open-source library loads the byte data into a string with ISO-8859-1 encoding, any byte value not present in ISO-8859-1 would likely throw an exception.

I'm not a character set expert, but the fact that I can't find code samples doing this conversion leads me to believe it won't work, at least not reliably.

ISO 8559-1 doesn't exist. You probably meant [ISO 8859-1](http://en.wikipedia.org/wiki/ISO_8859-1)? — BalusC, Jan 08 '13 at 01:48
Any byte sequence is valid in ISO-8859-1. This all depends on what the library does and why does it not take strings but bytes instead? What is the library called? — Esailija, Jan 08 '13 at 15:12
Esailija - the library is Sanselan from Apache Commons. https://svn.apache.org/repos/asf/commons/proper/imaging/trunk/src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcParser.java The code is explicitly reading a byte buffer stored in the JPG and converting to a string using ISO-8859-1 encoding (the same true on the reverse). I suspect, since the JPG storage is just a bag of bytes, I can probably modify the library to store strings of any encoding. — ktambascio, Jan 08 '13 at 18:54

score 3 · Accepted Answer · answered Jan 08 '13 at 01:49

3

You are correct that this won't work very well at all. Most of the non-ASCII characters in CP1251 are not present in ISO8859-1. (CP1251 is Eastern European, and contains a lot of Cyrillic characters; ISO8859-1 is Western European, and contains a mix of accented Latin characters, punctuation, and symbols.) There are a few characters which are represented in both, but so few (and almost all of them are punctuation) that it probably won't do you any good.

answered Jan 08 '13 at 01:49

Well, to be complete, both 8859-1 and 1251 are 8-bit encodings, and both contain the entirety of ASCII, so they share that, at least. :) – Michael Petrotta Jan 08 '13 at 01:53
Well... yes. But if the text were just ASCII, encodings wouldn't be an issue. :) – Jan 08 '13 at 03:39

Converting between ISO-8559-1 and cp1251

1 Answers1