1

I have a trouble to convert email attachment(simple text file in windows-1251 encoding with latin and cyrillic symbols) to String. I.e I have a problem with converting cyrillic. I got attachment file as base64 encoded String like this:

Base64Encoded email Attachment

Original file

So when I try to decode it, I got "?" instead of Cyrillic symbols.

How can I get right Cyrillic(Russian) symbols instead of "?"

I've already tried this code with all encodings, but nothing help to get correct Russian symbols.

    BASE64Decoder dec = new BASE64Decoder();

    for (String key : Charset.availableCharsets().keySet()) {
        System.out.println("K=" + key + " Value:" +
                           Charset.availableCharsets().get(key));
        try {
            System.out.println(new String(dec.decodeBuffer(encoded), key));
        } catch (Exception e) {
            continue;
        }
    }

Thank You beforehand.

user3283133
  • 11
  • 1
  • 3
  • Are you positive this actually IS Cyrillic string, not some binary data (PEM certificate; its mime-type in the details section is application/octet-stream)? – Cromax Feb 07 '14 at 09:43
  • Yes I get an pem file - this is request to digital certificate. But it is possible to read it with notepad. So I whould want to get its content in java. May be You know any library for such tasks? – user3283133 Feb 07 '14 at 13:06
  • But I bet you read it in Notepad in BASE64 form, don't you? Use some BASE64 decoder tool and see what you get then. – Cromax Feb 07 '14 at 13:18
  • I just tryed to parse it as X509Certificate. I got object from binary data, and execute method certx509.getSubjectDN(), but it also return me such String: EMAILADDRESS=kokleva@klg.grinn-corp.ru, CN=Êîêëåâà Òàòüÿíà Âèêòîðîâíà, T=Áóõãàëòåð ïî çàðïëàòå – user3283133 Feb 07 '14 at 13:18
  • It looks like the issuer didn't use UTF-8 but ISO-8859-5/KOI8-R charset for CN and T parts... Well, you could substring these parts, call `getBytes("ISO-8859-1")` on them, and then use those arrays to create string with "ISO-8859-5" encoding. But there's a chance it got distorted... – Cromax Feb 07 '14 at 13:30
  • It's did't help System.out.println((new String(((X500Name)certx509.getSubjectDN()).getCommonName().getBytes("ISO-8859-1"),"ISO-8859-5"))); - gives me ??????? ??????? ?????????? – user3283133 Feb 07 '14 at 14:35
  • Does it means I can't get Correct String or any other Variants? – user3283133 Feb 07 '14 at 14:37
  • Hm, strange, I got some output (Ъюъыхтр врђќџэр Тшъђю№ютэр), but it looks like it's broken either. Where does the output goes in your case? Is your console able to display this charset? If that doesn't work, I don't know how to help you more. You could probably contact the issuer and try to obtain another certificate but in UTF-8 charset. – Cromax Feb 07 '14 at 14:49
  • I try to print it to console. I think my console does'not able to display this encoding, but also tried to print it to simple text file, and this string prints right.But I need to get this String inside SOA Bpel Application. And in it I can't get right string. How can change encoding in Java code? – user3283133 Feb 09 '14 at 14:12

1 Answers1

0

I am not very familiar with BPEL and protocols it uses. If you communicate between nodes using some binary protocols, then you must 1) ensure, client and receiver use the same charset and 2) convert java string into proper bytes in this encoding. Java stores string internally in UTF-16 format. So when you execute String correct = new String(commonName.getBytes("ISO-8859-1"), "ISO-8859-5") you will get correct string in UTF-16. Then you need to export it to bytes in requested encoding, eg. byte[] buff = correct.getBytes("UTF-8") assuming the encoding you use between nodes is UTF-8. If happen the encoding is different, then you must make sure, it actually supports Cyrillic characters (e.g. ISO-8859-1 does not support it).

If you use XML for data exchange, make sure it uses suitable encoding in <?xml encoding="UTF-8"?>. You don't need then to play with bytes, you just need to correctly "import" the string (see correct variable). Writing to XML converts characters automatically, but it (encoding) must support characters you want to write. So if you set encoding="ISO-88591", then you will get those question marks again.

Cromax
  • 1,822
  • 1
  • 23
  • 35
  • So, I can explain my case in more details. I get a base64 encoded file(pem) to my bpel process. Then I decoded this Base64 string to decoded string(bytes). Then I get an instance of X509Certificate from that bytes and try to retrive Cn(Canonical Name) from that file. CN is a Cyrillic Surname of owner. So I think this CN is windows-1251 String, from which I want to get a correct for Java String(UTF-16). I tried: – user3283133 Feb 10 '14 at 07:46
  • new String(((X500Name)certx509.getSubjectDN()).getCommonName().getBytes("Cp1251"), "UTF-8"); But this doesn't help If I try to get String without any decoding ((X500Name)certx509.getSubjectDN()).getCommonName(); So result will be: Êîêëåâà Òàòüÿíà Âèêòîðîâíà – user3283133 Feb 10 '14 at 07:52
  • What you did is `new String(cn.getBytes("Cp1251"), "UTF-8")`, but it looks, like it's not UTF-8 (as the output shows), but some other encoding (presumably ISO-8859-5). What we want to achieve in this step is to get **original** sequence of bytes that was used in CN during generating certificate. Consider this: your system's default encoding is let's say E5. On the other hand default encoding in PEM files is E1. What's important, both encodings are 8-bit encodings (one byte per char). So if user enter (in E5) character 0xD4 (д), it will go also as 0xD4 in CN string, but as it has E1 encoding, – Cromax Feb 10 '14 at 09:06
  • it will be recognized as Ô character. Both share the same byte code, but they represent different glyphs. But as strings in Java are internally represented as UTF-16 (two bytes per char), you will encounter problems: character `Ô` in UTF-16 has also code 0x00D4, but character `д` has code 0x0434. So when you call `getBytes` on CN, you are then translating UTF-16 2-byte sequences to 1-byte sequence in encoding you specify. As CN was read as E1, then byte 0xD4 was translated to 0x00D4 sequence. On the other hand, if you would be able to specify encoding before letting Java read bytes, – Cromax Feb 10 '14 at 09:48
  • you could set input encoding as E5 and 1-byte sequence would be translated as 0x0434 in UTF-16 Java's internal format. So what we need is to get these original bytes again and recreate UTF-16 string, but with proper input encoding. For this purpose we need to know what encoding Java used to read original bytes (presumably ISO-8859-1) and use it to `getBytes` again. Then we need to know what encoding was used to write CN string into PEM file (presumably ISO-8859-5) and use it to read these regained bytes again and translate them properly into UTF-16. – Cromax Feb 10 '14 at 10:09
  • So that's what we do: cnb = cn:UTF-16->ISO-8859-1, cn = cnb:ISO-8859-5->UTF-16. Also you wrote, that you decode BASE64 string to decoded *string(bytes)* --- why you decode it as string and not as bytes in the first place? – Cromax Feb 10 '14 at 10:10
  • Thanks Commax for Your Assistance: 1) I decode base64 String as String because All this happens inside SOA application - XML driven configuration => I can manipulate only Strings and XML. There is a s chance to use java embedding, but very poor API. So I user standart oracle.soa.common.util.Base64Decoder – user3283133 Feb 10 '14 at 14:56
  • I little confued - How can I decode this String in Java: 1) How can I know Java encoding? Is this Environment encoding? Here are: LC_ALL=en_US.UTF-8 System.getProperty("file.encoding") =UTF-8 Charset.defaultCharset() = UTF-8 Pem FILE was encoded in Windows-1251 Can You explain in more details how can I get correct String. – user3283133 Feb 10 '14 at 15:01