0

I have a requirement for encoding a String that contains foreign characters eg. letters with accents to UCS2 characters and have the following piece of code working for normal english letters.

String encodeAsUCS2(String test) throws UnsupportedEncodingException{

        byte[] bytes = test.getBytes("UTF-16BE");

        StringBuilder sb = new StringBuilder();
        for (byte b : bytes) {
            sb.append(String.format("%02X", b));
        }

        return sb.toString();


    }

That outputs hexadecimal sequence of UCS2/UTF16 bytes

eg. hello = 00680065006C006C006F

It runs into an issue with the letters that have accents/foreign characters and displays the value as FFFD which is in the Specials table and is used to indicate problems when a system is unable to render a stream of data to a correct symbol.

Any work around for this?

gio10245
  • 37
  • 2
  • 13
  • UCS2 is the same as UTF-16, which is how Java stores strings in memory. There is no conversion needed. Both are an encoding needed when converting to **bytes**, and that conversion you're doing in the first line. – Andreas Oct 16 '15 at 09:56
  • I want to encode the string. eg. for hello I want 00680065006C006C006F. It's just not working for foreign characters – gio10245 Oct 16 '15 at 09:59
  • So what you *really* want, is to encode the string to a hexadecimal sequence of UCS2/UTF16 bytes. Please clarify question. – Andreas Oct 16 '15 at 10:01
  • Yeah encode not convert, already edited based on your original response. – gio10245 Oct 16 '15 at 10:02
  • It works fine for normal english characters but runs into problems with letters that have accents/foreign characters – gio10245 Oct 16 '15 at 10:04
  • Could you add a few letters where it doesn't work? – laune Oct 16 '15 at 10:06
  • I tried with `helloÁáÂâÄäḁ` and it came out fine: `00680065006C006C006F00C100E100C200E200C400E41E01` – Andreas Oct 16 '15 at 10:08
  • Strange, for `helloÁáÂâÄä` I get `00680065006C006C006FFFFDFFFDFFFDFFFDFFFDFFFD` – gio10245 Oct 16 '15 at 10:13
  • 1
    So where did you place that text? In a Java string as `"helloÁáÂâÄäḁ"`? If so, what encoding did you save the .java file with? And did you remember to compile with that encoding as well? --- When I pasted `ḁ` into Eclipse, it forced me to save as UTF-8, and it remembers that and compiles correctly, but `javac` would have to be told. – Andreas Oct 16 '15 at 10:16
  • I am running a Junit test to test the encoding `@Test public void testEncodeAsUcx2_string2() throws Exception { String encoded = sendRequestTransformer.encodeAsUCS2("helloÁáÂâÄä"); Assert.assertEquals("00680065006C006C006F00C100E100C200E200C400E4", encoded); }` – gio10245 Oct 16 '15 at 10:19
  • 1
    So, Java string. And your answer to my two encoding questions are?? --- To get around .java file encoding issues, you can enter the same string using unicode escapes: `"hello\u00C1\u00E1\u00C2\u00E2\u00C4\u00E4\u1E01"` – Andreas Oct 16 '15 at 10:20
  • The encoding within Intellij says the file encoding is windows-1252? I am not compiling, just running JUnit tests. – gio10245 Oct 16 '15 at 10:27
  • 1
    Some compiler will have to compile. Check the setting of the encoding of the source files. – laune Oct 16 '15 at 10:31
  • Also, please run a `String s= "helloÁáÂâÄä"; for( int i = ...) printf( "%04x", s.charAt(i) )` in the JUnit test where you call encodeAsUCS2 and show us the result. – laune Oct 16 '15 at 10:44
  • I don't have Intellij, but if I set Eclipse's Text file encoding to US-ASCII then I get the FFFD, which makes sense since codepoints > 0x7F aren't available in US-ASCII - hence the replacement. – laune Oct 16 '15 at 10:50
  • My code is working now and I have no idea why. Git sees no changes and the code works... strange... – gio10245 Oct 16 '15 at 11:18

0 Answers0