Java String to UCS2 encoding for Letters with Accents

Question

I have a requirement for encoding a String that contains foreign characters eg. letters with accents to UCS2 characters and have the following piece of code working for normal english letters.

String encodeAsUCS2(String test) throws UnsupportedEncodingException{

        byte[] bytes = test.getBytes("UTF-16BE");

        StringBuilder sb = new StringBuilder();
        for (byte b : bytes) {
            sb.append(String.format("%02X", b));
        }

        return sb.toString();


    }

That outputs hexadecimal sequence of UCS2/UTF16 bytes

eg. hello = 00680065006C006C006F

It runs into an issue with the letters that have accents/foreign characters and displays the value as FFFD which is in the Specials table and is used to indicate problems when a system is unable to render a stream of data to a correct symbol.

Any work around for this?

UCS2 is the same as UTF-16, which is how Java stores strings in memory. There is no conversion needed. Both are an encoding needed when converting to **bytes**, and that conversion you're doing in the first line. — Andreas, Oct 16 '15 at 09:56
I want to encode the string. eg. for hello I want 00680065006C006C006F. It's just not working for foreign characters — gio10245, Oct 16 '15 at 09:59
So what you *really* want, is to encode the string to a hexadecimal sequence of UCS2/UTF16 bytes. Please clarify question. — Andreas, Oct 16 '15 at 10:01
Yeah encode not convert, already edited based on your original response. — gio10245, Oct 16 '15 at 10:02
It works fine for normal english characters but runs into problems with letters that have accents/foreign characters — gio10245, Oct 16 '15 at 10:04
I tried with `helloÁáÂâÄäḁ` and it came out fine: `00680065006C006C006F00C100E100C200E200C400E41E01` — Andreas, Oct 16 '15 at 10:08
Strange, for `helloÁáÂâÄä` I get `00680065006C006C006FFFFDFFFDFFFDFFFDFFFDFFFD` — gio10245, Oct 16 '15 at 10:13
So where did you place that text? In a Java string as `"helloÁáÂâÄäḁ"`? If so, what encoding did you save the .java file with? And did you remember to compile with that encoding as well? --- When I pasted `ḁ` into Eclipse, it forced me to save as UTF-8, and it remembers that and compiles correctly, but `javac` would have to be told. — Andreas, Oct 16 '15 at 10:16
I am running a Junit test to test the encoding `@Test public void testEncodeAsUcx2_string2() throws Exception { String encoded = sendRequestTransformer.encodeAsUCS2("helloÁáÂâÄä"); Assert.assertEquals("00680065006C006C006F00C100E100C200E200C400E4", encoded); }` — gio10245, Oct 16 '15 at 10:19
So, Java string. And your answer to my two encoding questions are?? --- To get around .java file encoding issues, you can enter the same string using unicode escapes: `"hello\u00C1\u00E1\u00C2\u00E2\u00C4\u00E4\u1E01"` — Andreas, Oct 16 '15 at 10:20
The encoding within Intellij says the file encoding is windows-1252? I am not compiling, just running JUnit tests. — gio10245, Oct 16 '15 at 10:27
Some compiler will have to compile. Check the setting of the encoding of the source files. — laune, Oct 16 '15 at 10:31
Also, please run a `String s= "helloÁáÂâÄä"; for( int i = ...) printf( "%04x", s.charAt(i) )` in the JUnit test where you call encodeAsUCS2 and show us the result. — laune, Oct 16 '15 at 10:44
I don't have Intellij, but if I set Eclipse's Text file encoding to US-ASCII then I get the FFFD, which makes sense since codepoints > 0x7F aren't available in US-ASCII - hence the replacement. — laune, Oct 16 '15 at 10:50
My code is working now and I have no idea why. Git sees no changes and the code works... strange... — gio10245, Oct 16 '15 at 11:18

Java String to UCS2 encoding for Letters with Accents

0 Answers0