Java Charset that supports all symbols using 8 bit per symbol from ranges [0-255] per character

Question

I'm trying to pass a byte array with any kind of data ranging from 0 to 255 per element.

I have to pass it into Javascript so I convert it into a String, but some characters get lost and replaced with 0x3F Question Mark.

Whats the proper Charset that supports all 8 bit symbols to transfer to Javascript.

public String base64Decode(String s) {
  //... lots of stuff transforming String into byte array.

  //Some example bytes shown here.
  byte[] destArray = {(byte)0xf3, (byte)0xc3, 00, 01, 00, 00, 00, 00, (byte)0xc3, (byte)0x63, (byte)0x2d, 00, 00, 00, 00, 00, (byte)0xe0, (byte)0x9d, (byte)0xea};
  System.out.println(new String(destArray, Charset.forName("UTF-8")));
  return new String(new String(destArray, Charset.forName("UTF-8")));
}

I output the System.out.println into a file using a batch script

java Test > out.bin

Then compare byte by byte to see what is lost.
To sum it up 0x9D becomes 0x3D which is wrong.
There are probably others too but I didn't check the whole file its over 2 megs in size.

The default new String(destArray); does a better job but still misses a few characters.

Uuuh, you seem to be mixing a lot of things. So, first of all: what is the source of the data, and do you know the encoding used by this source? — fge, Mar 15 '14 at 22:31
Why not using base 64 encoding? It is the usual way of dealing with binary data. — SJuan76, Mar 15 '14 at 22:34
@SJuan76 I'm decoding Base64 for javascript's base64 encoded file. The javascript base64 encoders take 5 minutes to complete where Java takes 2-3 seconds. Also pretty sure the `byte[]`'s are converted to `unsigned bytes` under the hood when passed into the `new String(...)` or else I would have more problems. — SSpoke, Mar 15 '14 at 22:45
@fge the source encoding is supported by any I believe it's just `A-Z,a-z,0-9,+,/,=` thats about it. — SSpoke, Mar 15 '14 at 22:46
Even then, if the source is encoded with EBCDIC, you won't get a reliable result if you read that source using ASCII encoding ;) It is therefore of utmost importance to know the source encoding — fge, Mar 15 '14 at 22:48
Also, strings in JavaScript are UTF-8 encoded, aren't they? If you send along a plain JSON String, it should be readable as is as a JavaScript string — fge, Mar 15 '14 at 22:49
I outputted the first 20 bytes or so which are in the example above and they match 1 to 1 with the real file decoded by javascript's slow base64 system. — SSpoke, Mar 15 '14 at 22:49

score 2 · Accepted Answer · answered Mar 15 '14 at 22:40

2

You can use ISO-8859-1.

However, it's an ugly hack that should only be used if something really prevents you from using correct datatypes (i.e. using byte[] for binary data).

From the common sense, base64 is a way to represent binary data as ASCII strings, therefore base64Decode() should take a String and return a byte[].

answered Mar 15 '14 at 22:40

axtavt

239,438
41
511
482

ya but I still have to convert `byte[]` somehow to javascript string in the end since I can't find anything else in javascript which supports this maybe I could use the TypedArray's have to check it out. Also not to mention I can't just use Java `byte[]`'s in javascript directly because that would create a bigger problem with this library called msgpack I use, since it only uses javascript types. – SSpoke Mar 15 '14 at 22:50
Same issue. All `0x9D or 0x83 or 0x88 or 0x89 or 0x99`'s are replaced with `0x3F`'s probably any over 0x83 could it be `System.out.println` doing this? – SSpoke Mar 15 '14 at 23:01
If you want to represent arbitrary binary data is js string, you need to decide how exactly they should be represented. And yes, this trick with encoding works if you need `byte[]` -> `String` -> `byte[]` conversion, not when you want to output the `String` using `println()`. – axtavt Mar 16 '14 at 08:09
Yup `System.out.println` shows faulty results can't rely on it. Even though the result is correct in javascript passing a String with `ISO-8859-1` now everything works properly thank you. – SSpoke Mar 16 '14 at 16:30

score 0 · Answer 2 · answered Mar 16 '14 at 05:47

0

You cannot just blindly use any charset you want. Strings in Java and Javascript use UTF-16. Once you have decoded the base64 data into a byte array, you have to know the exact charset those bytes actually represent so they can be converted to UTF-16 correctly without losing any data. You have to know the charset that was used when the data was base64 encoded. If you do not know the exact charset, you are left with heuristic analysis or just plain guessing, and both are not reliable enough. Either both parties must agree on a common charset ahead of time, or else the charset needs to be exchanged along with the base64 data.

answered Mar 16 '14 at 05:47

Remy Lebeau

555,201
31
458
770

In javascript I use the `String.charCodeAt(i)` to read the byte data that's the problem since without `ISO-8859-1` it reads 1 byte as 2 bytes.. so I have to make sure every character is separated from each other. – SSpoke Mar 16 '14 at 16:29
1

Java(script) strings do not contain 8bit byte octets, they contain 16bit UTF-16 codeunits. Big difference. If you create a string from a byte array using ISO-8859-1, the string will contain 16bit representations of the original 8bit byte values, since ISO-8859-1 maps byte octets 0x00-0xFF to Unicode codepoints U+0000-U+00FF as-is. – Remy Lebeau Mar 16 '14 at 17:39
I don't what happened it worked. I'm working on loading my GameBoy Roms using javascript and it has to read byte by byte for each of the informations and `charCodeAt(#)` works perfectly. – SSpoke Mar 17 '14 at 03:59

Java Charset that supports all symbols using 8 bit per symbol from ranges [0-255] per character

2 Answers2