UTF-8 string to ordinal value: Java equivalent for Python output

Question

I have the feeling this is most likely a duplicate, but I'm unable to find it.

NOTE: My Python knowledge is very limited, so I'm not 100% sure how strings, bytes, and encodings are done in Python. My knowledge about encodings in general is also not too great..

Let's say we have the string "Aä$$€h". It contains three different ordinary ASCII characters (A$h), and two non-ASCII characters (ä€). In Python we have the following code:

# coding: utf-8
input = u'Aä$$€h'
print [ord(c) for c in input.encode('utf-8')]
# Grouped per character:
print [[ord(x) for x in c.encode('utf-8')] for c in input_code]

Which will output:

[65, 195, 164, 36, 36, 226, 130, 172, 104]
[[65], [195, 164], [36], [36], [226, 130, 172], [104]]

Try it online.

Now I'm looking for a Java equivalent giving this same integer-array. I know all Strings in Java are by default encoded with UTF-16, and only byte-arrays can have an actual encoding. I thought the following code would give the result I expected:

String input = "Aä$$€h";
byte[] byteArray = input.getBytes(java.nio.charset.StandardCharsets.UTF_8);
System.out.println(java.util.Arrays.toString(byteArray));

But unfortunately it gives the following result instead:

[65, -61, -92, 36, 36, -30, -126, -84, 104]

Try it online.

I'm not sure where these negative values are coming from..

So my question is mostly this:

Given a String in Java containing non-ASCII characters (i.e. "Aä$$€h"), output its ordinal UTF-8 integers similar as the Python ord-function does on an UTF-8 encoded byte. The first part of this question, in that we already have a Java String, is a precondition for this question.

"all Strings in Java are _by default_ encoded with UTF-16": From the API perspective (esp. `.length`), there is no other option. — Tom Blodget, Feb 05 '19 at 04:56

score 3 · Accepted Answer · answered Feb 04 '19 at 15:24

Java byte is signed, that is where the negative numbers are coming from. Bit-wise the numbers have the same value in both languages, the way they are being represented is just different. You can get the same representation as in python by using Byte.toUnsignedInt():

String input = "Aä$$€h";
byte[] byteArray = input.getBytes(java.nio.charset.StandardCharsets.UTF_8);
int[] ints = new int[byteArray.length];
for(int i = 0; i < ints.length; i++) {
    ints[i] = Byte.toUnsignedInt(byteArray[i]);
}
System.out.println(java.util.Arrays.toString(ints));

Which prints:

[65, 195, 164, 36, 36, 226, 130, 172, 104]

Ah, so that was the difference here. Thanks! I will accept your answer in a few minutes when I can. — Kevin Cruijssen, Feb 04 '19 at 15:27

UTF-8 string to ordinal value: Java equivalent for Python output

1 Answers1