-1

I am trying to convert codepoints from one charset to another in Java.

For example character ř is 248 in windows-1250, 345 in unicode.

So I have source charset and source codepoint and target charset and want to calculate target codepoint.

This may sound easy as windows-1250 is single byte, but I want it to work on any charset, like GB2312.

I guess it can be done somehow with Charset class, but it seems that it only converts bytes, not actual code points.

Charset sourceCharset = Charset.forName("GB2312");                
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");                
int targetCodePoint = ...; //???

I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.

Thanks in advance for any help.

1 Answers1

0

At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.

    Charset sourceCharset = Charset.forName("windows-1250");                
    int sourceCodePoint = 248; // ř
    byte[] bytes = {(byte)sourceCodePoint};
    String targetString = new String(bytes, sourceCharset);
    int targetCodePoint = targetString.codePointAt(0);
    System.out.println("targetString = " + targetString);
    System.out.println("targetCodePoint = " + targetCodePoint);

output:

targetString = ř
targetCodePoint = 345

Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.

    Charset sourceCharset = Charset.forName("GB2312");                
    int sourceCodePoint = 45257; // 吧 chinese character
    byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
    String targetString = new String(bytes, sourceCharset);
    int targetCodePoint = targetString.codePointAt(0);
    System.out.println("targetString = " + targetString);
    System.out.println("targetCodePoint = " + targetCodePoint);

output:

targetString = 吧
targetCodePoint = 21543
  • Thank you. Is there a way how to tell how many bytes specific charset uses or if it's variable length? UTF-8 is variable length for example. Are there any other charsets which are variable length too? My point is that I probably need to make list of charsets which are single byte, two byte (which directly store codepoint) and other specials, to make universal conversion of codepoints happen. – Jindra Petřík Nov 12 '22 at 04:26
  • 1
    For more information on a character set, you should check the specification for that character set. I don't understand what you mean by code points outside of Unicode. However, I think it's more or less correct to interpret codes over 255 as double-byte code characters. –  Nov 12 '22 at 05:25
  • Since you started with nio, you can get there in a similar way and slightly more directly with `ByteBuffer bb = ByteBuffer.allocate(2).putChar((char)sourceCodePoint).rewind(); String targetString = sourceCharset.newDecoder().decode(bb).toString();` – g00se Nov 12 '22 at 18:05