0

I am processing a text that then I have to link to files. The text has ä ( unicode points 97 + 776 ) but the FS has the file written as ä ( unicode point 228 ). Is there a way to convert 97 + 776 to 228? I believe these should be surrogate pairs and is UTF-8 encoded, I've tried getBytes as UTF-16 or other encodings, but nothing worked. I can't even paste the 2 code points char here correctly - it gets processed to the single char, but the hex representation is still "61 cc 88", what exactly is this "ä"?

machekj
  • 65
  • 8
  • *I can't even paste the 2 code points char here correctly - it gets processed to the single char* You can ;) That's exactly what I get when copy/pasting from this page – g00se Jun 24 '23 at 08:46

2 Answers2

1

The one with two codepoints isn't a surrogate pair, but rather an "a" with a combining diacritic "¨", resulting in the same visual appearance (in fonts that support it) as the precomposed (= character and diacritic in one) character "ä".

To convert between the two you need something called a Normalizer. Java's built-in class java.text.Normalizer should help you with that, have a look at https://stackoverflow.com/a/58403649/12344762 for more infomation.

linux_user36
  • 113
  • 1
  • 8
1

The character "ä" can be represented in Unicode in two different ways:

As a single character with the Unicode code point 228 (U+00E4). As a combination of two characters: "a" with the Unicode code point 97 (U+0061) followed by a combining diaeresis character with the code point 776 (U+0308). Both representations are valid, but they are not interchangeable in terms of their Unicode code points. If your file system represents "ä" using the single character approach (code point 228), and you have the text with the two-character representation (97 + 776), you will need to convert the text to match the file system's representation.

To convert the two-character representation (97 + 776) to the single-character representation (228), you can use normalization functions provided by programming languages or libraries that support Unicode manipulation. One common normalization form is Unicode Normalization Form C (NFC).

Here's an example in Java using the java.text.Normalizer class to perform the normalization:

import java.text.Normalizer;

String text = "a\u0308";  // Two-character representation (97 + 776)
String normalizedText = Normalizer.normalize(text, Normalizer.Form.NFC);
System.out.println(normalizedText);  // Output: ä

In this example, the Normalizer.normalize method with Normalizer.Form.NFC as the argument converts the two-character representation to the single character "ä" (U+00E4).

Note that the normalization process might involve other transformations and adjustments to ensure text consistency, so it's always a good practice to normalize text before comparing or processing it.

If you're using a different programming language, please let me know, and I can provide guidance on how to perform the conversion in that specific language.

Pallav Khare
  • 471
  • 1
  • 7
  • 12