Conversion of Japanese "semi-voice" character

Question

I was trying to compare two spark dataframe which contains Japanese characters and there's some characters that seem the same but actually different to the program, such as プ vs プ

If you put them in utf-8 encoder:

プ utf-8 = \xE3\x83\x97

プ utf-8 = \xE3\x83\x95\xE3\x82\x9A

Looks like フ(\xE3\x83\x95) + the little circle semi-voice sign(\xE3\x83\x95) = プ

What are these difference called, and is there any way to convert between them in Java/Scala?

Thank you.

score 3 · Answer 1 · answered Oct 09 '20 at 23:11

プ aka \xE3\x83\x97 (UTF-8) is \u30d7 aka 'KATAKANA LETTER PU' (U+30D7).

プ aka \xE3\x83\x95\xE3\x82\x9A (UTF-8) is \u30d5\u309a aka 'KATAKANA LETTER HU' (U+30D5) and 'COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK' (U+309A).

As you can see, the second is a base character and a combining character. This is the similar to how diacritical marks aka accent marks are done for Latin characters, e.g. how ñ = n + ̃ aka \u00f1 = \u006e + \u0303.

You can convert between the 2 forms using the Normalizer class. See: javadoc.

See also: The Java™ Tutorials - Normalizing Text.
See also: Combining accent and character into one character in java 7

Conversion of Japanese "semi-voice" character

1 Answers1