1

I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:

for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {

   // Account for 'r' like in dianr/huir
   if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {

Also, feel free to suggest a more elegant way to parse this ...

[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)

enter image description here

enter image description here

oh, dang, apparently it does not work simply copy and pasting:

enter image description here

Mairyu
  • 809
  • 7
  • 24
  • 3
    "I assume due to the Unicode encoding it counts as two characters" Well, why assume? `"兒".toCharArray().length()` tells you for definite. – Andy Turner Jul 16 '17 at 00:11
  • 1
    `兒` is [Unicode Han Character 'son, child, oneself; final part' (U+5152)](http://www.fileformat.info/info/unicode/char/5152/index.htm), i.e. one UTF-16 `char` only, so your assumption is wrong. – Andreas Jul 16 '17 at 00:23
  • okay, bad phrasing, it definitely is 2 chars, I just meant I assume it's 2 chars because it's unicode. The script works fine for the tone coloring I do, it just fails matching. If I go in the debugger and check the (...) in the ``if`` it comes back as 'false' – Mairyu Jul 16 '17 at 01:57
  • Of cause the if condition evaluates to false. You substring evaluates to a two-character-string, the string literal you pass to equals ("兒") is one character long. This can never be true because of their different length. You should call substring with two consecutive indices. Then it should work. – Björn Zurmaar Jul 16 '17 at 08:21
  • I think you're missing the point. The last pic is unexpected i.e. the single character should evaluate as a length of 2 (unicode char), I'm just showing that it unexpectedly does not. Now it's obvious that the ``equal()`` fails, but I'm still stumped why the same character shows a length of 2 in the substr (if I only take one, it shows nothing, it needs 2 as confirmed by others) and a length of 1 in the direct quote. – Mairyu Jul 16 '17 at 16:04
  • No, it shouldn't have a length of 2 characters as multiple persons now pointed out. It's UTF-16 encoded 0x5152. If you don't believe it, check it with the [Character.charCount](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#charCount-int-) method. – Björn Zurmaar Jul 16 '17 at 17:51
  • but then why does my substr show 2 char count ? you can check the first pic (red circle). If I do the substr on a single char, I get empty output, if I do 2 I get the '兒', I'm just trying to reconcile this. – Mairyu Jul 17 '17 at 20:06

3 Answers3

0

Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:

tmpChar.codePoints().forEach(c -> {
  if (c == '兒') {
    // ...
  }
});

(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • 2
    `兒` is either one character only, in which case your code compiles but using code points is unnecessary, or `兒` is two surrogate characters, in which case `'兒'` would fail to compile. I mean, using `codePoints()` is fine and all, but it is not really an answer to anything regarding this question. – Andreas Jul 16 '17 at 00:28
0

Either characters, accepting as substring.

String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
    int position2 = position + "兒".length();
    s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
    // At position i there is a 兒.
}

Or code points where it would be one code point. As that is not really easier, variable substring seem fine.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {

Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.

There are a variety of APIs on the String class for coping.

As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.

Or, you can use the ICU4J library with a richer set of facilities for all of this.

bmargulies
  • 97,814
  • 39
  • 186
  • 310