I'm trying to use regular expressions and the strings include unicode characters such as '' and ''. The Pattern/Matcher are finding the expression I'm looking for but the Matcher returns the wrong position for the start of the match and thus also all the Matcher groups are incorrect in such cases.
public static void main(String[] args) {
String test = "±1℃ ±5% 3kΩ";
for (int index = 0; index<test.length(); index++)
System.out.println(" char at " + index + ": " + test.charAt(index) + " \\u" +
Integer.toHexString(test.charAt(index) | 0x10000).substring(1));
Pattern pattern = Pattern.compile("(?<number>[0-9]*(\\.[0-9]+)?)(?<multiplier>[KM])?Ω");
Matcher matcher = pattern.matcher(test);
if (matcher.find()) {
System.out.println("info: " + matcher.start());
System.out.println("found \"" + matcher.group("number") + "\" \"" +
matcher.group("multiplier") + "\" in \"" + test + "\"");
}
}
Since the Matcher does find the sequence I expect that Matcher.group("number") returns "3" and Matcher.group("multiplier") should yield "k".
So the last print should yield:
found "3" "k"
instead I get:
info: 14
found "" "null"
The "info" line gives a hint. The matcher thinks the match started at position 14.
But the for loop prints the charAt positions and prints:
char at 0: � \u00c2
char at 1: � \u00b1
char at 2: 1 \u0031
char at 3: � \u00e2
char at 4: � \u201e
char at 5: � \u0192
char at 6: \u0020
char at 7: � \u00c2
char at 8: � \u00b1
char at 9: 5 \u0035
char at 10: % \u0025
char at 11: \u0020
char at 12: 3 \u0033
char at 13: k \u006b
char at 14: � \u00ce
char at 15: � \u00a9
and in that we see that the character position for where the match really starts should be 12 (the '3').
Why is Regular Expression Pattern/Matcher finding the match but calculating the location for the group() methods incorrectly?
What can I do with the string to convert it into some magical encoding that will work or what I can do to the Pattern or Matcher to get them to produce the expected results?
String encodings... ugh.