Why does Java's regex Pattern/Matcher miscount group positions in strings with unicode

Question

I'm trying to use regular expressions and the strings include unicode characters such as '' and ''. The Pattern/Matcher are finding the expression I'm looking for but the Matcher returns the wrong position for the start of the match and thus also all the Matcher groups are incorrect in such cases.

    public static void main(String[] args) {
        String test = "±1℃ ±5% 3kΩ";
        for (int index = 0; index<test.length(); index++)
            System.out.println(" char at " + index + ": " + test.charAt(index) + " \\u" +
                    Integer.toHexString(test.charAt(index) | 0x10000).substring(1));
        Pattern pattern = Pattern.compile("(?<number>[0-9]*(\\.[0-9]+)?)(?<multiplier>[KM])?Ω");
        Matcher matcher = pattern.matcher(test);
        if (matcher.find()) {
            System.out.println("info: " + matcher.start());
            System.out.println("found \"" + matcher.group("number") + "\" \"" +
                    matcher.group("multiplier") + "\" in \"" + test + "\"");
        }
    }

Since the Matcher does find the sequence I expect that Matcher.group("number") returns "3" and Matcher.group("multiplier") should yield "k".

So the last print should yield:

found "3" "k"

instead I get:

info: 14
found "" "null"

The "info" line gives a hint. The matcher thinks the match started at position 14.

But the for loop prints the charAt positions and prints:

 char at 0: � \u00c2
 char at 1: � \u00b1
 char at 2: 1 \u0031
 char at 3: � \u00e2
 char at 4: � \u201e
 char at 5: � \u0192
 char at 6:   \u0020
 char at 7: � \u00c2
 char at 8: � \u00b1
 char at 9: 5 \u0035
 char at 10: % \u0025
 char at 11:   \u0020
 char at 12: 3 \u0033
 char at 13: k \u006b
 char at 14: � \u00ce
 char at 15: � \u00a9

and in that we see that the character position for where the match really starts should be 12 (the '3').

Why is Regular Expression Pattern/Matcher finding the match but calculating the location for the group() methods incorrectly?

What can I do with the string to convert it into some magical encoding that will work or what I can do to the Pattern or Matcher to get them to produce the expected results?

String encodings... ugh.

My suggestion would be to check the regex in https://regex101.com/ and validate whether the regex and test case works there. It seems the special characters might need some type of encoding — Avishek Bhattacharya, Feb 06 '23 at 03:12
You definitely have a string encoding problem. Your source file is a UTF-8 file, but the compiler thought it was a windows-12nn or ISO 8859-n file. If you’re compiling on the command line, pass `-encoding UTF-8` when compiling. If you’re using an IDE, the project properties should have a place where you can specify the encoding/charset of source files. — VGR, Mar 07 '23 at 16:44

score 0 · Answer 1 · answered Feb 06 '23 at 03:15

0

And it turns out I'm an idiot.

The test string has a lower case 'k' and the Pattern only allowed for uppercase in '[kM]'. I believe my question still has some merit to some people as in this case the find() should have failed and returned false because 3fO should not match "[0-9][kM]?O".

Anyways, if I change to [KkMm] (it's resistance so 'm' is generally different from 'M' where 'K' vs 'k' is less different) then it seems to pull of the groups for me correctly. It's like the groups know they didn't match correctly but find() said YES!! anyways.

answered Feb 06 '23 at 03:15

Jeffrey Wiegley

51
4

1

`find` returns true because it is possible for the `number` group to match an empty string - `[0-9]*` can match 0 characters, and the rest is optional. Then the entire `multiplier` group is optional too. So the entire regex just matches the `Ω`. – Sweeper Feb 06 '23 at 03:18
Yep. I have to work on a better number regex that can handle things like ".375" and "2.". Thank you. – Jeffrey Wiegley Feb 07 '23 at 05:31

Why does Java's regex Pattern/Matcher miscount group positions in strings with unicode

1 Answers1