4

I'm trying to split my string using regex. It should include even zero-length matches before and after every delimiter. For example, if delimiter is ^ and my string is ^^^ I expect to get to get 4 zero-length groups. I can not use just regex = "([^\\^]*)" because it will include extra zero-length matches after every true match between delimiters. So I have decided to use not-delimiter symbols following after beginning of line or after delimiter. It works perfect on https://regex101.com/ (I'm sorry, i couldn't find a share option on this web-site to share my example) but in Intellij IDEa it skips one match.

So, now my code is:

final String regex = "(^|\\^)([^\\^]*)";
final String string = "^^^^";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) 
    System.out.println("[" + matcher.start(2) + "-" + matcher.end(2) + "]: \"" + matcher.group(2) + "\"");

and I expect 5 empty-string matches. But I have only 4:

[0-0]: ""
[2-2]: ""
[3-3]: ""
[4-4]: ""

The question is why does it skip [1-1] match and how can I fix it?

00vk
  • 65
  • 4
  • Do you need the indices? It seems you may get what you need using `split`, `String[] result = string.split("\\^", -1);` – Wiktor Stribiżew Sep 06 '18 at 09:26
  • What about regex = "([^\\^]*)"? That doesn't skip [1-1]. – Ralf Renz Sep 06 '18 at 09:28
  • @RalfRenz or you may mean `(^|\\^)([^\\^]*?)` – SL5net Sep 06 '18 at 09:34
  • regex101 does not seem to support Java regexes, so I wouldn't trust it. If you increase the number of `^` in the text what is the result? Does it only skip `1-1` or does it skip one match every 3, 4 characters? It really seems weird as a result, but unfortunately zero-width matches are often buggy or behave strangely in many regex implementations (e.g. the python `regex` module [not the stdlib `re`] has 2 different modes and one of the big differences is handling of zero-width matches). If you could find a way to avoid using zero-width matches it would probably make things easier. – Giacomo Alzetta Sep 06 '18 at 09:36
  • @GiacomoAlzetta It skips only second, always. So does it seem like my regex made correctly but it doesn't work because of bug? – 00vk Sep 06 '18 at 09:51
  • @VictorKondratiev What answer do you expect? A solution or an explanation? Does my top comment solution work for you? – Wiktor Stribiżew Sep 06 '18 at 09:51
  • @WiktorStribiżew I think your solution will help me. But also I want to be sure that my effort to make a regex was in the right direction. I had thought before that it may be incorrect to use `(^|\\^)` or something – 00vk Sep 06 '18 at 09:55

1 Answers1

2

Your regex matches either the start of string or a ^ (capturing that into Group 1) and then any 0+ chars other than ^ into Group 2. When the first match is found (the start of the string), the first group keeps an empty string (as it is the start of string) and Group 2 also holds an empty string (as the first char is ^ and [^^]* can match an empty string before a non-matching char. The whole match is zero-length, and the regex engine moves the regex index to the next position. So, after the first match, the regex index is moved from the start of the string to the position after the first ^. Then, the second match is found, the second ^ and the empty string after it. Hence, the the first ^ is not matched, it is skipped.

The solution is a simple split one:

String[] result = string.split("\\^", -1);

The second argument makes the method output all empty matches at the end of the resulting array.

See a Java demo:

String str = "^^^^";
String[] result = str.split("\\^", -1);
System.out.println("Number of items: " + result.length);
for (String s: result) {
    System.out.println("\"" + s+ "\"");
}

Output:

Number of items: 5
""
""
""
""
""
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563