2

In trying to match all multiline comments in a Java source file I run into a StackOverflow() error. It happens when the matched comment is pretty large. I've managed to more or less pinpoint the limit to 2500 characters, but this might be specific to my environment.

I'm using the following expression to match the comments:

/<comment:((\/\*([^*]|[\r\n]|(\*+([^*\/]|[\r\n])))*\*+\/))+>/mi

Is there some limit to the size of the match I should be aware of, or is there a flaw in my regex?

My stacktrace is:

|project://Sevo1/src/Volume.rsc|(985,32,<53,12>,<53,44>): StackOverflow()
    at countLines(|project://Sevo1/src/Volume.rsc|(985,33,<53,12>,<53,45>))
    at $root$(|prompt:///|(0,73,<1,0>,<1,73>))
Ruben Steins
  • 2,782
  • 4
  • 27
  • 48

1 Answers1

2

Your regex is not optimal as it contains a *-quantified capturing group that contains alternatives matching at the same locations inside the string. You may see that [^*] matches any char but * (i.e. it matches line breaks), and then you have [\r\n] that also matches line breaks. Note that the chunks of text you match are mostly 1-char long (except for * chunks matched with (\*+([^*\/]|[\r\n]))), and the regex engine just does not seem to cope with that task well here.

Nested quantifiers are only good when you match longer chunks at one go. Re-write the pattern as

/<comment:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/>/

and it will be more efficient. See the regex demo.

Details

  • \/\* - a /* substring
  • [^*]*\*+ - 0+ characters other than * followed with one or more literal *
  • (?:[^\/*][^*]*\*+)* - 0+ sequences of:
    • [^\/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with one or more asterisks (\*+)
  • \/ - closing /
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • So there probably is no limit to the size of the match itself. The problem lies in a sub-optimal pattern that does too much backtracking causing a StackOverflow error. Is that a proper summary? – Ruben Steins Dec 07 '17 at 13:51
  • It would help to see a stacktrace to make sure this is indeed the problem. The improved re will make the problem appear a lot less soon, but there could still be a Rascal-specific mapping issue between Java regex's and the translation to the Rascal virtual machine/interpreter. – Jurgen Vinju Dec 07 '17 at 14:08
  • @RubenSteins Well, the size of the match is certainly limited by the size of the variable type that holds the match. However, I cannot imagine you have comments that are that long. – Wiktor Stribiżew Dec 07 '17 at 14:10
  • @jurgenv I've added the stacktrace (at least I think it's the stracktrace) to the original question. Not sure if there's any way to get a more elaborate stacktrace. – Ruben Steins Dec 07 '17 at 14:23
  • @WiktorStribiżew I agree. The `str` type is probably big enough to hold more info that 2500 characters, which is what the limit was with my initial pattern. – Ruben Steins Dec 07 '17 at 14:24
  • @RubenSteins thanks. indeed the Java stack seems missing. I think it's lost to the eternal bitfields. If you have a code example and a case that triggers it, please send an email and I can use the JVM debugger to inspect the stack. Cheers! – Jurgen Vinju Dec 07 '17 at 15:23
  • Indeed `str` will hold any string that fits on the JVM's heap. – Jurgen Vinju Dec 07 '17 at 15:23