How do you find an overlapping match with a variable-length prefix using regex?

Question

I am trying to match tags like TODO inside comments in some code files using a regular expression. Consider for example the following file:

foo bar # TODO
bar foo quux   # TODO bar # TODO foo quux
quux # foo ## TODO foo # bar #  TODO quux
'# TODO\'' # TODO

Note that there might be multiple tags in one line as long as each one is preceded by #, so lines two and three should match twice. Furthermore, the prefixes before the first # (the actual code) may have arbitrary length; the same applies to what comes after each TODO. Apart from that there might be substrings like # TODO that are no comments (see line four; it should match once, the # TODO at the end).

I have been searching here on Stackoverflow and on other sites, but nothing seemed to answer a problem where you have multiple overlapping matches and a variable length prefix before those matches. I assume that the problem lies mainly in trying to use positive lookaheads/lookbehinds in conjunction with a context:

(?=#\s*TODO[^#]*) does not work since it matches line four twice. This is why I say overlapping: It seems that you have to take the structure of the prefix into account when matching.
I can match the prefix (actual code and comments without a tag) part with ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)* so that I get line four right, but this is a variable-length match, so using a positive lookbehind like (?<=^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*)(#\s*TODO[^#]*) will result in an error on every regex engine as far as I know (and if working, would only match the first # TODO anyways).
Matching the prefix and then using a positive lookahead like ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*(?=(#\s*TODO[^#]*)(#\s*(?!TODO)[^#]*)*) does not work either since it matches only one occurrence of # TODO.

To explain: \\. matches an escaped character and [^'\\]* anything that is not an escape character and not a string delimiter, so '[^'\\]*(\\.[^'\\]*)*' matches any string literal. Using [^#']* outside of that string literal part means: Match anything that does not start a string or a comment, so the code part of a line is ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*. A comment segment that does not contain a tag can be found with #\s*(?!TODO)[^#]*, so the whole prefix can be matched with ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*.

I use ripgrep, so this applies to PCRE/PCRE2 regular expressions. I would, however, be interested in whether there is a solution in any regex dialect.

I know that I can match each line that has at least one correct match and post-process the results in some scripting language to extract each TODO from the lines, but I would like to know if it is possible to do this regex-only.

Perhaps like this `"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'(*SKIP)(*F)|#+[^\r\n#]*TODO[^\r\n#]*` https://regex101.com/r/49NPut/1 — The fourth bird, Feb 06 '21 at 20:47
Is this the question du jour? https://stackoverflow.com/q/66079810/9473764 — Nick, Feb 06 '21 at 22:21
@Thefourthbird Almost! Thanks a lot, you pointed me in the right direction. I didn't know about `(*SKIP)` and `(*FAIL)` yet. This is the regex that works for me: `"[^"\\\n\r]*(?:\\(?:$(*SKIP)(*FAIL)|.)[^"\\\n\r]*)*(?:$|")(*SKIP)(*FAIL)|'[^'\\\n\r]*(?:\\(?:$(*SKIP)(*FAIL)|.)[^'\\\n\r]*)*(?:$|')(*SKIP)(*FAIL)|#[\t\f\v ]*TODO[^\n\r#]*` https://regex101.com/r/kMiZtU/1 This is also safe when there are syntactically wrong string literals in the code. — julianbetz, Feb 07 '21 at 00:49
@Thefourthbird If you'd like, you can add that as an answer and I would gladly accept it. Or should I answer my own question? What is the common way to go about this here? — julianbetz, Feb 07 '21 at 00:58
@Nick Yes, that's strange. I didn't see that question before. I essentially tackle the same problem, but on Python and Makefile and with a different regex engine. — julianbetz, Feb 07 '21 at 01:06
@julianbetz Perhaps you could shorten the pattern to a single `(*SKIP)(*FAIL)` using `(?:"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"|'[^'\\\r\n]*(?:\\.[^'\\\r\n]*)*'|["'][^"'\r\n]*$)(*SKIP)(*F)|#\h*TODO[^\r\n#]*` See https://regex101.com/r/mwl1HZ/1 — The fourth bird, Feb 07 '21 at 16:10
@Thefourthbird Yes, that's good. But we have to check escaping with backslash in broken strings as well, otherwise we get matches inside: `(?:"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*(?:"|\\?$)|'[^'\\\r\n]*(?:\\.[^'\\\r\n]*)*(?:'|\\?$))(*SKIP)(*F)|#\h*TODO[^\r\n#]*` https://regex101.com/r/RZFPE0/1 — julianbetz, Feb 07 '21 at 18:38

How do you find an overlapping match with a variable-length prefix using regex?

0 Answers0