2

I am creating a syntax highlight file for a language and I have everything mapped out and working with one exception.

I cannot come up with a regex that will match the following conditions for a specific line comment style.

If the first non white-space character is an asterisk (*) the line is considered a comment.

I have created many samples that work in regexr but it never captures in vscode.

For example, regexr is cool with this: ^(?:\s*)\*+(?:.*)?\n

So I convert it into the proper format for the tmlanguage.json file: ^(?:\\s*)\\*+(?:.*)?\\n

But it is not capturing properly, if the first character of the line is an *, it does not catch, but if the first character is a whitespace character followed by an * it does work.

I suck at formatting on stackoverflow, so represents a chr(9) tab character. is a space.

*******************************
  *****************************
<tab>*************************
* comment
  * comment
<tab>* comment

But it shouldn't work in these cases:
string *******************************
  string ***************************** string
<tab>string *************************
x *= 3

I am guessing that either the anchor ^ isn't working in my regex or I am escaping something incorrectly.

Any advice?

Please see sample image attached: screenshot

  • 1
    Are there any other rules in your grammar that are getting applied in the bad case? Does the [scope inspector](https://code.visualstudio.com/blogs/2017/02/08/syntax-highlighting-optimizations#_new-textmate-scope-inspector-widget) show anything interesting – Matt Bierner Aug 28 '17 at 02:00
  • @MattBierner, you hit the nail on the head. I removed all other rules from the file and it worked properly which led me down the road to track down the issue. I had a regex for a multiplier that was catching incorrectly on line 1. – H.Richardson Aug 28 '17 at 14:35
  • Your example `^(?:\\s*)\\*+(?:.*)?\\n` works for me. If the issue is because of your other rules, please update the question or write an answer so as to conclude this question. – colinfang Nov 26 '17 at 03:38

2 Answers2

1

I don't know the regex engine you're using. I'm just going to give you some
general tips on how it should be done.

  • First off, if you're reading a string with more than 1 newline in it,
    the anchor ^, in an engines default state means Beginning of String (BOS)

What you want in this case is Multi-Line-Mode. This makes the anchor ^ match at the Beginning of Line (BO L) as well as the BOS.

  • Second, you don't need those non capture groups (?:\s*) (?:.*), they encapsulate single constructs.

  • Third, it is redundant to make a group optional when its enclosed contents are optional (?:.*)?

  • Fourth, you don't need the newline \n construct at the end, since it should not be highlighted anyway, and it might not be present on the last line of text.
    The latter will make it not match.


So, putting it all together, the modified regex would be (?m)^\s*\*.*

Explained

 (?m)     # Inline modifier: Multi-line mode
 ^        # Beginning of line
 \s*      # Optional many whitespace
 \*       # Required at least a single asterisk
 .*       # Optional rest of non-newline characters

Note that you could put a single capture group around the data
if you need to reference it in a replace (?m)^(\s*\*.*)

Also, the language you're using should have a way to specify options when compiling the regex. If the engine doesn't accept inline modifiers (?m) take it out and specify that option when compiling the regex.

  • Thank you sln, that is very helpful, but does not solve the issue I am having in the system I am working on. Your answer attempt has been very helpful to me in general though. – H.Richardson Aug 27 '17 at 09:48
  • VS Code's language highlighting engine (based on TextMate) does not support multiline regular expressions. I believe `^` should always match the start of the line – Matt Bierner Aug 28 '17 at 01:57
  • To you guys... I find it hard to believe there is no _Multiline_ option for the anchors `^$`. It doesn't matter anyway, there is a workaround, the hard way: `(?:^|(?<=\n))(\s*\*.*)` Or, if it doesn't support assertions, you'd have to actually consume the previous newline `(?:^|\n))(\s*\*.*)` and it becomes part of the match. –  Aug 28 '17 at 22:12
0

Apparently VS Code's syntax highlighter is single-line. No matter how much i tried matching regeces that are over several lines, these never worked.

Second, if you're designing a language I suggest you not to use an arithmetic operator for comments.

Third, apparently you can match newlines in the begin and end attributes. You can try it there.

Anatoly
  • 193
  • 8