Regular expression for syntax highlighting attributes in HTML tag

Question

I'm working on regular expressions for some syntax highlighting in a Sublime/TextMate language file, and it requires that I "begin" on a non-self closing html tag, and end on the respective closing tag:

begin: (<)([a-zA-Z0-9:.]+)[^/>]*(>)
end: (</)(\2)([^>]*>)

So far, so good, I'm able to capture the tag name, and it matches to be able to apply the appropriate patterns for the area between the tags.

jsx-tag-area:
    begin: (<)([a-zA-Z0-9:.]+)[^/>]*>
    beginCaptures:
      '1': {name: punctuation.definition.tag.begin.jsx}
      '2': {name: entity.name.tag.jsx}
    end: (</)(\2)([^>]*>)
    endCaptures:
      '1': {name: punctuation.definition.tag.begin.jsx}
      '2': {name: entity.name.tag.jsx}
      '3': {name: punctuation.definition.tag.end.jsx}
    name: jsx.tag-area.jsx
    patterns:
    - {include: '#jsx'}
    - {include: '#jsx-evaluated-code'}

Now I'm also looking to also be able to capture zero or more of the html attributes in the opening tag to be able to highlight them.

So if the tag were <div attr="Something" data-attr="test" data-foo>

It would be able to match on attr, data-attr, and data-foo, as well as the < and div

Something like (this is very rough):

(<)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*)[^/>]*(>)

It doesn't need to be perfect, it's just for some syntax highlighting, but I was having a hard time figuring out how to achieve multiple capture groups within the tag, whether I should be using look-around, etc, or whether this is even possible with a single expression.

Edit: here are more details about the specific case / question - https://github.com/reactjs/sublime-react/issues/18

This probably won't work very well if you're trying to capture an arbitrary amount of attributes. If it's a variable amount of attributes the regex is going to be very messy and unreadable. [This is how ugly it looks capturing two attributes](http://regex101.com/r/nB2lL9/3) — skamazin, Aug 04 '14 at 17:11
You've had a look at [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348/1048572)? — Bergi, Sep 08 '14 at 20:26
Yes of course :) I'm not trying to faithfully parse the html, I'm trying to roughly pattern match it... take a look at the use case https://github.com/reactjs/sublime-react/issues/18 — tgriesser, Sep 08 '14 at 20:30
Also, the issue is half with the actual matching and half with how it should actually work based on Sublime's syntax highlighting rules (or if I'm going about this the wrong way) — tgriesser, Sep 08 '14 at 20:33
It's a shame I can't really play with this one... From [the tutorial](http://docs.sublimetext.info/en/latest/extensibility/syntaxdefs.html#begin-end-rules) it looks like you can use `"include": "$self"` for recursive matching, which is very cute. Can it also be used for a specific group? For example: match `<[Tag][All Attributes]>`...`[Tag]>`, and then use another rule to parse `[All Attributes]`? — Kobi, Sep 10 '14 at 05:11
I don't know what jsx is but have you checked http://examples.oreilly.com/0636920023630/Regex_Cookbook_2_Code_Samples.html in case you can translate their examples into this? (search for "HTML tags (strict)") — alexandroid, Sep 10 '14 at 06:28
@Kobi that link/explanation is exactly what I was looking for but was having the hardest time finding it. If you want to open an answer I'll award you some points. — tgriesser, Sep 10 '14 at 19:49

Oscar Hermosilla · Answer 1 · 2014-09-14T09:15:49.897

I may found a possible solution.

It is not perfect because as @skamazin said in the comments if you are trying to capture an arbitrary amount of attributes you will have to repeat the pattern that matches the attributes as many times as you want to limit the number of attributes you will allow.

The regex is pretty scary but it may work for your goal. Maybe it would be possible to simplify it a bit or maybe you will have to adjust some things

For only one attribute it will be as this:

(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))

DEMO

For more attributes you will need to add this as many times as you want:

(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))?

So for example if you want to allow maximum 3 attributes your regex will be like this:

(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?

DEMO

Tell me if it suits you and if you need further details.

Well the g modifier is only for the demo to see several scenarios but I guess that it won't be necessary if the original regex he posted was working already (<)([a-zA-Z0-9:.]+)[^/>]*(>). — Oscar Hermosilla, Sep 10 '14 at 09:41

score 0 · Answer 2 · edited May 23 '17 at 12:08

I'm unfamiliar with sublimetext or react-jsx but this to me sounds like a case of "Regex is your tool, not your solution."

A solution that uses regex as a tool for this would be something like this JsFiddle (note that the regex is slightly obfuscated because of html-entities like > for > etc.)

Code that does the actual replacing:

blabla.replace(/(&lt;!--(?:[^-]|-(?!-&gt;))*--&gt;)|(&lt;(?:(?!&gt;).)+&gt;)|(\{[^\}]+\})/g, function(m, c, t, a) {
    if (c!=undefined)
        return '<span class="comment">' + c + '</span>';
    if (t!=undefined)
        return '<span class="tag">' + t.replace(/ [a-z_-]+=?/ig, '<span class="attr">$&</span>') + '</span>';
    if (a!=undefined)
        return a.replace(/'[^']+'/g, '<span class="quoted">$&</span>');
});

So here I'm first capturing the separate type of groups following this general pattern adapted for this use-case of HTML with accolade-blocks. Those captures are fed to a function that determines what type of capture we're dealing with and further replaces subgroups within this capture with its own .replace() statements.

There's really no other reliable way to do this. I can't tell you how this translates to your environment but maybe this is of help.

score 0 · Answer 3 · answered Sep 11 '14 at 13:43

Regex alone doesn't seem to be good enough, but since you're working with sublime's scripting here, there's a way to simplify both the code and the process. Keep in mind, I'm a vim user and not familiar with sublime's internals - also, I usually work with javascript regexes, not PCREs (which seems to be the format used by sublime, or closest thereof).

The idea is as follows:

use a regex to get the tag, attributes (in a string) and contents of the tag
use capture groups to do further processing and matching if necessary

In this case, I made this regex:

<([a-z]+)\ ?([a-z]+=\".*?\"\ ?)?>([.\n\sa-z]*)(<\/\1>)?

It starts by finding an opening tag, creates a control group for the tag name, if it finds a space it proceeds, matches the bulk of attributes (inside the \"...\" pattern I could have used \"[^\"]*?\" to match only non-quote characters, but I purposefully match any character greedily until the closing quote - this is to match the bulk of attributes, which we can process later), matches any text in between tags and then finally matches the closing tag.

It creates 4 capture groups:

tag name
attribute string
tag contents
closing tag

as you can see in this demo, if there is no closing tag, we get no capture group for it, same for attributes, but we always get a capture group for the contents of the tag. This can be a problem generally (since we can't assume that a captured feature will be in the same group) but it isn't here because, in the conflict case where we get no attributes and no content, thus the 2nd capture group is empty, we can just assume it means no attributes and the lack of a 3rd group speaks for itself. If there's nothing to parse, nothing can be parsed wrongly.

Now to parse the attributes, we can simply do it with:

([a-z]+=\"[^\"]*?\")

demo here. This gives us the attributes exactly. If sublime's scripting lets you get this far, it certainly would allow you further processing if necessary. You can of course always use something like this:

(([a-z]+)=\"([^\"]*?)\")

which will provide capture groups for the attribute as a whole and its name and value separately.

Using this approach, you should be able to parse the tags well enough for highlighting in 2-3 passes and send off the contents for highlighting to whatever highlighter you want (or just highlight it as plaintext in whatever fancy way you want).

score 0 · Answer 4 · edited Jul 07 '18 at 11:49

Your own regex was quite helpful in answering your question.

This seems to work well for me:

/(:?<|<\/)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*[^/>]*(:?>|\/>)/g

The / at the beginning and end are just the wrappers regex usually requires. In addition, the g at the end stands for global, so it works for repetitions as well.

A good tool I use to figure out what I am doing wrong with my regex is: http://regexr.com/

Hope this helps!

Regular expression for syntax highlighting attributes in HTML tag

4 Answers4