-11

Here's HTML text I want to get attributes from without using DOM APIs:

     <div
      blah lorem
      foo-bar
            multi-line="
             foo
              bar
            "
          df"234   Yeah,that-is-an-attribute-too!_And-so-is-this-one!
                bar=" asdf"
                bar=  zxcv 
            foo=asdf
                aa=df-bar=()
                a-b=df-b"ar=()
                ac=df"-bar=()
                ad=df-bar=()
     ></div>
             

This needs to run in both Node and Browsers, and using a Regex keeps it small and lean compared to importing a DOMParser implementation in Node.

And here's the regex I have so far:

/(?:\s(?:[^'"/\s><]+?)[\s/>])|(?:\S+(?:\s*=\s*(?:(?:(['"])[\s\S]*?\1|([^\s>]+))|(?:[^'"\s>]+))))/g

It almost works. Sample:

const re = /(?:\s(?:[^'"/\s><]+?)[\s/>])|(?:\S+(?:\s*=\s*(?:(?:(['"])[\s\S]*?\1|([^\s>]+))|(?:[^'"\s>]+))))/g

const html = `
     <div
      blah lorem
      foo-bar
            multi-line="
             foo
              bar
            "
          df"234   Yeah,that-is-an-attribute-too!_And-so-is-this-one!
                bar=" asdf"
                bar=  zxcv 
            foo=asdf
                aa=df-bar=()
                a-b=df-b"ar=()
                ac=df"-bar=()
                ad=df-bar=()
     ></div>
             
`

const result = html.match(re).map(s => s.trim())

console.log(result)

Explore live here:
https://regexr.com/6p82g or https://regex101.com/r/1zOh1S/1

It is not picking up the lorem Boolean attribute, and the bar= zxcv attribute is erroneously being detected as two attributes.

If you delete the first part, (?:\s(?:[^'"/\s><]+?)[\s/>])|, then it almost works too and it selects all attributes except the boolean attributes (without =):
https://regexr.com/6p82j or https://regex101.com/r/iLOVpv/1

How can we make this pick all the attributes up correctly?


Aside: This question is not a duplicate of RegEx match open tags except XHTML self-contained tags, whose answer is "do not use regexes for HTML because you can't". That answer is not the solution to this question. This question requires a regex solution, and I've found the solution and posted it below.

trusktr
  • 44,284
  • 53
  • 191
  • 263
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/247616/discussion-on-question-by-trusktr-regex-for-html-attributes-inside-a-single-open). – Jean-François Fabre Aug 28 '22 at 08:01

1 Answers1

-4
  • replace that first [^'"/\s><] with [^/\s><=] to avoid it picking up attributes with values, and to also pick up Boolean attributes with quotes in their names such as foo"bar or foo'bar (those are totally valid)
  • wrap [\s/>] with a positive lookahead to exclude it from the actual match ((?=[\s/>])) that way we solve the problem of back-to-back boolean attributes (for example, lorem) not being included
  • replace \S+ with \S+? so that when we add capturing groups, an attribute like ad=df-bar=() will be detected as name ad with value df-bar=() instead of name ad=df-bar with value ()
  • delete the |([^'"\s>]+) near the end, which doesn't do anything, and allows us to remove one non-capture group wrapper (keeping those for expression of intent)
  • finally update the groups so that we can capture the needed values

The final regex is:

/(?:\s([^/\s><=]+?)(?=[\s/>]))|(?:(\S+?)(?:\s*=\s*(?:(['"])([\s\S]*?)\3|([^\s>]+))))/g

Sample:

const re = /(?:\s([^/\s><=]+?)(?=[\s/>]))|(?:(\S+?)(?:\s*=\s*(?:(['"])([\s\S]*?)\3|([^\s>]+))))/g
//               ^ capture group 1: boolean attribute name (attributes without values)
//                                           ^ capture group 2: non-boolean attribute name
//                                                                    ^ capture group 4: non-boolean attribute value with quotes
//                                                                                 ^ capture group 5: non-boolean attribute value without quotes

const html = `
     <div
      blah lorem
      foo-bar
            multi-line="
             foo
              bar
            "
          df"234   Yeah,that-is-an-attribute-too!_And-so-is-this-one!
                bar=" asdf"
                bar=  zxcv 
            foo=asdf
                aa=df-bar=()
                a-b=df-b"ar=()
                ac=df"-bar=()
                ad=df-bar=()
     ></div>
             
`

const result = Array.from(html.matchAll(re))

for (let i = 0, l = result.length; i < l; i += 1) {
  const match = result[i]
  console.log('name: "' + (match[1] || match[2]) + '", value: "' + (match[4] || match[5] || "") + '"')
}

Explore live: https://regexr.com/6p8p0

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
trusktr
  • 44,284
  • 53
  • 191
  • 263
  • 5
    Still fails `
    ` and `
    ` (and probably many others). It also doesn't convert HTML entities. Coming up with your own expression is probably not something you want. There might be some already out there that have been tested for ages, but the best is also probably to use a complete DOM parser, HTML is a real beast.
    – Kaiido Jul 08 '22 at 02:13
  • 5
    And it's also obviously context unaware, so in ` – Kaiido Jul 08 '22 at 02:31
  • @Kaiido Thanks for the insight. In my case I'm running only on a single opening tag every time (no children involved). I'll state that in the question. – trusktr Jul 08 '22 at 09:07