Here's HTML text I want to get attributes from without using DOM APIs:
<div
blah lorem
foo-bar
multi-line="
foo
bar
"
df"234 Yeah,that-is-an-attribute-too!_And-so-is-this-one!
bar=" asdf"
bar= zxcv
foo=asdf
aa=df-bar=()
a-b=df-b"ar=()
ac=df"-bar=()
ad=df-bar=()
></div>
This needs to run in both Node and Browsers, and using a Regex keeps it small and lean compared to importing a DOMParser implementation in Node.
And here's the regex I have so far:
/(?:\s(?:[^'"/\s><]+?)[\s/>])|(?:\S+(?:\s*=\s*(?:(?:(['"])[\s\S]*?\1|([^\s>]+))|(?:[^'"\s>]+))))/g
It almost works. Sample:
const re = /(?:\s(?:[^'"/\s><]+?)[\s/>])|(?:\S+(?:\s*=\s*(?:(?:(['"])[\s\S]*?\1|([^\s>]+))|(?:[^'"\s>]+))))/g
const html = `
<div
blah lorem
foo-bar
multi-line="
foo
bar
"
df"234 Yeah,that-is-an-attribute-too!_And-so-is-this-one!
bar=" asdf"
bar= zxcv
foo=asdf
aa=df-bar=()
a-b=df-b"ar=()
ac=df"-bar=()
ad=df-bar=()
></div>
`
const result = html.match(re).map(s => s.trim())
console.log(result)
Explore live here:
https://regexr.com/6p82g or https://regex101.com/r/1zOh1S/1
It is not picking up the lorem
Boolean attribute, and the bar= zxcv
attribute is erroneously being detected as two attributes.
If you delete the first part, (?:\s(?:[^'"/\s><]+?)[\s/>])|
, then it almost works too and it selects all attributes except the boolean attributes (without =
):
https://regexr.com/6p82j or https://regex101.com/r/iLOVpv/1
How can we make this pick all the attributes up correctly?
Aside: This question is not a duplicate of RegEx match open tags except XHTML self-contained tags, whose answer is "do not use regexes for HTML because you can't". That answer is not the solution to this question. This question requires a regex solution, and I've found the solution and posted it below.