This is my best RegEx to extract properties in HTML Tag:
# Trim the match inside of the quotes (single or double)
(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2
# Without trim
(\S+)\s*=\s*([']|["])([\W\w]*?)\2
Pros:
- You are able to trim the content inside of quotes.
- Match all the special ASCII characters inside of the quotes.
- If you have title="You're mine" the RegEx does not broken
Cons:
- It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.:
<div title="You're">
the result is Group 1: title, Group 2: ", Group 3: You're.
This is the online RegEx example:
https://regex101.com/r/aVz4uG/13
I normally use this RegEx to extract the HTML Tags:
I recommend this if you don't use a tag type like <div
, <span
, etc.
<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
For example:
<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">
This is the online RegEx example:
https://regex101.com/r/aVz4uG/15
The bug in this RegEx is:
<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
In this tag:
<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>
Returns <div '>
but it should not return any match:
Match: <div '>
To "solve" this remove the [^/]+?
pattern:
<div(?:\".*?\"|'.*?'|.*?)*?>
The answer #317081 is good but it not match properly with these cases:
<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)
This is the improvement:
(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?
vs
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Avoid the spaces between equal signal:
(\S+)\s*=\s*((?:...
Change the last + and . for:
|[>"']))?[^"']*)["']?
This is the online RegEx example:
https://regex101.com/r/aVz4uG/8