0

I need to get all HTML tags without attribute from a string. I tried regex: < *([^/][^ ]*).*?> but it still gets HTML tag and attributes.

Can anyone help me find a regex to get this.

Example:

From <html><head></head><body class="body"><a href="abc.html"></a></body>, I want to get <html><head></head><body><a></a></body>.

And a Regex to get only html tag

get to html head head body a a body

Thanks all.

Community
  • 1
  • 1
Nho Huynh
  • 674
  • 6
  • 13
  • 1
    possible duplicate of [Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la) – unwind Feb 21 '14 at 11:27
  • 2
    It's not a duplicate. You can't parse the tree structure of HTML/XML with regular expressions. However, you can tokenize it with regular expressions, and that's all that is needed here. – Stefan Haustein Feb 21 '14 at 13:50

1 Answers1

1

While it's not a good idea in general to try to parse HTML with a regular expression, in this case it works.

Try the following replacement

s/<( *\w+)( [^>/]+)?(/?)>/<$1$3>/g

This matches the opening angle bracket, then captures possible white space and any word characters ([A-Za-z0-9_]) following that. Then if there's a white space followed by any characters that are neither a slash nor a closing angle bracket, it matches that. Then it captures an optional slash and the closing angle bracket.

It replaces this with an opening angle bracket, the captured tag, the captured optional slash, and a closing angle bracket.

This assumes there are no opening or closing angle brackets that are not part of a tag.

SQB
  • 3,926
  • 2
  • 28
  • 49