HTML tag regex doesen't work

Question

Why this code doesen't return "" ? What regex should I use to replace all tags from a html file?

x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");

Thanks!

[Only Chuck Norris can parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Reimeus, Dec 22 '14 at 20:20
Then what should I use? I don't really want to parse it, I'm not interested in the content of the HTML tags, I want to remove all tags from a HTML file. — yonutix, Dec 22 '14 at 20:22
@CosminMihai Very bad solution, as you should use a HTML parser for that, but did you try with `<.+?>`. [DEMO](http://ideone.com/DLyZ4j) — BackSlash, Dec 22 '14 at 20:26
I want to remove the HTML tags, not to parse them, I'm trying to obtain raw data, please read the question I want to "REPLACE" with "" meaning that I want to delete them — yonutix, Dec 22 '14 at 20:28
@CosminMihai A HTML parser is able to remove all tags in a much cleaner way than with regexes. — BackSlash, Dec 22 '14 at 20:29

score 4 · Answer 1 · answered Dec 22 '14 at 20:28

4

I want to remove the HTML tags

You could simply use a HTML parsing library such as JSoup. Here is an example

Document doc = 
     Jsoup.parse("<html><h3><a href=\"#\">current community</a></h3></html>");
System.out.println(doc.text());

Output:

current community

answered Dec 22 '14 at 20:28

Reimeus

158,255
15
216
276

Thanks, I will try this solution later, for the moment I need something fast to use, Jsoup needs to be downloaded I think. – yonutix Dec 22 '14 at 20:34

score 3 · Answer 2 · answered Dec 22 '14 at 20:47

I will agree with everyone else that attempting to use a regex to parse HTML is a bad idea. (And I think that's true even if all you're doing is removing the tags; things like comments and !CDATA will complicate any attempt at a simple solution.) However, I think it's useful to explain why your solution didn't produce the results you expected (because this applies to other situations where regexes are more appropriate).

By default, the * and + quantifiers are greedy, which means they will match as many characters as they can. Thus, in your example:

x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");

I think this is what you meant:

String x = "<h3><a href=\"#\">current community</a></h3>";
x = x.replaceAll("<.*>", "");

When the matching engine searches for your pattern, it finds < as the first character of x. Then it looks for a sequence of zero or more characters that can be anything, followed by >. But since it's a greedy quantifier, if there's a choice of more than one > it can pick, it will pick the one that makes .* match the longest possible string. In your case, that means that it will pick the > which is the last character of x. The effect is that the entire string is replaced by "".

To make it match the smallest possible string, add ? to make it a "reluctant quantifier":

x = x.replaceAll("<.*?>", "");

Another solution is to tell the matcher not to include > when matching "any character":

x = x.replaceAll("<[^>]*>", "");

[^>] means "match any character except >. For HTML/XML/SGML, the regex I would choose is neither of the above, since you shouldn't use regular expressions to parse complex structures like that.

score 2 · Accepted Answer · 2014-12-22T20:41:52.610

Disclaimer: You shouldn't use regex to parse html.

But, if you insist, try a

 <
 (?:
      (?:
           /? 
           \w+ 
           \s* 
           /? 
      )
   |  
      (?:
           \w+ 
           \s+ 
           (?:
                (?:
                     (?: " [\S\s]*? " )
                  |  (?: ' [\S\s]*? ' )
                )
             |  (?: [^>]*? )
           )+
           \s* 
           /? 
      )
   |  
      \?
      [\S\s]*? 
      \?
   |  
      (?:
           !
           (?:
                (?:
                     DOCTYPE
                     [\S\s]*? 
                )
             |  (?:
                     \[CDATA\[
                     [\S\s]*? 
                     \]\]
                )
             |  (?:
                     --
                     [\S\s]*? 
                     --
                )
             |  (?:
                     ATTLIST
                     [\S\s]*? 
                )
             |  (?:
                     ENTITY
                     [\S\s]*? 
                )
             |  (?:
                     ELEMENT
                     [\S\s]*? 
                )
           )
      )
 )
 >

@CosminMihai- Ok, added doctype. It's a little xml-ish. – Dec 22 '14 at 20:43 — , Dec 22 '14 at 20:43

HTML tag regex doesen't work

3 Answers3