1

Why this code doesen't return "" ? What regex should I use to replace all tags from a html file?

x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");

Thanks!

yonutix
  • 1,964
  • 1
  • 22
  • 51
  • 6
    [Only Chuck Norris can parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Reimeus Dec 22 '14 at 20:20
  • 1
    Seriously, don't parse html with regex – keyser Dec 22 '14 at 20:20
  • Then what should I use? I don't really want to parse it, I'm not interested in the content of the HTML tags, I want to remove all tags from a HTML file. – yonutix Dec 22 '14 at 20:22
  • What are you trying to do? –  Dec 22 '14 at 20:23
  • I want to remove the HTML tags in order to obtain the text – yonutix Dec 22 '14 at 20:24
  • @CosminMihai Very bad solution, as you should use a HTML parser for that, but did you try with `<.+?>`. [DEMO](http://ideone.com/DLyZ4j) – BackSlash Dec 22 '14 at 20:26
  • I want to remove the HTML tags, not to parse them, I'm trying to obtain raw data, please read the question I want to "REPLACE" with "" meaning that I want to delete them – yonutix Dec 22 '14 at 20:28
  • @CosminMihai A HTML parser is able to remove all tags in a much cleaner way than with regexes. – BackSlash Dec 22 '14 at 20:29

3 Answers3

4

I want to remove the HTML tags

You could simply use a HTML parsing library such as JSoup. Here is an example

Document doc = 
     Jsoup.parse("<html><h3><a href=\"#\">current community</a></h3></html>");
System.out.println(doc.text());

Output:

current community
Reimeus
  • 158,255
  • 15
  • 216
  • 276
  • Thanks, I will try this solution later, for the moment I need something fast to use, Jsoup needs to be downloaded I think. – yonutix Dec 22 '14 at 20:34
3

I will agree with everyone else that attempting to use a regex to parse HTML is a bad idea. (And I think that's true even if all you're doing is removing the tags; things like comments and !CDATA will complicate any attempt at a simple solution.) However, I think it's useful to explain why your solution didn't produce the results you expected (because this applies to other situations where regexes are more appropriate).

By default, the * and + quantifiers are greedy, which means they will match as many characters as they can. Thus, in your example:

x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");

I think this is what you meant:

String x = "<h3><a href=\"#\">current community</a></h3>";
x = x.replaceAll("<.*>", "");

When the matching engine searches for your pattern, it finds < as the first character of x. Then it looks for a sequence of zero or more characters that can be anything, followed by >. But since it's a greedy quantifier, if there's a choice of more than one > it can pick, it will pick the one that makes .* match the longest possible string. In your case, that means that it will pick the > which is the last character of x. The effect is that the entire string is replaced by "".

To make it match the smallest possible string, add ? to make it a "reluctant quantifier":

x = x.replaceAll("<.*?>", "");

Another solution is to tell the matcher not to include > when matching "any character":

x = x.replaceAll("<[^>]*>", "");

[^>] means "match any character except >. For HTML/XML/SGML, the regex I would choose is neither of the above, since you shouldn't use regular expressions to parse complex structures like that.

ajb
  • 31,309
  • 3
  • 58
  • 84
2

Disclaimer: You shouldn't use regex to parse html.

But, if you insist, try a

Find: "<(?:(?:/?\\w+\\s*/?)|(?:\\w+\\s+(?:(?:(?:\"[\\S\\s]*?\")|(?:'[\\S\\s]*?'))|(?:[^>]*?))+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"
Replace: ""

 <
 (?:
      (?:
           /? 
           \w+ 
           \s* 
           /? 
      )
   |  
      (?:
           \w+ 
           \s+ 
           (?:
                (?:
                     (?: " [\S\s]*? " )
                  |  (?: ' [\S\s]*? ' )
                )
             |  (?: [^>]*? )
           )+
           \s* 
           /? 
      )
   |  
      \?
      [\S\s]*? 
      \?
   |  
      (?:
           !
           (?:
                (?:
                     DOCTYPE
                     [\S\s]*? 
                )
             |  (?:
                     \[CDATA\[
                     [\S\s]*? 
                     \]\]
                )
             |  (?:
                     --
                     [\S\s]*? 
                     --
                )
             |  (?:
                     ATTLIST
                     [\S\s]*? 
                )
             |  (?:
                     ENTITY
                     [\S\s]*? 
                )
             |  (?:
                     ELEMENT
                     [\S\s]*? 
                )
           )
      )
 )
 >