How get text without html tag for a text contained in a known html element

Question

I need to retrieve contents(string without html tags) inside the html element <div class='important-contents'>...</div> from an html string.

Actually I can load all text using the following code.

  string htmlString= "<html>...</html>";
  Regex regex = new Regex("\\<[^\\>]*\\>");
  return regex.Replace(htmlString, String.Empty);

How do I specify contents inside important-contents class element?

I don't think regex is the best route here, there are html classes from memory that can get contents of the tags... — Austin T French, Apr 15 '15 at 14:13
Please make sure to read first 20+ answers to [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/) to construct reasonable RegEx. — Alexei Levenkov, Apr 15 '15 at 14:13
@AustinFrench obviously it is not. But OP should already know that (having good number of HTML and C# questions and answers). So while something like HtmlAgilityPack is definitely good approach, I think this should stay as an exercise in creating regular expressions... — Alexei Levenkov, Apr 15 '15 at 14:16
Thank you men, I have heard of `HtmlAgilityPack` but I wonder if including a library only to get a content inside a div is heavy! And I confirm that my html file is well known and will contain a div with a known class... — Bellash, Apr 15 '15 at 14:24
If the only requirement for the content is to be inside some arbitrary HTML tag with the `important-contents` class name, a RegEx is a terrible idea. However, if it's always going to be a `div` like the OP wrote, it shouldn't easy for the correct RegEx to handle. **But** the OP never said he wanted to use a RegEx for it... — Zohar Peled, Apr 15 '15 at 14:24
Why not traverse the html script via nodes like one would do with XML. http://stackoverflow.com/questions/1157258/find-specific-data-in-html-with-htmlelementcollection-and-webbrowser and https://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement(v=vs.110).aspx — James Shaw, Apr 15 '15 at 14:38

Wiktor Stribiżew · Accepted Answer · 2015-04-15T14:48:49.040

You can match what is inside the DIV tag using this regex that features a non-fixed width look-behind (thanks to .NET regex engine):

(?s)(?<=<div\s[^>]*?class=["']?important-contents["']?[^>]*?>).*?(?=</div>)

Then, to remove all tags, you can use this regex to remove all tags inside the matched DIV contents:

</?[^>]+>

To remove <script> tags that may find their way to the DIV tag, let's introduce another step:

(?s)<script[^>]*?>.*?</script>

I do not know of a way to match discontinuous texts, so it can only be done in {2,} steps.

DISCLAIMER: if you have "malformed" HTML, you can get wierd results, or no match at all.

Sample code:

var div_rgx = new Regex(@"(?si)(?<=<div\s[^>]*?class=[""']?important-contents[""']?[^>]*?>).*?(?=</div>)");
var tag_rgx = new Regex(@"</?[^>]+>");
var script_rgx = new Regex(@"(?s)<script[^>]*?>.*?</script>");
var txt = "<html>\r\n<body>\r\n<div class='important-contents'>\r\n<script>function getV(str) { return 0; }</script>\r\n<span>My <i>text</i><font face=\"Verdana\">.</font></span>\r\n</div>\r\n</body>\r\n</html>";
var result = div_rgx.Match(txt);
if (result.Success)
   var final = tag_rgx.Replace(script_rgx.Replace(result.Value, string.Empty), string.Empty).Trim();

Output:

enter image description here

what if the div has a script block inside? (And I've seen this type of horrible HTML in my days, it's not just to tackle you, things like this exists all over the internet) — Zohar Peled, Apr 15 '15 at 14:41
Thank @stribizhev but this shows only contents not in any html tag. for example when `
Hello World
` it returns `Hello` while it should return `Hello World`. in addition after `if(condition)` there is a `{` when the first line is declaring a variable. I edited your code as following `if (result.Success){ var final =tag_rgx.Replace(script_rgx.Replace(result.Value, string.Empty),string.Empty).Trim();}` — Bellash, Apr 16 '15 at 10:37

score 0 · Answer 2 · answered Apr 15 '15 at 14:35

Use "'important-contents'>" as a Match but do not capture anchor then consume all text until a < is hit such as

(?:'important-contents'\>)(?<Content>[^>]+)

In the above I have placed all of the contents into a Named Match Capture Group named "Content" for easier extraction.

score 0 · Answer 3 · edited May 23 '17 at 12:14

First of all, regex is not able to do get string without html tags in general case, because of HTML grammar is not regular. You have two choices:

Use full html parser and work with DOM (answer see here What is the best way to parse html in C#? for example)

Put on task some tradeoffs, for example <div class='important-contents'> wouldn't contains inner html tags. In case of tradeoffs solution may be like this:

var regex = `"<div class='important-contents'>(?<important>.*)</div>";`
MatchCollection matches = Regex.Matches(htmlString, regex);
foreach(Match m in matches){
    Console.WriteLine(m.Groups["important"].ToString());
}

How get text without html tag for a text contained in a known html element

3 Answers3