0

I need to retrieve contents(string without html tags) inside the html element <div class='important-contents'>...</div> from an html string.

Actually I can load all text using the following code.

  string htmlString= "<html>...</html>";
  Regex regex = new Regex("\\<[^\\>]*\\>");
  return regex.Replace(htmlString, String.Empty); 

How do I specify contents inside important-contents class element?

Bellash
  • 7,560
  • 6
  • 53
  • 86
  • I don't think regex is the best route here, there are html classes from memory that can get contents of the tags... – Austin T French Apr 15 '15 at 14:13
  • Please make sure to read first 20+ answers to [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/) to construct reasonable RegEx. – Alexei Levenkov Apr 15 '15 at 14:13
  • 2
    @AustinFrench obviously it is not. But OP should already know that (having good number of HTML and C# questions and answers). So while something like HtmlAgilityPack is definitely good approach, I think this should stay as an exercise in creating regular expressions... – Alexei Levenkov Apr 15 '15 at 14:16
  • Thank you men, I have heard of `HtmlAgilityPack` but I wonder if including a library only to get a content inside a div is heavy! And I confirm that my html file is well known and will contain a div with a known class... – Bellash Apr 15 '15 at 14:24
  • If the only requirement for the content is to be inside some arbitrary HTML tag with the `important-contents` class name, a RegEx is a terrible idea. However, if it's always going to be a `div` like the OP wrote, it shouldn't easy for the correct RegEx to handle. **But** the OP never said he wanted to use a RegEx for it... – Zohar Peled Apr 15 '15 at 14:24
  • Why not traverse the html script via nodes like one would do with XML. http://stackoverflow.com/questions/1157258/find-specific-data-in-html-with-htmlelementcollection-and-webbrowser and https://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement(v=vs.110).aspx – James Shaw Apr 15 '15 at 14:38

3 Answers3

1

You can match what is inside the DIV tag using this regex that features a non-fixed width look-behind (thanks to .NET regex engine):

(?s)(?<=<div\s[^>]*?class=["']?important-contents["']?[^>]*?>).*?(?=</div>)

Then, to remove all tags, you can use this regex to remove all tags inside the matched DIV contents:

</?[^>]+>

To remove <script> tags that may find their way to the DIV tag, let's introduce another step:

(?s)<script[^>]*?>.*?</script>

I do not know of a way to match discontinuous texts, so it can only be done in {2,} steps.

DISCLAIMER: if you have "malformed" HTML, you can get wierd results, or no match at all.

Sample code:

var div_rgx = new Regex(@"(?si)(?<=<div\s[^>]*?class=[""']?important-contents[""']?[^>]*?>).*?(?=</div>)");
var tag_rgx = new Regex(@"</?[^>]+>");
var script_rgx = new Regex(@"(?s)<script[^>]*?>.*?</script>");
var txt = "<html>\r\n<body>\r\n<div class='important-contents'>\r\n<script>function getV(str) { return 0; }</script>\r\n<span>My <i>text</i><font face=\"Verdana\">.</font></span>\r\n</div>\r\n</body>\r\n</html>";
var result = div_rgx.Match(txt);
if (result.Success)
   var final = tag_rgx.Replace(script_rgx.Replace(result.Value, string.Empty), string.Empty).Trim();

Output:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • what if the div has a script block inside? (And I've seen this type of horrible HTML in my days, it's not just to tackle you, things like this exists all over the internet) – Zohar Peled Apr 15 '15 at 14:41
  • OK, I added this step to the answer. – Wiktor Stribiżew Apr 15 '15 at 14:48
  • Thank @stribizhev but this shows only contents not in any html tag. for example when `
    Hello World
    ` it returns `Hello` while it should return `Hello World`. in addition after `if(condition)` there is a `{` when the first line is declaring a variable. I edited your code as following `if (result.Success){ var final =tag_rgx.Replace(script_rgx.Replace(result.Value, string.Empty),string.Empty).Trim();}`
    – Bellash Apr 16 '15 at 10:37
0

Use "'important-contents'>" as a Match but do not capture anchor then consume all text until a < is hit such as

(?:'important-contents'\>)(?<Content>[^>]+)

In the above I have placed all of the contents into a Named Match Capture Group named "Content" for easier extraction.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
0

First of all, regex is not able to do get string without html tags in general case, because of HTML grammar is not regular. You have two choices:

  1. Use full html parser and work with DOM (answer see here What is the best way to parse html in C#? for example)
  2. Put on task some tradeoffs, for example <div class='important-contents'> wouldn't contains inner html tags. In case of tradeoffs solution may be like this:

    var regex = `"<div class='important-contents'>(?<important>.*)</div>";`
    MatchCollection matches = Regex.Matches(htmlString, regex);
    foreach(Match m in matches){
        Console.WriteLine(m.Groups["important"].ToString());
    }
    
Community
  • 1
  • 1
Ilia Maskov
  • 1,858
  • 16
  • 26