1

I have the following html, i tried many many regex to remove hperlink content/text that is between ul tag and li tag only, but not found any regex for removing a tag text . I want that , whenever a tag comes under in ul and li tag then i want to replace a tag text with empty string.

<ul id="foot.dir" class="content" >
 <li><a href="http://www.citysearch.com/aboutcitysearch/about_us"  name="search_grid.footer.1.aboutCs" rel="nofollow" id="foot.dir.about">About</a></li>
 <li><a href="http://www.citysearch.com/mobile-application" name="search_grid.footer.1.mobile" id="foot.dir.apps">Apps</a></li>
</ul>

i have tried this regex but it is not working, here input is string that contains html.

input = Regex.Replace(input, @"<ul[^>]*?><li><a[^>]*?>(?<option>.*?)</ul></li></a>", string.Empty);

Please help me out. Thank You

Waseem Fastian
  • 33
  • 1
  • 10

2 Answers2

2

Regex is a poor choice for parsing HTML, in particular HTML that is not consistent.

I suggest using the HTML Agility Pack to parse and change the HTML.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

The source download comes with a number of sample projects showing how to use the library.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • +1 Totally agree. On a side-note, if you can guarantee the html is x-html, then you might also be able to use XDocument et al. Does depends on what types of character references are used (x-html has its own of course) - but I've found that to be incredibly simple. – Andras Zoltan Nov 30 '12 at 11:51
1

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack

Regex is used for Regular expression

You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

foreach(var item in doc.DocumentNode.SelectNodes("//li[a]"))// select li only if it has anchor tag
{
    item.ParentNode.RemoveChild(item);//removed anchor tag
}
//dont forget to save

i want to remove tag text using regex only ..

Regex.Replace(input,@"(?<=<li[^>]*>)\s*<a.*?(?=</li>)","",RegexOptions.Singleline);
Anirudha
  • 32,393
  • 7
  • 68
  • 89