-1

I am working on XML documents using c#.

<data>
    <single>
        <p xmlns="http://www.w3.org/1999/xhtml">
            <strong>Hi hello bbvahvgxvzhavxhgsavxv</strong>
        </p>
        <p xmlns="http://www.w3.org/1999/xhtml">
            <strong>dmcdnsbcdbn</strong>
        </p>
    </single>
    <single>
        <div xmlns="http://www.w3.org/1999/xhtml">
            <strong>Hi hello bbvahvgxvzhavxhgsavxv</strong>
        </div>
        <span xmlns="http://www.w3.org/1999/xhtml">
            <strong>dmcdnsbcdbn</strong>
        </span>
    </single>
</data>

I want to remove all the <p>, <div>, and <span> tags.

Output needed:

<data>
    <single>
        <strong>Hi hello bbvahvgxvzhavxhgsavxv</strong>
        <strong>dmcdnsbcdbn</strong>
    </single>
    <single>
        <strong>Hi hello bbvahvgxvzhavxhgsavxv</strong>
        <strong>dmcdnsbcdbn</strong>
    </single>
</data>

Can any one suggest how to do it using C#. using XmlDocument.

Filburt
  • 17,626
  • 12
  • 64
  • 115
Patan
  • 17,073
  • 36
  • 124
  • 198

2 Answers2

1

Using HtmlAgilityPack, it can be done as:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(xml);

doc.DocumentNode
    .Descendants("strong")
    .ToList().ForEach(n => n.ParentNode.ParentNode.RemoveChild(n.ParentNode, true));

var newXml = doc.DocumentNode.InnerHtml;
L.B
  • 114,136
  • 19
  • 178
  • 224
0

This is a fairly simple looking Regex.

string tmp = xmlDoc.DocumentElement.InnerXml;

tmp = Regex.Replace(tmp, "<p.*>|</p>|<div.*>|</div>|<span.*>|</span>", "");

XmlDocument newDoc = new XmlDocument();
newDoc.LoadXml(tmp);

This will preserve the data (everything in between the tags) but remove the tags themselves. NOTE: this could mess up some of the formatting in the document (lots of whitespace) but it should still be useable.

After running this statment on the example you gave, this was the output.

<data>
    <single>

            <strong>Hi hello bbvahvgxvzhavxhgsavxv</strong>


            <strong>dmcdnsbcdbn</strong>

    </single>
    <single>

            <strong>Hi hello bbvahvgxvzhavxhgsavxv</strong>


            <strong>dmcdnsbcdbn</strong>

    </single>
</data>

Im not sure if you like that or not, you might want to run a .Trim(), or even a secondary Regex designed to remove all the whitespace between tags, on the string before attempting to load.

The Regex pattern for that would be

Regex.Replace(string, "(>) *(<)", "$1$2");

Or you could use ".*" instead of " *" to make certain to replace all line breaks or other special characters that might be leftover between tags

Nevyn
  • 2,623
  • 4
  • 18
  • 32