1

i really hope someone can help me with this issue. The solution should be on C#.

I have a xml file with the size of 36 MB and with 900k lines. On some nodes it has a lot of html markup and some invalid markup like

<Obs><p>
<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p>

I've tried different ways to clean this file but only one way is able to perform the task, however, as this is being executed on a web application it's blocking the application and taking around 6 minutes to finish the task and consuming around 450MB in memory.

As this file is an invalid xml i cannot use XmlTextReader. Using XLST, based on Strip HTML-like characters (not markup) from XML with XSLT? ,strangely i'm also with problems with HTML Entities.

The process that worked (with some tweaks) is the following on http://www.codeproject.com/Articles/19652/HTML-Tag-Stripper

Thanks

Edit:

Following Kevin's suggestions. I'm trying to build a solution using HTML Agility Pack. At least to do some benchmarks. I'm stuck however. Imagine the following xml node:

<Obs><p> I WANT THIS TEXT<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p></Obs>

How can i strip the tags inside "obs" tag, keep the tag "obs" and also keep the text "I WANT THIS TEXT" ? Basically this:

<Obs>I WANT THIS TEXT</Obs>

For now this is the code i have:

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(text);
        Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
        while (nodes.Count > 0)
        {
            HtmlNode node = nodes.Dequeue();
            HtmlNode parentNode = node.ParentNode;

            HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (HtmlNode child in childNodes)
                {
                    if (child.Name != "obs")
                    {
                        nodes.Enqueue(child);
                    }
                    else
                    {
                        childNodes = child.SelectNodes("//p|//jantes");
                        foreach (HtmlNode nodeToStrip in childNodes)
                            nodeToStrip.ParentNode.RemoveChild(nodeToStrip);
                    }
                }
            }
        }
        string s = doc.DocumentNode.InnerHtml;

Thanks :)

EDIT 2

Ok, i was able to complete the task. However this is taking too much time. About 3 hours and consuming 800MB in memory.

Still needing help!

Here is the code, it might help someone.

HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(text);
        Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
        while (nodes.Count > 0)
        {
            HtmlNode node = nodes.Dequeue();
            HtmlNode parentNode = node.ParentNode;

            HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (HtmlNode child in childNodes)
                {
                    if (child.Name != "obs")
                    {
                        nodes.Enqueue(child);
                    }
                    else
                    {
                        childNodes = child.SelectNodes("//p|//jantes");
                        if (childNodes != null)
                        {
                            foreach (HtmlNode nodeToStrip in childNodes)
                            {
                                var replacement = doc.CreateTextNode(nodeToStrip.InnerText);
                                nodeToStrip.ParentNode.ReplaceChild(replacement, nodeToStrip);
                            }
                        }
                    }
                }
            }
        }
        string s = doc.DocumentNode.InnerHtml;
Community
  • 1
  • 1
blindado
  • 43
  • 7

1 Answers1

4

Have you tried Html Agility Pack? Among its claims:

  • the parser is very tolerant with "real world" malformed HTML
  • you can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it
carla
  • 1,970
  • 1
  • 31
  • 44
Kevin Collins
  • 1,453
  • 1
  • 10
  • 16
  • Didn't like Html Agility Pack for the purpose. It's very powerful though. But it took me 4 hours to "clean" the file :( – blindado Apr 27 '13 at 08:45