i really hope someone can help me with this issue. The solution should be on C#.
I have a xml file with the size of 36 MB and with 900k lines. On some nodes it has a lot of html markup and some invalid markup like
<Obs><p>
<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p>
I've tried different ways to clean this file but only one way is able to perform the task, however, as this is being executed on a web application it's blocking the application and taking around 6 minutes to finish the task and consuming around 450MB in memory.
As this file is an invalid xml i cannot use XmlTextReader. Using XLST, based on Strip HTML-like characters (not markup) from XML with XSLT? ,strangely i'm also with problems with HTML Entities.
The process that worked (with some tweaks) is the following on http://www.codeproject.com/Articles/19652/HTML-Tag-Stripper
Thanks
Edit:
Following Kevin's suggestions. I'm trying to build a solution using HTML Agility Pack. At least to do some benchmarks. I'm stuck however. Imagine the following xml node:
<Obs><p> I WANT THIS TEXT<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p></Obs>
How can i strip the tags inside "obs" tag, keep the tag "obs" and also keep the text "I WANT THIS TEXT" ? Basically this:
<Obs>I WANT THIS TEXT</Obs>
For now this is the code i have:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
foreach (HtmlNode nodeToStrip in childNodes)
nodeToStrip.ParentNode.RemoveChild(nodeToStrip);
}
}
}
}
string s = doc.DocumentNode.InnerHtml;
Thanks :)
EDIT 2
Ok, i was able to complete the task. However this is taking too much time. About 3 hours and consuming 800MB in memory.
Still needing help!
Here is the code, it might help someone.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
if (childNodes != null)
{
foreach (HtmlNode nodeToStrip in childNodes)
{
var replacement = doc.CreateTextNode(nodeToStrip.InnerText);
nodeToStrip.ParentNode.ReplaceChild(replacement, nodeToStrip);
}
}
}
}
}
}
string s = doc.DocumentNode.InnerHtml;