I'm trying to sanitize/modify the html tree by building the mshtml DOM and then traversing it. The problem is, it takes a very long time to do so, up to 3-4 seconds for a standart twitter newsletter or something on par with that.
After the profiling session i was able to pinpoint the hotspot:
private void AddAttributes(IHTMLDOMNode node)
{
string nodeName = node.nodeName;
var attributes = (IHTMLAttributeCollection) node.attributes;
int length = attributes.length;
for (int i = 0; i < length; i++)
{
//problem line
IHTMLDOMAttribute attribute = attributes.item(i) as IHTMLDOMAttribute;
string attributeName = attribute.nodeName;
//do some work
...
}
}
Casting to IHTMLDOMAttribute is taking 75% percent of the time (in comparison, whole DOM creation takes only ~3%)
Profiler's output for AddAttributes: Function body: 0.3% Called functions: DoCLRToCOMCall: 41.5% JITutil_ChkCastAny: 27.2% ?InterfaceMarshaler_ConvertToManaged...: 10%
What can i do to improve the performance in this case?
I've been here: HTML Traversal is very slow, it looks similar, but we are stuck with .NET 3.5, so dynamics are out of question. There are several other reports about similar problems over the internet, but there's no clear answer, only hints on marshalling problems.
HTML Agility Pack, while being much faster, is incapable of parsing CSS attributes, which is crucial for us.