0

I'm trying to sanitize/modify the html tree by building the mshtml DOM and then traversing it. The problem is, it takes a very long time to do so, up to 3-4 seconds for a standart twitter newsletter or something on par with that.

After the profiling session i was able to pinpoint the hotspot:

private void AddAttributes(IHTMLDOMNode node)
{
        string nodeName = node.nodeName;
        var attributes = (IHTMLAttributeCollection) node.attributes;
        int length = attributes.length;
        for (int i = 0; i < length; i++)
        {
            //problem line
            IHTMLDOMAttribute attribute = attributes.item(i) as IHTMLDOMAttribute;

            string attributeName = attribute.nodeName;

            //do some work
            ...
        }
}

Casting to IHTMLDOMAttribute is taking 75% percent of the time (in comparison, whole DOM creation takes only ~3%)

Profiler's output for AddAttributes: Function body: 0.3% Called functions: DoCLRToCOMCall: 41.5% JITutil_ChkCastAny: 27.2% ?InterfaceMarshaler_ConvertToManaged...: 10%

What can i do to improve the performance in this case?

I've been here: HTML Traversal is very slow, it looks similar, but we are stuck with .NET 3.5, so dynamics are out of question. There are several other reports about similar problems over the internet, but there's no clear answer, only hints on marshalling problems.

HTML Agility Pack, while being much faster, is incapable of parsing CSS attributes, which is crucial for us.

Community
  • 1
  • 1
Anatoly Sazanov
  • 1,814
  • 2
  • 14
  • 24
  • *Try* moving all code in JavaScript - if slowness comes because of COM/C# interop you may get better performance... – Alexei Levenkov Feb 27 '15 at 16:25
  • We need to parse html from the mails, entire project is c#-based. It is expected to have some kind of performance loss due to the interop, but that's too much for expected behaviour. – Anatoly Sazanov Feb 28 '15 at 17:43
  • Obviously was unclear suggestion - try to move code that deals with changing/cleaning HTML/CSS over to JavaScript - keep whatever browser hosting in C# indeed. – Alexei Levenkov Feb 28 '15 at 21:42
  • Try [`TreeWalker`](https://msdn.microsoft.com/en-us/library/ie/ff974360%28v=vs.85%29.aspx) and see if you do any better. – noseratio Mar 01 '15 at 04:07
  • Well, we did something close to it. Ended up transferring problematic class to the c++ helper library, which essentially reduced marshalling to the exchange of strings. Still it would be nice to know if there could be an alternative answer. – Anatoly Sazanov Mar 03 '15 at 07:47

0 Answers0