0

I'm experiencing a very strange issue. So, the background is that we have a mapping between Word ContentControl and a custom object we use to store some information related to the content inside that control. We use a SortedList<ContentControl, OurCustomObject> to maintain this mapping. The SortedList part is useful to be able to find the next/previous content control, as well as to be able to quickly access the object associated with a content control.

To set this up, we do something like the following:

var dictOfObjs = Globals.ThisAddIn.Application.ActiveDocument.ContentControls
    .Cast<ContentControl>()
    .ToDictionary(key => key, elem => new OurCustomObject(elem));
var comparer = Comparer<ContentControl>
    .Create((x, y) => x.Range.Start.CompareTo(y.Range.Start));
var list = new SortedList<ContentControl, OurCustomObject>(dictOfObjs, storedcomparer);

This seemed to work pretty well, but I recently tried it on a document with ~5000 content controls, and it slowed to an absolute crawl (3+ minutes to instantiate the SortedList).

So that's strange enough, but even more strangeness was yet to come. I added some logging to figure out what was going on, and found that logging the start of each ContentControl before using them in the list sped it up by a factor of ~60. (Yes, ADDING logging sped it up!). Here is the much faster code:

var dictOfObjs = Globals.ThisAddIn.Application.ActiveDocument.ContentControls
    .Cast<ContentControl>()
    .ToDictionary(key => key, elem => new OurCustomObject(elem));

foreach (var pair in dictOfObjs)
{
    _logger.Debug("Start: " + pair.Key.Range.Start);
}

var comparer = Comparer<ContentControl>
    .Create((x, y) => x.Range.Start.CompareTo(y.Range.Start));
var list = new SortedList<ContentControl, OurCustomObject>(dictOfObjs, storedcomparer);

The constructor for SortedList calls Array.Sort<TKey, TValue>(keys, values, comparer); on the keys and values of the dictionary. I can't figure out why accessing the Range objects in a loop beforehand would speed it up. Maybe something to do with the order in which they are accessed? The foreach loop will access them in the order they appear in the document, while Array.Sort will hop around all over..

Edit: When I say SortedList, I mean System.Collections.Generic.SortedList<TKey, TValue>. Here is the code for the constructor I'm using:

public SortedList(IDictionary<TKey, TValue> dictionary, IComparer<TKey> comparer) 
    : this((dictionary != null ? dictionary.Count : 0), comparer) {
    if (dictionary==null)
        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.dictionary);

    dictionary.Keys.CopyTo(keys, 0);
    dictionary.Values.CopyTo(values, 0);
    Array.Sort<TKey, TValue>(keys, values, comparer);
    _size = dictionary.Count;            
}
Zout
  • 821
  • 10
  • 18
  • Could you provide the SortedList constructor code? If that's where the slow-down is happening, that's what we need to see... A known issue in Word, fwiw, is that it's often faster to loop using for + index instead of foreach. I think it has something to do with how Word has to "track" where the objects are, rather than simply being able to pick it up using the index. – Cindy Meister Apr 05 '16 at 17:44
  • Sure - I added the code to my question. The line that takes so long is the one I mentioned earlier, the call to `Array.Sort`. My understanding is that `Array.Sort` will use the comparer to sort the CCs based on their range start. I just don't understand why accessing `Range.Start` suddenly becomes so slow once there are a sufficient number of CCs in the document, and yet can be sped up by iterating over them beforehand. Very confusing to my poor brain! – Zout Apr 05 '16 at 18:11

1 Answers1

0

Besides your performance issue, I think your solution (unless the document is in fact a static document) will fail over time. Range.Start locations tend to be shifting depending on adding/removing content from your document.

To prove this, add the following little macro and run it from VBA:

Sub testccstart()

    Dim cc As ContentControl
    Set cc = ActiveDocument.ContentControls.Add(wdContentControlRichText)

    MsgBox cc.Range.Start

    ActiveDocument.Range(0).InsertBefore "Blablabla"

    MsgBox cc.Range.Start

End Sub

You'll notice the Range.Start of the Content Control shifted from 1 to 10 the minute you entered the text at the start of the document. So all edits in your document require you to reload the Range.Start based lists.

The answer to your question though is that by adding the logging your triggered some what they call 'Lazy Loading' (loading only things when actually accessed). There are all sorts of optimizations within Office that are not always clear (for instance, accessing Excel ranges is often much faster using arrays).

I tested a bit and wonder if this might be a solution for you:

 var dictOfObjs = document.ContentControls
                        .Cast<ContentControl>()
                        .ToDictionary(key => key.Range.Start, elem => new OurCustomObject(elem));

 var comparer = Comparer<int>.Create((x, y) => x.CompareTo(y));

 var list = new SortedList<int, OurCustomObject>(dictOfObjs, comparer);

Instead of using the Content Control as Key (I guess you store the Content Control already in OurCustomObject?) assuming the Content Controls each have a unique start position use the start position as Key, this brought my processing of 1600 items back from 48 to 5 seconds ...

Maarten van Stam
  • 1,901
  • 1
  • 11
  • 16
  • I was worried about this as well (that the ordering would be broken as the document content changes) - but it's not actually a problem, since adding or removing text at a point will shift the range starts of all following content controls forward or backwards, respectively. – Zout Apr 05 '16 at 12:14
  • True, but won't be if someone -moves- the text. Also, I believe you are missing the Content Controls in the Headers and Footers of the document and/or Content Controls in shapes. Read about it here: http://stackoverflow.com/questions/4605179/how-to-get-the-list-of-all-content-controls-in-the-document Be very careful using Content Control objects, things can go tricky if not tested in detail ;-) – Maarten van Stam Apr 05 '16 at 13:27
  • The solution for moving was to use the ContentControlBeforeDelete ContentControlAfterAdd events (which are called when the content control is moved) to ensure the SortedList stays ordered. I didn't realize you could add CCs to shapes - interesting! I don't think that is an issue, though - the content being wrapped by the CCs is somewhat standardized and I believe will always be in the normal part of the document. About the lazy loading, do you have any idea if there is a way to force it to occur? The logging "workaround" seems terribly hacky to me. Thanks for your time, btw. – Zout Apr 05 '16 at 13:36
  • I did some testing, maybe you can check if I'm correct. See edit in my answer. I thought it would be ok to use the Range.Start as Key value thinking that each Content Control has its own starting position, don't you think? That speeded up my test significantly ... – Maarten van Stam Apr 05 '16 at 15:21
  • Hmm... I think that will cause the problem you noticed at first, where updates to the document cause the order to be invalidated. It works right now because the Range objects' start positions get updated with document changes, but if it was a plain integer, it'd be left with the old value. The confusing part of this situation is that this operation only takes a few seconds (or less) up to a certain threshold, after which it takes several minutes.. unless something happens with lazy loading or caching or some other Office magic. Ideally I'd like that magic to happen every time! – Zout Apr 05 '16 at 16:34
  • Yes, this alternative but quicker option is this snapshot of the Range.Start positions. You indicated the -order- of the controls would not change and 'rescanned' after adding or deleting a Content Control. Did that change? The delay in storing the -referenced- Content Controls is that it has to recalculate a Range.Start position if not already calculated just before querying given values. One way or another, Range.Starts are dynamic and the position pointers are -very- volatile hence the risk I warned for. I'll try to think about other alternatives as it sure is an interesting Use Case. :-) – Maarten van Stam Apr 06 '16 at 07:55
  • Ah, I guess I didn't specify what I use those events for. I remove the CC from the SortedList in the BeforeDelete event, and add it to the list in the AfterAdd event. This makes sure that the order is maintained even if a CC is moved, since it will be removed and then re-added at its proper position. – Zout Apr 06 '16 at 16:45