How can I improve the performance of retrieving values from SharedStringTable in OpenXml Excel spreadsheet tools?

Question

I'm using DocumentFormat.OpenXml to read an Excel spreadsheet. I have a performance bottleneck with the code used to look up the cell value from the SharedStringTable object (it seems to be some sort of lookup table for cell values):

var returnValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;

I've created a dictionary to ensure I only retrieve a value once:

if (dictionary.ContainsKey(parsedValue))
{
    return dictionary[parsedValue];
}

var fetchedValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;
dictionary.Add(parsedValue, fetchedValue);
return fetchedValue;

This has cut down the performance time by almost 50%. However my metrics indicate that it still takes 208 seconds for the line of code fetching the value from the SharedStringTable object to execute 123,951 times. Is there any other way of optimising this operation?

score 8 · Accepted Answer · answered Feb 28 '17 at 17:12

I would read the whole shared string table into your dictionary in one go rather than looking up each value as required. This will allow you to move through the file in order and stash the values ready for a hashed lookup which will be more efficient than scanning the SST for each value you require.

Running something like the following at the start of your process will allow you to access each value using dictionary[parsedValue].

private static void LoadDictionary()
{
    int i = 0;

    foreach (var ss in sharedStringTablePart.SharedStringTable.ChildElements)
    {
        dictionary.Add(i++, ss.InnerText);
    }
}

If your file is very large, you might see some gains using a SAX approach to read the file rather than the DOM approach above:

private static void LoadDictionarySax()
{
    using (OpenXmlReader reader = OpenXmlReader.Create(sharedStringTablePart))
    {
        int i = 0;
        while (reader.Read())
        {
            if (reader.ElementType == typeof(SharedStringItem))
            {
                SharedStringItem ssi = (SharedStringItem)reader.LoadCurrentElement();
                dictionary.Add(i++, ssi.Text != null ? ssi.Text.Text : string.Empty);
            }
        }
    }
}

On my machine, using a file with 60000 rows and 2 columns it was around 300 times quicker using the LoadDictionary method above instead of the GetValue method from your question. The LoadDictionarySax method gave similar performance but on a larger file (100000 rows with 10 columns) the SAX approach was around 25% faster than the LoadDictionary method. On an even larger file (100000 rows, 26 columns), the LoadDictionary method threw an out of memory exception but the LoadDictionarySax worked without issue.

I really like this approach. In LoadDictionary() how do you ensure the cell value (which is numeric and the lookup value) maps to the dictionary key accurately? Is the assumption that the enumerator on sharedStringTablePart.SharedStringTable.ChildElements will return them in numerical order starting with 0 a safe one? — Neil Humby, Mar 07 '17 at 15:16
I'm glad you like it. Yes, that's exactly right, they will be read in order. Out of interest, how much has it improved your performance by if you don't mind me asking? — petelids, Mar 07 '17 at 15:19
Hugely, from 208 seconds to 0.9 seconds. Really appreciate you sharing this tip, thank you! — Neil Humby, Mar 09 '17 at 11:06

How can I improve the performance of retrieving values from SharedStringTable in OpenXml Excel spreadsheet tools?

1 Answers1

Linked