11

In Excel, they 'compress' strings to a numerical mapping (though I'm not sure that the word compress is correct in this case). Here is an example shown below:

enter image description here

While this helps to reduce the overall filesize and memory footprint, how then does Excel do sorting on a string field? Would every single string need to go through the lookup mapping: and if so, wouldn't that greatly increase the cost of / slowing down doing a sort on a string field (what if there were 1M values, 1M key lookups wouldn't be trivial). Two questions on this:

  1. Are shared strings used within the Excel application itself, or only when saving the data?
  2. What would be an example algorithm to sort on the field then? Any language is fine (c, c#, c++, python).
David542
  • 104,438
  • 178
  • 489
  • 842
  • I'll be interested in a knowledgeable answer to this as well. I can only guess that it has something to do with memory caching but can easily be wrong. – PeterT Jan 15 '20 at 19:48
  • I think the fact that this mapping exists in the physical XML representation of a document is independent on how Excel internally represents the data at runtime. I would believe that it is more computationally efficient to represent columns of data in a raw way (though this might be done in many ways). – alxrcs Jan 15 '20 at 19:59
  • @alxrcs are there any documents or books that go into the internals of Excel, similar to something like this for SQLServer? https://www.amazon.com/Pro-Server-Internals-Dmitri-Korotkevitch/dp/1430259620, or is it basically a black box outside of the ms team? – David542 Jan 15 '20 at 20:28
  • Not sure, sorry. You can find online some specifications for the file formats, but I don't think details on Excel runtime internals are that easy to find. – alxrcs Jan 19 '20 at 16:50
  • Anyway, from your second question I suspect you're more interested in the theory than in the Excel specifics, is that right? – alxrcs Jan 19 '20 at 16:51
  • @alxrcs well, I'd prefer the excel specifics if possible, but without that, then yes I suppose a theory. – David542 Jan 20 '20 at 01:21
  • It looks like Excel is using a similar trick like in .NET where it is called String.Interning. At runtime you deal with pointers to strings (no index or table lookup) but during loading you load the string value only once from the file. This approach saves memory because the same string is existing only once in memory and not million times but at the same time it will slow down loading and saving considerably because whenever a shared reference is read/written to the file Excel keeps track if a string is already in the shared string table or if a new string index needs to be allocated. – Alois Kraus Jan 26 '20 at 22:02
  • @AloisKraus -- how can you tell? Are you looking at something code-level or just inferring from the question? Also, what do you mean by 'saving'? What if you don't save, but do something like a sort-operation -- would that also slow things down? – David542 Jan 26 '20 at 23:50
  • @AloisKraus also, if it's just storing a pointer, how would you sort it? Wouldn't dereferencing `*item` 1M times add a lot of overhead? – David542 Jan 27 '20 at 01:13
  • @David542: I am inferring that from the question. There are not so many efficient in memory storage possibilities. If you want to spare even more memory you can use an integer index to a pointer array. But that would limit your unique string count to 4 billion. Using 64 bit integers would be identical to a pointer which is the reason why I am sure that this is not the case. To really measure it you can break with a debugger during the sort operation or profile it to check what the hottest code areas are and deduce from that what Excel actually uses. What would you do with that information? – Alois Kraus Jan 27 '20 at 14:48

2 Answers2

1

I can't find how exactly Excel stores cells with SharedStringTable elements in-memory at runtime, but storing them as an index of the item in SharedStringTable requires just one extra dereference to access them, assuming that the elements are stored as an array. So my guess is that this is how it is done. That is the simplest way and the only way to make it faster is having runtime representation of SharedStringTable already sorted by elements. In such case sorting by an index is equivalent to sorting by the value. That approach, however, makes the insertion operation costly as when a new string is inserted into the middle of the table all the indexes larger than it should be incremented and the number of such cells in the document can be very large, up to all the cells referring to SharedStringTable.

If the cells contain indexes same as in the file, here is how one would sort the cells represented by columnValue vector based on the strings they are pointing to stored in the sharedStrings vector (in C++ since you said there is no difference) at a cost of 2 extra dereferences per comparison operation:

// sort indexes from columnValue based on comparing values in sharedStrings
sort(columnValue.begin(), columnValue.end(), 
     [&sharedStrings](size_t i1, size_t i2){return sharedStrings[i1] < sharedStrings[i2];});

It wasn't in the OP, but the reverse SharedStringTable lookup operation is slow and caching elements into a dictionary helps.

isp-zax
  • 3,833
  • 13
  • 21
1

Microsoft Excel Shared Strings Table

Shared strings table is and Open XML standard, as defined by ISO standard - ISO/IEC 29500-1:2016(E)

Official definition of Shared strings (cited from ISO document)

Shared String Table

String values may be stored directly inside spreadsheet cell elements; however, storing the same value inside multiple cell elements can result in very large worksheet Parts, possibly resulting in performance degradation. The Shared String Table is an indexed list of string values, shared across the workbook, which allows implementations to store values only once.

ISO standard on Shared Strings can be downloaded from

https://standards.iso.org/ittf/PubliclyAvailableStandards/c071691_ISO_IEC_29500-1_2016.zip

Answers to the questions on this topic

Question 1: Are shared strings used within the Excel application itself, or only when saving the data?

Answer: Shared strings are used by Excel only at the time of saving the document, I.E., only for the purpose of storing the spreadsheet as a file on storage.

However, when the file is opened for display, the cells are populated with actual string values pulled from the shared strings table.

-

Question 2: What would be an example algorithm to sort on the field then? Any language is fine (c, c#, c++, python).

Answer: For an application like Excel, I guess that a special proprietary variation of Quick sort is the most likely algorithm to be used for sorting on string values.

Excel has a limit of 1,048,576 rows. For this size, Quick sort is definitely a winner. Quick sort can produce very efficient result for data set of this magnitude.

Here is the link to the implementation of Quick Sort in C++ for sorting strings:

http://www.cplusplus.com/forum/beginner/101599/

Gopinath
  • 4,066
  • 1
  • 14
  • 16
  • 2
    quick sort would be on the string itself, you'd need to dereference a pointer or do a lookup map a million times though, no? I think this answer is basically just saying "Yes, it does Shared Strings. Here is how to do a sort without shared strings". – David542 Jan 27 '20 at 01:15
  • 2
    Shared strings table is used only for storing the file content to disk. The ISO standard does not specify how the cells must be populated when the application is open. If cells are populated with copy of string value extracted from shared strings table, then the dereferencing can be avoided. – Gopinath Jan 27 '20 at 01:17
  • 1
    I see. Yes, my main point of interest here was how it's handled in-memory, outside of the to/from-storage aspect. Do you have any insight into that part of it? – David542 Jan 27 '20 at 01:45
  • In excel sorting, user must specify the sort order as a list of columns (Example: Sort by Column A, Then by B, Then by C, Then by D ). Suppose that the column A contains duplicate strings. While sorting, all the rows with same value for column A will be sorted on values of 'Column B'. If cells of B also contain duplicate values, then sorting will be done on Column C... so on till the column with unique values is found. If none of the columns has unique values, then the rows will be skipped. – Gopinath Jan 27 '20 at 09:36