I'm trying to understand the sorting algorithm behind SQL ORDER BY
clauses in the case that the properties are indexed.
Secondary indices are usually implemented as B+ Trees with a combined key, consisting of the indexed value and the associated primary key. For example, an index on first name may look like this:
Key | Value |
---|---|
John.id4 | null |
John.id5 | null |
Jane.id16 | null |
... | .... |
The task that sorting needs to perform is: given a set of IDs and a list of sort commands (consisting of column
and ASC/DESC
), sort the IDs.
If we only want to sort by a single column (e.g. ORDER BY FirstName
), the algorithm is easy:
- Iterate over the secondary index.
- If the ID part of the
Key
occurs in the input set, remove it from the set and add it to the (sorted) output list - Stop if the input set becomes empty or the index has reached its end, whichever occurs first
- Return the output list.
But how does the same thing work if we have multiple sortings? For example, the clause ORDER BY FirstName ASC LastName ASC
? The main issue is of course that we cannot simply tie-break between two IDs simply by looking them up in the second index, because it's sorted by index value, not by primary key. We will have to minimize the number of scans per index as much as possible.
How do big databases, such as PostGreSQL or MySQL solve this issue?