1

I'm trying to understand the sorting algorithm behind SQL ORDER BY clauses in the case that the properties are indexed.

Secondary indices are usually implemented as B+ Trees with a combined key, consisting of the indexed value and the associated primary key. For example, an index on first name may look like this:

Key Value
John.id4 null
John.id5 null
Jane.id16 null
... ....

The task that sorting needs to perform is: given a set of IDs and a list of sort commands (consisting of column and ASC/DESC), sort the IDs.

If we only want to sort by a single column (e.g. ORDER BY FirstName), the algorithm is easy:

  • Iterate over the secondary index.
  • If the ID part of the Key occurs in the input set, remove it from the set and add it to the (sorted) output list
  • Stop if the input set becomes empty or the index has reached its end, whichever occurs first
  • Return the output list.

But how does the same thing work if we have multiple sortings? For example, the clause ORDER BY FirstName ASC LastName ASC? The main issue is of course that we cannot simply tie-break between two IDs simply by looking them up in the second index, because it's sorted by index value, not by primary key. We will have to minimize the number of scans per index as much as possible.

How do big databases, such as PostGreSQL or MySQL solve this issue?

Martin Häusler
  • 6,544
  • 8
  • 39
  • 66
  • How vendors achieve that is commercially confidential, so you aren't going to find out here. Different vendors won't necessarily use the same method; different queries might use different methods, depending on which is most efficient for the vendors structure; the same query might use different methods at different runs, depending on the relative cardinalities of the various keys. If the `WHERE` is filtering by columns nothing to do with the `ORDER BY`, probably any indexes are no help. At worst: `FULL TABLE SCAN` to grab the records into a cache; old-fashioned sort to `ORDER` them. – AntC Aug 20 '21 at 21:56
  • Why do you want to find out/what difference does it make to you? The only thing you might observe is different response times for the 'same' query over the 'same' dataset. And you'll need millions of rows and plenty of runs to see any patterns of difference. – AntC Aug 20 '21 at 21:59
  • Why: I need to implement something similar and I don't want to reinvent the wheel. Proprietary: there are plenty of open source databases out there and they all need to solve this problem. I was hoping that a contributor may have an answer. – Martin Häusler Aug 21 '21 at 08:30
  • No you don't need to implement something similar. It'll be orders-of-magnitude more costly and time consuming to develop; and a maintenance nightmare. What you are proposing is exactly reinventing the wheel, even if you copy what a wheelwright has already done. Go out and buy a DBMS off the shelf. – AntC Aug 21 '21 at 10:13
  • 1
    1) PostGres is not SQL compliant, it is not "big", it is just heavily marketed freeware. 2) Real SQL platforms (Sybase; DB2; MS; Informix) solved this in the 1980's. 3) If you tighten up the terms in your Q, I can answer it. All indices are B-Trees, the index record is a KeyValue (whatever the index is created upon, which may be multiple columns) plus a location (page id, row-in-page). "Secondary" in meaningless. Replace the "null" with something meaningful. – PerformanceDBA Aug 22 '21 at 10:26

0 Answers0