2

I've embarked on attempting to optimize a rather large query that has 3 nested subqueries (like Russian dolls). The query itself is generated by south from a Django project, and I freely admit I'm no expert at SQL optimization. My strategy thus far is to start with the innermost query and work my way outwards.


Thus, the first and inner most query is

SELECT
  DISTINCT ON (quote_id) quote_id,
  MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
GROUP BY quote_id, created_at
ORDER BY quote_id, created_at DESC;

The EXPLAIN ANALYZE for the above is

 Unique  (cost=535905.10..610867.38 rows=3331657 width=12) (actual time=4364.469..7587.242 rows=1462625 loops=1)
   ->  GroupAggregate  (cost=535905.10..602538.24 rows=3331657 width=12) (actual time=4364.467..6996.550 rows=3331656 loops=1)
         Group Key: quote_id, created_at
         ->  Sort  (cost=535905.10..544234.24 rows=3331657 width=12) (actual time=4364.460..5574.351 rows=3331657 loops=1)
               Sort Key: quote_id, created_at
               Sort Method: external merge  Disk: 84648kB
               ->  Seq Scan on billing_pricequotestatus  (cost=0.00..61080.57 rows=3331657 width=12) (actual time=0.013..854.722 rows=3331657 loops=1)
 Planning time: 0.107 ms
 Execution time: 7759.317 ms
(9 rows)

Table structure is

                                    Table "public.billing_pricequotestatus"
   Column   |           Type           |                               Modifiers
------------+--------------------------+-----------------------------------------------------------------------
 id         | integer                  | not null default nextval('billing_pricequotestatus_id_seq'::regclass)
 created_at | timestamp with time zone | not null
 updated_at | timestamp with time zone | not null
 notes      | text                     | not null
 name       | character varying(20)    | not null
 quote_id   | integer                  | not null
Indexes:
    "billing_pricequotestatus_pkey" PRIMARY KEY, btree (id)
    "billing_pricequotestatus_quote_id" btree (quote_id)
    "status_timestamp_idx" btree (quote_id, created_at)
Foreign-key constraints:
    "quote_id_refs_id_2b0d5331de8d31b7" FOREIGN KEY (quote_id) REFERENCES billing_pricequote(id) DEFERRABLE INITIALLY DEFERRED

I've tried http://explain.depesz.com/, but I'm not entirely sure I know how to derive next-steps from the report. I've also found an article suggesting that the ORDER BY clause could be removed if the SELECT will return the rows in order anyways, which I think might be the case here? Unsure how to tell that.

If I remove the ORDER BY clause, that shaves off ~3410 ms, but I feel this should be faster (if I only do a straight SELECT with no aggregate function, DISTINCT or ordering, my baseline time appears to be 832.427 ms). I've seen several other SO posts in regards to tables 10x the size of mine getting 3-5x better performance with the right indexes. I know it's not an apples-to-apples comparison, always, but hoping for some insight anyways.

neezer
  • 19,720
  • 33
  • 121
  • 220
  • 2
    If the data needs to be ordered in a certain way then you should always have an `ORDER BY`. Assuming that the SQL engine will order things for you without it for whatever reason is a big mistake. That said, you should remove the `ORDER BY` because ordering in a subquery makes no sense. – Tom H Mar 04 '16 at 16:06
  • actually this query does not make any sense, you're grouping by quote_id and by created_at, at the same time you're trying to get unique quote_ids with maximal date, I cannot imagine what results could be, probably you need only `SELECT quote_id, MAX(created_at) AS max_created_at FROM billing_pricequotestatus GROUP BY quote_id` – Iłya Bursov Mar 04 '16 at 16:09
  • @TomH "That said, you should remove the ORDER BY because ordering in a subquery makes no sense." Good to know, thanks! – neezer Mar 04 '16 at 16:46
  • `ORDER BY` is required for `DISTINCT ON` so the general rule that ORDER BY is not needed in a subquery has exceptions (it's also good if not required ot have ORDER BY if you have LIMIT in a subquery). The problem of course here is that the MAX() and GROUP BY cancel out with the DISTINCT ON. You don't need both. Either GROUP BY or DISTINCT ON. – ypercubeᵀᴹ Mar 04 '16 at 20:41

2 Answers2

2

That's confusing, you create a distinct list of quote_id, created_at and you do a MAX(created_at) plus DISTINCT ON (quote_id)?

This should return the same result:

SELECT
  quote_id,
  MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
GROUP BY quote_id
dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • That does indeed produce the same output and reduces execution time by 2 seconds, leaving it at 4-5 seconds. Anyway to optimize further? – neezer Mar 04 '16 at 16:39
  • @neezer: Does it use the existing index `"status_timestamp_idx" btree (quote_id, created_at)`? – dnoeth Mar 04 '16 at 20:58
2

It is look like a mix of the two different solutions for the same thing: get max of created_at for each distinct quote_id.

1)

SELECT
  quote_id,
  MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
GROUP BY quote_id

2)

SELECT
  distinct on (quote_id) quote_id,
  created_at
FROM billing_pricequotestatus
ORDER BY quote_id, created_at DESC

Probably something wrong with the query producer.

Abelisto
  • 14,826
  • 2
  • 33
  • 41