40

Say I have a table order as

id | clientid | type | amount | itemid | date
---|----------|------|--------|--------|-----------
23 | 258      | B    | 150    | 14     | 2012-04-03
24 | 258      | S    | 69     | 14     | 2012-04-03
25 | 301      | S    | 10     | 20     | 2012-04-03
26 | 327      | B    | 54     | 156    | 2012-04-04
  • clientid is a foreign-key back to the client table
  • itemid is a foreign key back to an item table
  • type is only B or S
  • amount is an integer

and a table processed as

id | orderid | processed | date
---|---------|-----------|---------
41 | 23      | true      | 2012-04-03
42 | 24      | true      | 2012-04-03
43 | 25      | false     | <NULL>
44 | 26      | true      | 2012-04-05     

I need to get all the rows from order that for the same clientid on the same date have opposing type values. Keep in mind type can only have one of two values - B or S. In the example above this would be rows 23 and 24.

The other constraint is that the corresponding row in processed must be true for the orderid.

My query so far

SELECT c1.clientid,
       c1.date,
       c1.type,
       c1.itemid,
       c1.amount,
       c2.date,
       c2.type,
       c2.itemid,
       c2.amount

FROM   order c1
INNER JOIN order c2 ON c1.itemid    =  c2.itemid AND
                       c1.date      =  c2.date   AND
                       c1.clientid  =  c2.clientid AND
                       c1.type     <>  c2.type AND
                       c1.id        <  c2.id

INNER JOIN processed p1 ON p1.orderid   =  c1.id AND
                         p1.processed =  true
INNER JOIN processed p2 ON p2.orderid   =  c2.id AND
                         p2.processed =  true

QUESTION: Keeping the processed = true as part of the join clause is slowing the query down. If I move it to the WHERE clause then the performance is much better. This has piqued my interest and I'd like to know why.

The primary keys and respective foreign key columns are indexed while the value columns (value, processed etc) aren't.

Disclaimer: I have inherited this DB structure and the performance difference is roughly 6 seconds.

ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
Insectatorious
  • 1,305
  • 3
  • 14
  • 29
  • 3
    Can you show the execution plan (ideally using `explain analyze`) for both statements? This does sound like a bug though. You might want to upload the execution plan to http://explain.depesz.com instead of posting it inline. –  Jun 01 '12 at 11:07
  • 2
    does replacing `c1.type <> c2.type` with `c1.type = 'b' and c2.type = 's'` improve things at all? – SingleNegationElimination Jun 01 '12 at 11:54
  • 1
    @TokenMacGuy Semantically, would that not be different ie, only when and 's' comes after a 'b'? c1.id < c2.id. – Insectatorious Jun 01 '12 at 12:24
  • 1
    @a_horse_with_no_name this will be a bit tricky as I have simplified the table structure to post the question....lemme see what I can do. – Insectatorious Jun 01 '12 at 12:25
  • 1
    @Insectatorious: Is that `c1.orderid` supposed to be `c1.id` ? – ypercubeᵀᴹ Jun 01 '12 at 12:31
  • 1
    If the only indexes you have on the tables, are the ones you mention, you should see performance gain by adding indexes for this query. I'm no expert on Postgres but for `order`, a compound index on `(itemid, date, clientid, type)` seems appropriate. – ypercubeᵀᴹ Jun 01 '12 at 12:36
  • 2
    You may be able to see that they are equivalent, but the database is not likely to know that they can only occur in one order. – SingleNegationElimination Jun 01 '12 at 12:37
  • 1
    +1 for @TokenMacGuy's comments, too. Equality (`=`) conditions usually perform better than inequality ones (`<` or `<=`). Even if they are supposed to give the same result. – ypercubeᵀᴹ Jun 01 '12 at 12:39
  • 1
    @ypercube It's `c1.id` . @TokenMacGuy So lemme make sure I get this right, doing `c1.type = 'b' and c2.type = 's'` would produce the same result as `c1.type = 's' and c2.type = 'b'`? – Insectatorious Jun 01 '12 at 12:41
  • 1
    You should also consider adding an index on `(processed, orderid)`. Or even better, delete all the rows with `processed = False` from the processed table (provided that you are allowed to change the db design). – ypercubeᵀᴹ Jun 01 '12 at 12:45
  • 2
    @Insectatorious: To answer your question to @Token: No, but `(c1.type = 'b' and c2.type = 's') OR (c1.type = 's' and c2.type = 'b')` may be faster than `c1.type <> c2.type`. – ypercubeᵀᴹ Jun 01 '12 at 12:47
  • @ypercube Aha. I see. That makes sense. I *can* make changes to the db design but I would have to justify the change to the db admin first; so as long as we don't have a chain of several of these in a row that might cause a page to time-out I'm going to leave things as they are. Cheers! – Insectatorious Jun 01 '12 at 12:50
  • 1
    I meant: if all the rows with `processed = False` have no other meaningful data, it would make sense to delete them and **also drop the `processed` column**. This may of course mean, other changes that will need to be made into INSERT/DELETE/UPDATE procedures you have, besides the SELECT queries. – ypercubeᵀᴹ Jun 01 '12 at 12:55
  • 1
    @ypercube I see what you mean now. This would essentially change the logic of every query that's dependent on `processed` as it (the logic) would go from checking the value of the processed column to checking if the row exists ie, absence equals an order that has not been processed. – Insectatorious Jun 01 '12 at 13:00
  • 1
    @Insectatorious: Have you tried adding the `(processed, orderid)` index? – ypercubeᵀᴹ Jun 13 '12 at 16:06
  • 1
    @ypercube `orderid` is indexed. `processed` isn't. Going to the solution of absence equaling a pending order surely it shouldn't matter if processed in indexed? – Insectatorious Jun 13 '12 at 16:19
  • 1
    No, if you remove the `processed = TRUE` condidtions, of course it shouldn't. I meant for these versions of the query. – ypercubeᵀᴹ Jun 13 '12 at 16:23

1 Answers1

20

The reason that you're seeing a difference is due to the execution plan that the planner is putting together, this is obviously different depending on the query (arguably, it should be optimising the 2 queries to be the same and this may be a bug). This means that the planner thinks it has to work in a particular way to get to the result in each statement.

When you do it within the JOIN, the planner will probably have to select from the table, filter by the "True" part, then join the result sets. I would imagine this is a large table, and therefore a lot of data to look through, and it can't use the indexes as efficiently.

I suspect that if you do it in a WHERE clause, the planner is choosing a route that is more efficient (ie. either index based, or pre filtered dataset).

You could probably make the join work as fast (if not faster) by adding an index on the two columns (not sure if included columns and multiple column indexes are supported on Postgres yet).

In short, the planner is the problem it is choosing 2 different routes to get to the result sets, and one of those is not as efficient as the other. It's impossible for us to know what the reasons are without the full table information and the EXPLAIN ANALYZE information.

If you want specifics on why your specific query is doing this, you'll need to provide more information. However the reason is the planner choosing different routes.

Additional Reading Material:

http://www.postgresql.org/docs/current/static/explicit-joins.html

Just skimmed, seems that the postgres planner doesn't re-order joins to optimise it. try changing the order of the joins in your statement to see if you then get the same performance... just a thought.

Martin
  • 2,180
  • 4
  • 21
  • 41
  • Right....makes sense...the trouble is I've simplified the tables and their respective structures to post this question..I'll try and get the `explain analyse` – Insectatorious Jun 01 '12 at 12:27
  • 2
    You do not **force** the query planner by putting conditions in the `ON` or the `WHERE` clause. A decent optimizer/query planner should be able to identify both versions as equivalent (when they are) and choose from various execution plans. – ypercubeᵀᴹ Jun 01 '12 at 13:29
  • @ypercube Optimizer would normally push them down in as low as possible to reduce the cardinality as soon as possible, but obviously that is not good when it results in a table op instead of an index op. And then perhaps it's not smart enough to pull it up and use it later when the working set is smaller. What's most interesting is that the optimizer doesn't push around the clauses in the WHERE version to be the same. – Cade Roux Jun 01 '12 at 13:40
  • 1
    @CadeRoux: Yeah but I think Postgres is mature enough to do that. What may confuse the optimizer is that it has to join 4 tables (so quite a lot of plans there) and only a few indexes. If there were useful indexes, I think it would choose same plans in both cases. – ypercubeᵀᴹ Jun 01 '12 at 13:52
  • Maybe "Force" isn't the right word, however, the concept is correct. Maybe "Tell" is the word, but this is meant to be descriptive to people who are not familiar with planners. By doing what he's doing (JOIN vs WHERE) the planner is taking another path, and therefore there is a difference in performance. – Martin Jun 01 '12 at 13:56
  • "Force" is not the right word and neither is "have to". It's not *"Tell"* either. An SQL statement/query does not tell the DBMS how to do something, it tells what to do. If one wants to give hints of force the optimizer to do things a specific way, there are ways too (that vary from DBMS to DBMS). But the query has no such hints. – ypercubeᵀᴹ Jun 01 '12 at 15:09
  • @Martin: What you are right, is that the optimizer is choosing different (execution) paths/plans. – ypercubeᵀᴹ Jun 01 '12 at 15:10
  • We could go into technical detail about how planners work, however, that is way beyond the scope of the question. The answer gives the OP, and someone not familar with planners, the ability to understand why it is happening without providing meaningless information. – Martin Jun 01 '12 at 15:51
  • @ypercube: I've updated the answer to move away from those words, I think that covers the question that was asked, do you agree? – Martin Jun 01 '12 at 15:56