-1

I am writing a distributed SQL query planner(Query Engine). Data will be fetched from RDBMS(PostgreSQL) nodes involving network I/O.

I want to optimize JOIN queries.

Logical Order of Execution is:

  1. Do JOIN(make use of ON clause)
  2. Apply WHERE clause on the joined result.

I was thinking about applying Filter(WHERE clause specific to a table) first itself, and then do join. In what cases would that result in wrong results?


Example:

SELECT * 
FROM tableA 
LEFT JOIN tableB ON(tableA.col1 = tableB.col1) 
LEFT JOIN tableC ON(tableB.col2 = tableC.col1)
WHERE tableA.colY < 100 AND tableB.colX > 50 

Logical Execution:

  1. joinResult = (tableA left join tableB ON() ) left join tableC ON()
  2. Filter joinResult using given WHERE clause.

Proposed Execution:

  1. filteredA = tableA WHERE tableA.colY < 100

    filteredB = tableB WHERE tableB.colX > 50

  2. Result = (filteredA left join filteredB ON(..))left join tableC ON(..)

Can I optimize any query like this? That is filtering the table first and then applying join above that.

Edit: Some people are confusing and talking about this specific example. I am not talking about this specific example query, I am writing a query planner and I want to handle all type of queries

Please note that, each of the tables is sharded and stored in different machines, and the current execution model is to fetch each of the tables and then do join locally. So if I apply the WHERE filter before fetching, it would be better.

Insaf K
  • 336
  • 4
  • 8
  • 4
    Your `where` clause is turning the first `left join` into an inner join. – Gordon Linoff May 13 '20 at 11:00
  • It was just an example, the WHERE clause may not be there at all. Anyway I've edited it – Insaf K May 13 '20 at 11:02
  • `Logical Order of Execution is:` Wrong. There is no *logical order*. Anything goes, as long as the result is correct. – wildplasser May 13 '20 at 11:11
  • @wildplasser FROM, ON, JOIN, WHERE and so on... Source: https://learn.microsoft.com/en-us/sql/t-sql/queries/select-transact-sql?view=sql-server-ver15#logical-processing-order-of-the-select-statement – Insaf K May 13 '20 at 11:15
  • 1
    @Insaf: that's the order of *parsing* of the statement, not the actual **execution** order –  May 13 '20 at 11:19
  • @a_horse_with_no_name I agree it's not the actual order of execution. Query planner or optimizer can change the order. In fact, what I actually wanted to ask was, if I change the order, that is filtering that table(ie., applying WHERE clause) and then doing JOIN would produce right result or not. I'll be fetching the table first, and then doing JOIN locally, each table are partitioned and stored in different nodes – Insaf K May 13 '20 at 11:23
  • There are *some* transformations on the query tree that can be proven to generate the same result. One example is your `AND tableB.colX > 50` , which is detected by most optimisers, turning that part of the query into a straight plain JOIN. – wildplasser May 13 '20 at 11:32
  • 1
    If you are writing your own DBMS, then what does this have to do with PostgreSQL? – jjanes May 13 '20 at 14:56

1 Answers1

0

This is actually a complex topic.

We can filter the table in some cases. We can also reorder outer joins and then push the filter quals inside.

I was going through a research paper regarding this, but I haven't completed it yet(may not complete it also).

So for now, for those who are looking for answers, you could probably go through this research paper particularly section 2.2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.2531&rep=rep1&type=pdf


For now I'm relying on PostgreSQL's planner and taking its output and reconstructing the query for my requirements.

Insaf K
  • 336
  • 4
  • 8