6

I have 2 tables, "transaksi" and "buku". "transaksi" has around ~250k rows, and buku has around ~170k rows. Both tables have column called "k999a", and both tables use no indexes. Now I check these 2 statements.

Statement 1:

explain select k999a from transaksi where k999a not in (select k999a from buku);

Statement 1 outputs:

 Seq Scan on transaksi  (cost=0.00..721109017.46 rows=125426 width=9)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..5321.60 rows=171040 width=8)
           ->  Seq Scan on buku  (cost=0.00..3797.40 rows=171040 width=8)

Statement 2:

explain select k999a from transaksi where k999a in (select k999a from buku);

Statement 2 outputs:

Hash Semi Join  (cost=6604.40..22664.82 rows=250853 width=9)
   Hash Cond: (transaksi.k999a = buku.k999a)
   ->  Seq Scan on transaksi  (cost=0.00..6356.53 rows=250853 width=9)
   ->  Hash  (cost=3797.40..3797.40 rows=171040 width=8)
         ->  Seq Scan on buku  (cost=0.00..3797.40 rows=171040 width=8)

Why in the NOT IN query, postgresql does loop join, making the query takes a long time?

PS: postgresql version 9.6.1 on windows 10

aaron
  • 257
  • 6
  • 15
  • 1
    why no indexes? – Paul Maxwell Oct 04 '18 at 03:36
  • I don't know why Postgres chose to hash `buku` in one case, and materialize `buku` in memory in the other case. The bottom line is that if you really wanted these queries to run fast, you'd index `k999a` in the `buku` table. – Tim Biegeleisen Oct 04 '18 at 03:48
  • 1
    @TimBiegeleisen What difference make an index on buku? He is doing a full scan of that table anyway. – Juan Carlos Oropeza Oct 04 '18 at 03:58
  • @JuanCarlosOropeza Who says we do a full scan on that table if there is a B-tree built around `k999a`? Are you saying Postgres would not take advantage of this index, if it existed? – Tim Biegeleisen Oct 04 '18 at 03:59
  • 1
    Im saying on first select `(select k999a from buku)` you get a materialize table and you need the whole table to check `NOT IN` so in that case an index doesnt help. On the second one even when also a select looks like planner do a `SEMI JOIN` in that case the index would help – Juan Carlos Oropeza Oct 04 '18 at 04:02
  • @Used_By_Already a dummy table for learning, wanna check the entire plan 1st – aaron Oct 04 '18 at 12:15
  • 1
    @JuanCarlosOropeza just tried using index on buku.k999a, no difference on plan – aaron Oct 04 '18 at 12:17

1 Answers1

11

This is to be expected. You may get better performance using WHERE NOT EXISTS instead:

SELECT k999a
FROM transaksi
WHERE NOT EXISTS (
    SELECT 1 FROM buku WHERE buku.k999a = transaksi.k999a LIMIT 1
);

Here is a good explanation as to why for each of the methods: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/

Nick
  • 7,103
  • 2
  • 21
  • 43
  • @TimBiegeleisen - I can tell you that is certainly not the case in PostgreSQL. `EXISTS` and `IN` are often the same, but not true for anti-joins – Nick Oct 04 '18 at 03:44
  • @TimBiegeleisen - it is possible only when subselect result should not be NULL. PostgreSQL cannot to ensure this behave, and then `NOT IN` cannot be translated to antijoin. Other databases maybe can ensure this behave or their implementation is not semantically correct. – Pavel Stehule Oct 04 '18 at 03:46
  • @Nicarus so basically, the problem is caused by the implementation of NOT IN by postgres? at least that's what I get from the link you gave – aaron Oct 04 '18 at 12:27