PostgreSQL query runs faster with index scan, but engine chooses hash join

Question

The query:

SELECT "replays_game".*
FROM "replays_game"
INNER JOIN
 "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 50027

If I set SET enable_seqscan = off, then it does the fast thing, which is:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..27349.80 rows=3395 width=72) (actual time=28.726..65.056 rows=3398 loops=1)
   ->  Index Scan using replays_playeringame_player_id on replays_playeringame  (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.019..2.412 rows=3398 loops=1)
         Index Cond: (player_id = 50027)
   ->  Index Scan using replays_game_pkey on replays_game  (cost=0.00..5.41 rows=1 width=72) (actual time=0.017..0.017 rows=1 loops=3398)
         Index Cond: (id = replays_playeringame.game_id)
 Total runtime: 65.437 ms

But without the dreaded enable_seqscan, it chooses to do a slower thing:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=7330.18..18145.24 rows=3395 width=72) (actual time=92.380..535.422 rows=3398 loops=1)
   Hash Cond: (replays_playeringame.game_id = replays_game.id)
   ->  Index Scan using replays_playeringame_player_id on replays_playeringame  (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.020..2.899 rows=3398 loops=1)
         Index Cond: (player_id = 50027)
   ->  Hash  (cost=3668.08..3668.08 rows=151208 width=72) (actual time=90.842..90.842 rows=151208 loops=1)
         Buckets: 1024  Batches: 32 (originally 16)  Memory Usage: 1025kB
         ->  Seq Scan on replays_game  (cost=0.00..3668.08 rows=151208 width=72) (actual time=0.020..29.061 rows=151208 loops=1)
 Total runtime: 535.821 ms

Here are the relevant indexes:

Index "public.replays_game_pkey"
 Column |  Type   | Definition
--------+---------+------------
 id     | integer | id
primary key, btree, for table "public.replays_game"

Index "public.replays_playeringame_player_id"
  Column   |  Type   | Definition
-----------+---------+------------
 player_id | integer | player_id
btree, for table "public.replays_playeringame"

So my question is, what am I doing wrong that Postgres is mis-estimating the relative costs of the two ways of joining? I see in the cost estimates that it thinks the hash-join will be faster. And its estimate of the cost of the index-join is off by a factor of 500.

How can I give Postgres more of a clue? I did run a VACUUM ANALYZE immediately before running all of the above.

Interestingly, if I run this query for a player with a smaller # of games, Postgres chooses to do the index-scan + nested-loop. So something about the large # of games tickles this undesired behavior where relative estimated cost is out of line with actual estimated cost.

Finally, should I be using Postgres at all? I don't wish to become an expert in database tuning, so I'm looking for a database that will perform reasonably well with a conscientious developer's level of attention, as opposed to a dedicated DBA. I am afraid that if I stick with Postgres I will have a steady stream of issues like this that will force me to become a Postgres expert, and perhaps another DB will be more forgiving of a more casual approach.

A Postgres expert (RhodiumToad) reviewed my full database settings (http://pastebin.com/77QuiQSp) and recommended set cpu_tuple_cost = 0.1. That gave a dramatic speedup: http://pastebin.com/nTHvSHVd

Alternatively, switching to MySQL also solved the problem pretty nicely. I have a default installation of MySQL and Postgres on my OS X box, and MySQL is 2x faster, comparing queries that are "warmed up" by repeatedly executing the query. On "cold" queries, i.e. the first time a given query is executed, MySQL is 5 to 150 times faster. The performance of cold queries is pretty important for my particular application.

The big question, as far as I'm concerned, is still outstanding -- will Postgres require more fiddling and configuration to run well than MySQL? For example, consider that none of the suggestions offered by the commenters here worked.

I take it `player_id` is not a unique column in the `replays_playeringame` table? What's the distribution of values like? — Neil, May 17 '12 at 21:12
What are the settings for {work_mem, shared_buffers, effective_cache_size} ? How many NULL values are there for "public.replays_playeringame_player_id" ? How many unique FK's ? (oh, that's what Neil said) — wildplasser, May 17 '12 at 21:12
@Neil, player_id is not unique within replays_playeringame. replays_playeringame is a many-to-many join table, with indexed fields player_id and game_id. The distribution of values is uneven -- the most active player is in 3,000 games, while the median player is probably in 2 games. — dsjoerg, May 17 '12 at 21:19
Temporary conclusion: you have the default buffer settings. The working set fits in core, but the DBMS does not know this. (in bost cases the estimates are by far larger than the actual measured times) — wildplasser, May 17 '12 at 21:21
@wildplasser, the settings are { 1MB, 24MB, 128MB } respectively. No idea if those are reasonable for my system, those were chosen somehow when I installed postgres on my dev box. There are 0 NULL values in replays_playeringame.player_id, if that's what you're asking. And there are 159k distinct values for player_id in replays_playeringame. — dsjoerg, May 17 '12 at 21:22
128M is small (for a current machine) for effective_cache_size. Try setting it to (total_memory - memory_used_by_all_processes) You did not alter the other cost constants? Also note that the row estimates are almost correct. — wildplasser, May 17 '12 at 21:28
@wildplasser, the estimates are using an arbitrary unit, not in time. — sayap, May 17 '12 at 22:33
"*I'm afraid I will have a steady stream of issues like this*": you will have that with **any** database. Don't believe anyone who claims that a DBMS does not need attention or that it always runs every statement with the ultimately best execution plan. — , May 17 '12 at 22:34
@sayap: wildplasser was referring to the **row** estimates not the cost. — , May 17 '12 at 22:39
@a_horse_with_no_name: wildplasser mentioned "in bost cases the estimates are by far larger than the actual measured times", like comparing one to the other. — sayap, May 17 '12 at 22:43
@a_horse_with_no_name: correct. OMG they just started to tune the page fetching costs! BTW: for SSD or NAS, random_cost *could* be set equal to sequential cost, depending on the phase of the moon. — wildplasser, May 17 '12 at 22:44

score 14 · Answer 1 · answered May 17 '12 at 22:29

My guess is that you are using the default random_page_cost = 4, which is way too high, making index scan too costly.

I try to reconstruct the 2 tables with this script:

CREATE TABLE replays_game (
    id integer NOT NULL,
    PRIMARY KEY (id)
);

CREATE TABLE replays_playeringame (
    player_id integer NOT NULL,
    game_id integer NOT NULL,
    PRIMARY KEY (player_id, game_id),
    CONSTRAINT replays_playeringame_game_fkey
        FOREIGN KEY (game_id) REFERENCES replays_game (id)
);

CREATE INDEX ix_replays_playeringame_game_id
    ON replays_playeringame (game_id);

-- 150k games
INSERT INTO replays_game
SELECT generate_series(1, 150000);

-- ~150k players, ~2 games each
INSERT INTO replays_playeringame
select trunc(random() * 149999 + 1), generate_series(1, 150000);

INSERT INTO replays_playeringame
SELECT *
FROM
    (
        SELECT
            trunc(random() * 149999 + 1) as player_id,
            generate_series(1, 150000) as game_id
    ) AS t
WHERE
    NOT EXISTS (
        SELECT 1
        FROM replays_playeringame
        WHERE
            t.player_id = replays_playeringame.player_id
            AND t.game_id = replays_playeringame.game_id
    )
;

-- the heavy player with 3000 games
INSERT INTO replays_playeringame
select 999999, generate_series(1, 3000);

With the default value of 4:

game=# set random_page_cost = 4;
SET
game=# explain analyse SELECT "replays_game".*
FROM "replays_game"
INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 999999;
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1483.54..4802.54 rows=3000 width=4) (actual time=3.640..110.212 rows=3000 loops=1)
   Hash Cond: (replays_game.id = replays_playeringame.game_id)
   ->  Seq Scan on replays_game  (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.012..34.261 rows=150000 loops=1)
   ->  Hash  (cost=1446.04..1446.04 rows=3000 width=4) (actual time=3.598..3.598 rows=3000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 106kB
         ->  Bitmap Heap Scan on replays_playeringame  (cost=67.54..1446.04 rows=3000 width=4) (actual time=0.586..2.041 rows=3000 loops=1)
               Recheck Cond: (player_id = 999999)
               ->  Bitmap Index Scan on replays_playeringame_pkey  (cost=0.00..66.79 rows=3000 width=0) (actual time=0.560..0.560 rows=3000 loops=1)
                     Index Cond: (player_id = 999999)
 Total runtime: 110.621 ms

After lowering it to 2:

game=# set random_page_cost = 2;
SET
game=# explain analyse SELECT "replays_game".*
FROM "replays_game"
INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 999999;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=45.52..4444.86 rows=3000 width=4) (actual time=0.418..27.741 rows=3000 loops=1)
   ->  Bitmap Heap Scan on replays_playeringame  (cost=45.52..1424.02 rows=3000 width=4) (actual time=0.406..1.502 rows=3000 loops=1)
         Recheck Cond: (player_id = 999999)
         ->  Bitmap Index Scan on replays_playeringame_pkey  (cost=0.00..44.77 rows=3000 width=0) (actual time=0.388..0.388 rows=3000 loops=1)
               Index Cond: (player_id = 999999)
   ->  Index Scan using replays_game_pkey on replays_game  (cost=0.00..0.99 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3000)
         Index Cond: (id = replays_playeringame.game_id)
 Total runtime: 28.542 ms
(8 rows)

If using SSD, I would lower it further to 1.1.

As for your last question, I really think you should stick with postgresql. I have experience with postgresql and mssql, and I need to put in triple the effort into the later for it to perform half as well as the former.

If anyone is interested, I once encountered a similar issue, posted to the postgresql mailing list, and a long discussion was spawned: http://www.mail-archive.com/pgsql-performance@postgresql.org/msg41005.html — sayap, May 17 '12 at 22:46
that discussion was effectively throttled by Tom Lane stating that effective_cache_size also affects the decision to use index scans. BTW: nice test setup. But I am afraid your strategy succeeds for the wrong reasons. — wildplasser, May 17 '12 at 23:13
Also note that the ratio between estimated and actual time is still wrong by a factor of ~40. You are just lucky because your OS has more buffers than the optimiser assumed. (this will alter once the tables get bigger (or the memory pressure higher) — wildplasser, May 17 '12 at 23:31
@sayap i am indeed running with the default random_page_cost of 4. However do you agree with the assertion [here](http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server) that "this not where you should start to search for plan problems... Instead, you should start by [doing] all the things gone over above" — dsjoerg, May 18 '12 at 00:26
@dsjoerg: Sure, you should also tune the other settings mentioned in the wiki page. random_page_cost was indeed the last thing I changed when I encountered the similar issue last time :) — sayap, May 18 '12 at 04:19
@wildplasser: There is no estimated time in the EXPLAIN ANALYSE output. — sayap, May 18 '12 at 04:25
I ran some tests, and setting random_page_cost to {4,2,1}. Results were indeed like this. **Until I scaled-up the data** by a factor of ten (both total size and the hard-hitter): now the chosen plan was actually the best, even for random_page_cost=4. My explanation of this is that the original query is executing fully in core, and that the disk-I/O cost factors loose grip on reality for such a small footprint. (they do affect the query plan, obviously) BTW: setting work_mem to a lower value has a similar effect on the generated plan. a lower fraction of hard-hitters yields better plans. — wildplasser, May 18 '12 at 12:17
I set random_page_cost to 2, and the chosen plan was still not the actually optimal plan. I happen to also have a second database (production!) that already had random_page_cost set to 2, and it was also running with the suboptimal plan. — dsjoerg, May 18 '12 at 16:59

wildplasser · Answer 2 · 2013-11-23T15:09:05.943

I ran sayap's testbed-code (Thanks!) , with the following modifications:

code is run four times with random_page_cost set to 8,4,2,1; in that order. (the cpc=8 is intended to prime the disk-buffer-cache)
The test is repeated with a reduced (1/2,1/4,1/8) fraction of the hard-hitters (respectively: 3K, 1K5,750 and 375 hardhitters; the rest of the records is kept unchanged.
These 4*4 tests are repeated with a lower setting (64K, the minimum) for work_mem.

After this run, I did the same run, but scaled up tenfold: with 1M5 records (30K hard-hitters)

Currently, I am running the same test with a hundred-fold scale-up, but the initialisation is rather slow...

Results The entries in the cells are the total time in msec plus a string that denotes the chosen queryplan. (only a handfull of plans occur)

Original 3K / 150K  work_mem=16M

rpc     |       3K      |       1K5     |       750     |       375
--------+---------------+---------------+---------------+------------
8*      | 50.8  H.BBi.HS| 44.3  H.BBi.HS| 38.5  H.BBi.HS| 41.0  H.BBi.HS
4       | 43.6  H.BBi.HS| 48.6  H.BBi.HS| 4.34  NBBi    | 1.33  NBBi
2       | 6.92  NBBi    | 3.51  NBBi    | 4.61  NBBi    | 1.24  NBBi
1       | 6.43  NII     | 3.49  NII     | 4.19  NII     | 1.18  NII


Original 3K / 150K work_mem=64K

rpc     |       3K      |       1K5     |       750     |       375
--------+---------------+---------------+---------------+------------
8*      | 74.2  H.BBi.HS| 69.6  NBBi    | 62.4  H.BBi.HS| 66.9  H.BBi.HS
4       | 6.67  NBBi    | 8.53  NBBi    | 1.91  NBBi    | 2.32  NBBi
2       | 6.66  NBBi    | 3.6   NBBi    | 1.77  NBBi    | 0.93  NBBi
1       | 7.81  NII     | 3.26  NII     | 1.67  NII     | 0.86  NII


Scaled 10*: 30K / 1M5  work_mem=16M

rpc     |       30K     |       15K     |       7k5     |       3k75
--------+---------------+---------------+---------------+------------
8*      | 623   H.BBi.HS| 556   H.BBi.HS| 531   H.BBi.HS| 14.9  NBBi
4       | 56.4  M.I.sBBi| 54.3  NBBi    | 27.1  NBBi    | 19.1  NBBi
2       | 71.0  NBBi    | 18.9  NBBi    | 9.7   NBBi    | 9.7   NBBi
1       | 79.0  NII     | 35.7  NII     | 17.7  NII     | 9.3   NII


Scaled 10*: 30K / 1M5  work_mem=64K

rpc     |       30K     |       15K     |       7k5     |       3k75
--------+---------------+---------------+---------------+------------
8*      | 729   H.BBi.HS| 722   H.BBi.HS| 723   H.BBi.HS| 19.6  NBBi
4       | 55.5  M.I.sBBi| 41.5  NBBi    | 19.3  NBBi    | 13.3  NBBi
2       | 70.5  NBBi    | 41.0  NBBi    | 26.3  NBBi    | 10.7  NBBi
1       | 69.7  NII     | 38.5  NII     | 20.0  NII     | 9.0   NII

Scaled 100*: 300K / 15M  work_mem=16M

rpc     |       300k    |       150K    |       75k     |       37k5
--------+---------------+---------------+---------------+---------------
8*      |7314   H.BBi.HS|9422   H.BBi.HS|6175   H.BBi.HS| 122   N.BBi.I
4       | 569   M.I.sBBi| 199   M.I.sBBi| 142   M.I.sBBi| 105   N.BBi.I
2       | 527   M.I.sBBi| 372   N.BBi.I | 198   N.BBi.I | 110   N.BBi.I
1       | 694   NII     | 362   NII     | 190   NII     | 107   NII

Scaled 100*: 300K / 15M  work_mem=64K

rpc     |       300k    |       150k    |       75k     |       37k5
--------+---------------+---------------+---------------+------------
8*      |22800 H.BBi.HS |21920 H.BBi.HS | 20630 N.BBi.I |19669  H.BBi.HS
4       |22095 H.BBi.HS |  284 M.I.msBBi| 205   B.BBi.I |  116  N.BBi.I
2       |  528 M.I.msBBi|  399  N.BBi.I | 211   N.BBi.I |  110  N.BBi.I
1       |  718 NII      |  364  NII     | 200   NII     |  105  NII

[8*] Note: the RandomPageCost=8 runs were only intended as a prerun to prime the disk buffer cache; the results should be ignored.

Legend for node types:
N := Nested loop
M := Merge join
H := Hash (or Hash join)
B := Bitmap heap scan
Bi := Bitmap index scan
S := Seq scan
s := sort
m := materialise

Preliminary conclusion:

"the working set" for the original query is too small: all of it fits in core, resulting in the cost of page fetches to be grossly overestimated. Setting RPC to 2 (or 1) "solves" this problem, but once the query is scaled-up, the page-costs become dominant, and RPC=4 becomes comparable or even better.
Setting work_mem to a lower value is another way to make the optimiser shift to index-scans (instead of hash+bitmap-scans). The differences I found are smaller than what Sayap reported. Maybe I have more effective_cache_size, or he forgot to prime the cache?
The optimiser is known to have problems with "skewed" distributions (and "skewed" or "peaked" multidimentional distributions) The testruns with 1/4 and 1/8 of the initial 3K/150K hardhitters show that this effect vanishes once the "peak" flattens out.
Something happens at the 2% boundary: the 3000/150000 gererate different (worse) plans, than those with <2% hardhitters. Could this be the granularity of the histograms ?

Massaging the optimizer - your specialty! :) So much information here! — Erwin Brandstetter, May 22 '12 at 11:25
The hardest part was getting all the ASCII-art right ;-) BTW, I really should do another run, dropping the cache in between tests. (in stead of priming it) — wildplasser, May 22 '12 at 11:29

Gang Liang · Answer 3 · 2016-03-14T16:25:33.270

This is an old post, but quite helpful that I just encountered a similar issue.

Here is my finding so far. Given there are 151208 rows in the replays_game, the average cost of hitting an item is about log(151208)=12. Since there are 3395 records in replays_playeringame after filtering, the average cost is 12*3395, which is rather high. Also, the planner overestimated the page cost: it assumes all rows are randomly distributed, while it is not. Should that be true, a seq scan would be much better. So basically, the query plan is trying to avoid the worst scenarios.

@dsjoerg's problem is that there is no index on replays_playeringame(game_id). Index scan would be always used if there is an index on replays_playeringame(game_id): the cost of scanning index would become 3395+12 (or something close to that).

@Neil suggested to have index on (player_id, game_id), which is close but not exact. The right index to have is either (game_id) or (game_id, player_id).

score 2 · Answer 4 · answered May 17 '12 at 23:02

2

You might get a better execution plan using a multiple column (player_id, game_id) index on the replays_playeringame table. This avoids having to use a random page seek to look up the game id(s) for the player id.

answered May 17 '12 at 23:02

Neil

54,642
8
60
72

PostgreSQL query runs faster with index scan, but engine chooses hash join

4 Answers4

Linked