5

I am trying to test efficiency of different partitioning schemas in PostgreSQL. I created several databases with one table and different partition sizes and then populated hundreds of million records in each.

Now I would like to test some queries on all of them and compare results. But the issue is that I received different actual time on one database when I run same query.

For example, I executed EXPLAIN (buffers, analyze, verbose) SELECT count(*) FROM testdata WHERE dt = '2017-06-01 08:50:00' first time:

                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=356199.96..356199.97 rows=1 width=8) (actual time=155757.190..155757.190 rows=1 loops=1)
  Output: count(*)
  Buffers: shared read=290555
  ->  Append  (cost=0.00..354506.51 rows=677381 width=0) (actual time=165.565..155583.815 rows=665275 loops=1)
        Buffers: shared read=290555
        ->  Seq Scan on public.testdata  (cost=0.00..0.00 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
              Filter: (testdata.dt = '2017-06-01 08:50:00'::timestamp without time zone)
        ->  Bitmap Heap Scan on public.testdata_2017_06_01  (cost=12682.26..354506.51 rows=677380 width=0) (actual time=165.562..155319.677 rows=665275 loops=1)
              Recheck Cond: (testdata_2017_06_01.dt = '2017-06-01 08:50:00'::timestamp without time zone)
              Rows Removed by Index Recheck: 31006916
              Heap Blocks: exact=25283 lossy=263451
              Buffers: shared read=290555
              ->  Bitmap Index Scan on testdata_2017_06_01_dt_idx  (cost=0.00..12512.91 rows=677380 width=0) (actual time=159.597..159.597 rows=665275 loops=1)
                    Index Cond: (testdata_2017_06_01.dt = '2017-06-01 08:50:00'::timestamp without time zone)
                    Buffers: shared read=1821
Planning time: 75.419 ms
Execution time: 155757.564 ms
(17 rows)

and second time:

                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=356199.96..356199.97 rows=1 width=8) (actual time=18871.164..18871.164 rows=1 loops=1)
  Output: count(*)
  Buffers: shared read=290555
  ->  Append  (cost=0.00..354506.51 rows=677381 width=0) (actual time=148.208..18731.027 rows=665275 loops=1)
        Buffers: shared read=290555
        ->  Seq Scan on public.testdata  (cost=0.00..0.00 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
              Filter: (testdata.dt = '2017-06-01 08:50:00'::timestamp without time zone)
        ->  Bitmap Heap Scan on public.testdata_2017_06_01  (cost=12682.26..354506.51 rows=677380 width=0) (actual time=148.204..18489.171 rows=665275 loops=1)
              Recheck Cond: (testdata_2017_06_01.dt = '2017-06-01 08:50:00'::timestamp without time zone)
              Rows Removed by Index Recheck: 31006916
              Heap Blocks: exact=25283 lossy=263451
              Buffers: shared read=290555
              ->  Bitmap Index Scan on testdata_2017_06_01_dt_idx  (cost=0.00..12512.91 rows=677380 width=0) (actual time=143.119..143.119 rows=665275 loops=1)
                    Index Cond: (testdata_2017_06_01.dt = '2017-06-01 08:50:00'::timestamp without time zone)
                    Buffers: shared read=1821
Planning time: 2.105 ms
Execution time: 18871.482 ms
(17 rows)

The difference is huge (155sec versus 18sec) and my first though was that the result was cached after the first time. But explain above does not display hit values in buffers section. Also estimated cost of both executions are same. Why than actual time is different? Is there any service work such creating a query plan which takes such significant time?

If I try to execute same query for another date the situation repeats. In case if PostgreSQL caches query plan, will it be re-generated with every new parameter value? It does not make sense for me because it works like query results caching in this case.

A side question: what is the better way to get pure result of query execution time to compare different partitioning schemas? In MySQL I could use SELECT SQL_NO_CACHE to disable any caching in query. In PostgreSQL there is DISCARD ALL I think, but it does not help me with explains above.

Update #1:

Definition of main testdata table:

=> \d+ testdata;
                                    Table "public.testdata"
   Column   |            Type             |   Modifiers   | Storage  | Stats target | Description
------------+-----------------------------+---------------+----------+--------------+-------------
 param1id   | integer                     | not null      | plain    |              |
 param2id   | integer                     |               | plain    |              |
 param3id   | integer                     |               | plain    |              |
 devid      | integer                     |               | plain    |              |
 nameid     | integer                     |               | plain    |              |
 dt         | timestamp without time zone | default now() | plain    |              |
 value      | double precision            |               | plain    |              |
 svalue     | character varying(255)      |               | extended |              |
Triggers:
    testdata_partition_insert_trigger BEFORE INSERT ON testdata FOR EACH ROW EXECUTE PROCEDURE create_partition_and_insert()
Child tables: testdata_2017_06_01,
              testdata_2017_06_02,
              ... other child tables, one per day ...
              testdata_2017_06_30

Definition of child testdata_2017_06_01:

\d+ testdata_2017_06_01;
                               Table "public.testdata_2017_06_01"
   Column   |            Type             |   Modifiers   | Storage  | Stats target | Description
------------+-----------------------------+---------------+----------+--------------+-------------
 param1id   | integer                     | not null      | plain    |              |
 param2id   | integer                     |               | plain    |              |
 param3id   | integer                     |               | plain    |              |
 devid      | integer                     |               | plain    |              |
 nameid     | integer                     |               | plain    |              |
 dt         | timestamp without time zone | default now() | plain    |              |
 value      | double precision            |               | plain    |              |
 svalue     | character varying(255)      |               | extended |              |
Indexes:
    "testdata_2017_06_01_devid_idx" btree (devid)
    "testdata_2017_06_01_dt_idx" btree (dt)
    "testdata_2017_06_01_nameid_idx" btree (nameid)
Check constraints:
    "partition_check" CHECK (dt >= '2017-06-01 00:00:00'::timestamp without time zone AND dt <= '2017-06-01 23:59:59'::timestamp without time zone)
Inherits: testdata

Update #2:

After turning on track_io_timing following @rvs advice I rebooted the instance and the picture changed significantly. First, freeable memory was increased from 6GB to almost 8GB - screenshot . Second, planner started use Index Only Scan instead of Bitmap Heap Scan. Then, all queries take much less time now (about 2.4 ms for the first query and 0.041 ms when it hits the cache. against intial 155 and 18 seconds). And last, it started hit the cache (in examples above all explains displayed only read buffers).

                                                                               QUERY PLAN
 ----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.59..4.60 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=4
   ->  Append  (cost=0.00..4.58 rows=2 width=0) (actual time=0.012..0.012 rows=0 loops=1)
         Buffers: shared hit=4
         ->  Seq Scan on public.testdata  (cost=0.00..0.00 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
               Filter: (testdata.dt = '2017-06-15 18:10:10'::timestamp without time zone)
         ->  Index Only Scan using testdata_2017_06_15_dt_idx on public.testdata_2017_06_15  (cost=0.56..4.58 rows=1 width=0) (actual time=0.009..0.009 rows=0 loops=1)
               Index Cond: (testdata_2017_06_15.dt = '2017-06-15 18:10:10'::timestamp without time zone)
               Heap Fetches: 0
               Buffers: shared hit=4
 Planning time: 0.888 ms
 Execution time: 0.044 ms
(13 rows)

Did huge number of data inserts overflowed all available RAM (8GB)? We inserted one partition with 40kk records and then copied it to another partitions. Should we expect something like this on production (we are going to insert about 1k records per second with bulk inserts)? It is quite frightening, if we need to restart database instance time to time.

Thank you for helping!

dmkov
  • 334
  • 2
  • 11
  • 1
    Good question. Does it raise any suspicion the fact that most of the difference is in the `Bitmap Heap Scan` phase? – joanolo Jul 01 '17 at 10:50
  • And when you submit the same query a third time, it will again take 18 seconds? – wildplasser Jul 01 '17 at 11:56
  • @wildplasser every next execution takes less time until it reaches some point at 5 second. New query pretty constantly starts with 150-160 seconds, then it may go to 5 seconds with the second query or take two or three executions to get there (for example, first - 150 second, second - 30, third - 15 and fourth - final 5 seconds). After that every next execution takes same time of 5 seconds before I change WHERE parameter to another date. – dmkov Jul 01 '17 at 15:08
  • @joanolo, good point. Inner `Bitmap Index Scan` takes pretty same time in both cases. But first `Bitmap Heap Scan` processes data significantly more time. Is there any way I can look deeply what happen on this step? – dmkov Jul 01 '17 at 15:18
  • Is your machine being used by other processes that might be interfering with the database? Is it a VM? Can you reproduce the same scenario on a laptop with no other processes playing around? – joanolo Jul 01 '17 at 15:44
  • @joanolo I use Amazon RDS instances (only one request per time from me) so probably it is not the issue – dmkov Jul 01 '17 at 15:51
  • 1
    Can you provide the definition for your table `testdata` ... it looks like you have it time-partitioned with inherintance? Are you using `brin` indexes? How many rows are in the specific table `testdata_2017_06_01`? I am trying to reproduce your setup, but I get timings so different that my comparison don't make sense. – joanolo Jul 01 '17 at 16:10
  • @joanolo, you are right, I am testing time based partitions with 40 million records in each partition. I updated original post with table definitions (there are btree indexes, not brin). Here is sql dump if you would like to create them - https://paste.ofcode.org/FdNu4VGpnAS54fkacVAtVn . – dmkov Jul 01 '17 at 17:56

1 Answers1

2

What's the amount of data and how much RAM do you have?

Most likely your subsequent queries hitting the cache (PostgreSQL will use OS-level cache as well). Try with set track_io_timing = true in your session and see what i/o timings Postgres reports for you.

Estimations are likely off due to incorrect setting of effective_cache_size.

To answer your side question: do not seek for 'pure' time. Try to replicate real scenario. Will this query run repeatedly? Will there be other queries run in parallel? Always run everything multiple times. Cache sizing is important part of performance tuning.

rvs
  • 1,251
  • 13
  • 22
  • Thank you for the reply! It seems that the issue is really somewhere with RAM (please look at **Update #2** in initial post). I enabled track_io_timing but now queries hit the cache and I did not get to I/O timings with second execution. As for you comment about 'pure' time - I completely agree with it, but for now I would like to test different partition approaches on 4 databases with same number of data. So I really want to compare timing on a very low level and then tune the selected option with cache size and other parameters. – dmkov Jul 03 '17 at 10:36