2

I am trying to determine why a fairly simple aggregate query is taking so long to perform on a single table. The table is called plots, and it is [id, device_id, time, ...] There are two indices, UNIQUE(id) and UNIQUE(device_id, time).

The query is simply:

   SELECT device_id, MIN(time) 
   FROM plots
   GROUP BY device_id

To me, this should be very fast, but it is taking 3+ minutes. The table has ~45 million rows, divided roughly equally among 1200 or so device_id's.

EXPLAIN for the query:

Finalize GroupAggregate  (cost=1502955.41..1503055.97 rows=906 width=12)
   Group Key: device_id
   ->  Gather Merge  (cost=1502955.41..1503052.35 rows=906 width=12)
         Workers Planned: 1
         ->  Sort  (cost=1501955.41..1501955.86 rows=906 width=12)
               Sort Key: device_id
               ->  Partial HashAggregate  (cost=1501943.79..1501946.51 rows=906 width=12)
                     Group Key: device_id
                     ->   Parallel Seq Scan on plots  (cost=0.00..1476417.34 rows=25526447 width=12)

EXPLAIN for query with a where device_id = xxx:

GroupAggregate  (cost=398.86..78038.77 rows=906 width=12)
   Group Key: device_id
   ->  Bitmap Heap Scan on plots  (cost=398.86..77992.99 rows=43065 width=12)
         Recheck Cond: (device_id = 6780)
         ->  Bitmap Index Scan on index_plots_on_device_id_and_time  (cost=0.00..396.71 rows=43065 width=0)
               Index Cond: (device_id = 6780)

I have done VACUUM (FULL, ANALYZE) as well as REINDEX DATABASE.

I have also tried doing partition queries to accomplish the same.

Any pointers on making this faster? Or am I just boned on the table size. It seems like it should be fine with the index though. Maybe I am missing something...

EDIT / UPDATE:

The problem seems to be resolved at this point, though I am not sure why. I have dropped and rebuilt the index many times, and suddenly the query is only taking ~7 seconds, which is acceptable. Of note, this morning I dropped the index and created a new one with the reverse column order (time, device_id) and I was surprised to see good results. I then reverted to the previous index, and the results were improved further. I will refork the production database and try to retrace my steps and post an update. Should I be worried about the query planner wonking out in the future?

Current EXPLAIN with analysis (as requested):

Finalize GroupAggregate  (cost=1000.12..480787.58 rows=905 width=12) (actual time=36.299..7530.403 rows=916 loops=1)
   Group Key: device_id
   Buffers: shared hit=135087 read=40325
   I/O Timings: read=138.419
   ->  Gather Merge  (cost=1000.12..480783.96 rows=905 width=12) (actual time=36.226..7552.052 rows=1829 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         Buffers: shared hit=509502 read=160807
         I/O Timings: read=639.797
         ->  Partial GroupAggregate  (cost=0.11..479687.58 rows=905 width=12) (actual time=15.779..5026.094 rows=914 loops=2)
               Group Key: device_id
               Buffers: shared hit=509502 read=160807
               I/O Timings: read=639.797
               ->  Parallel Index Only Scan using index_plots_time_and_device_id on plots  (cost=0.11..454158.41 rows=25526447 width=12) (actual time=0.033..2999.764 rows=21697480 loops=2)
                     Heap Fetches: 0
                     Buffers: shared hit=509502 read=160807
                     I/O Timings: read=639.797
 Planning Time: 0.092 ms
 Execution Time: 7554.100 ms
(19 rows)
valenumr
  • 85
  • 5
  • Please **[edit]** your question and add the [execution plan](https://www.postgresql.org/docs/current/static/using-explain.html) generated using **`explain (analyze, buffers, format text)`** (_not_ just a "simple" explain). An aggregate on just 45 million rows shouldn't take minutes. –  May 19 '20 at 07:52

1 Answers1

0

Approach 1:

You can try to remove your UNIQUE to an index on your database. CREATE UNIQUE INDEX and CREATE INDEX have different behaviors. I believe that you can get benefits from CREATE INDEX.

Approach 2:

You can create a materialized view. If you can get some delay on your information, you can do the following:

CREATE MATERIALIZED VIEW myreport AS 
    SELECT device_id, 
           MIN(time) AS mintime 
      FROM plots
  GROUP BY device_id

CREATE INDEX myreport_device_id ON myreport(device_id);

Also, you need to remember to regularly do:

REFRESH MATERIALIZED VIEW CONCURRENTLY myreport;

And less regularly do:

VACUUM ANALYZE myreport
William Prigol Lopes
  • 1,803
  • 14
  • 31