Group by too slow on Amazon RDS Postgres

Question

I am running Postgres 9.4.4 on an Amazon RDS db.r3.4xlarge instance - 16CPUs, 122GB Memory. I recently came across one of the queries which needed a fairly straight forward aggregation on a large table (~270 million records). The query takes over 5 hours to execute.

The joining column and the grouping column on the large table have indexes defined. I have tried experimenting with the work_mem and temp_buffers by setting each to 1GB but it dint help much.

Here's the query and the execution plan. Any leads will be highly appreciated.

explain SELECT
largetable.column_group,
MAX(largetable.event_captured_dt) AS last_open_date,
.....   

FROM largetable

LEFT JOIN smalltable
ON smalltable.column_b = largetable.column_a

WHERE largetable.column_group IS NOT NULL

GROUP BY largetable.column_group

Here is the execution plan -

GroupAggregate  (cost=699299968.28..954348399.96 rows=685311 width=38)
  Group Key: largetable.column_group
  ->  Sort  (cost=699299968.28..707801354.23 rows=3400554381 width=38)
        Sort Key: largetable.column_group
        ->  Merge Left Join  (cost=25512.78..67955201.22 rows=3400554381 width=38)
              Merge Cond: (largetable.column_a = smalltable.column_b)
              ->  Index Scan using xcrmstg_largetable_launch_id on largetable  (cost=0.57..16241746.24 rows=271850823 width=34)
                    Filter: (column_a IS NOT NULL)
              ->  Sort  (cost=25512.21..26127.21 rows=246000 width=4)
                    Sort Key: smalltable.column_b
                    ->  Seq Scan on smalltable  (cost=0.00..3485.00 rows=246000 width=4)

score 1 · Answer 1 · answered Nov 30 '15 at 01:20

1

You say the joining key and the grouping key on the large table are indexed, but you don't mention the joining key on the small table.

The merges and sorts are a big source of slowness. However, I'm also worried that you're returning ~700,000 rows of data. Is that really useful to you? What's the situation where you need to return that much data, but a 5 hour wait is too long? If you don't need all that data coming out, then filtering as early as possible is by far and away the largest speed gain you'll realize.

answered Nov 30 '15 at 01:20

Nathaniel Waisbrot

23,261
7
71
99

Thanks for your inputs.. I did try indexing the smaller table as well but again, it dint help much. Also about filtering rows early, unfortunately we need to generate whole of this dataset everyday. It seemed to be working fine (less than 10 mins) when we were using Redshift to do the same task. However, due to some technical requirements we needed to shift to RDS. – Vaidhyanathan Srikumar Subrama Nov 30 '15 at 06:24
That makes sense. Redshift would have sharded your data and you'd be doing multiple smaller joins in parallel. Maybe you should try to shift those technical requirements. After you've spend 10 minutes running the query in RS, you've got another 290 minutes to work before you're as slow as your current solution. – Nathaniel Waisbrot Dec 01 '15 at 12:59

Group by too slow on Amazon RDS Postgres

1 Answers1