Why Select queries on partitioned tables take lock and get stuck

Question

We designed a database so that it can accepts lots of data. For that we used partitioned table quite a lot (because the way we handle trafic information in database can take advantage of the partitionning system).

to be more precise, we have table with partition, and partition that also have partition (with 4 levels)

 main table
   -> sub tables partitioned by row 1 (list)
        -> sub tables partitioned by row 2 (list)
             ...

There are 4 main partitionned tables. Each one has from 40 to 120 (sub) partitions behing.

The query that take lock and is locked by others is a SELECT that work on these 4 tables, joined. (So counting partitions it work over about 250 tables)

We had no problem until now, maybe due to trafic increase. Now Select queries that use these tables, that normally are executed in 20ms, can wait up to 10seconds , locked and waiting. When requesting pg_stat_activity I see that these queries are :

wait_event_type : LWLock
wait_event : lock_manager

I asked dev team and also confirmed in reading logs (primary and replica), there were nothing else running except select and insert / update queries on these tables.

These queries, select queries, are running on the replica servers.

I try to find on the internet before but everything I find is : yes there are exclusive lock on partitioned table , but it's when there are some operations like drop, attach / dettach partitions.. And it's not happening on my server while there is a problem.

Server is version 12.4, running on AWS Aurora.

What can make these queries locked and waiting for this LWLock ? What could be my options to improve this behaviour ? (help my queries no being locked...)

EDIT : Adding some details I ve been asked or that could be interesting :

number of connections :
- usually : 10 connections opened by seconds
- in peak (when the problem appear) : from 40 to 100 connections opened by second
- during the problem, the numbre of opened connection vary from 100 to 200.
size of the database : 30 Gb - currently lots of partitions are empty.

What is the number of active database connections when that happens? — Laurenz Albe, Mar 24 '21 at 16:39
yup, sorry, here are some figures.. when ok : there are only ~10 connections by second. When it begins to lock, there are about 30 to 90 connection by second.. when there is the pb, so, there are between 100 and 200 connection on the database. About the query, it makes a join on 4 partitionned table. Each partitionned table has 40 to 120 partitions. Currently most of them are empty. The DB is only 30 GB large for now. — Enialis, Mar 25 '21 at 14:44

score 1 · Answer 1 · answered Mar 25 '21 at 15:23

1

You are probably suffering from internal contention on database resources caused by too many connections that all compete to use the same shared data structures. It is very hard to pinpoint the exact resource with the little information provided, but the high number of connections is a strong indication.

What you need is a connecrion pool that maintains a small number of persistent database connections. That will at the same time reduce the problematic contention and do away with the performance wasted on opening lots of short-lived database connections. Your overall throughput will increase.

If your application has no connection pool built in, use pgBouncer.

answered Mar 25 '21 at 15:23

Laurenz Albe

209,280
17
206
263

thanks @Laurenz Albe - I add one and it's better now... I set pgBouncer because I used it before. Though, it's not working with multiple slaves (for loadbalancing...) do you know which one I could use for this usage ? – Enialis Aug 09 '21 at 16:06
You could experiment with pgPool, but I recommend sticking with pgBouncer and using dedicated load balancing software like HAProxy. – Laurenz Albe Aug 15 '21 at 14:14

Why Select queries on partitioned tables take lock and get stuck

1 Answers1