We designed a database so that it can accepts lots of data. For that we used partitioned table quite a lot (because the way we handle trafic information in database can take advantage of the partitionning system).
to be more precise, we have table with partition, and partition that also have partition (with 4 levels)
main table
-> sub tables partitioned by row 1 (list)
-> sub tables partitioned by row 2 (list)
...
There are 4 main partitionned tables. Each one has from 40 to 120 (sub) partitions behing.
The query that take lock and is locked by others is a SELECT that work on these 4 tables, joined. (So counting partitions it work over about 250 tables)
We had no problem until now, maybe due to trafic increase. Now Select queries that use these tables, that normally are executed in 20ms, can wait up to 10seconds , locked and waiting. When requesting pg_stat_activity I see that these queries are :
- wait_event_type : LWLock
- wait_event : lock_manager
I asked dev team and also confirmed in reading logs (primary and replica), there were nothing else running except select and insert / update queries on these tables.
These queries, select queries, are running on the replica servers.
I try to find on the internet before but everything I find is : yes there are exclusive lock on partitioned table , but it's when there are some operations like drop, attach / dettach partitions.. And it's not happening on my server while there is a problem.
Server is version 12.4, running on AWS Aurora.
What can make these queries locked and waiting for this LWLock ? What could be my options to improve this behaviour ? (help my queries no being locked...)
EDIT : Adding some details I ve been asked or that could be interesting :
- number of connections :
- usually : 10 connections opened by seconds
- in peak (when the problem appear) : from 40 to 100 connections opened by second
- during the problem, the numbre of opened connection vary from 100 to 200.
- size of the database : 30 Gb - currently lots of partitions are empty.