Long running PSQL queries timing out randomly - cannot analyse

Question

I have some PSQL queries running on RDS. 90% of the time these queries will run fine. However occasionally these queries will randomly timeout and not execute. I have enabled logging and auto_explain however auto_explain will only log query plans for queries that complete. If I increase the statement_timeout the queries will still continue to timeout at random intervals with no explanation.

Has anyone seen this issue before or have any idea how to analyse queries that do not complete?

Have you tried writing logs inbetween the statements and write them to a seperate table to find out where the timeout appears? — Jakob, Jan 23 '18 at 12:58
Hi, I have logs that tell me what statement causes the timeout, when the timeout occurs, and how long the statement took before it timed out. However beyond this I cannot get a query plan or any other information that may explain why the statement will timeout sometimes but not all the timeout. The timeouts also take significantly longer than the average time for the statement. — Angelis, Jan 23 '18 at 13:01
What are the queries doing? Are other databases involved, or just local? — Jakob, Jan 23 '18 at 13:09
The query retrieves information from one table that has an index on it, the table is in a database that contains other tables but the query doesn't use any joins on it. The table is also quite small with ~360k rows. — Angelis, Jan 23 '18 at 13:10
We had several problems which could be related with it. First - check "keepalive" settings on database and on OS. Values can help to explain if "keepalive" mechanism is responsible for canceling those sessions. But if so it does not mean it is bad. Second - is there some correlation between timeouted sessions and resources (average load on CPUs, memory) being used - e.g. sessions timeout only during high average load etc? Third - if you increase work_mem for these sessions does it help to lower number of timeouted sessions? — JosMac, Jan 23 '18 at 13:16
Set `log_lock_waits = on`, maybe locks are blocking your queries. — Laurenz Albe, Jan 23 '18 at 13:32
What is leading you to conclude that the queries are timing out? Where exactly is the error being raised, and what does it say? — Nick Barnes, Jan 23 '18 at 14:03
Have checked for locks, the query only reads from the table and does not write to it. We get an error thrown that says ```ERROR: canceling statement due to statement timeout```, this happens directly after the query (we log exactly which query times out). We have multiple queries running one after the other and it seems to be random as to which query will time out. — Angelis, Jan 23 '18 at 14:39
Have also tried to load auto_explain for any long running queries however that does not seem to be functioning correctly on AWS. — Angelis, Jan 23 '18 at 14:59
Were you able to solve the issue? We're running into a similar issue and we're still stuck - https://stackoverflow.com/questions/62557194/is-autovacuum-vacuum-the-reason-why-this-postgresql-update-query-occasionally/62558429?noredirect=1#comment110759452_62558429 — John Kenn, Aug 04 '20 at 12:14

Long running PSQL queries timing out randomly - cannot analyse

0 Answers0