Postgresql 12 high CPU Spike on Slave fixed by running analyze.. unable to RCA

Question

Hi Postgresql 12 gradually spiked to 100% CPU. App team was looking after issue from 60 minutes. 2 different queries were taking high cpu. Tables involved were small so i just ran analyze on all 3 involved tables to fix any underlying statistics issue without first checking the query plan. Just after analyze run, issue got fix.

Now we tried to recreate the issue from backup few hours older than issue time but there query plan is same before and after analyze. We are unable to find RCA.

System Details: vCPU:8 RAM: 61 GB Mode: slave/read replica

Table details resource_type: 174 rows roles: 290026 rows user_role_mapping: 3332118 rows

Query with plan from prod copy machine(for RCA purpose): EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, COSTS TRUE, BUFFERS TRUE, TIMING TRUE) SELECT * FROM user_role_mapping u WHERE u.user_profile_id = '44130444' AND u.service_name = 'GLOBAL' AND u.status = 'ACTIVE' AND EXISTS (SELECT * FROM roles r WHERE r.id=u.role_id AND r.role_name='Owner') AND EXISTS (SELECT * FROM resource_type_dummy r WHERE r.id=u.resource_type_id AND r.type='HOTEL');

EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, COSTS TRUE, BUFFERS TRUE, TIMING TRUE) SELECT * FROM user_role_mapping u WHERE u.user_profile_id = 'x1' AND u.service_name = 'GLOBAL' AND u.status = 'ACTIVE' AND EXISTS (SELECT * FROM roles r WHERE r.id=u.role_id AND r.role_name='Owner') AND EXISTS (SELECT * FROM resource_type r WHERE r.id=u.resource_type_id AND r.type='HOTEL');

                                                                                              QUERY PLAN

Nested Loop (cost=1.00..424.35 rows=1 width=361) (actual time=368.512..369.536 rows=1 loops=1) Output: u.id, u.status, u.created_by, u.created_at, u.modified_by, u.last_modified_at, u.service_name, u.crs_id, u.identity_id, u.role_id, u.resource_id, u.resource_type_id, u.user_profile_id Inner Unique: true Buffers: shared hit=3 read=9 I/O Timings: read=369.368 -> Nested Loop (cost=0.57..415.82 rows=1 width=361) (actual time=281.798..282.821 rows=1 loops=1) Output: u.id, u.status, u.created_by, u.created_at, u.modified_by, u.last_modified_at, u.service_name, u.crs_id, u.identity_id, u.role_id, u.resource_id, u.resource_type_id, u.user_profile_id Join Filter: (u.resource_type_id = r_1.id) Buffers: shared hit=1 read=7 I/O Timings: read=282.696 -> Index Scan using type_index_resource_type on public.resource_type r_1 (cost=0.14..8.16 rows=1 width=16) (actual time=0.054..0.055 rows=1 loops=1) Output: r_1.id, r_1.status, r_1.created_by, r_1.created_at, r_1.modified_by, r_1.last_modified_at, r_1.service_name, r_1.crs_id, r_1.type, r_1.actions Index Cond: ((r_1.type)::text = 'HOTEL'::text) Buffers: shared hit=1 read=1 I/O Timings: read=0.016 -> Index Scan using user_profile_id_index_user_role_mapping on public.user_role_mapping u (cost=0.43..407.19 rows=37 width=361) (actual time=281.739..282.759 rows=1 loops=1) Output: u.id, u.status, u.created_by, u.created_at, u.modified_by, u.last_modified_at, u.service_name, u.crs_id, u.identity_id, u.role_id, u.resource_id, u.resource_type_id, u.user_profile_id Index Cond: ((u.user_profile_id)::text = 'x1'::text) Filter: (((u.service_name)::text = 'GLOBAL'::text) AND ((u.status)::text = 'ACTIVE'::text)) Rows Removed by Filter: 2 Buffers: shared read=6 I/O Timings: read=282.680 -> Index Scan using pk_roles on public.roles r (cost=0.42..8.44 rows=1 width=16) (actual time=86.708..86.708 rows=1 loops=1) Output: r.id, r.status, r.created_by, r.created_at, r.modified_by, r.last_modified_at, r.service_name, r.crs_id, r.role_name, r.team_name, r.access_category, r.crs_team_id, r.actions Index Cond: (r.id = u.role_id) Filter: ((r.role_name)::text = 'Owner'::text) Buffers: shared hit=2 read=2 I/O Timings: read=86.672 Planning Time: 0.439 ms Execution Time: 369.570 ms (30 rows)

EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, COSTS TRUE, BUFFERS TRUE, TIMING TRUE) SELECT CAST(urm.id AS VARCHAR), urm.status, urm.created_by, urm.created_at, CAST(urm.modified_by AS VARCHAR), urm.last_modified_at, urm.service_name, urm.crs_id, urm.identity_id, CAST(urm.role_id AS VARCHAR), urm.resource_id, CAST(urm.resource_type_id AS VARCHAR), urm.user_profile_id, r.role_name role_name, rt.type resource_type_name FROM user_role_mapping urm INNER JOIN (SELECT id, role_name FROM roles r WHERE r.role_name IN ('xa', 'xb', 'xc', 'xd'))r ON urm.role_id = r.id INNER JOIN (SELECT id, rt.type FROM resource_type rt WHERE rt.type = 'experiment')rt ON urm.resource_type_id = rt.id WHERE (urm.user_profile_id IN ('11', '12')) AND urm.service_name = 'CRS' AND urm.status = 'ACTIVE';

Nested Loop (cost=1.00..830.89 rows=1 width=271) (actual time=240.293..240.294 rows=0 loops=1) Output: (urm.id)::character varying, urm.status, urm.created_by, urm.created_at, (urm.modified_by)::character varying, urm.last_modified_at, urm.service_name, urm.crs_id, urm.identity_id, (urm.role_id)::character varying, urm.resource_id, (urm.resource_type_id)::character varying, urm.user_profile_id, r.role_name, rt.type Inner Unique: true Buffers: shared hit=8 read=5 I/O Timings: read=240.145 -> Nested Loop (cost=0.57..822.43 rows=1 width=379) (actual time=240.292..240.293 rows=0 loops=1) Output: urm.id, urm.status, urm.created_by, urm.created_at, urm.modified_by, urm.last_modified_at, urm.service_name, urm.crs_id, urm.identity_id, urm.role_id, urm.resource_id, urm.resource_type_id, urm.user_profile_id, rt.type Join Filter: (urm.resource_type_id = rt.id) Rows Removed by Join Filter: 2 Buffers: shared hit=8 read=5 I/O Timings: read=240.145 -> Index Scan using type_index_resource_type on public.resource_type rt (cost=0.14..8.16 rows=1 width=34) (actual time=0.018..0.019 rows=1 loops=1) Output: rt.id, rt.status, rt.created_by, rt.created_at, rt.modified_by, rt.last_modified_at, rt.service_name, rt.crs_id, rt.type, rt.actions Index Cond: ((rt.type)::text = 'experiment'::text) Buffers: shared hit=2 -> Index Scan using user_profile_id_index_user_role_mapping on public.user_role_mapping urm (cost=0.43..814.21 rows=4 width=361) (actual time=138.694..240.267 rows=2 loops=1) Output: urm.id, urm.status, urm.created_by, urm.created_at, urm.modified_by, urm.last_modified_at, urm.service_name, urm.crs_id, urm.identity_id, urm.role_id, urm.resource_id, urm.resource_type_id, urm.user_profile_id Index Cond: ((urm.user_profile_id)::text = ANY ('{11,12}'::text[])) Filter: (((urm.service_name)::text = 'CRS'::text) AND ((urm.status)::text = 'ACTIVE'::text)) Buffers: shared hit=6 read=5 I/O Timings: read=240.145 -> Index Scan using pk_roles on public.roles r (cost=0.42..8.45 rows=1 width=32) (never executed) Output: r.id, r.status, r.created_by, r.created_at, r.modified_by, r.last_modified_at, r.service_name, r.crs_id, r.role_name, r.team_name, r.access_category, r.crs_team_id, r.actions Index Cond: (r.id = urm.role_id) Filter: ((r.role_name)::text = ANY ('{xa, xb, xc, xd}'::text[])) Planning Time: 0.860 ms Execution Time: 240.358 ms (27 rows)

Any other directions to look?

If it is not reproducible, you are out of luck. Next time, take `EXPLAIN (ANALYZE, BUFFERS)` of the query before fixing the problem. — Laurenz Albe, Feb 23 '21 at 06:57
"Now we tried to recreate the issue from backup few hours older than issue time" That makes sense if the issue is easy to spot ,then a back up from before it was spotted would probably not display the problem. Do you have a wal archive so that you can PITR to a later time? If no, maybe set one up, for next time something like that happens. — jjanes, Feb 23 '21 at 14:51
Maybe an autoanalyze just happened to get a bad sample, leading to the problem.. And then another analyze fixed it. On your test copy of the system, you could just retry ANALYZE then query in a loop to see if you can occasionally get the bad case. — jjanes, Feb 23 '21 at 14:56
@LaurenzAlbe: Tried multiple explain but no luck. In the current query plan for query 1 and query2 do you see some scope of improvment? In query 2 there is Filter: Filter: ((r.role_name)::text = ANY ('{xa, xb, xc, xd}'::text[])).. In case when there. are multiple records for role_name approx 20000 even then this filter is applied instead of using index.... I believe this filter effect would have got trippled during issue time though no increase in disk IOPS have been noted. Anything we can do here... I believe most of time issues of bad plan could be fixed with best query and indexing. — Monika Yadav, Feb 23 '21 at 15:18
@LaurenzAlbe: also before analyze automatically ran on production, approx 20,000 records were updated on for a column value into roles table, but that updated column was/is not part of query or any index. — Monika Yadav, Feb 23 '21 at 16:31

Postgresql 12 high CPU Spike on Slave fixed by running analyze.. unable to RCA

0 Answers0