On latest ceph version 17.2.6 quincy (stable) I got persistent error with crash radosgw process on all runned rgw. I have two rgw but it crashed simultaneously with a minimum load on the servers, while the radosgw process constantly consumes ~100%. We run rgw on "CentOS Linux release 8.5.2111" and "AlmaLinux release 8.8 (Sapphire Caracal)" therefore, I do not associate this with the operation of servers or operating systems. Example logs:
-34> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 1 ====== starting new request req=0x7f6e8a39c710 =====
-33> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 2 req 12755806709950828565 0.000000000s initializing for trans_id = tx00000b105ba9e9a29d015-0064a3fd8d-5c7a0-eu-west-1
-32> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 2 req 12755806709950828565 0.000000000s getting op 0
-31> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 2 req 12755806709950828565 0.000000000s s3:get_obj verifying requester
-30> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj normalizing buckets and tenants
-29> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj init permissions
-28> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj recalculating target
-27> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj reading permissions
-26> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 0 req 12755806709950828565 0.002999922s s3:get_obj WARNING: couldn't find acl header for object, generating default
-25> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj init op
-24> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj verifying op mask
-23> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj verifying op permissions
-22> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0) mask=49
-21> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for uid=6016-5
-20> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Found permission: 15
-19> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=1 mask=49
-18> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
-17> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=2 mask=49
-16> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
-15> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0), owner=6016-5, perm=1
-14> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj verifying op params
-13> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj pre-executing
-12> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj check rate limiting
-11> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj executing
-10> 2023-07-04T11:07:57.236+0000 7f6eee4e6700 -1 *** Caught signal (Aborted) **
in thread 7f6eee4e6700 thread_name:radosgw
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f70c9e8fcf0]
2: gsignal()
3: abort()
4: /lib64/libstdc++.so.6(+0x9009b) [0x7f70c8e7b09b]
5: /lib64/libstdc++.so.6(+0x9653c) [0x7f70c8e8153c]
6: /lib64/libstdc++.so.6(+0x95559) [0x7f70c8e80559]
7: __gxx_personality_v0()
8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f70c885fb03]
9: _Unwind_Resume()
10: /lib64/libradosgw.so.2(+0x538c5b) [0x7f70cc373c5b]
11: /lib64/libradosgw.so.2(+0x63048a) [0x7f70cc46b48a]
12: /lib64/libstdc++.so.6(+0xc2b13) [0x7f70c8eadb13]
13: /lib64/libpthread.so.0(+0x81ca) [0x7f70c9e851ca]
14: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-9> 2023-07-04T11:07:57.306+0000 7f706c7e2700 5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
-8> 2023-07-04T11:07:57.306+0000 7f706c7e2700 5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
-7> 2023-07-04T11:07:57.314+0000 7f709f26f700 5 RGW-SYNC:data:sync:shard[120]: failed to take lease
-6> 2023-07-04T11:07:57.347+0000 7f6f795fc700 2 req 6774614862144446470 0.191995010s s3:put_obj completing
-5> 2023-07-04T11:07:57.348+0000 7f6f795fc700 2 req 6774614862144446470 0.192994997s s3:put_obj op status=0
-4> 2023-07-04T11:07:57.348+0000 7f6f795fc700 2 req 6774614862144446470 0.192994997s s3:put_obj http status=200
-3> 2023-07-04T11:07:57.348+0000 7f6f795fc700 1 ====== req done req=0x7f6e8a41d710 op status=0 http_status=200 latency=0.192994997s ======
-2> 2023-07-04T11:07:57.348+0000 7f6f795fc700 1 beast: 0x7f6e8a41d710: [IPv6 address] - 6016-5 [04/Jul/2023:11:07:57.155 +0000] "PUT /owncloud-prod/urn%3Aoid%3A2376416?partNumber=11&uploadId=2~sX-2sT0iBoilw73U4ziIIXNCeOPgniT HTTP/1.1" 200 5242880 - "aws-sdk-php/3.134.8 Guzzle/5.3.1 curl/7.29.0 PHP/7.4.24" - latency=0.192994997s
-1> 2023-07-04T11:07:57.392+0000 7f70a4a7a700 10 monclient: tick
0> 2023-07-04T11:07:57.765+0000 7f709f26f700 5 RGW-SYNC:data:sync:shard[119]: failed to take lease
On a second:
-34> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 1 ====== starting new request req=0x7f6e8a39c710 =====
-33> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 2 req 12755806709950828565 0.000000000s initializing for trans_id = tx00000b105ba9e9a29d015-0064a3fd8d-5c7a0-eu-west-1
-32> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 2 req 12755806709950828565 0.000000000s getting op 0
-31> 2023-07-04T11:07:57.228+0000 7f6eee4e6700 2 req 12755806709950828565 0.000000000s s3:get_obj verifying requester
-30> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj normalizing buckets and tenants
-29> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj init permissions
-28> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj recalculating target
-27> 2023-07-04T11:07:57.229+0000 7f6eee4e6700 2 req 12755806709950828565 0.000999974s s3:get_obj reading permissions
-26> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 0 req 12755806709950828565 0.002999922s s3:get_obj WARNING: couldn't find acl header for object, generating default
-25> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj init op
-24> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj verifying op mask
-23> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj verifying op permissions
-22> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0) mask=49
-21> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for uid=6016-5
-20> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Found permission: 15
-19> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=1 mask=49
-18> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
-17> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=2 mask=49
-16> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
-15> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 5 req 12755806709950828565 0.002999922s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0), owner=6016-5, perm=1
-14> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj verifying op params
-13> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj pre-executing
-12> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj check rate limiting
-11> 2023-07-04T11:07:57.231+0000 7f6eee4e6700 2 req 12755806709950828565 0.002999922s s3:get_obj executing
-10> 2023-07-04T11:07:57.236+0000 7f6eee4e6700 -1 *** Caught signal (Aborted) **
in thread 7f6eee4e6700 thread_name:radosgw
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f70c9e8fcf0]
2: gsignal()
3: abort()
4: /lib64/libstdc++.so.6(+0x9009b) [0x7f70c8e7b09b]
5: /lib64/libstdc++.so.6(+0x9653c) [0x7f70c8e8153c]
6: /lib64/libstdc++.so.6(+0x95559) [0x7f70c8e80559]
7: __gxx_personality_v0()
8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f70c885fb03]
9: _Unwind_Resume()
10: /lib64/libradosgw.so.2(+0x538c5b) [0x7f70cc373c5b]
11: /lib64/libradosgw.so.2(+0x63048a) [0x7f70cc46b48a]
12: /lib64/libstdc++.so.6(+0xc2b13) [0x7f70c8eadb13]
13: /lib64/libpthread.so.0(+0x81ca) [0x7f70c9e851ca]
14: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-9> 2023-07-04T11:07:57.306+0000 7f706c7e2700 5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
-8> 2023-07-04T11:07:57.306+0000 7f706c7e2700 5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
-7> 2023-07-04T11:07:57.314+0000 7f709f26f700 5 RGW-SYNC:data:sync:shard[120]: failed to take lease
-6> 2023-07-04T11:07:57.347+0000 7f6f795fc700 2 req 6774614862144446470 0.191995010s s3:put_obj completing
-5> 2023-07-04T11:07:57.348+0000 7f6f795fc700 2 req 6774614862144446470 0.192994997s s3:put_obj op status=0
-4> 2023-07-04T11:07:57.348+0000 7f6f795fc700 2 req 6774614862144446470 0.192994997s s3:put_obj http status=200
-3> 2023-07-04T11:07:57.348+0000 7f6f795fc700 1 ====== req done req=0x7f6e8a41d710 op status=0 http_status=200 latency=0.192994997s ======
-2> 2023-07-04T11:07:57.348+0000 7f6f795fc700 1 beast: 0x7f6e8a41d710: [IPv6 address] - 6016-5 [04/Jul/2023:11:07:57.155 +0000] "PUT /owncloud-prod/urn%3Aoid%3A2376416?partNumber=11&uploadId=2~sX-2sT0iBoilw73U4ziIIXNCeOPgniT HTTP/1.1" 200 5242880 - "aws-sdk-php/3.134.8 Guzzle/5.3.1 curl/7.29.0 PHP/7.4.24" - latency=0.192994997s
-1> 2023-07-04T11:07:57.392+0000 7f70a4a7a700 10 monclient: tick
0> 2023-07-04T11:07:57.765+0000 7f709f26f700 5 RGW-SYNC:data:sync:shard[119]: failed to take lease
We are running the latest version 17.2.6 on all mds, mgr, mon, osd, rgw nodes. I tried changing the default rgw_thread_pool_size up and down - but that didn't work. Can you help with solving this error?