This could be more related to the Ceph cluster than the python lib boto3. When an OSD goes down all cluster responding 502 (Bad Gateway) to our s3 clients (boto3, s3cmd, rclone, aws-cli).
The big picture of my cluster: - 3 Rados Gateway with Nginx performing fastcgi_pass to civetweb. In the same server is running ceph-monitor. - 5 OSD servers with 3 OSD each one.
Everything works fine until one OSD goes down. Immediately the Cluster enters in WARNING status and starts to remap the PGs to the others OSDs (I'm using replica 3 for the pool data). But when the cluster is recovering, it responds 502 to all s3 clients even if I'm trying to list buckets.
self.resource = boto3.setup_default_session(**credentials)
for bucket in self.resource.buckets.all():
yield bucket
for bucket in self.resource.buckets.all():
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/boto3/resources/collection.py", line 83, in __iter__
for page in self.pages():
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/boto3/resources/collection.py", line 161, in pages
pages = [getattr(client, self._py_operation_name)(**params)]
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/botocore/client.py", line 324, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/botocore/client.py", line 622, in _make_api_call
raise error_class(parsed_response, operation_name)
ClientError: An error occurred (502) when calling the ListBuckets operation (reached max retries: 4): Bad Gateway
I expected that Ceph could deal with OSDs down because there are still 2 replicas for each object but instead of this, It goes down.
Do you guys have any ideas of what is happening here?