Why calling the boto3 list buckets operation returns Bad Gateway (502) during a ceph osd flapping?

Question

This could be more related to the Ceph cluster than the python lib boto3. When an OSD goes down all cluster responding 502 (Bad Gateway) to our s3 clients (boto3, s3cmd, rclone, aws-cli).

The big picture of my cluster: - 3 Rados Gateway with Nginx performing fastcgi_pass to civetweb. In the same server is running ceph-monitor. - 5 OSD servers with 3 OSD each one.

Everything works fine until one OSD goes down. Immediately the Cluster enters in WARNING status and starts to remap the PGs to the others OSDs (I'm using replica 3 for the pool data). But when the cluster is recovering, it responds 502 to all s3 clients even if I'm trying to list buckets.

self.resource = boto3.setup_default_session(**credentials)

for bucket in self.resource.buckets.all():
    yield bucket

  for bucket in self.resource.buckets.all():
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/boto3/resources/collection.py", line 83, in __iter__
  for page in self.pages():
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/boto3/resources/collection.py", line 161, in pages
  pages = [getattr(client, self._py_operation_name)(**params)]
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/botocore/client.py", line 324, in _api_call
  return self._make_api_call(operation_name, kwargs)
File "/usr/local/venv/azionmanager/lib/python2.7/site-packages/botocore/client.py", line 622, in _make_api_call
  raise error_class(parsed_response, operation_name)

ClientError: An error occurred (502) when calling the ListBuckets operation (reached max retries: 4): Bad Gateway

I expected that Ceph could deal with OSDs down because there are still 2 replicas for each object but instead of this, It goes down.

Do you guys have any ideas of what is happening here?

Are You using any proxy/ reverse server? Bad gateway reason normally occurs due proxy or load balancer or firewall dropping your connection. — Maeda, May 28 '19 at 20:10
Well, I'm interfacing ceph-client (civetweb) with nginx. When I use `fastcgi_pass` I've got 502 (Bad Gateway) but with `proxy_pass`I've got 504 (Gateway Timeout). On ceph-client logs it shows the start of the new request but the req is done only when the OSD got back. — Gabriel Bullit, May 30 '19 at 13:58
Your nginx configuration is ok? Maybe handdling your headers? Or are you using SSL? Did you try to transport a little file to avoid some possible timeout? — Maeda, May 31 '19 at 01:29

Why calling the boto3 list buckets operation returns Bad Gateway (502) during a ceph osd flapping?

0 Answers0