0

I'm trying to handle Couchbase bootstrap failure gracefully and not fail the application startup. The idea is to use "Couchbase as a service", so that if I can't connect to it, I should still be able to return a degraded response. I've been able to somewhat achieve this by using the Couchbase async API; RxJava FTW.

Problem is, when the server is down, the Couchbase Java client goes crazy and keeps trying to connect to the server; from what I see, the class that does this is ConfigEndpoint and there's no limit to how many times it tries before giving up. This is flooding the logs with java.net.ConnectException: Connection refused errors. What I'd like, is for it to try a few times, and then stop.

Got any ideas that can help?

Edit:

Here's a sample app.

Steps to reproduce the problem:

  1. svn export https://github.com/asarkar/spring/trunk/beer-demo.
  2. From the beer-demo directory, run ./gradlew bootRun. Wait for the application to start up.
  3. From another console, run curl -H "Accept: application/json" "http://localhost:8080/beers". The client request is going to timeout due to the failure to connect to Couchbase, but Couchbase client is going to flood the console continuously.
Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
  • Just talked with a colleague about this and one thing is ambiguous. Are you saying the bootstrap fails at startup, or that it fails later when there's a failure in the cluster? We think it should fail nearly immediately if it can't connect at startup. Maybe a small sample app would help. – Matt Ingenthron Sep 01 '17 at 19:22
  • @MattIngenthron See edit with sample app. – Abhijit Sarkar Sep 04 '17 at 09:00
  • It seems from the code, you have an own implementation of spring-data-couchbase. Is there a particular reason that the existing spring-data-couchbase cannot be used? I believe it’s not working because of the issues in your implementation of spring integration. – subhashni Sep 15 '17 at 21:19
  • @subhashni I already answered your question on your fake answer which got deleted. – Abhijit Sarkar Sep 16 '17 at 01:38
  • "spring data cb can't handle bootstrap failure. And yours is not an answer to my question" - This was your answer. It was marked for deletion based on report that it was a fake answer. Thanks. – subhashni Sep 16 '17 at 02:03
  • @subhashni which part of my previous answer can you not understand? – Abhijit Sarkar Sep 16 '17 at 03:08
  • You had complained about spring-data-couchbase which not even used in your project. I'd asked if there was a reason in not using it and rather implementing your own. – subhashni Sep 16 '17 at 03:10
  • @subhashni The keyword here is "I had". There's no mention of spring data in my question now. I started with spring data, but it proved inadequate for my purposes. Now the problem is what the question says it is. – Abhijit Sarkar Sep 16 '17 at 03:25
  • Probably not getting into the details of your specific project, it would be best to debug if you can try this sample cb spring project and see if you can reproduce this issue. https://github.com/couchbaselabs/try-cb-java. As I dont see the bootstrap issue you had mentioned elsewhere. – subhashni Sep 16 '17 at 03:29
  • @subhashni I don't give a crap about some hello world project someone has done. If you've a comment based on the project I've linked here, feel free to voice it. Otherwise, thanks for your time, move on. – Abhijit Sarkar Sep 16 '17 at 03:31
  • Sure, I just tried to point where the issue lies. – subhashni Sep 16 '17 at 03:35

3 Answers3

1

The reason we choose to have the client continue connecting is that Couchbase is typically deployed in high-availability clustered situations. Most people who run our SDK want it to keep trying to work. We do it pretty intelligently, I think, in that we do an exponential backoff and have tuneables so it's reasonable out of the box and can be adjusted to your environment.

As to what you're trying to do, one of the tuneables is related to retry. With adjustment of the timeout value and the retry, you can have the client referenceable by the application and simply fast fail if it can't service the request.

The other option is that we do have a way to let your application know what node would handle the request (or null if the bootstrap hasn't been done) and you can use this to implement circuit breaker like functionality. For a future release, we're looking to add circuit breakers directly to the SDK.

All of that said, these are not the normal path as the intent is that your Couchbase Cluster is up, running and accessible most of the time. Failures trigger failovers through auto-failover, which brings things back to availability. By design, Couchbase trades off some availability for consistency of data being accessed, with replica reads from exception handlers and other intentionally stale reads for you to buy into if you need them.

Hope that helps and glad to get any feedback on what you think we should do differently.

Matt Ingenthron
  • 1,894
  • 14
  • 14
  • Thanks for your response. I edited my question to remove the "crappy design" part: On hindsight, everything is easy, but good software design is iterative and based on feedback. FWIW, I've made a small contribution to CB codebase in the past :) That said, if I understood your suggestion, you are suggesting tuning `Reliability Options#retryStrategy` and `Timeout Options#socketConnect`? Based on this [post](https://forums.couchbase.com/t/what-is-best-way-to-stop-the-connection-retry-in-java-sdk-2-1-4/4998), the former property wouldn't help, and I've infact tried that without success. – Abhijit Sarkar Sep 01 '17 at 05:24
  • I'm considering giving up on the Java client and making straight REST calls. – Abhijit Sarkar Sep 01 '17 at 05:26
  • Thanks for the edit, will see if I can get a better description or example here for you. – Matt Ingenthron Sep 01 '17 at 18:10
  • I've posted a sample app couple days ago and edited my question to provide more info. – Abhijit Sarkar Sep 07 '17 at 17:38
  • Will try to have a look today-- we were going through releases last week so my time has been a bit oversubscribed. – Matt Ingenthron Sep 11 '17 at 16:19
  • Owing to some time constraints, going to ask a colleague to have a look. Apologies, haven't forgotten! – Matt Ingenthron Sep 13 '17 at 00:36
0

Solved this issue myself. The client I designed handles the following use cases:

  1. The client startup must be resilient of CB failure/availability.
  2. The client must not fail the request, but return a degraded response instead, if CB is not available.
  3. The client must reconnect should a CB failover happens.

I've created a blog post here. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.

Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
-1

Start a separate thread and keep calling ping on it every 10 or 20 seconds, one CB is down ping will start failing, have a check like "if ping fails 5-6 times continuous then close all the CB connections/resources"

Vivek Rai
  • 19
  • 4
  • wow, an answer after 5.5 years! how does creating a busy loop help in preventing CB client from spamming the logs? The problem is much deeper than that, I recommend you read the blog post I linked in my answer. – Abhijit Sarkar Jan 28 '23 at 05:11