1

I'm using a Cloud SQL proxy sidecar on my nodejs API service.

It appears to work great, except that approximately 1% of my API requests come back with an error indicating that the DB connection failed with:

connect ECONNREFUSED 127.0.0.1:3306

My backend logs show that this was thrown from my ORM when it attempted to connect to the DB.

Sidecar logs show nothing, and the CloudSQL instance in question shows nothing out of the ordinary (17/4000 connections, <1% CPU usage, 1.5/3.5GiB memory usage, <100KiB ingress/egress per time slice on 6 hour window).

What might be causing this?

Edit: additional information:

All my pods have been up for many hours with 0 restarts, so the intermittent failure isn't a transient startup failure.

Logs show that this has been occurring intermittently since 30 days ago.

  • Is it possible this is happening at startup and the NodeJS pod is starting up so fast and trying to connect before the cloudsql proxy pod has completely started? – neildo Aug 06 '19 at 00:27
  • No; I've added information to the question regarding this. –  Aug 06 '19 at 00:42
  • Has this been happening consistently? Or did this just happen recently? This could be caused by a myriad of things eventually causing for some connections to be dropped or refused and that is to be expected. In perspective, the Service Level Agreement states that should the error rate reach 20%, the instance would be considered “down”. Given how far the current error rate is to that, I see no reason to worry at the moment. – JKleinne Aug 06 '19 at 03:06
  • We have around 20 apps (mostly Java) using CloudSQL proxy side cars and we don’t see that issue. Are you using any database connection pool? I’ve seen connection issues related to database connection pools and idle connections being closed but usually that manifests as a connection reset error. – neildo Aug 07 '19 at 01:32

1 Answers1

0

Here are a few reasons that can cause a Cloud SQL instance to become inaccessible:

1) Connection failure between your instance and the agents Cloud SQL uses to monitor the health of your instance
2) Synchronization of operations between your instance and the Cloud SQL service
3) Underprovisioning of resources, such as CPU cores, RAM, and/or storage, to your Cloud SQL instance (see Cloud SQL's Operational Guidelines [1] for additional information).

Since there are several reasons which could cause connections to be dropped (many of which are intricately related to the specifics of your project's implementation and environment), it's extremely complex to diagnose abnormal connection rejection. Additionally, Cloud SQL continuously monitors for any issues that can make an instance inaccessible and automatically takes action to resolve these issues.

Under normal circumstances, the error rate will not fully go away, but should happen at a very low level [2]. There are, of course, some conditions that can make it worse - both production issues as well as certain combinations of operations.

In any case, the recommendation under such circumstances is to implement a retry strategy for reconnection to the instances with exponential backoff. Some of the client libraries already have supporting code in place, but it depends a bit on what you're exactly using.

[1] https://cloud.google.com/sql/docs/mysql/operational-guidelines
[2] https://cloud.google.com/sql/sla

JKleinne
  • 1,270
  • 7
  • 14
  • Not relevant to this particular issue. The connection to Cloud SQL isn't the issue. They are using a Cloud SQL proxy which runs as a side car in kubernetes and proxies connections from the application pod to Cloud SQL. The connection from the proxy to Cloud SQL is fine. But the connection from the application to the proxy has issues. – neildo Aug 14 '19 at 16:17
  • `My backend logs show that this was thrown from my ORM when it attempted to connect to the DB.` Is this the case? @bdares Could you please clarify? – JKleinne Aug 15 '19 at 01:37