Task queues end up with "(2062, 'Cloud SQL socket open failed with error: No such file or directory')"

Question

We are building an application which uses heavy backend tasks (Task queues), And in each task - we are doing I/O in Google Cloud SQL.

As GAE have limitation for 12 concurrent connections (not sure whether this is issue? I saw at https://stackoverflow.com/a/26155819/687692)

""Each App Engine instance running in a Standard environment or using Standard-compatible APIs cannot have more than 12 concurrent connections to a Google Cloud SQL instance." - https://cloud.google.com/sql/faq"

My most of the backend tasks (100-500 tasks per second) are failing because of this issue.

Also, I checked active connection for last 4 days: I dont see any of the connection is going more than 12 connections.

So, What approach i need to take to fix this? Connection pooling (How to do it in GAE with GCS?) ? or some other fix?

Any help - guidance is very much appreciated. Let me know, if any one need more information.

Thanks,

The error you mention does usually indicate that the 12 connection limit has been reached. Note that the limit is 12 concurrent connections per /App Engine instance/, not for the whole application. Do you know how many task queue requests each App Engine instance in your application is handling? Would setting a maximum max_concurrent_requests work for you as per the linked post? (Note that setting a lower max_concurrent_requests value may cause more App Engine instances to be spun up to handle requests). — Vadim, Mar 06 '16 at 21:11
max_concurrent_requests are for front-end instances. My Tasks queue generally runs 100-500 tasks in one second. (I hope this is not huge numbers for big projects) — Love Sharma, Mar 06 '16 at 23:28
max_concurrent_requests applies to any module using automatic scaling. Can you describe your setup a bit more? Are you using backends (https://cloud.google.com/appengine/docs/python/backends/) or a separate module for your task handlers? How many tasks is each instance handling per second (not overall number of tasks)? — Vadim, Mar 07 '16 at 07:45
I am using Task Queue (https://cloud.google.com/appengine/articles/deferred). which is handled by multiple queues - I have around 10 queues for different type of tasks and handled differently. Each queue has limit of 500/s (bucket 500) - I guess, this is max. ``` - name: reminder rate: 500/s bucket_size: 500 retry_parameters: task_retry_limit: 2 ``` Above is one example. I have cron job - which add tasks to these queues. almost 500 tasks. — Love Sharma, Mar 07 '16 at 11:21
What kind of scaling are you using for the module that is handling the tasks? I wrote a small [load test app](https://github.com/laixer/cloudsql-appengine-python-taskload) that schedules 10,000 tasks that perform some database work (and hold the connection open for a short time to increase contention). The test is able to finish with 0 errors. I'm interested to know if there's anything different in your setup. — Vadim, Mar 08 '16 at 04:24
wow!! thanks for doing POC, appreciated. I am doing same stuff. But now, let me go through your solution and apply it to my entire project. I guess, closing connection is the issue. will update the status ASAP. Thanks a lot for helping me. — Love Sharma, Mar 09 '16 at 00:20
And may be, i have 10 queues, all of them are running concurrent as well as each queue also run concurrent tasks. Not sure, that gives you required info. Thought of sharing my impementation. — Love Sharma, Mar 09 '16 at 00:22
I believe the concurrency is limited per instance so the number of queues shouldn't matter, but I'll update the test app so I can use it as a reference. — Vadim, Mar 09 '16 at 01:24
I updated the app to schedule concurrently on 10 queues. I don't see any connection issues. — Vadim, Mar 09 '16 at 09:08

Vadim · Accepted Answer · 2016-03-09T22:06:42.777

1

It's not likely that you would exceed the 12 connection limit with the standard Python App Engine scaling settings if the connections are handled properly.

To demonstrate, I have created a small application that schedules many tasks, with each task acquiring a database connection and doing some work. I am able to run this test without hitting connection issues.

One thing worth making sure is that you are not leaking any connections (i.e. not closing the connection in some places or when exceptions happen).

For MySQLdb, you can guarantee you are not leaking connections by using closing from contextlib:

from contextlib import closing

def getDbConnection():
    return MySQLdb.connect(unix_socket='/cloudsql/instance_name', db='db', user='user', charset='utf8') 

with closing(getDbConnection()) as db:
    # do stuff, database is guaranteed to be closed

edited Mar 09 '16 at 22:06

answered Mar 07 '16 at 08:10

Vadim

4,996
1
26
30

I guess, seems like this fixed my issue: but now, instead of this - i get new error : **OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 38")**. Is this related to this? – Love Sharma Mar 14 '16 at 05:54
How often does it occur? – Vadim Mar 16 '16 at 06:35
once to thrice in a day. – Love Sharma Mar 16 '16 at 14:05
Does it coincide with higher traffic? Is it possible you are hitting the overall concurrent limit? https://cloud.google.com/sql/pricing#v1-pricing – Vadim Mar 17 '16 at 02:44
Yes - May be, that's why i have asked cloud-sql team to increase concurrent limit. Any suggestion to improve it? Or should I contact them? – Love Sharma Mar 17 '16 at 05:31
1

For First Generation instances, the concurrent connection limit depends on the tier of your instance. The link in my previous comment lists the limits by tier. If you're only seeing this error a few times a day, it may not be worth upgrading to a higher tier. I can suggest a few alternate solutions 1) Do nothing and let the task queue retry the failed tasks. Task queues have a retry mechanism because the real world is not perfect and things occasionally fail. 2) Catch the error connecting to the database, sleep for a short amount and retry. It's likely the second attempt will succeed. – Vadim Mar 17 '16 at 05:54
3) Use basic scaling for the module that is handling the tasks from the queue and apply an appropriate max_instances setting to control the maximum concurrency. If you're aiming for the best throughput (2) is probably the best option. – Vadim Mar 17 '16 at 05:56
I have added retry mechanism (using decorators) for execute methods. It seems, its fixed. Will monitor it and update you. Thanks a lot for your time, very much appreciated. – Love Sharma Mar 17 '16 at 13:00

Task queues end up with "(2062, 'Cloud SQL socket open failed with error: No such file or directory')"

1 Answers1