0

I have experienced an strange problem since two weeks ago. I have a system running in GAE Python (server side) with 100 restaurantes and 1000 users working without problem, but suddenly, since two weeks ago, every day at rush hour, the tasks in the task queue had experienced a long delay in their start, two weeks before was only 1 or 2 seconds, now is 15 to 60 seconds, impacting the user experience and usability. I have to modify all the code with taskqueues and replace them with calls to urlfetch request async without waiting for the rpc (testing in some customers with success). The worst is that when adding task to the queues they are causing a dealine exceeded 123 error at rush hour (more than 100 QPS), losing between 50 to 1000 request every day (from 300.000 a day without problems). The task and my proceses are very fast, they last only from 50ms to 3 seconds, not more, but I get a lot of them with 60000ms and more in the "LIMBO", never executed and get cancelled without even starting (i have logging.debug message at the very beggining of every task/process that never get executed). I have 2 idle instances and all the settings to increase the instances without restriction when pending latency is more than 500ms. The start time of my instances is only 1 second, there are no special processes in the booting. I have 6 modules, separated modules for the task and the problem is affecting the module that call the task.add to add the task to the bucket(not the module that executes the task). I made all the changes proposed in this forum and google documentation to avoid datastore contention, I also deactivated the logs, I use a lot the memcache, I changed the F1 instances to F2, ant this error continues. And it APPEARED TWO WEEKS AGO. I have 1 year and a half runing my app, without problems, and suddenly this problem appeared.

Has anyone experienced the same problem and in this case, do you have a reccomendation? Please note that my code was working fine during a year and this problem arises since two weeks ago, the number of users is growing but not so much, two weeks ago were 850 users and now 1.000, so I think is not a problem of scale. My processes are veru efficient and quick. I'm having 3 years programming in GAE Python and 30 years experience in TI, for me this is very strange and could be related to platform changes.

This my module.yaml standard config:

runtime: python27
api_version: 1
instance_class: F2
threadsafe: true

automatic_scaling:
  min_idle_instances: 2
  max_idle_instances: automatic
  min_pending_latency: 10ms
  max_pending_latency: 500ms
  max_concurrent_requests: 20

This is the taskqueue config (I have 10 queues each with 10 restaurantes)

- name: TaskRegOr00  
  rate: 10/s  
  bucket_size: 100
Rene Marty
  • 531
  • 4
  • 14
  • I'm not the only one with this problem... check this: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/google-appengine/K4X4m4ZmfWA/-tqna9Y0CgAJ – Rene Marty Dec 02 '15 at 21:01

1 Answers1

0

This seems to be a temporary issue with GAE, see this incident status https://status.cloud.google.com/incident/appengine/15024?_ga=1.267668750.1284093861.1444800865

Filip Nilsson
  • 115
  • 2
  • 11