0

After a few days of using an App Engine to stream data from Unity to PubSub I've encountered huge amount of 502 errors which increased the number of created instances and because of that it goes over Daily spending limit.

Instance increase

Instance increase

Python script I'm using: https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/appengine/standard_python37/pubsub

All the errors are HTTP POST requests and have latency bigger that 30s and the log message is:

This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

I've been recommended to look into readiness checks and liveness checks which are only available on Flex environment but not on Standard. What could cause this errors and how to avoid them?

Log example:

    {
 httpRequest: {
  status: 502   
 }
 insertId: "5e2bf548000bc56fe5d27eec"  
 labels: {
  clone_id: "00c61b117c17d8ad1566cc1788d5763552dbe5ace0c77a3523e4047bdc431b306a3840a5168f"   
 }
 logName: "projects/x/logs/appengine.googleapis.com%2Frequest_log"  
 operation: {
  first: true   
  id: "5e2bf52700ff069deff4c6f3400001737e626f6f74796661726d6c697665000133000100"   
  last: true   
  producer: "appengine.googleapis.com/request_id"   
 }
 protoPayload: {
  @type: "type.googleapis.com/google.appengine.logging.v1.RequestLog"   
  appEngineRelease: "1.9.71"   
  appId: "s~x"   
  cost: 3.2186e-8   
  endTime: "2020-01-25T07:59:04.771274Z"   
  finished: true   
  first: true   
  host: "x.appspot.com"   
  httpVersion: "HTTP/1.1"   
  instanceId: "00c61b117c17d8ad1566cc1788d5763552dbe5ace0c77a3523e4047bdc431b306a3840a5168f"   
  instanceIndex: -1   
  ip: "x"   
  latency: "33.337627s"   
  line: [
   0: {
    logMessage: "This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application."     
    severity: "INFO"     
    time: "2020-01-25T07:59:04.770902Z"     
   }
  ]
  megaCycles: "4387"   
  method: "POST"   
  pendingTime: "0.005315345s"   
  requestId: "5e2bf52700ff069deff4c6f3400001737e626f6f74796661726d6c697665000133000100"   
  resource: "/"   
  responseSize: "288"   
  startTime: "2020-01-25T07:58:31.433647Z"   
  status: 502   
  traceId: "2ee7b458fad62565bd5ed150e67647fc"   
  traceSampled: true   
  urlMapEntry: "auto"   
  userAgent: "Dalvik/2.1.0 (Linux; U; Android 5.1.1; SM-T360 Build/LMY47X) x SDK"   
  versionId: "3"   
  wasLoadingRequest: true   
 }
 receiveTimestamp: "2020-01-25T07:59:04.772448343Z"  
 resource: {
  labels: {
   module_id: "default"    
   project_id: "x"    
   version_id: "3"    
   zone: "us12"    
  }
  type: "gae_app"   
 }
 severity: "INFO"  
 timestamp: "2020-01-25T07:58:31.433647Z"  
 trace: "projects/x/traces/2ee7b458fad62565bd5ed150e67647fc"  
 traceSampled: true  
}

App.yaml:

runtime: python37
entrypoint: gunicorn -b :$PORT main:app

    #[START env]

        env_variables:
            PUBSUB_TOPIC: xx
            # This token is used to verify that requests originate from your
            # application. It can be any sufficiently random string.
            PUBSUB_VERIFICATION_TOKEN: xx
        #[END env]

Requirements.txt:

Flask==1.0.2
google-api-python-client==1.7.8
google-auth==1.6.3
google-cloud-pubsub==0.40.0
gunicorn==19.7.1

*Note: After a few days the increased number of instances is not visible on AppEngine dashboard even though the errors are still visible in the logs.

kenlukas
  • 3,101
  • 2
  • 16
  • 26
Hurikan
  • 1
  • 1
  • Do you see gunicorn timeouts in your logs also? i think it might be that when new instances starts, it takes too long and gunicron worker times out leading to 502 error. You can try to increase [default 30 second timeout](http://docs.gunicorn.org/en/latest/settings.html#timeout) with `entrypoint: gunicorn --timeout 90 -b :$PORT main:app` and see if this helps. – Emil Gi Jan 29 '20 at 10:56
  • I can't see any gunicorn timouts in the logs. After a few logs with 502 error I posted there are only 500 errors with a message `Request was aborted after waiting too long to attempt to service your request.` Thank you for your suggestion. I'll test it out. It might take a few days since this error only started showing after a few days of the app being in production. – Hurikan Jan 29 '20 at 14:02
  • After additional research I don't think this has to do with gunicorn timeouts. There are no log entries and the code is very simple to take 30s anyway. I would recommend opening a ticket with support for more visibility if the issue persists. – Emil Gi Jan 29 '20 at 16:49
  • My plan is to try to change the timeout anyway, since all the error logs have the latency at least a bit over 30 s. I'll also try to set min_instances on 1 so that at least one is active at all times and do a warmup for the new instances so that they are ready to accept the data. I'll test it next week so that I can make the changes when necessary. I'll let you know if anything helped. – Hurikan Jan 30 '20 at 07:55
  • @EmilGi Just to report on my tries to solve this. 1. I've tried raising the gunicorn timeout. As you said, that didn't solve the problem. At the same time I set the `max_instances` so that the instances don't increase to the abnormal values. 2. I've also tried setting `warmup` along with `min_idle_instances` and `min_instances`. Along with that I raised `instance_class` from default (F1) to F2 because of errors some 500 errors (5-10 per hour) for using too much memory. Nothing really helped to solve this. Thank you for your help anyway. – Hurikan Feb 07 '20 at 09:42

0 Answers0