1

Below you can see the setup that I currently have. A Django app creates a set of requests as Celery tasks. Load is balanced using Istio across the gRPC server pods. The Python script processes the request and returns it. Everything is on AWS EKS and HPA and cluster scaling is also active.

enter image description here

The Python script is a CPU intensive process and depending on the request that Django sends, the CPU and Memory usages of the python script varies a lot. Visually inspecting it, for each request it can take anything between:

  • Best case (more common) -> 100m Memory, 100m CPU -> the python script takes a few seconds to process

To

  • Worst case (less common) -> 1000m Memory, 10,000m CPU -> the python script takes up to 3-4 minutes to process

Here is the current resources used for the gRPC server which is on a c5.2xlarge instance:

resources:
            limits:
              cpu: 14
              memory: 16384Mi
            requests:
              cpu: 14
              memory: 16384Mi

Also, the gRPC server has ThreadPoolExecutor with max_workers=16 which means it can respond to 16 requests at the same time.

The issue is that i'm trying to use the least amount of resource, and at the same time make sure each request don't take more than X minutes/seconds.

Scenarios that i can think of:

  1. Using the same resources as defined above and setting max_workers=1. In this way i'm sure that each pod only process one request at a time, and i can somehow guarantee how long it'd take for the worst case to process. However, it'd be super expensive and probably not that scalable.
  2. Using the same resources as defined above but setting max_workers=16 or a bigger number. In this case, even though each pod is taking up a lot of memory and CPU, but at least each gRPC server can handle multiple requests at the same time. However, the issue is that what if a few of the Worst case requests hit the same pod? Then it'd take a long time to process the requests.
  3. Set max_workers=1 and modify the resources to something like below. In this way still each pod only process 1 request at a time, as well as using the minimum resources, but it can go up to the limit for the rare cases. I guess it's not a good practice for limits and requests to be that different.
resources:
            limits:
              cpu: 14
              memory: 16384Mi
            requests:
              cpu: 100m
              memory: 100m

I'd be grateful if you can take a look at the scenarios above. Any/all thoughts are highly appreciated.

Thanks

James
  • 113
  • 4

0 Answers0