2

Doing predictions on AWS GPU instance g4dn.4xlarge(16gb gpu memory,64 gb cpu mem) and deployed with k8s & dockers.

Tested with
(cuda10.1 + onnxruntime-gpu==1.4.0 ) and (cuda10.2 + onnxruntime-gpu==1.6.0) same error.
Models are customised for our purpose,cant point to weights.

Problem is :

Getting cuda oom(out of memory) error:

Error: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'Conv_16' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:298 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 33554432

On some backtracking:

Using nvidia-smi commands and GPU memory profiling, found for the 1st prediction and for next all predictions a constant GPU memory of ~1.8GB minimum for some models ~ 3 GB is blocked for some (I think it's blocked for multiprocess ). Releasing mem doesnt make sense , coz for next prediction same amount of mem will be blocked.

My understanding:

So at the peak, we are scaling up to 22 pods & in every pod, the model load is initialized, and hence every pod is blocking 1.8 ~ 3gb of memory & pointing to 1 GPU instance of 16 GB GPU memory.So, with 22 pods, oom is expected.

What is confusing:

Above cuda message throws oom, but gpu profiling shows memory utilisation is never more than 50% , though SM(Streaming multiprocessing) is 100% at peak(when pods scaled to 22).Attached image for refernce. On research I understood that SM has nothing to do with oom and cuda would handle sm efficiently. Then why getting cuda oom error if only 50% mem is utilised?

Ruled out.

I ruled out memory leak from model , as it runs w/o oom error when load is low.

Why GPU and not CPU for prediction.

Want faster predictions. Ran on CPU w/o any error ,even on high load.

What I am looking for:

  • A solution to scale AWS GPU instances on the basis of GPU memory.If oom is reason ,scaling on GPU mem should solve problem.I can't find.
  • Understanding cuda msg , when mem is available why oom ?
  • Being very hypothetical. If there is a way to create singleton object by design or using k8s for particular model load and saled up pods can utilise that model load object for prediction rather than creating new server. BUt that would kill sense or using k8s for availabilty & scalabilty.

enter image description here

talonmies
  • 70,661
  • 34
  • 192
  • 269
tsoulb
  • 41
  • 4

0 Answers0