2

I am following the Tensorflow Object Detection API tutorial to train a Faster R-CNN model on my own dataset on Google Cloud. But the following "ran out-of-memory" error kept happening.

The replica master 0 ran out-of-memory and exited with a non-zero status of 247.

And according to the logs, a non-zero exit status -9 was returned. As described in the official documentation, a code of -9 might means the training is using more memory than allocated.

However, the memory utilization is lower than 0.2. So why I am having the memory problem? If it helps, the memory utilization graph is here.

MrAlias
  • 1,316
  • 15
  • 26

3 Answers3

4

The memory utilization graph is an average across all workers. In the case of an out of memory error, it's also not guaranteed that the final data points are successfully exported (e.g., a huge sudden spike in memory). We are taking steps to make the memory utilization graphs more useful.

If you are using the Master to also do evaluation (as exemplified in most of the samples), then the Master uses ~2x the RAM relative to a normal worker. You might consider using the large_model machine type.

rhaertel80
  • 8,254
  • 1
  • 31
  • 47
  • I used the large_model machine type after getting the same error, but it failed after the same number of training steps as the STANDARD_1 scale tier. Do you know what might be the fix to that? – tzharg Jan 07 '18 at 13:12
  • @rhaertel80, Does increasing number of workers solve this problem, or we have to use the larger model machine type as you said? – Khanh Le Jan 11 '18 at 16:21
0

Looking at your error, it seems that your ML code is consuming more memory that it is originally allocated.

Try with a machine type that allows you more memory such as "large_model" or "complex_model_l". Use a config.yaml to define it as follows:

trainingInput:
 scaleTier: CUSTOM
 # 'large_model' for bigger model with lots of data
 masterType: large_model
 runtimeVersion: "1.4"

There is a similar question Google Cloud machine learning out of memory. Please refer to that link for actual solution.

Hafizur Rahman
  • 2,314
  • 21
  • 29
0

The running_pets tutorial uses the BASIC_GPU tier, so perhaps the GPU has ran out of memory. The graphs on ML engine currently only show CPU memory utilization.

If this is the case, changing your tier to larger GPUs will solve the problem. Here is some information about the different tiers. On the same page, you will find an example of yaml file on how to configure it.

Hafplo
  • 91
  • 2
  • 9