Getting error on ML-Engine predict but local predict works fine

Question

I have searched a lot here but unfortunately could not find an answer.

I am running TensorFlow 1.3 (installed via PiP on MacOS) on my local machine, and have created a model using the provided "ssd_mobilenet_v1_coco" checkpoints.

I managed to train locally and on the ML-Engine (Runtime 1.2), and successfully deployed my savedModel to the ML-Engine.

Local predictions (below code) work fine and I get the model results

gcloud ml-engine local predict --model-dir=... --json-instances=request.json

 FILE request.json: {"inputs": [[[242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 23]]]}

However when deploying the model and trying to run on the ML-ENGINE for remote predictions with the code below:

gcloud ml-engine predict --model "testModel" --json-instances request.json(SAME JSON FILE AS BEFORE)

I get this error:

{
  "error": "Prediction failed: Exception during model execution: AbortionError(code=StatusCode.INVALID_ARGUMENT, details=\"NodeDef mentions attr 'data_format' not in Op<name=DepthwiseConv2dNative; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE]; attr=strides:list(int); attr=padding:string,allowed=[\"SAME\", \"VALID\"]>; NodeDef: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)\n\t [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)]]\")"
}

I saw something similar here: https://github.com/tensorflow/models/issues/1581

About the problem being with the "data-format" parameter. But unfortunately I could not use that solution since I am already on TensorFlow 1.3.

It also seems that it might be a problem with MobilenetV1: https:// github.com/ tensorflow/models/issues/2153

Any ideas?

How did you train locally and successfully deployed your savedModel to the ML-Engine? This seem to imply that you used the TensorFlow 1.3 for training, and then version 1.2 for predictions. — George, Sep 26 '17 at 19:33
Hi George! Thank you for the comment! I have indeed used TF1.3 for training and maybe that is the case. But how can I be using 1.2 for predictions? Can I set that on the gcloud tool or in the web interface??? — Victor Torres, Sep 27 '17 at 08:06
You may use version 1.2 of TF locally, for model training purposes, in place of the current TF1.3. — George, Sep 27 '17 at 19:40
Thanks again for the comments George! In the end my team and I have decided to use Tensorflow Serving in a dedicated server to serve the predictions. Working well so far with the same models that were bugging on the ML-Engine. But I hope someone with similar problems can find this thread and try your suggestion out. I was also rather disappointed on how difficult was to get support from Google's side for this (through GCP) =( — Victor Torres, Sep 29 '17 at 03:29

score 3 · Accepted Answer · answered Nov 26 '17 at 06:51

3

I had a similar issue. This issue is due to mismatch in Tensorflow versions used for training and inference. I solved the issue by using Tensorflow - 1.4 for both training and inference.

Please refer to this answer.

answered Nov 26 '17 at 06:51

Vikas

116
10

Thank you very much! For the project I am working on we have decided to not use GCP-ML because of this, but I will definitely check it out. Since I believe your answer should fix this I will mark as solved. I'm glad the TF team worked this out =D – Victor Torres Nov 27 '17 at 08:21
Is this issue resolved in Tensorflow 1.9 version? I tried to do prediction in CloudML still I got the same error. – Madhi Aug 22 '18 at 10:52

score 2 · Answer 2 · answered Jan 03 '18 at 15:47

If you're wondering how to ensure that your model version is running the correct tensorflow version that you need to run, first have a look at this model versions list page

You need to know which model version supports the Tensorflow version that you need. At the time of writing:

ML version 1.4 supports TensorFlow 1.4.0 and 1.4.1
ML version 1.2 supports TensorFlow 1.2.0 and
ML version 1.0 supports TensorFlow 1.0.1

Now that you know which model version you require, you need to create a new version from your model, like so:

gcloud ml-engine versions create <version name> \
--model=<Name of the model> \
--origin=<Model bucket link. It starts with gs://...> \
--runtime-version=1.4

In my case, I needed to predict using Tensorflow 1.4.1, so I used the runtime version 1.4.

Refer to this official MNIST tutorial page, as well as this ML Versioning Page

Getting error on ML-Engine predict but local predict works fine

2 Answers2

Linked