2

I have searched a lot here but unfortunately could not find an answer.

I am running TensorFlow 1.3 (installed via PiP on MacOS) on my local machine, and have created a model using the provided "ssd_mobilenet_v1_coco" checkpoints.

I managed to train locally and on the ML-Engine (Runtime 1.2), and successfully deployed my savedModel to the ML-Engine.

Local predictions (below code) work fine and I get the model results

gcloud ml-engine local predict --model-dir=... --json-instances=request.json

 FILE request.json: {"inputs": [[[242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 23]]]}

However when deploying the model and trying to run on the ML-ENGINE for remote predictions with the code below:

gcloud ml-engine predict --model "testModel" --json-instances request.json(SAME JSON FILE AS BEFORE)

I get this error:

{
  "error": "Prediction failed: Exception during model execution: AbortionError(code=StatusCode.INVALID_ARGUMENT, details=\"NodeDef mentions attr 'data_format' not in Op<name=DepthwiseConv2dNative; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE]; attr=strides:list(int); attr=padding:string,allowed=[\"SAME\", \"VALID\"]>; NodeDef: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)\n\t [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)]]\")"
}

I saw something similar here: https://github.com/tensorflow/models/issues/1581

About the problem being with the "data-format" parameter. But unfortunately I could not use that solution since I am already on TensorFlow 1.3.

It also seems that it might be a problem with MobilenetV1: https:// github.com/ tensorflow/models/issues/2153

Any ideas?

  • How did you train locally and successfully deployed your savedModel to the ML-Engine? This seem to imply that you used the TensorFlow 1.3 for training, and then version 1.2 for predictions. – George Sep 26 '17 at 19:33
  • Hi George! Thank you for the comment! I have indeed used TF1.3 for training and maybe that is the case. But how can I be using 1.2 for predictions? Can I set that on the gcloud tool or in the web interface??? – Victor Torres Sep 27 '17 at 08:06
  • You may use version 1.2 of TF locally, for model training purposes, in place of the current TF1.3. – George Sep 27 '17 at 19:40
  • Thanks again for the comments George! In the end my team and I have decided to use Tensorflow Serving in a dedicated server to serve the predictions. Working well so far with the same models that were bugging on the ML-Engine. But I hope someone with similar problems can find this thread and try your suggestion out. I was also rather disappointed on how difficult was to get support from Google's side for this (through GCP) =( – Victor Torres Sep 29 '17 at 03:29

2 Answers2

3

I had a similar issue. This issue is due to mismatch in Tensorflow versions used for training and inference. I solved the issue by using Tensorflow - 1.4 for both training and inference.

Please refer to this answer.

Vikas
  • 116
  • 10
  • Thank you very much! For the project I am working on we have decided to not use GCP-ML because of this, but I will definitely check it out. Since I believe your answer should fix this I will mark as solved. I'm glad the TF team worked this out =D – Victor Torres Nov 27 '17 at 08:21
  • Is this issue resolved in Tensorflow 1.9 version? I tried to do prediction in CloudML still I got the same error. – Madhi Aug 22 '18 at 10:52
2

If you're wondering how to ensure that your model version is running the correct tensorflow version that you need to run, first have a look at this model versions list page

You need to know which model version supports the Tensorflow version that you need. At the time of writing:

  • ML version 1.4 supports TensorFlow 1.4.0 and 1.4.1
  • ML version 1.2 supports TensorFlow 1.2.0 and
  • ML version 1.0 supports TensorFlow 1.0.1

Now that you know which model version you require, you need to create a new version from your model, like so:

gcloud ml-engine versions create <version name> \
--model=<Name of the model> \
--origin=<Model bucket link. It starts with gs://...> \
--runtime-version=1.4

In my case, I needed to predict using Tensorflow 1.4.1, so I used the runtime version 1.4.

Refer to this official MNIST tutorial page, as well as this ML Versioning Page

wcyn
  • 3,826
  • 2
  • 31
  • 25