Incompatible shapes: [11,768] vs. [1,5,768] - Inference in production with a huggingface saved model

Question

I have saved a pre-trained version of distilbert, distilbert-base-uncased-finetuned-sst-2-english, from huggingface models, and i am attempting to serve it via Tensorflow Serve and make predictions. All is being tested currently in Colab at the moment.

I am having issue getting the prediction into the correct format for the model via TensorFlow Serve. Tensorflow services are up and running fine serving the model, however my prediction code is not correct and i need some help understanding how to make a prediction via json over the API.

# tokenize and encode a simple positive instance
instances = tokenizer.tokenize('this is the best day of my life!')
instances = tokenizer.encode(instances)
data = json.dumps({"signature_name": "serving_default", "instances": instances, })
print(data)

{"signature_name": "serving_default", "instances": [101, 2023, 2003, 1996, 2190, 2154, 1997, 2026, 2166, 999, 102]}

# setup json_response object
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/my_model:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)

predictions

{'error': '{{function_node __inference__wrapped_model_52602}} {{function_node __inference__wrapped_model_52602}} Incompatible shapes: [11,768] vs. [1,5,768]\n\t [[{{node tf_distil_bert_for_sequence_classification_3/distilbert/embeddings/add}}]]\n\t [[StatefulPartitionedCall/StatefulPartitionedCall]]'}

Any direction here would be appreciated.

JSS · Accepted Answer · 2020-09-01T19:40:46.457

Was able to find the solution by setting signatures for input shape and attention mask, which is the following below. This is a simple implementation that uses a fixed input shape for a saved model and requires you to pad the inputs to the expected input shape of 384. I have seen implementations of calling custom signatures and model creation to match expected input shapes, however the below simple case worked for what I was looking to accomplish with serving a huggingface model via TF Serve. If anyone has any better examples or ways to extend this functionality better, please post for future use.

# create callable
from transformers import TFDistilBertForQuestionAnswering
distilbert = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')
callable = tf.function(distilbert.call)

By calling get_concrete_function, we trace-compile the TensorFlow operations of the model for an input signature composed of two Tensors of shape [None, 384], the first one being the input ids and the second one the attention mask.

concrete_function = callable.get_concrete_function([tf.TensorSpec([None, 384], tf.int32, name="input_ids"), tf.TensorSpec([None, 384], tf.int32, name="attention_mask")])

save the model with the signatures:

# stored model path for TF Serve (1 = version 1) --> '/path/to/my/model/distilbert_qa/1/'
distilbert_qa_save_path = 'path_to_model'
tf.saved_model.save(distilbert, distilbert_qa_save_path, signatures=concrete_function)

check to see that it contains the correct signature:

saved_model_cli show --dir 'path_to_model' --tag_set serve --signature_def serving_default

output should look like:

The given SavedModel SignatureDef contains the following input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 384)
      name: serving_default_attention_mask:0
  inputs['input_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 384)
      name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_0'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 384)
      name: StatefulPartitionedCall:0
  outputs['output_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 384)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

TEST MODEL:

from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

question, text = "Who was Benjamin?", "Benjamin was a silly dog."
input_dict = tokenizer(question, text, return_tensors='tf')

start_scores, end_scores = distilbert(input_dict)

all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])

FOR TF SERVE (in colab): (which was my original intent with this)

!echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -
!apt update

!apt-get install tensorflow-model-server

import os
# path_to_model --> versions directory --> '/path/to/my/model/distilbert_qa/'
# actual stored model path version 1 --> '/path/to/my/model/distilbert_qa/1/'
MODEL_DIR = 'path_to_model'
os.environ["MODEL_DIR"] = os.path.abspath(MODEL_DIR)

%%bash --bg
nohup tensorflow_model_server --rest_api_port=8501 --model_name=my_model --model_base_path="${MODEL_DIR}" >server.log 2>&1

!tail server.log

MAKE A POST REQUEST:

import json
!pip install -q requests
import requests
import numpy as np

max_length = 384  # must equal model signature expected input value
question, text = "Who was Benjamin?", "Benjamin was a good boy."

# padding='max_length' pads the input to the expected input length (else incompatible shapes error)
input_dict = tokenizer(question, text, return_tensors='tf', padding='max_length', max_length=max_length)

input_ids = input_dict["input_ids"].numpy().tolist()[0]
att_mask = input_dict["attention_mask"].numpy().tolist()[0]
features = [{'input_ids': input_ids, 'attention_mask': att_mask}]

data = json.dumps({ "signature_name": "serving_default", "instances": features})

headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/my_model:predict', data=data, headers=headers)
print(json_response)

predictions = json.loads(json_response.text)['predictions']

all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
answer = ' '.join(all_tokens[tf.math.argmax(predictions[0]['output_0']) : tf.math.argmax(predictions[0]['output_1'])+1])
print(answer)

Hello fellow developer, i dont know who you are but you have gained a virtual fan with that response. Thank you very much for this response, it just solved a huge headache. Big thanks. — Funnymemes, Apr 22 '21 at 15:11

Incompatible shapes: [11,768] vs. [1,5,768] - Inference in production with a huggingface saved model

1 Answers1