2

I would like to use the Helsinki-NLP/opus-mt-de-en model from HuggingFace to translate text. This works fine with the HuggingFace Inference API or a Transformers pipeline, e.g.:

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM

model = ORTModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en", from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")

onnx_translation = pipeline("translation_de_to_en", model=model, tokenizer=tokenizer)

result = onnx_translation("Dies ist ein Test!")
result # Prints: "[{'translation_text': 'This is a test!'}]"

However, I need to use the ONNX Runtime for this as part of a project. I was able to successfully export the model to ONNX format, but I get the following output when I decode the output of the InferenceSession:

<unk> <unk> <unk> <unk> <unk>.<unk> <unk> <unk>,<unk>,<unk>,.<unk> <unk>,<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.<unk> <unk> <unk>,<unk> <unk> <unk>. the,<unk>,,.<unk> <unk> <unk>,<unk>, the<unk> <unk> <unk> <unk> <unk>.<unk> <unk> <unk> <unk> <unk>.<unk> <unk> <unk> <unk>.<unk> <unk> <unk> <unk> the.. in<unk> <unk>.<unk> <unk>,<unk> <unk> <unk> the<unk> <unk> <unk> <unk>.<unk> the<unk> <unk> the<unk> <unk> <unk>.<unk> <unk> <unk> <unk> <unk>,<unk> in<unk> the<unk>,<unk>,<unk> <unk> in, the in<unk> <unk> <unk> s<unk>. the.<unk> <unk>, in,<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>. a,,<unk> <unk> <unk>.<unk>.<unk> <unk>.<unk> is<unk>,<unk> in,,<unk> <unk> the<unk> the the. in<unk> <unk> <unk>,<unk> <unk> <unk> <unk>. in<unk>,,,<unk>.<unk> <unk>. of<unk> in<unk>.<unk>,<unk> <unk> <unk> the<unk> <unk> <unk> <unk> <unk> die,.<unk>,<unk> <unk> <unk> die,<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> the<unk>.<unk> <unk> <unk> <unk> <unk> of.<unk> <unk> <unk>.<unk> in<unk>, the<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.<unk> <unk>,.<unk> <unk>,<unk>,<unk> <unk>,<unk> <unk>,<unk> <unk> <unk>,<unk>,<unk> <unk> <unk>,<unk>.<unk> of<unk>.<unk> of, the<unk> the.<unk> <unk>

This is the relevant code (without the ONNX export):

from transformers import AutoTokenizer

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
# Encode input
encoded_input = tokenizer("Dies ist ein Test!")
# Create model input dictionary
model_input = {
    'input_ids': [encoded_input.input_ids],
    'attention_mask': [encoded_input.attention_mask],
    'decoder_input_ids': [encoded_input.input_ids],
    'decoder_attention_mask': [encoded_input.attention_mask]
}
# Run inference
output = session.run(['last_hidden_state'], model_input)
last_hidden_state = output[0][0][0]
# Decode output
decoded_output = tokenizer.decode(last_hidden_state, skip_special_tokens=True)
decoded_output # Expected value: "This is a test!"

The complete code can be found in this Colab Notebook as reproducible example.

I don't have much experience with ONNX and the MarianMT models yet. What am I doing wrong and how can I decode the text correctly?

RGe
  • 1,181
  • 1
  • 10
  • 19

1 Answers1

2

The ONNX model you have exported does not contain the HuggingFace internals which generate the full translation with a single call. Running your ONNX model essentially just executes the encoder-decoder pair once. However, getting the full translation is actually an iterative process, where tokens are translated one at a time and fed back into the decoder. So to get the full translation with ort, you would have to write this part yourself.

Also, note that the output node of your ONNX model is called last_hidden_state. This is the raw output of the decoder, a feature vector used to predict the probabilities of tokens. Adding the flag --feature=seq2seq-lm when exporting includes the last linear layers in the model, and the output will be a tensor of probabilities of words.

To summarize:

  1. Export the model with the correct flags
  2. See the link on how to generate translations with transformers
  3. Find what are the correct values for decoder_input_ids and decoder_attention_mask
  4. Write the inference loop for your ONNX model

EDIT: Minimal working example

import onnxruntime as rt
import numpy as np

# Model exported with the --feature=seq2seq-lm flag
session = rt.InferenceSession("model.onnx")

for node in session.get_inputs():
    print(node.name, node.shape, node.type)
# >>> input_ids ['batch', 'encoder_sequence'] tensor(int64)
# >>> attention_mask ['batch', 'encoder_sequence'] tensor(int64)
# >>> decoder_input_ids ['batch', 'decoder_sequence'] tensor(int64)
# >>> decoder_attention_mask ['batch', 'decoder_sequence'] tensor(int64)

for node in session.get_outputs():
    print(node.name, node.shape, node.type)
# >>> logits ['batch', 'decoder_sequence', 58101] tensor(float)
# >>> onnx::MatMul_1036 ['Addonnx::MatMul_1036_dim_0', 'Addonnx::MatMul_1036_dim_1', 512] tensor(float)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
encoded_input = tokenizer("Dies ist ein Test!")

# Avoid passing python lists to InferenceSession.run, create
# explicitly typed and shaped numpy arrays instead
input_ids = np.array(encoded_input.input_ids).astype(np.int64).reshape(1, -1)
attention_mask = np.array(encoded_input.attention_mask).astype(np.int64).reshape(1, -1)

# 58100 is the <pad> token. Initially, the decoder input has no words
decoder_input_ids = np.full((1, 6), 58100)

# Attention mask for the decoder, initially we are predicting the first word
decoder_attention_mask = np.array([[1, 0, 0, 0, 0,0]])

model_input = {
    'input_ids': input_ids,
    'attention_mask': attention_mask,
    'decoder_input_ids': decoder_input_ids,
    'decoder_attention_mask': decoder_attention_mask,
}

# Run inference loop
for i in range(1, 6):
    # Word probabilites for each token in decoder sequence
    logits = session.run(None, model_input)[0]
    # Extract the tokens with the highest probability
    tokens = logits.argmax(axis=2)[0]

    # Update the decoder inputs for the next step
    # Add the word we just predicted to input ids
    model_input["decoder_input_ids"][0, i] = tokens[i]
    # Update attention mask to match the current position
    model_input["decoder_attention_mask"][0, i] = 1

predicted_sequence = model_input["decoder_input_ids"].reshape(-1)
decoded_output = tokenizer.decode(predicted_sequence, skip_special_tokens=True)
print(decoded_output) # This is a test!

Note that this code is wrong in several ways, please don't use it as is. You would need to add some extra padding to the decoder input as you don't know how long the translation is going to be. Also you would want to check if the output is an end-of-sentence token, in which case the transformer thinks the translation is done and you can stop the inference.

I am by no means an expert on HuggingFace, but I'm fairly certain that they provide helper functions for constructing the decoder inputs.

simeonovich
  • 349
  • 7
  • Thank you for your answer. Do you have a working example that I can use as a guide? Unfortunately, I am not getting anywhere with points 3 and 4. – RGe Apr 30 '23 at 08:46
  • Added a minimal working example based on your code. – simeonovich May 02 '23 at 11:36