2

I'm trying to convert a huggingface model into ONNX so I can use it in BigQuery ML. The latter one allows importing ONNX models. However transformers tokenizer is never included into the model.

How do I export a model WITH tokenizer into a single ONNX file?

Here's what I tried:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import sys
# model_name is being passed in from the command line
model_name = sys.argv[1]

# Load tokenizer and PyTorch weights form the Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

# replace slashes with underscores
model_name_underscored = model_name.replace("/", "_")
# Save to disk
tokenizer.save_pretrained(model_name_underscored)
pt_model.save_pretrained(model_name_underscored)

And then

python3 -m transformers.onnx --model=./$model_name_underscored ${model_name_underscored}_onnx/

After importing it to BigQuery ML I can see that the model expects tokenized input, not a plain text.

Alternatively, how do I use AutoTokenizer within BigQuery ML so that the output of the model would be identical to the one Python script produces?

stkvtflw
  • 12,092
  • 26
  • 78
  • 155

3 Answers3

0

Try something like

tokenizer = AutoTokenizer.from_pretrained(model_name)
dummy_model_input = tokenizer("Just a sample", return_tensors="pt")

torch.onnx.export(
    model,
    tuple(dummy_model_input.values()),
    f="model.onnx",
    input_names=['input_ids', 'attention_mask'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'},
                  'attention_mask': {0: 'batch_size', 1: 'sequence'},
                  'logits': {0: 'batch_size', 1: 'sequence'}},
    do_constant_folding=True,
    opset_version=13,
)

Or give a look to transformers.onnx package in Optimum,

Silverstorm
  • 15,398
  • 2
  • 38
  • 52
0

Short answer: what you are trying to achieve might be impossible.

Long answer:

Depending on the exact tokenizer you are using, you might be able to produce a single onnx file using onnxruntime-extensions library. However, this is unlikely to solve your problem. The thing is that the ONNX standard does not include any operators for string manipulation which makes tokenization impossible. While onnxruntime-extensions provides the missing operators via Custom Ops API, most likely these operators are not part of the runtime used by BigQuery, so it won’t be able to consume such a model.

Therefore, using a model with custom operators in BigQuery is a bigger challenge than creating such a model. You could still try though.

-2

You need to make sure that both the model and tokenizer are exported. Here's the way to export a Hugging Face model into a single ONNX file along with the tokenizer

import torch
import onnx
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the model and the tokenizer
model_name = "your-model-name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Convert the model and the tokenizer to ONNX
dummy_input = tokenizer("dummy input", return_tensors="pt")
output = model(**dummy_input)
torch.onnx.export(model, tuple(dummy_input.values()), "model.onnx",
                  opset_version=12, input_names=["input"], output_names=["output"],
                  dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}})
tokenizer.save_pretrained("model.onnx")

This will save the model and the tokenizer in a single ONNX file named "model.onnx".

You can use the TRANSFORM clause in BigQuery ML to apply the tokenizer to your input data.

CREATE MODEL my_model
TRANSFORM(
  input_text,
  (SELECT input_ids FROM UNNEST(my_tokenizer_func([input_text]))) AS input_ids
)
OPTIONS(
  model_type='tensorflow',
  input_label_cols=['input_text'],
  output_label_cols=['input_ids'],
  model_path='gs://my-bucket/model.onnx'
)
AS
SELECT input_text, label FROM my_table;
Caio Victor
  • 102
  • 3