1

I have trained a model for text classification using huggingface/transformers, then I exported it using the built-in ONNX functionality.

Now, I'd like to use it for inference on millions of texts (around 100 millions of sentences). My idea is to put all the texts in a Spark DataFrame, then bundle the .onnx model into a Spark UDF, and run inference that way, on a Spark cluster.

Is there a better way of doing this? Am I doing things "the right way"?

Contestosis
  • 369
  • 1
  • 4
  • 19
  • I have edited my answer to also include the MLflow approach. Note that I have not used it myself, but according to their documentation, it should also be doable. I hope that either of the solutions would help you solve your issue. – Arda Aytekin Aug 25 '22 at 08:25

1 Answers1

2

I am not sure if you are aware of and/or allowed to use SynapseML, due to the requirements (cf. "SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+," as of today, per the landing page), but SynapseML does have support for ONNX Inference on Spark. This could probably be the cleanest solution for you.

EDIT. Also, MLflow has support for exporting a python_function model as an Apache Spark UDF. With MLflow, you save your model in, say, the ONNX format, log/register the model via mlflow.onnx.log_model, and later retrieve it in the mlflow.pyfunc.spark_udf call via its path, i.e., models:/<model-name>/<model-version>.

Arda Aytekin
  • 1,231
  • 14
  • 24