How to reduce the inference time of Helsinki-NLP/opus-mt-es-en (translation model) from transformer

Question

Currently Helsinki-NLP/opus-mt-es-en model takes around 1.5sec for inference from transformer. How can that be reduced? Also when trying to convert it to onxx runtime getting this error:

ValueError: Unrecognized configuration class <class 'transformers.models.marian.configuration_marian.MarianConfig'> for this kind of AutoModel: AutoModel. Model type should be one of RetriBertConfig, MT5Config, T5Config, DistilBertConfig, AlbertConfig, CamembertConfig, XLMRobertaConfig, BartConfig, LongformerConfig, RobertaConfig, LayoutLMConfig, SqueezeBertConfig, BertConfig, OpenAIGPTConfig, GPT2Config, MobileBertConfig, TransfoXLConfig, XLNetConfig, FlaubertConfig, FSMTConfig, XLMConfig, CTRLConfig, ElectraConfig, ReformerConfig, FunnelConfig, LxmertConfig, BertGenerationConfig, DebertaConfig, DPRConfig, XLMProphetNetConfig, ProphetNetConfig, MPNetConfig, TapasConfig.

Is it possible to convert this to onxx runtime?

score 5 · Accepted Answer · answered Jan 13 '21 at 10:10

The OPUS models are originally trained with Marian which is a highly optimized toolkit for machine translation written fully in C++. Unlike PyTorch, it does have the ambition to be a general deep learning toolkit, so it can focus on MT efficiency. The Marian configurations and instructions on how to download the models are at https://github.com/Helsinki-NLP/OPUS-MT.

The OPUS-MT models for Huggingface's Transformers are converted from the original Marian models are meant more for prototyping and analyzing the models rather than for using them for translation in a production-like setup.

Running the models in Marian will certainly much faster than in Python and it is certainly much easier than hacking Transformers to run with onxx runtime. Marian also offers further tricks to speed up the translation, e.g., by model quantization, which is however at the expense of the translation quality.

With both Marian and Tranformers, you can speed things up if you use GPU or if you narrow the beam width during decoding (attribute num_beams in the generate method in Transformers).

Any other thoughts to reduce translation speed? I saw about a 10% reduction in translation time when moving `num_beams` from 4 to 2 for reference. I've yet to implement GPU as a test, but I am under the impression that's only useful for batch processing, and minor reduction in translation time itself. — Kevin Danikowski, Aug 16 '21 at 01:04

score 0 · Answer 2 · answered Jan 10 '22 at 13:56

One way to speedup the translations is to indicate (when possible) the source language:

After importing the library and creating the model as such:

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

then (if possible), provide the source language as such :

translated_word = model.translate("Coucou!", source_lang="fr", target_lang="en" )
print(translated_word)  # Hello!

This gets better translation results (for short sentences) and is faster than if you do not provide the source language :

translated_word = model.translate("Coucou!", target_lang="en")
print(translated_word)  # He's gone!

More details on the official page : https://github.com/UKPLab/EasyNMT

Enjoy

How to reduce the inference time of Helsinki-NLP/opus-mt-es-en (translation model) from transformer

2 Answers2

Linked