0

I have a situation where I am trying to using the pre-trained hugging-face models to translate a pandas column of text from Dutch to English. My input is simple:

Dutch_text             
Hallo, het gaat goed
Hallo, ik ben niet in orde
Stackoverflow is nuttig

I am using the below code to translate the above column and I want to store my result into a new column ENG_Text. So the output will look like this:

ENG_Text             
Hello, I am good
Hi, I'm not okay
Stackoverflow is helpful

The code that I am using is as follows:

#https://huggingface.co/Helsinki-NLP for other pretrained models 
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
input_1 = df['Dutch_text']
input_ids = tokenizer("translate English to Dutch: "+input_1, return_tensors="pt").input_ids # Batch size 1
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Any help would be appreciated!

Django0602
  • 797
  • 7
  • 26
  • Could you say more about what errors you are seeing and/or where the output is not as you expect? That may help others identify where the problem is. – Matthew Cox Dec 28 '20 at 21:36

1 Answers1

1

This is not how an MT model is supposed to be used. It is not a GPT-like experiment to test if the model can understand instruction. It is a translation model that only can translate, there is no need to add the instruction "translate English to Dutch". (Don't you want to translate the other way round?)

Also, the translation models are trained to translate sentence by sentence. If you concatenate all sentences from the column, it will be treated as a single sentence. You need to either:

  1. Iterate over the column and translate each sentence independently.

  2. Split the column into batches, so you can parallelize the translation. Note that in that case, you need to pad the sentences in the batches to have the same length. The easiest way to do it is by using the batch_encode_plus method of the tokenizer.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • You are correct that I need to iterate each sentence separately and I will take care of that but can you explain a little to me as to how should I translate then, and my requirement is to translate multiple languages to English only. For example dutch to English and German to English. How should I tweak the instruction then? Is there a separate model that I can use to achieve the same? – Django0602 Dec 29 '20 at 12:39
  • There is no instruction. For translation from German to English, you have to use a different model, for instance `Helsinki-NLP/opus-mt-de-en`. – Jindřich Dec 29 '20 at 15:08
  • that's correct I am using that model now as I am trying to translate text from german to english. But how do I apply that model directly to each sentence? If I am applying it to each sentence then do I need to add the padded sequence to make the text length equal? – Django0602 Dec 29 '20 at 15:10