Do huggingface translation models support separate vocabulary for source and target?

Question

Every example I've looked at so far seems to use a shared vocabulary between source and target languages, and I'm wondering if that is a hard-coded constraint of the Huggingface models, or my misunderstanding, or I've just not looked in the right place yet?

To take a random example, when I look at the files here, https://huggingface.co/Helsinki-NLP/opus-mt-en-zls/tree/main, I see separate "spm" (sentience piece model) files for source and target languages, and they are of different sizes (792kb vs. 850kb). But there is only a single "vocab.json" file. And the config.json file only mentions a single "vocab_size": 57680.

I've also been experimenting, e.g. tokenizer(inputs, text_target=inputs, return_tensors="pt"). If source and target used different vocabulary I would expect the returned input_ids and labels to use different numbers. But every model I've tried so far the numbers are identical (NO, my mistake - see update below).

Can a Huggingface tokenizer even support two vocabularies? If not then a model would need two tokenizers, which seems to clash with the way AutoTokenizer works.

UPDATE

Here is a test script to show the above model is actually using two spm vocabs with AutoTokenizer.


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = 'Helsinki-NLP/opus-mt-en-zls'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

inputs = ['Filter all items from same host']
targets = ['Filtriraj sve stavke s istog hosta']

x=tokenizer(inputs, text_target=targets, return_tensors="pt")
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))

print("\nGiving inputs on both sides")
x=tokenizer(inputs, text_target=inputs, return_tensors="pt")
print(x)  ## Expecting to see different numbers if they use different vocabs
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))

print("\nGiving targets on both sides")
x=tokenizer(targets, text_target=targets, return_tensors="pt")  ## Expecting to see different numbers if they use different vocabs
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))

print(model)

The output is:

{'input_ids': tensor([[10373,    90,  8255,    98,   605,  6276,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638,  1392,  7636,   386, 35861,    95,  2130,   218,  6276,    27,
             0]])}
▁Filter all▁items from same host</s>
Filtriraj sve stavke s istog hosta</s>

Giving inputs on both sides
{'input_ids': tensor([[10373,    90,  8255,    98,   605,  6276,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638,   911,    90,  3188,     7,    98,   605,  6276,     0]])}
▁Filter all▁items from same host</s>
Filter all items from same host</s>

Giving targets on both sides
{'input_ids': tensor([[11638,  1392,  7636,    95,   120,   914,   465,   478,    95,    29,
            25,   897,  6276,    27,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638,  1392,  7636,   386, 35861,    95,  2130,   218,  6276,    27,
             0]])}
Filtriraj sve stavke s istog hosta</s>
Filtriraj sve stavke s istog hosta</s>

When I choose identical strings in English or Croatian it gives slightly different numbers, showing that different tokenizers are involved. You can then see that the different ids sometimes map back to an identical string, sometimes not.

But when I print out the model we see it is actually a shared vocabulary, which makes the two spm models a bit pointless.

 (encoder): MarianEncoder(
   (embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
 (decoder): MarianDecoder(
   (embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
(lm_head): Linear(in_features=512, out_features=57680, bias=False)

I haven't got as far as finding out if a non-shared vocabulary is possible, but still yet to see evidence of one.

I assume this is not about HuggingFace, which is just a model hosting zoo, but about OPUS-MT, and specifically about the framework used by OPUS-MT. Should we edit the title accordingly? — Adam Bittlingmayer, Dec 08 '22 at 19:59
@AdamBittlingmayer No, it is specifically a question about Huggingface and trying to understand how its tokenizers and AutoTokenizer works for translation in the case of separate vocabularies. The random example I chose just happened to be under opus-mt. — Darren Cook, Dec 08 '22 at 20:33
By "huggingface" I am referring to "huggingface transformers" (https://huggingface.co/docs/transformers/index) and huggingface-tokenizers (the two tags I chose). As a company/community they do have their fingers in other pies, of course. — Darren Cook, Dec 08 '22 at 20:35

Adam Bittlingmayer · Answer 1 · 2022-12-09T20:44:36.630

1

For Marian-based models, HuggingFace now supports separate vocabularies for source and target, but some models may not, especially older models.

(As you know, OPUS-MT models are based on MarianMT. The MarianMT framework supports it.)

Before https://github.com/huggingface/transformers/pull/15831, HuggingFace used a shared vocabulary file for Marian.

This PR updates the Marian model:

To allow not sharing embeddings between encoder and decoder.

Allow tying only decoder embeddings with lm_head.

Separate two vocabs in tokenizer for src and tgt language

...

share_encoder_decoder_embeddings: to indicate if emb should be shared or not

So models trained with earlier versions of the framework, or that parameter set to false, only have one shared vocabulary file for source and target.

edited Dec 09 '22 at 20:44

answered Dec 08 '22 at 20:08

Adam Bittlingmayer

1,169
9
22

1

Thanks for the link to that pull request. It led to https://github.com/huggingface/transformers/pull/16049 and #16050 which seem to be suggesting that different vocabs are on the edges of what Huggingface transformers supports? At least, as of April 2022, when both tasks went quiet. – Darren Cook Dec 08 '22 at 20:42
@DarrenCook There is another comment here: https://github.com/huggingface/transformers/issues/15982#issuecomment-1065072152 – Adam Bittlingmayer Dec 09 '22 at 08:35
1

Note that this is specific to Marian and does not generalize to other models that are accessible through `transformers`. Implementation details like these are not streamlined across the library - that would be quite hard to do. – Bram Vanroy Dec 09 '22 at 13:52
@BramVanroy Good point, I updated the answer to be clearer about that, feel free to edit. – Adam Bittlingmayer Dec 09 '22 at 20:45

Do huggingface translation models support separate vocabulary for source and target?

1 Answers1