2

I want to align source and target sentences in a multilingual translation setting.

Conceptually, I want to do something like the following for an exemplary English source sentence and a German target sentence:

0   1   2   3    4       5   6      7
i   saw the man  walking on  the    street  
ich sah den mann auf     der straẞe gehen

Word-level alignment would be: 0-0 1-1 2-2 3-3 4-7 5-4 6-5 7-6

Or in the case of different lengths between source and target sentence:

0  1   2    3         4   5  6        7   8    9
it is  a    different way of saying   the same thing
es ist eine andere    art ,  dasselbe zu  sagen

Word-level alignment should be something like: 0-0 1-1 2-2 3-3 4-4 5-5 6-[7,8] 7-6 8-6 9-6

What's the best way to achieve this? Thanks for any suggestions!

Lena
  • 111
  • 3

1 Answers1

2

Depending on your efficiency requirements, there are various tools you can use. There is a pretty old and very fast tool called FastAlign. It needs to be trained on parallel data first and it seems that pre-trained models are not available.

A very accurate tool based on pre-trained multilingual transformers is SimAlign. It is unsupervised and works right away for more than 100 languages, however, it is quite computationally demanding.

Even better results can be achieved using a tool called AwesomeAlign. It is based on SimAlign, but it allows further training using parallel data.

(Your examples are English-German, there is plenty of parallel English-German data available either in Huggingface hub or in the Opus project.)

Jindřich
  • 10,270
  • 2
  • 23
  • 44