Sentence segmentation and aligment in noisy text corpus

Question

I have a parallel corpus which contains about 100,000 aligned paragraphs in Arabic and Persian.

My corpus is a noisy corpus which its paragraphs are incomplete translation of each other (i.e., the parts of Arabic paragraphs are not translated to Persian, and the punctuation marks are not matched, too).

In order to divide the paragraphs to sentences, i used the punctuation marks, but the sentence count is not matched.

Then, I used Microsoft Aligner to align the sentences, but the result is really erroneous.

How do I segment and align the sentences of corpus?

score 0 · Answer 1 · answered Feb 06 '13 at 09:47

0

You've used the Giza++ tag in your question: did you look at using the alignment tools from there? The other option that I know quite a few people use is Moses, which is a fully featured statistical MT package, but I believe you can invoke the alignment models in isolation if this is really all you want.

answered Feb 06 '13 at 09:47

Ben Allison

7,244
1
15
24

Giza++ is used for word alignment, not for sentence alignment. The Moses toolkit already contains Giza++. – jvdbogae Mar 25 '15 at 12:34

Sentence segmentation and aligment in noisy text corpus

1 Answers1