I have a parallel corpus which contains about 100,000 aligned paragraphs in Arabic and Persian.
My corpus is a noisy corpus which its paragraphs are incomplete translation of each other (i.e., the parts of Arabic paragraphs are not translated to Persian, and the punctuation marks are not matched, too).
In order to divide the paragraphs to sentences, i used the punctuation marks, but the sentence count is not matched.
Then, I used Microsoft Aligner to align the sentences, but the result is really erroneous.
How do I segment and align the sentences of corpus?