Which among the following would be a better dataset for training and tuning Moses?

Question

I am trying to build a Tamil-English Translation System using Moses. https://github.com/joshua-decoder/indian-parallel-corpora/tree/master/ta-en is my data source for the parallel corpus. The dict files are approx 70k lines long, the others are in the range of 2-3k and the training files are 30k long approx. Would be of help, if someone hinted which of the following are better choices for training and tuning?

Currently, I'm using the training files for training and the test files for tuning. Is there a better combination?

score 0 · Accepted Answer · answered Aug 28 '14 at 12:57

0

The size of the tuning data is typically much smaller than the training data. I would advice you to merge the data you have into a single corpus, then take about 1000 sentences from this corpora for tuning, and maybe 3000 for development/testing.

answered Aug 28 '14 at 12:57

Pierre Lison

678
1
5
7

Which among the following would be a better dataset for training and tuning Moses?

1 Answers1