WMT'15 newstest dataset: .sgm formatting

Question

What scripts are used (and how?) to get the newstest datasets from wmt from the .sgm format to an unformatted format (like the europarl dataset)?

e.g. the newstest dataset downloaded at: http://www.statmt.org/wmt15/test.tgz

contains (when extracted) files such as newstest2015-ende-ref.de.sgm

How do I make that similar to the europarl dataset where each line represents a sentence with no formatting?

Note:

I have found a script in the moses directory (linked from the wmt site) called wrap-xml.perl. It mentions in the test section that it is used to go to .sgm format, but the script itself contains no documentation (and I am clueless in perl)

SGM files are a little irritating so we created this: https://github.com/alvations/warmth/blob/master/wmt_metric_task_data_indices.py#L177 — alvas, Jun 13 '16 at 11:16
BTW, the WMT test sets should not be from Europarl. Possibly, if you're looking for Europarl, this is what you're looking for http://opus.lingfil.uu.se/Europarl.php , e.g. plaintext format: http://opus.lingfil.uu.se/download.php?f=Europarl/de-en.txt.zip . Painstakingly compiled by Jorg Tiedemann =) — alvas, Jun 13 '16 at 11:19
@alvas thanks for your response, I am looking for the wtm'15 to compare with luong/chungs results on machine translation. I tried running the script you supplied, by outcommenting what is in the bottom and placing range(14, 15), got an error about missing the `'metric_data/WMT14/references/newstest2014-ref.hi-en'`, but this is not in the sgm format? how does this handle the .sgm format? — Alexander R Johansen, Jun 13 '16 at 12:21

WMT'15 newstest dataset: .sgm formatting

0 Answers0