What scripts are used (and how?) to get the newstest datasets from wmt from the .sgm format to an unformatted format (like the europarl dataset)?
e.g. the newstest dataset downloaded at: http://www.statmt.org/wmt15/test.tgz
contains (when extracted) files such as newstest2015-ende-ref.de.sgm
How do I make that similar to the europarl dataset where each line represents a sentence with no formatting?
Note:
I have found a script in the moses directory (linked from the wmt site) called wrap-xml.perl. It mentions in the test section that it is used to go to .sgm format, but the script itself contains no documentation (and I am clueless in perl)