1

I was trying to make a baseline MT system. Just for checking How it works I made Source (S) and Target (T) language corpus of just 2000 sentences. The very first step is to prepare the data for Machine Translation (MT) system. In this step first we have to perform tokenization as mentioned here Baseline SMT. I've used this code:

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< ~/corpus/training/news-commentary-v8.fr-en.en    \
> ~/corpus/news-commentary-v8.fr-en.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< ~/corpus/training/news-commentary-v8.fr-en.fr    \
> ~/corpus/news-commentary-v8.fr-en.tok.fr

( say S = French & T = English)

I checked after 2 hours it was still running. I got curious since it was not expected. Then I tried with just ten sentences. To my surprise, it's been 30 minutes and it is still running.

Did I do anything wrong?

PS: OS = Ubuntu 14.04.5 LTS Sony ultrabook No dual boot.

ObiWan
  • 196
  • 1
  • 12

1 Answers1

2

Please Follow bellow steps ;

git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder

git clone https://github.com/moses-smt/giza-pp.git
cd giza-pp
make

mkdir tools
cp giza-pp/GIZA++-v2/GIZA++ giza-pp/GIZA++-v2/snt2cooc.out giza-pp/mkcls-v2/mkcls tools

scripts/tokenizer/tokenizer.perl -l fr < ~/corpus/training/news-commentary-v8.fr-en.fr > ~/corpus/news-commentary-v8.fr-en.tok.fr