Tokenizer in moses-SMT system stuck even with 10 sentences

Question

I was trying to make a baseline MT system. Just for checking How it works I made Source (S) and Target (T) language corpus of just 2000 sentences. The very first step is to prepare the data for Machine Translation (MT) system. In this step first we have to perform tokenization as mentioned here Baseline SMT. I've used this code:

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< ~/corpus/training/news-commentary-v8.fr-en.en    \
> ~/corpus/news-commentary-v8.fr-en.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< ~/corpus/training/news-commentary-v8.fr-en.fr    \
> ~/corpus/news-commentary-v8.fr-en.tok.fr

( say S = French & T = English)

I checked after 2 hours it was still running. I got curious since it was not expected. Then I tried with just ten sentences. To my surprise, it's been 30 minutes and it is still running.

Did I do anything wrong?

PS: OS = Ubuntu 14.04.5 LTS Sony ultrabook No dual boot.

Oh. I forgot to mention that. I've made edits. Please check. — ObiWan, Oct 18 '16 at 17:39
Yes! I did press enter. I re-compiled the package. Now it's working!!! — ObiWan, Nov 01 '16 at 12:25

score 2 · Answer 1 · answered Aug 13 '20 at 05:52

Please Follow bellow steps ;

git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder

git clone https://github.com/moses-smt/giza-pp.git
cd giza-pp
make

mkdir tools
cp giza-pp/GIZA++-v2/GIZA++ giza-pp/GIZA++-v2/snt2cooc.out giza-pp/mkcls-v2/mkcls tools

scripts/tokenizer/tokenizer.perl -l fr < ~/corpus/training/news-commentary-v8.fr-en.fr > ~/corpus/news-commentary-v8.fr-en.tok.fr

Tokenizer in moses-SMT system stuck even with 10 sentences

1 Answers1