0

I am using the WordPunct Tokenizer to tokenize this sentence:

في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء

My code is:

import re
import nltk
sentence= " في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"
wordsArray = nltk.tokenize.wordpunct_tokenize(sentence)
print " ".join(wordsArray)

I noticed that the printed output is the same as the input sentence, so why do use the tokenizer? Also, would there be any difference creating a machine translation system (MOSES) using the token files or normal text files?

heidi
  • 11
  • 2
  • 1
    It's printing the input because you joined the tokens back together. You would tokenize when you want to work with the words individually. – Robert Harvey Jul 18 '13 at 16:31
  • You might want to edit this question to emphasise the MT part of your question, if that's the most important part, or set up a second question to ask about using tokenized vs. untokenized texts in MT in general. – dmh Jul 19 '13 at 15:28

1 Answers1

0

The output of the tokeniser is a list of tokens (wordsArray). What you do is you join again the tokens in the list into one string with the command:

print " ".join(wordsArray)

Replace this with:

print wordsArray

Your second question regarding MOSES is not clear, please try to be more specific.

dkar
  • 2,113
  • 19
  • 29