Difference between tokenized and normal text in Python NLTK

Question

I am using the WordPunct Tokenizer to tokenize this sentence:

في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء

My code is:

import re
import nltk
sentence= " في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"
wordsArray = nltk.tokenize.wordpunct_tokenize(sentence)
print " ".join(wordsArray)

I noticed that the printed output is the same as the input sentence, so why do use the tokenizer? Also, would there be any difference creating a machine translation system (MOSES) using the token files or normal text files?

It's printing the input because you joined the tokens back together. You would tokenize when you want to work with the words individually. — Robert Harvey, Jul 18 '13 at 16:31
You might want to edit this question to emphasise the MT part of your question, if that's the most important part, or set up a second question to ask about using tokenized vs. untokenized texts in MT in general. — dmh, Jul 19 '13 at 15:28

score 0 · Answer 1 · answered Jul 18 '13 at 20:22

The output of the tokeniser is a list of tokens (wordsArray). What you do is you join again the tokens in the list into one string with the command:

print " ".join(wordsArray)

Replace this with:

print wordsArray

Your second question regarding MOSES is not clear, please try to be more specific.

Difference between tokenized and normal text in Python NLTK

1 Answers1