NLTK available languages for word tokenization and sentence tokenization

Question

I need to understand for which languages the tokenization in NLTK is possible. I think I need to set the language like this:

import nltk.data
lang = "WHATEVER_LANGUAGE"
tokenizer = nltk.data.load('nltk:tokenizers/punkt/'+lang+'.pickle')
text = "something in some specified whatever language"
tokenizer.tokenize(text)

I need to understand for which languages I can use this, but I couldn't find any information on the nltk documentation.

score 2 · Answer 1 · answered Sep 29 '22 at 19:32

You can check out this comment here, on a similar question: https://stackoverflow.com/a/71069740/11551168

The list of the languages supported by the NLTK tokenizer is as follows:

'czech'
'danish'
'dutch'
'english'
'estonian'
'finnish'
'french'
'german'
'greek'
'italian'
'norwegian'
'polish'
'portuguese'
'russian'
'slovene',
'spanish'
'swedish'
'turkish'
It corresponds to the pickles stored in C:\Users\XXX\AppData\Roaming\nltk_data\tokenizers\punkt (in Windows). This is what you enter with the key 'language' when tokenizing, e.g.

nltk.word_tokenize(text, language='italian')

NLTK available languages for word tokenization and sentence tokenization

1 Answers1