What languages are supported for nltk.word_tokenize and nltk.pos_tag

Question

I need to conduct name entity extraction for text in multiple languages: spanish, portuguese, greek, czech, chinese.

Is there somewhere a list of all supported languages for these two functions? And is there a method to use other corpora so that these languages can be included?

https://stackoverflow.com/questions/41070105/pos-for-languages-other-than-english has some pointers for getting POS tagger support for languages other than English. — tripleee, Apr 21 '23 at 03:52

score 11 · Answer 1 · answered Feb 10 '22 at 17:35

The list of the languages supported by the NLTK tokenizer is as follows:

'czech'
'danish'
'dutch'
'english'
'estonian'
'finnish'
'french'
'german'
'greek'
'italian'
'norwegian'
'polish'
'portuguese'
'russian'
'slovene',
'spanish'
'swedish'
'turkish'

It corresponds to the pickles stored in C:\Users\XXX\AppData\Roaming\nltk_data\tokenizers\punkt (in Windows). This is what you enter with the key 'language' when tokenizing, e.g.

nltk.word_tokenize(text, language='italian')

score 4 · Accepted Answer · answered Feb 28 '13 at 13:35

4

By default, both functions only support English text. It's not really in the documentation but you can see it by looking at the source code:

The pos_tag() function loads a tagger from the this file: 'taggers/maxent_treebank_pos_tagger/english.pickle'. (see here)
The word_tokenize() function uses the Treebank tokenizer which uses regular expressions to tokenize text as in the (English) Penn Treebank Corpus. (see here)

answered Feb 28 '13 at 13:35

Suzana

4,251
2
28
52

But the word_tokenize works well with western languages. Does it mean that i can use it but not for languages that dont separate words with spaces? – Michael Mar 01 '13 at 10:28
They explain the algorithm [here](http://www.cis.upenn.edu/~treebank/tokenization.html). As Western languages usually seperate by space it should work fine. I've only tried it for German and it looked okay. – Suzana Mar 01 '13 at 15:15

What languages are supported for nltk.word_tokenize and nltk.pos_tag

2 Answers2

Linked

Related