10

I need to conduct name entity extraction for text in multiple languages: spanish, portuguese, greek, czech, chinese.

Is there somewhere a list of all supported languages for these two functions? And is there a method to use other corpora so that these languages can be included?

Michael
  • 1,759
  • 4
  • 19
  • 29
  • https://stackoverflow.com/questions/41070105/pos-for-languages-other-than-english has some pointers for getting POS tagger support for languages other than English. – tripleee Apr 21 '23 at 03:52

2 Answers2

11

The list of the languages supported by the NLTK tokenizer is as follows:

  • 'czech'
  • 'danish'
  • 'dutch'
  • 'english'
  • 'estonian'
  • 'finnish'
  • 'french'
  • 'german'
  • 'greek'
  • 'italian'
  • 'norwegian'
  • 'polish'
  • 'portuguese'
  • 'russian'
  • 'slovene',
  • 'spanish'
  • 'swedish'
  • 'turkish'

It corresponds to the pickles stored in C:\Users\XXX\AppData\Roaming\nltk_data\tokenizers\punkt (in Windows). This is what you enter with the key 'language' when tokenizing, e.g.

nltk.word_tokenize(text, language='italian')
4

By default, both functions only support English text. It's not really in the documentation but you can see it by looking at the source code:

  1. The pos_tag() function loads a tagger from the this file: 'taggers/maxent_treebank_pos_tagger/english.pickle'. (see here)

  2. The word_tokenize() function uses the Treebank tokenizer which uses regular expressions to tokenize text as in the (English) Penn Treebank Corpus. (see here)

Suzana
  • 4,251
  • 2
  • 28
  • 52
  • But the word_tokenize works well with western languages. Does it mean that i can use it but not for languages that dont separate words with spaces? – Michael Mar 01 '13 at 10:28
  • They explain the algorithm [here](http://www.cis.upenn.edu/~treebank/tokenization.html). As Western languages usually seperate by space it should work fine. I've only tried it for German and it looked okay. – Suzana Mar 01 '13 at 15:15