How can I extract lemma with pl196x from nltk for Polish in Python?

Question

I wrote some basic program and I want extract lemma for Polish words what is very important since Polish use inflection.

I was create simple program to load data and convert word into lemma but not know how to do it:

from nltk import corpus

pl = corpus.pl196x
print dir(pl)
print iter(pl.tagged_words()).next()

For example I want do such lemmatization (ignore that some lemmatization can be ambiguous - it is normal in Polish):

kot, kota, kota, kotu, kotem, kocie, kocie == kot (singular male)
kotka, kotki, kotkę, kotce, kotką, kotce, kotka == kot (singular female!)
kociątko, kociątka, kociątko, kociątku, kociątkiem, kociątku, kociątko == kot (singular neutral)
etc. (plural males, plural females, plural neutrals)

How to use pl196x to do such job in or nltk.

Source data for corpus contains such inflections and lemmas - so it is possible but how to access it:

<w id="pu147125" lemma="kot" ana="SSNA---------P">kot</w>
<w id="pr021633" lemma="kot" ana="SSAA---------P">kota</w>
etc.

score 6 · Answer 1 · answered Jun 28 '15 at 20:56

6

I don't know if it is possible in NLTK itself. When I had similar problem I had used Morfeusz for that (http://sgjp.pl/morfeusz/). Although I have used only C version of the library, there is a Python binding for it (http://jwilk.net/software/python-morfeusz).

answered Jun 28 '15 at 20:56

jaboja

2,178
1
21
35

Looks that is the best idea if new version not supports it yet. – Chameleon Jul 20 '15 at 12:07

How can I extract lemma with pl196x from nltk for Polish in Python?

1 Answers1