I am working with Python, and I would like to find the roots of some words, that mainly refer to countries. Some examples that demonstrate what I need are:
- Spanish should give me Spain.
- English should give me England.
- American should give me America.
- Nigerian should give me Nigeria.
- Greeks (plural) should give me Greece.
- Puerto Ricans (plural) should give me Puerto Rico.
- Portuguese should give me Portugal.
I have experimented a bit with the Porter, Lancaster and Snowball stemmers of the NLTK module. But Porter and Snowball do not change the tokens at all, while Lancaster is too aggressive. For example, the Lancaster stem of American is "Am", which is pretty badly butchered.I have also played some with the WordNet lemmatizer, with no success.
Is there a way to get the above results, even if it only works for countries?