Smart stemming/lemmatizing in Python for Nationalities

Question

I am working with Python, and I would like to find the roots of some words, that mainly refer to countries. Some examples that demonstrate what I need are:

Spanish should give me Spain.
English should give me England.
American should give me America.
Nigerian should give me Nigeria.
Greeks (plural) should give me Greece.
Puerto Ricans (plural) should give me Puerto Rico.
Portuguese should give me Portugal.

I have experimented a bit with the Porter, Lancaster and Snowball stemmers of the NLTK module. But Porter and Snowball do not change the tokens at all, while Lancaster is too aggressive. For example, the Lancaster stem of American is "Am", which is pretty badly butchered.I have also played some with the WordNet lemmatizer, with no success.

Is there a way to get the above results, even if it only works for countries?

Have a look at [this comprehensive list](https://en.wikipedia.org/wiki/Demonym) on Wikipedia. — lenz, Feb 03 '17 at 21:47

score 0 · Accepted Answer · answered Feb 03 '17 at 15:51

0

You might want to check out Unicode's CLDR (Common Locale Data Repository): http://cldr.unicode.org/

It has lists of territories and languages that might be useful as you could map them together using their shared standard ISO 639 codes (en, de, fr etc).

Here's a useful JSON repository:

https://github.com/unicode-cldr/cldr-localenames-full/tree/master/main/en

Check out the territories.json and languages.json files there.

answered Feb 03 '17 at 15:51

PrettyHands

568
4
16

I think the OP is talking about country adjectives (like "Spanish wine"), not languages – which don't have such a nice 1:1 mapping... (many countries for the same language, and many languages for the same country) – lenz Feb 03 '17 at 21:49
I agree, but many country adjectives do map quite nicely to language names, and the ones that don't could mostly be discounted by checking their similarity with Levenstein distance and falling back to (for example) a more suffix based approach if they're too dissimilar. – PrettyHands Feb 03 '17 at 23:32
But using the Wikipedia list is a better way to go :) – PrettyHands Feb 03 '17 at 23:33

Smart stemming/lemmatizing in Python for Nationalities

1 Answers1