Before the answer, you need to know how the wordnet interface in NLTK works, see http://www.nltk.org/howto/wordnet.html
Wordnet is indexed by concepts that can be represented by different words contains semantic information about. And the Wordnet interface in NLTK let's you search the concepts that a word can represent, e.g.:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> for ss in wn.synsets('dog'):
... print ss, ss.definition()
...
Synset('dog.n.01') a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Synset('frump.n.01') a dull unattractive unpleasant girl or woman
Synset('dog.n.03') informal term for a man
Synset('cad.n.01') someone who is morally reprehensible
Synset('frank.n.02') a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
Synset('pawl.n.01') a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
Synset('andiron.n.01') metal supports for logs in a fireplace
Synset('chase.v.01') go after with the intent to catch
To access all synsets in wordnet:
wn.all_synsets()
And for each synsets, there are different functions that you can look up regarding the synsets, e.g.
>>> ss = wn.synsets('dog')[0] # First synset for the word 'dog'
>>> ss.definition()
u'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'
>>> ss.hypernyms()
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
>>> ss.hyponyms()
[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
>>> ss.name()
u'dog.n.01'
>>> ss.lemma_names() # Other words that can represent this concept.
[u'dog', u'domestic_dog', u'Canis_familiaris']
So you can do it with a one liner, it's not so readable:
sorted(ss.name() for ss in wn.all_synsets() if len(ss.name())>18)
But note that that will only give you a list of lemma names that are the Synsets' indices. Also, you're including the POS tag and the index ID (i.e. .s.01
in the synset's indexed name: absorbefacient.s.01
) when you check for len(ss.name()) > 18
.
So what you need is the lemma_names()
instead of the name()
.
>>> from itertools import chain
>>> sorted(lemma for lemma in chain(*(ss.lemma_names() for ss in wn.all_synsets())) if len(lemma) > 18)
Alternatively, you can check the length while you collect the lemma before chaining and sorting them:
>>> sorted(chain(*([lemma for lemma in ss.lemma_names() if len(lemma)>18] for ss in wn.all_synsets())))
Note: By iterating through the synsets and getting the lemma_names()
, you will get duplicates and also lemma_names() that are caps initial vs lemma names that are not.
And of course, you don't need to loop through all that trouble, since there's a built-in function
>>> sorted(lemma for lemma in wn.all_lemma_names() if len(lemma) > 18)