Filter words from a file based on their number of syllables

Question

I need to identify complex words from a .txt file. I am trying to use nltk but no such module exist. Complex words are words in the text that contains more than two syllables.

Have you done anything to try to identify complex words? Do you have an example TXT file or sample code? What constitutes a complex word? — Daniel Gale, Aug 30 '18 at 18:11
Welcome to Stack Overflow. What do you mean by "complex words"? Please see [ask]; the more clear and complete your question is the more likely you'll get a helpful response. — ChrisGPT was on strike, Aug 30 '18 at 18:12
Can you post a sample of this text file? I'm not sure what you mean by "complex words". — nnyby, Aug 30 '18 at 18:12
See [this previous answer](https://stackoverflow.com/a/405179/9518258). Your best bet is to either implement the algorithm described in the disseration linked to there, or to use a dictionary which includes this kind of metadata. — 0xdd, Aug 30 '18 at 18:23
A better question `What is a "complex word"`? See https://arxiv.org/abs/1804.09132 =) — alvas, Aug 31 '18 at 01:32

brandizzi · Answer 1 · 2018-08-30T19:21:32.170

I would use Pyphen. This module has a Pyphen class used for hyphenation. One of its methods, positions(), returns the number of places in a word where it can be split:

>>> from pyphen import Pyphen
>>> p = Pyphen(lang='en_US')
>>> p.positions('exclamation')
[2, 5, 7]

If the word "exclamation" can be split in three places, it has four syllables, so you just need to filter all words with more than one split place.

. . .

But I noted you tagged it as an [t:nltk] question. I'm not experienced with NLTK myself but the question suggested by @Jules has a nice suggestion in this aspect: to use the cmudict module. It gives you a list of pronunciations of a word in American English:

>>> from nltk.corpus import cmudict
>>> d = cmudict.dict()
>>> pronounciations = d['exasperation']
>>> pronounciations
[['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']]

Luckily, our fist word has only one pronounciation. It is represented as a list of strings, each one representing a phoneme:

>>> phonemes = pronounciations[0]
>>> phonemes
['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']

Note that vowel phonemes have a number at the end, indicating stress:

Vowels are marked for stress (1=primary, 2=secondary, 0=no stress). E.g.: NATURAL 1 N AE1 CH ER0 AH0 L

So, we just need to count the number of phonemes with digits at the end:

>>> vowels = [ph for ph in phonemes if ph[-1].isdigit()]
>>> vowels
['EH2', 'AE2', 'ER0', 'EY1', 'AH0']
>>> len(vowels)
5

. . .

Not sure which is the best option but I guess you can work your problem out from here.

Filter words from a file based on their number of syllables

1 Answers1