I need to identify complex words from a .txt file. I am trying to use nltk but no such module exist. Complex words are words in the text that contains more than two syllables.
-
Have you done anything to try to identify complex words? Do you have an example TXT file or sample code? What constitutes a complex word? – Daniel Gale Aug 30 '18 at 18:11
-
1Welcome to Stack Overflow. What do you mean by "complex words"? Please see [ask]; the more clear and complete your question is the more likely you'll get a helpful response. – ChrisGPT was on strike Aug 30 '18 at 18:12
-
Can you post a sample of this text file? I'm not sure what you mean by "complex words". – nnyby Aug 30 '18 at 18:12
-
See [this previous answer](https://stackoverflow.com/a/405179/9518258). Your best bet is to either implement the algorithm described in the disseration linked to there, or to use a dictionary which includes this kind of metadata. – 0xdd Aug 30 '18 at 18:23
-
A better question `What is a "complex word"`? See https://arxiv.org/abs/1804.09132 =) – alvas Aug 31 '18 at 01:32
1 Answers
I would use Pyphen. This module has a Pyphen
class used for hyphenation. One of its methods, positions()
, returns the number of places in a word where it can be split:
>>> from pyphen import Pyphen
>>> p = Pyphen(lang='en_US')
>>> p.positions('exclamation')
[2, 5, 7]
If the word "exclamation" can be split in three places, it has four syllables, so you just need to filter all words with more than one split place.
. . .
But I noted you tagged it as an [t:nltk] question. I'm not experienced with NLTK myself but the question suggested by @Jules has a nice suggestion in this aspect: to use the cmudict
module. It gives you a list of pronunciations of a word in American English:
>>> from nltk.corpus import cmudict
>>> d = cmudict.dict()
>>> pronounciations = d['exasperation']
>>> pronounciations
[['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']]
Luckily, our fist word has only one pronounciation. It is represented as a list of strings, each one representing a phoneme:
>>> phonemes = pronounciations[0]
>>> phonemes
['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']
Note that vowel phonemes have a number at the end, indicating stress:
Vowels are marked for stress (1=primary, 2=secondary, 0=no stress). E.g.: NATURAL 1 N AE1 CH ER0 AH0 L
So, we just need to count the number of phonemes with digits at the end:
>>> vowels = [ph for ph in phonemes if ph[-1].isdigit()]
>>> vowels
['EH2', 'AE2', 'ER0', 'EY1', 'AH0']
>>> len(vowels)
5
. . .
Not sure which is the best option but I guess you can work your problem out from here.

- 26,083
- 8
- 103
- 158